Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is a custom word list okay to add to the repository? #43

Open
luc-x41 opened this issue Apr 10, 2024 · 2 comments
Open

Is a custom word list okay to add to the repository? #43

luc-x41 opened this issue Apr 10, 2024 · 2 comments

Comments

@luc-x41
Copy link

luc-x41 commented Apr 10, 2024

Hi! I've been checking and adding words to an internal dictionary which we've used in our penetration testing reports, blog posts, etc. the last years, when a colleague pointed out that I should maybe just be using a better dictionary than the default one and pointed me here :)

Many of these words, such as cryptographically, canonicalized, satisfiable, and transpiling, are not yet in this repository, so I want to contribute/consolidate those. I have no automatically updating source for them (and we certainly could not publish customers' reports for the project to scrape words from 😅), so my question is whether you are interested in including a list of custom words that does not get automatic updates. The words are split out into:

  • about 250 (jargon) words like the aforementioned four, all American English spellings because that's what we've standardized on;
  • about 100 acronyms and 200 brand names that I think you will mostly already have (didn't do a diff yet); and
  • about 60 names of different protocols or standards (WHOIS, WebAuthn), tools (sudo, rsync), attacks (Heartbleed, Clickjacking), encodings (Base64), etc., few of which seem to be in the repository.
    • (By the way, if someone has an overarching name for these sixty, I'd be interested. I guess the distinction I'm making is that brands are something you may recommend whereas Heartbleed, rsync, or USB... I mean, they can be trademarked and you could recommend something like USB or rsync over some other connector or transfer software, but they're very broadly used as a neutral term (no risk of it looking like an endorsement) and it's also part of composite nouns such as USB drive which I wouldn't think is trademarked or considered branded.)

If yes, follow-up questions are:

  • What should the structure be? If anything in ./wordlists/ is considered to be automatically generated, I could simply make a script that does no more than echoing the words. Alternatively, a plain text file could be included among the generated word lists, perhaps with a comment on top that indicates it is custom.
  • Should they be in one list/file with a blank line and comment separating each category (that's my current structure), or rather separate lists? The repository currently uses separate lists per category, but because these words will not update, it also feels potentially sensible to just collect these in one place.
    • The acronyms and brand names are a bit different because those categories already exist. I will check whether there's a point merging them in the first place: if there are more than, say, a dozen new entries then it probably makes sense to incorporate these into the existing scripts for acronyms and brands.
psliwka added a commit that referenced this issue Apr 13, 2024
Add answers to some questions posed in #43.
@psliwka
Copy link
Owner

psliwka commented Apr 13, 2024

Hi, and thanks for your interest in expanding this project! New wordlists are certainly a welcome addition, and having them auto-generated is not a must – so far they're all rendered by scripts because I was simply too lazy to ever craft one manually 😅

I've addressed some of your follow-up questions in fed05e1 – TLDR: static lists in ./wordlists/ are okay, try to add multiple smaller lists rather than one big. Also, it's fine to extend existing scripts for e.g. brands.words and acronyms.words to pull words from somewhere else (IDK, maybe a script-embedded list? a separate static "include" file?) to enrich their scrapped output with extra words ;)

Let me know if you have any more questions :)

@mamekoro
Copy link

To which word list should I add words that are used in various fields and difficult to classify? (e.g. "resize" and "despawn")

In my opinion, the existing word lists are already chaotic. The word "iterator" is in python.words even though it is not Python-specific. "Btrfs" is in docker.words even though it is independent of Docker.

So, how about creating a new word list like misc.words?
The idea is that words that are easy to classify will be added to existing or newly-created word lists, while other words will be added to misc.words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants