Other words too, e.g. "from".

My first thought was that the creator used a search library that filters common words by default, but the search code is all in the page and doesn't do that.

My second thought was that the 10k word corpus doesn't include those most common words. But it does.

Then I realized that the creator filtered them out. The page does say "7931 words", and the title here on HN says "10k* most common". The original corpus has exactly 10,000 words.

https://github.com/first20hours/google-10000-english/blob/d0...

The first 21 include all four we've mentioned:

the, of, and, to, a, in, for, is, on, that, by, this, with, i, you, it, not, or, be, are, from

The reason for this (I should have probably added a note to the site in hindsight), is that WordNet doesn't include definitions for these words in its corpus. This is why the count is less than 10,000: anything that WordNet doesn't have a definition for isn't included. I left a nod to this in the asterisk, but I realise now I didn't explain it anywhere.

From the old Princeton WordNet FAQ page (https://wordnet.princeton.edu/frequently-asked-questions):

> WordNet only contains "open-class words": nouns, verbs, adjectives, and adverbs. Thus, excluded words include determiners, prepositions, pronouns, conjunctions, and particles.

I suppose I could have included them as source nodes (only outgoing), but I think they would have ended up connecting to a whole bunch of definitions, while not providing much in the way of interest.