Lately I have been experimenting with some text mining ideas using the U.S. patent corpus which Google has conveniently provided for free download. Each raw data file contains all the patents issued in one week, in an xml format, typically on the order of 50 to 100M compressed (up to 500M when uncompressed).

One of the things I was curious about was the size of the vocabulary used in patent claims. It is commonly supposed that an average educated person has a vocabulary of about 20,000 words. A large English dictionary includes on the order of 250,000. How big a vocabulary is encompassed by the words commonly used in patents?

So I made a crude count of the words in the claims in slightly more than four years of issued U.S. patents (all patents issued from 2009 through January 2013) — a total of 14,717,173 claims from 878,461 patents. I counted the number of distinct words, and the number of times each word appeared, then sorted the list by frequency of appearance.

I did not do word-stemming — words with the same stem, e.g. “run” and “running”, were treated as different words. This was just a quick and dirty experiment and including word-stemming would make the program considerably more complicated. I also did not remove stop words (common words like “with” and “from”). but I did exclude all words less than four letters long.

Patent claims often include numbers and other character strings such as genetic sequences that would not properly be considered words, and capitalization introduces additional variability. Before counting each word, I converted to lower case and removed all non-alphabetic characters, so as to eliminate these confounding factors. I did not attempt to identify and remove genetic sequences since I assume any such tokens would appear at most one or a few times, so that they would appear at the extreme bottom of the frequency-sorted output list.

Hyphenation presents a particular challenge. It is very common in patents to form coin new words by combining two existing words with a hyphen. On my first pass, I treated hyphenated combinations as separate words. Counted that way, there were 966,377 distinct words, out of a total of 545,588,265 words in the data set.

A quick skim of the frequency-sorted word list revealed that a large proportion of those 966,377 words were hyphenated compound words. In fact, when I repeated the count, this time splitting all hyphenated words (e.g. “phase-shifted” becomes two words, “phase” and “shifted”), the total number of distinct words dropped by more than half, to 451,452.

A large proportion of those 451,452 were chemical names like “dihydrobenzofuranyltrifluorohindolylmethylmethylpentanol.” As a somewhat crude stab at filtering these from the results, I re-ran the count, this time also excluding any words longer than 16 letters. This dropped the total number of words to 356,000. Of these, more than a third appeared at most twice, likely to be chemical names, coined words, or misspellings that appeared in a single patent. Excluding these leaves a patent claim vocabulary of 194,426, still including many chemical names and far from perfectly filtered, but probably reasonably representative, at least at the higher frequency end of the list.

The most common 1,000 words accounted for 75.40 percent of the dataset. The most common 10,000 accounted for 97.44 percent. To get to 99 percent of the dataset required 20,338 distinct words. Only 16,097 words were used at a frequency exceeding 1.0 per million words. These accounted for 98.73 percent of all the words in the dataset.

The most common words were “wherein”, making up 2.64 percent of the words in the dataset, and “said”, making up 2.51 percent. The most common ten words — “wherein”, “said”, “claim”, “first”, “second”, “from”, “method”, “comprising”, “with”, and “least”, accounted for 15.68 percent of the words in the data set.

Here is a list of the 1,000 most common words (tab delimited with counts and percentages).

Leave a Reply

Your email address will not be published. Required fields are marked *