Lately I have been experimenting with some text mining ideas using the U.S. patent corpus which Google has conveniently provided for free download. Each raw data file contains all the patents issued in one week, in an xml format, typically on the order of 50 to 100M compressed (up to 500M when uncompressed).

One of the things I was curious about was the size of the vocabulary used in patent claims. It is commonly supposed that an average educated person has a vocabulary of about 20,000 words. A large English dictionary includes on the order of 250,000. How big a vocabulary is encompassed by the words commonly used in patents?

So I made a crude count of the words in the claims in slightly more than four years of issued U.S. patents (all patents issued from 2009 through January 2013) — a total of 14,717,173 claims from 878,461 patents. I counted the number of distinct words, and the number of times each word appeared, then sorted the list by frequency of appearance. More »