It’s now possible to download from Google the full text of all issued U.S. patents back to 1976 (here) and all published U.S. patent applications back to 2001 (here). Getting anything useful out of them is not a task for the faint of heart — they use a fairly complicated XML schema, which has changed several times, and it’s a lot of downloading (about 70G, zipped, for 2007 to the present).

To put all that data into a form convenient for searching and extracting statistics, I wrote a Python utility that reads the XML, parses it into a standard set of fields, cleans up most of the unicode weirdnesses, and outputs everything into a single large text file, one field per line, each line beginning with a four letter identifier indicating what part of the document it is. I have posted the latest version on github here, with detailed instructions/description. More »