Often when reading a patent the focus is not so much on what the patent says, it’s on being sure about what it doesn’t say. If an examiner has cited a patent as prior art, and you want to argue that it’s different from the claimed invention because the claimed invention includes some feature that the cited reference doesn’t disclose, you need to satisfy yourself that the feature is not in the cited reference anywhere.

That can be a very tedious undertaking. The particular reference that finally overcame my inertia and got me to scribble up the python script that is the subject of this post is a patent application, cited as a prior art reference in an office action in one of my cases, that is 76 pages of two-column fine print.¬† No way does the¬†client want to pay me for the hours it would take to read it in detail. More »

It’s now possible to download from Google the full text of all issued U.S. patents back to 1976 (here) and all published U.S. patent applications back to 2001 (here). Getting anything useful out of them is not a task for the faint of heart — they use a fairly complicated XML schema, which has changed several times, and it’s a lot of downloading (about 70G, zipped, for 2007 to the present).

To put all that data into a form convenient for searching and extracting statistics, I wrote a Python utility that reads the XML, parses it into a standard set of fields, cleans up most of the unicode weirdnesses, and outputs everything into a single large text file, one field per line, each line beginning with a four letter identifier indicating what part of the document it is. I have posted the latest version on github here, with detailed instructions/description. More »