It’s now possible to download from Google the full text of all issued U.S. patents back to 1976 (here) and all published U.S. patent applications back to 2001 (here). Getting anything useful out of them is not a task for the faint of heart — they use a fairly complicated XML schema, which has changed several times, and it’s a lot of downloading (about 70G, zipped, for 2007 to the present).

To put all that data into a form convenient for searching and extracting statistics, I wrote a Python utility that reads the XML, parses it into a standard set of fields, cleans up most of the unicode weirdnesses, and outputs everything into a single large text file, one field per line, each line beginning with a four letter identifier indicating what part of the document it is. I have posted the latest version on github here, with detailed instructions/description. (Yes, I know there are Python libraries for parsing XML, I’ve used them, and yes, I agree that it’s hazardous to write XML parsers from scratch. I probably should have done it that way, but when I started this I was only going to parse out a few fields and it wasn’t worth the complication. Also I know that it’s possible to search the XML directly, but that makes writing search scripts a lot more tedious and complicated, and it makes the file roughly twice as big. For what I wanted it for, a simple line-delimited flat file with line labels is much easier.)

Also in the github repository is a sample search script.

You might reasonably wonder, why would anyone bother to do this? Surely there are plenty of patent search databases available that are far better than anything I would cobble together in a few hours of misguided code-monkeying?

It’s true, if your only goal is routine patent searching, you probably wouldn’t do this, even just for fun. The U.S. Patent Office’s public-facing search utility doesn’t even rise to the level of ‘pathetic’ (I’ll save that rant for another time), but fortunately there are some quite good free search tools (Freepatentsonline, Patentlens), some very good reasonably inexpensive ones (my personal favorite is Acclaim), and of course there’s Google. Added to which, my database only goes back ~8 years, and if you’re doing serious patent searching, you need the whole data set.

However, my main purpose for doing this wasn’t to create my own search tool, it was to do some text mining and tinkering with semantic analysis. I’ve already posted some results about the size of the vocabulary used in patent claims (here), and I’ll be posting some other interesting style and word usage statistics in due course.

Using the database for searching was somewhat of an afterthought, in response to what I have long perceived as some specific shortcomings of the standard search model.

The first problem with nearly all patent search tools — also an issue with non-patent search tools like Google and Bing — is that there are two fundamentally different kinds of search goals, and the standard search model is adapted for only one of them.

Goal 1: return at least one document that answers the question implicit in search query X

Goal 2: return every document that meets the specific criteria of search query X

Goal 1 is the ‘normal’ goal, the one that most search engines are optimized for. If you just want to know what new features will be in Windows 10, you don’t need to find every document that has the answer, you just need one. And if you’re looking for a specific document that you know exists — the Amazon “1 click” patent, say — you don’t care what else is out there, you just want the search engine to find that document. Standard search engines like Google and Bing have become amazingly good at guessing what we want and finding it.

But for some problems — some kinds of patent searches are in this category — it isn’t enough to find a representative sampling of the relevant documents. You need to be sure that you’ve seen every relevant document.

A part of the price that we pay for having search algorithms that cleverly figure out what documents are most relevant is that now the search engine is a black box — you don’t know exactly what it’s doing or how it’s doing it. So when you do a search on a boolean combination of words, instead of getting a list of exactly every document that matches the query, you’re getting a much larger list that includes other documents that the search algorithm thinks are ‘close’. Hopefully, all the documents that match your query exactly are in there somewhere, but you have no way to distinguish them from all the others, short of reading them all. You have no fine-grained control.

The second shortcoming with the standard search tools is that once I’ve found a list of documents that match a query, I want to be able to zero in on the exact places inside each document where the match occurs. For example, I want to be able to easily generate a report that contains only the paragraphs with ‘hits’. This is not rocket science, and some of the (very) expensive search tools like Westlaw have that kind of functionality. But the readily available search tools — even the otherwise excellent Acclaim patent search tool — don’t provide for searching within the returned text and pulling out the relevant bits.

One other thing that search engines don’t handle is searches that aren’t readily expressed as simple word searches. This actually comes up in patent searching oftener than you might think. Here’s an example: recently I wanted to see patents where there are multiple peptide sequences in a single claim. (The patent office has become very draconian about making applicants split biological sequence claims into separate divisional applications, and I wanted to get a better understanding of when and in what circumstances they were allowing claims with more than one sequence.) It’s easy to search for claims that contain sequences, because they have to reference the ID number of the sequence in the filed sequence listing using the exact words “SEQ ID NO”. But with a standard search query, there is no way to search for claims that contain “SEQ ID NO” five times (say) in the same claim.

All of the foregoing problems are easily solved if you have all the full text. Then you can search for anything you want — just write a simple script with whatever criteria you want. It isn’t going to be fast — a brute force search through the whole 70G might take half an hour or so, but to me, it’s often worth the wait just for the convenience of being able to output exactly the parts I want.




Leave a Reply

Your email address will not be published. Required fields are marked *