Often when reading a patent the focus is not so much on what the patent says, it’s on being sure about what it doesn’t say. If an examiner has cited a patent as prior art, and you want to argue that it’s different from the claimed invention because the claimed invention includes some feature that the cited reference doesn’t disclose, you need to satisfy yourself that the feature is not in the cited reference anywhere.

That can be a very tedious undertaking. The particular reference that finally overcame my inertia and got me to scribble up the python script that is the subject of this post is a patent application, cited as a prior art reference in an office action in one of my cases, that is 76 pages of two-column fine print.  No way does the client want to pay me for the hours it would take to read it in detail.

It isn’t hard to exclude a lot of it. Since the examiner cited the reference as disclosing a feature relating to a “sample”, pretty clearly any paragraph that doesn’t include the word “sample” (or a synonym) can be safely ignored. In this case, though, a lot of paragraphs did include that word.

So what I really wanted was an easy way to pull out a short “blurbs”, including half a line or so before and after, for everywhere in the reference that the word “sample” appears. Obviously, I could just do a word search and go through the document that way, but that’s not as useful as having a concise list of all the blurbs. For one thing, patents tend to be repetitious — things are likely to be described over and over using the same phrases. If you have a list where they’re all lined up in short blurbs, it’s easy to pick out any blurbs that are introducing something different.

Here (below) is a python script that will scan through a text document and pull out blurbs around a specified keyword and line them up in a list. If there are paragraph numbers in brackets (which the PTO rules recommend but not everyone does it), the program will also indicate which paragraph each blurb comes from. The input is a text file, the output is another text file containing the list. The easiest place to get a text file of a patent or patent application is the USPTO’s search page (one of the few good uses for a search tool that is otherwise pathetic — usually I much prefer freepatentsonline or patentlens, but for plain text the USPTO site is better since it doesn’t format or otherwise alter the text).

The output looks like this:

[0349]	to the SPIN filter tube. The sample was again centrifuged (14000
[0349]	the SPIN filter tube and the sample was centrifuged at the same 
[0350]	0] The extracted genomic DNA samples were delivered to Beijing G

This script also illustrates a potential pitfall in text searching. You might be wondering, why not just do a regex search? One reason is, if you do a regex search in Python using re.findall() or re.finditer(), you won’t pick up overlapping hits if there are any — and with blurbs like this, there are likely to be some. In other words, suppose you’re searching for blurbs containing the word “foo”, with each blurb including (say) 50 characters to either side of each hit, so that you get 103-character strings with “foo” in the middle. If it happens that there are any passages where there’s a second “foo” less than 50 characters after the last one, the regex won’t pick up the second one, because regex searches in Python don’t overlap — the search for the next hit starts at the end of the last one. (Yes, in theory, there is a way to do it using regex backreferences. I’ll pass — too complicated, takes me too long to get that to work and even then I wouldn’t be sure it was working right without a lot of testing. I’d rather use code that I actually understand.)

Anyway, here’s the code:

#save as blurbextractor.py
import sys, os, shutil, re
usage = """
    usage: >python blurbextractor.py txt_file out_file searchword size
    txt_file is the file to be searched for all instances of the word.
    All files are in text format. Search is case insensitive.
    out_file is the filename for output file, it will be created if not
    present, otherwise overwritten. 
    size is desired no of characters in each blurb
"""
if len(sys.argv) != 5:
    sys.exit("args missing" + '\n' + usage)
(progpath, txtfn, outfn, searchword, size) = sys.argv[0:5]
progpath = os.path.dirname(progpath)
if not os.path.exists(txtfn):
    txtfn = os.path.join(progpath, txtfn)
outfn = os.path.join(os.path.dirname(txtfn), outfn)
outf = open(outfn, 'w')
span = int((int(size) - len(searchword))/2)

rxpara = re.compile('\[\d{3,5}\]')
rxsw = re.compile(searchword, re.I)
txt = ''

def getpara(loc): #return paragraph number for hit location
    pno = "[0000]"
    for p in paraidx:
        if loc < p[0]: break
        pno = p[1]
    return pno + '\t'

def getblurb(loc): #return blurb of specified span at loc 
    global txt, span
    a = loc - span
    if a < 0: a = 0     b = loc + len(searchword) + span     if b > len(txt): b = len(txt)
    return txt[a:b]

with open(txtfn, 'r') as txtf:
    txt = ''.join(txtf.readlines())
    txt = txt.replace('\n',' ')
    txt = txt.replace('\r', '')
    #first locate paragraph numbers
    paraidx = [(m.start(), m.group(0)) for m in rxpara.finditer(txt)]
    hits = [m.start() for m in rxsw.finditer(txt)]
    blurbs = [getpara(hit) + getblurb(hit) for hit in hits]
    outf.write('\n'.join(blurbs))
print 'done'

Leave a Reply

Your email address will not be published. Required fields are marked *