get your job done faster with pyPdf

A friend of mine was writing a mathematical paper, and she split that work into multiple chapters. For maintenance purposes, I think, each chapter was located in a different file. After she was done with it, she wanted to merge all the chapters into one big pdf file. Because she didn’t have many chapters, she easily did this by hand.

But, I started to think, what would you have done if you had dozens of files? Hundreds? Thousands? You could hire some chimps to do it … or you could write a script. Since I haven’t wrote any Python code in a while, I decided to go with it. A quick google search revealed pyPdf.

The great thing was, they had an example on their website, that was very easy to understand. The only thing that looked a bit weird to me ( but I’m no Python expert, so what do I know ) was that they used the file constructor directly in their example. I thought you were supposed to use open. Never mind.

For this article’s sake, let’s assume we have 100 chapters in one of our directories, from chap1.pdf to chap100.pdf. The first thing you’d want to do is get them in an array. You can achieve this by using os.listdir:


files = os.listdir("/home/me/math_work")

The thing is, os.listdir isn’t returning the absolute paths to the files, only relative ones. To fix that, we use os.path.join :


files = map(lambda file: os.path.join("/home/me/math_work",file),files)

and that will get you absolute paths. An alternative to this would have been this code snippet:


files = [os.path.join("/home/me/math_work",file) for file in os.listdir("/home/me/math_work")]

Use whichever seems clearer to you. The next thing we need to do is sort the file list. You wouldn’t want to append chapter 11 after chapter 1, would you? Python’s arrays have a sort function that we’ll make use of. If we would just sort them normally, using sort:


files.sort()

the files array would contain the files in this order : chap1.pdf,chap12.pdf … chap2.pdf, chap21.pdf . This is not what we want. So, to go around this, we use sort’s parameter. We can pass a custom function to sort. Here’s the one we will use :


def chapsort(first,second):
    b1 = os.path.basename(first)
    b2 = os.path.basename(second)
    b1 = os.path.splitext(b1)[0]
    b2 = os.path.splitext(b2)[0]
    reg = "chap(\d+)"
    m1 = re.search(reg,b1,re.I)
    m2 = re.search(reg,b2,re.I)
    group_1 = int(m1.group(1))
    group_2 = int(m2.group(1))
    return cmp(group_1,group_2)

What this function does is: strips the extension of the files, and perform the comparisons based on the number of the chapters. This will ensure that chapter 10 is after chapter 9 , and not after chapter 1. To customize our sort, we call the sort function like this :


files.sort(chapsort)

The pdf part is adapted based on the example provided on the pypdf website. It’s really easy to understand:


import os,sys
from pyPdf import PdfFileWriter, PdfFileReader

def chapsort(first,second):
    b1 = os.path.basename(first)
    b2 = os.path.basename(second)
    b1 = os.path.splitext(b1)[0]
    b2 = os.path.splitext(b2)[0]
    reg = "chap(\d+)"
    m1 = re.search(reg,b1,re.I)
    m2 = re.search(reg,b2,re.I)
    group_1 = int(m1.group(1))
    group_2 = int(m2.group(1))
    return cmp(group_1,group_2)


print "starting"

files = [os.path.join(sys.argv[1],file) for file in os.listdir(sys.argv[1])]
output = PdfFileWriter()
files.sort(chapsort)
for f in files:
    print "appending %s" % f
    input = PdfFileReader(open(f,"rb"))
    for page in range(0,input.getNumPages()):
         print "\tadding page %d" % page
         output.addPage(input.getPage(page))
outputFile = sys.argv[2]
outStream = open(outputFile,"wb")
output.write(outStream)
outStream.close()

and this will concatenate all the chapters in the folder in a file specified by the second argument sent to the script. Note that on a UNIX platform, the binary mode flag is not necessary. But if you want your script to work under windows too, keep it.

Have fun with pyPdf!

ssssssssssscripting !!!