Making Text from the Facebook Papers More Accessible

tl;dr

I’ve been working on extracting text from the released pdfs of the Facebook Papers. The cleaned pdfs, the extracted text and the code used to clean the text are all available on Github.

Original pdf on the left; processed pdf on the right

The script requires Python 3.6 or higher, and has only been tested on Linux. Enjoy!

The Details

Like many of us, I’ve been following the reporting on internal Facebook documents, and how these documents confirm and reinforce details that have been clear about Facebook for years, and how these documents illustrate exactly how well Facebook knew and didn’t act to solve the problems they created.

Also like many of us, I’ve been dying to see the original docs, so when the team at Gizmodo started releasing the docs I was pretty darn excited.

Pretty.

Darn.

Excited.

Seriously, the team at Gizmodo (Shoshana Wodinsky, Dell Cameron, Andrew Couts) have been doing stellar work reporting on these docs, and getting the core docs released publicly.

Due to the provenance of these documents, the “pdfs” released were actually worse than your normal PDF – and that’s saying something, because on the best of days PDFs are where information goes to die. These pdfs appear to be a collection of images taken of a computer screen stitched together into pdfs.

But the information in these pdfs is incredibly valuable, and we are lucky to have it.

Fortunately, from an old side project, I had some dirty, ugly, functional code lying around that cleaned up PDFs. I grabbed some of the early docs released by Gizmodo, did a test run, and lo and behold, it worked. It was ugly, but it worked.

Last night, I reworked my original (dirty, ugly) script into something cleaner, that generates better output. I generally don’t write code, except when I need to, so about the only thing I will ever say about code I write is that it solves a clearly defined problem for me at a point in time — which is a far cry from actually writing code that is good. In this improved version, I had some invaluable help from Smart People Who Know Things (~~I have asked permission to credit them here; I’ll update this post if/when I receive their consent~~

The resulting code is now up on Github, along with the text files and the cleaned pdfs. I’m keeping my fingers crossed that I don’t bump into any repository size restrictions on Github anytime soon.

And: if there are any improvements you’d like to make or questions you have, let me know.