Skip to content

Graphing Hillary Clinton’s E-mails in Neo4j

Technologies: Neo4j, OpenRefine, Prismatic Topics API, Python, Py2neo

tn1oy

Bernie is sick and tired of hearing about Hillary’s e-mails and so am I.  So, why am I writing about them?  Well, they can possibly provide an interesting insight into how our government works (or doesn’t work) — if only they were in a better format than PDFs!!  They represent a perfect graph!

I started off by downloading the CSV files created by Ben Hammer.  Some of the information about who messages were from/to aren’t very normalized in that dataset, so I used the OpenRefine faceting feature and created emails-refined.csv.

I imported these into Neo4j:

With the data in Neo4j, I got to explore the Person nodes Hillary sent the most Email nodes to.

hillary_emails_to

Knowing the e-mails and senders+receivers is interesting, but I wanted to see what the e-mails are about!  While the subject lines are included with the e-mails, they’re often opaque, like the meaningful subject “HEY” used in an e-mail from Jake Sullivan to Hillary Clinton.  Natural language processing to the rescue!

I built a small Python script and used Py2neo to query all e-mails without attached topics.  I then go through each e-mail and send the raw body text and subject to the Prismatic Topics API.  The API returns a set of topics, which I then use to create REFERENCES relationships between the e-mails and topics.  This code is based on the excellent post on the topic by Mark Needham.

Now I can explore e-mails by topic, like the graph below where I see e-mails related to David Cameron.  When I double-clicked on the e-mail with subject ‘GUARDIAN’ in the Neo4j Browser, I can see all the other topics that e-mail references, including Sin Fein, Northern Ireland, Ireland, and Peace.

david_cameron

With this additional topic information, I can start to understand more context around Hillary’s e-mails.

What fun things can you find in her e-mails?

I’ve opened up the Neo4j instance with this data for the world to explore.  Check it out at http://ec2-54-209-65-47.compute-1.amazonaws.com:7474/browser/.  The dataset is open to the public, but I’ve marked it as read-only.  Mention me on Twitter with @ryguyrg if you discover any interesting nuggets of knowledge in Hillary’s e-mails!

 

Published inUncategorized