Skip to content

Category: Uncategorized

Lakehouse isn’t an architecture; it’s a way of life

Recently a tweet of mine was revealed to have been included in a Twilio board of directors presentation from 2011. The tweet was about the simplicity of the developer experience for both Twilio and Google App Engine. What’s this have to do with Lakehouses? Everything.

Apparently screen caps in 2011 weren’t high res

My entire career has been about enabling simplified experiences for technologists so they can focus on what matters — the key differentiators of their business or application.

Google App Engine, though released before its time, made it a lot easier to launch and maintain production-ready web applications.
Google Apps Marketplace made it easier to market B2B apps to a large audience.
Neo4j and the property graph data model makes it easier to understand and query the relationships between data.

The same is true with Databricks and the Lakehouse architecture.

Nearly all large enterprises today have two-tier data architectures — combining a data lake and a data warehouse to power their data science, machine learning, data analytics, and business intelligence (BI).

Data Lake, storing all the fresh enterprise data, populated directly from key business applications and sources. By using popular open data formats, the data lake is great for compatibility with popular distributed computing frameworks and data science tools. However, a traditional data lake is typically slow to query plus lacks schema validation, transactions, and other features needed to ensure data integrity.
Data Warehouse, with a subset of the enterprise data ETLd from the data lake, stores mission critical data “cleanly” in proprietary data formats. It’s fast to query this subset, but the underlying data is often stale due to the complex ETL processes used to move the data from the business apps -> lake -> warehouse. The proprietary data formats also make it difficult to work with the data in other systems and keep users locked in to their warehouse engine.

Simplicity is king

Why have a two-tier data architecture when a single tier will satisfy the performance and data integrity requirements, improve data freshness and reduce cost?

It simply wasn’t possible before the advent of data technologies like Delta Lake, enabling highly-performant access to data stored in open data formats (like Parquet) with the data integrity constraints and ACID transactions only previously possible in data warehouses.

bestOf(DataLake) + bestOf(DataWarehouse) => (DataLakehouse)

Reduce your (mental, financial, ops) overhead

I’d encourage y’all to invest in simplicity and reduce the complexity of your data architecture. The first step is reading up on new technologies like Delta Lake. There is a great VLDB paper on the technology as well as a Getting Started with Delta Lake tech talk series by some of my esteemed colleagues.

DevRel @ Databricks: Year 1

I joined Databricks in October 2019. My first day was also the first day of the Spark + AI Summit in Amsterdam – a heck of an exciting introduction to a new team, new company and new community.

Why Databricks? I was very happy at Neo4j and certainly loved working with the Neo4j community. Databricks brought with it an exciting opportunity though – build a talented team at one of the fastest growing cloud data startups in history. I also have the opportunity to work directly with the founders and senior leadership at the company who understand the value of developer relations and the importance of building a great community of data scientists, data engineers and data analysts.

Our Team

Databricks DevRel Team

The team is now 6 located in San Francisco CA, Seattle WA, Boulder CO, Sante Fe NM, and Blacksburg VA. We have Developer Advocates and Program Managers all working together to grow awareness and adoption of Databricks and the open source projects which we support.

First-Year Team Accomplishments

Here’s some of our accomplishments from the first-year, working along with amazing collaborators across the company and community.

Keep in mind: a lot of work our team has done in 2020 hasn’t yet been released – stay tuned! And, of course, this is all within the context of the broader accomplishments of the company and community in 2020.

Looking Forward

We have a really excited 2021 planned as the team continues many of the initiatives above and takes on new challenges. We’ll be focused on making it easier to learn data science, data engineering and data analytics, as well as making it simple to apply these learnings using Databricks. An important part of this mission will be growing and strengthening the community so we can all learn from each other.

Are you a data geek and want to join the adventure? We have data engineering/analytics advocate roles and developer (online) experience advocate roles open in the US as well as a regional advocate role in Europe. Reach out to me at (firstname).(lastname) if you want to learn more!

Moving RDBMS data into a Graph Database

One of the most common questions we get at Neo4j is how to move from a SQL database to a Graph Database like Neo4j. The previous solution for accomplishing this was to export the SQL tables into CSV files and then importing the CSV files with neo4j-import or LOAD CSV. There’s a much better way: JDBC!

Neo4j JDBC Support

There are two distinct ways you can use JDBC within Neo4j:

  1. Access Neo4j Data via JDBC. Do you have existing code that accesses your SQL database using JDBC, and you want to move that code to access Neo4j instead? Neo4j has a JDBC Driver. Just update your code to use the awesome power of the Cypher query language instead of SQL, and switch over the JDBC driver you’re using, and you’re off to the races!
  2. Import SQL Databases into Neo4j. Do you have data in your SQL database that you want to move into a Graph? The APOC library for Neo4j has a set of procedures in apoc.load.jdbc to make this simple. This blog post will cover this use case.

Loading Sample Northwind SQL tables into MySQL

In order to run the code snippets in the following sections, you’ll need to have the Northwind SQL tables in a MySQL database accessible from your Neo4j server. I’ve published a GitHub Gist of the SQL script which you can execute in MySQL Workbench or using the command-line client.

In order to run this, I created a blank MySQL database in Docker:

Loading data from RDBMS into Neo4j using JDBC

With the APOC JDBC support, you can load data from any type of database which supports JDBC. In this post, we’ll talk about moving data from a MySQL database to Neo4j, but you can apply this concept to any other type of database: PostgreSQL, Oracle, Hive, etc. You can use it for other NoSQL databases too, but APOC has direct support for MongoDB, Couchbase and more.

1. Install APOC and JDBC Driver into Neo4j plugins directory

Note: This step is not necessary if you’re using the Neo4j Sandbox and MySQL or PostgreSQL. Each Sandbox comes with APOC and the JDBC drivers for these database systems.

All JAR files placed in the Neo4j plugins directory are made available for use by Neo4j. We need to copy the APOC library and JDBC drivers into this directory.

First, download APOC. Be sure to grab the download that is for your version of Neo4j.

Next, download the JDBC driver. Then, copy the file into your plugins directory:

Finally, restart Neo4j on your system.

2. Register the JDBC Driver with APOC

Open up the Neo4j Browser web interface:

In the Neo4j Browser, enter a Cypher statement to load the required JDBC driver:

3. Start pulling Northwind SQL tables into Neo4j with JDBC and Cypher

Run the following Cypher queries, courtesy of William Lyon, separately in the Neo4j Browser:

Running Cypher Queries on Imported Data

Here’s a simple Cypher query for collaborative filtering product recommendations:


Next Steps

If this was your first experience with Neo4j, you probably want to learn more about Neo4j’s Cypher query language. Neo4j has some great (free) online training you can take to learn more. You can also use the Cypher Refcard to power your journey to becoming a Graphista.

Graphing Hillary Clinton’s E-mails in Neo4j

Technologies: Neo4j, OpenRefine, Prismatic Topics API, Python, Py2neo


Bernie is sick and tired of hearing about Hillary’s e-mails and so am I.  So, why am I writing about them?  Well, they can possibly provide an interesting insight into how our government works (or doesn’t work) — if only they were in a better format than PDFs!!  They represent a perfect graph!

I started off by downloading the CSV files created by Ben Hammer.  Some of the information about who messages were from/to aren’t very normalized in that dataset, so I used the OpenRefine faceting feature and created emails-refined.csv.

I imported these into Neo4j:

With the data in Neo4j, I got to explore the Person nodes Hillary sent the most Email nodes to.


Knowing the e-mails and senders+receivers is interesting, but I wanted to see what the e-mails are about!  While the subject lines are included with the e-mails, they’re often opaque, like the meaningful subject “HEY” used in an e-mail from Jake Sullivan to Hillary Clinton.  Natural language processing to the rescue!

I built a small Python script and used Py2neo to query all e-mails without attached topics.  I then go through each e-mail and send the raw body text and subject to the Prismatic Topics API.  The API returns a set of topics, which I then use to create REFERENCES relationships between the e-mails and topics.  This code is based on the excellent post on the topic by Mark Needham.

Now I can explore e-mails by topic, like the graph below where I see e-mails related to David Cameron.  When I double-clicked on the e-mail with subject ‘GUARDIAN’ in the Neo4j Browser, I can see all the other topics that e-mail references, including Sin Fein, Northern Ireland, Ireland, and Peace.


With this additional topic information, I can start to understand more context around Hillary’s e-mails.

What fun things can you find in her e-mails?

I’ve opened up the Neo4j instance with this data for the world to explore.  Check it out at  The dataset is open to the public, but I’ve marked it as read-only.  Mention me on Twitter with @ryguyrg if you discover any interesting nuggets of knowledge in Hillary’s e-mails!