17/04/2020

Neo4j Cypher is great but sometimes strange

Home



My journey with Neo4j started a few months ago and I really think that it is a great database and I'm in love with its query language called Cypher. However, especially having SQL background, some things might be surprising (in this negative way). I will give you an example. Let's assume that we have a graph that among other stores movies. We may write such a query to retrieve all the movies:
MATCH (m:Movie) 
RETURN m
Here is a sample result:


If we slightly modify this query we can select movies that were released in 1999:
MATCH (m:Movie)
WHERE m.released = 1999
RETURN m
This time we will get just 4 movies:


Now let's make something more complex i.e. additionally let's return information about producers of movies (if they exist):
MATCH (m:Movie)<-[:PRODUCED]-(p:Person)
WHERE m.released = 1999
RETURN m, p
This query has one small bug. The problem is that it returns only 1 movie while we know that there are 4 movies released in 1999. To fix a problem we need to use an equivalent of LEFT/RIGHT JOIN from SQL which is called OPTIONAL MATCH:
MATCH (m:Movie)
OPTIONAL MATCH (m)<-[:PRODUCED]-(p:Person)
WHERE m.released = 1999
RETURN m, p
Better but still something is wrong. Instead of getting 4 movies, we have dozens of them and if we analyze results we will notice something very strange. Movies returned by a query above were released in many different years e.g. in 1975 even though we have this condition WHERE m.released = 1999.

Does it mean that WHERE statement in Cypher does not work in some cases? Or maybe something is wrong with OPTIONAL MATCH. Well, everything is ok if we know how Cypher exactly works. What is not obvious is that:

OPTIONAL MATCH ... WHERE never removes rows from the result.

Yes, it is awkward but it is how Cypher works. Everything is also explained in this nice article from Neo4j. Any way to fix a problem we have 2 options. We can either change the order of statements in a query:
MATCH (m:Movie)
WHERE m.released = 1999
OPTIONAL MATCH (m)<-[:PRODUCED]-(p:Person)
RETURN m, p
Or use WITH statement:
MATCH (m:Movie)
OPTIONAL MATCH (m)<-[:PRODUCED]-(p:Person)
WITH m, p
WHERE m.released = 1999
RETURN m, p
And now results are perfect: