Using NEO4J for Contact Tracing: Uncovering the Hidden links

Wilson Chua
16 min readOct 27, 2020

--

This 7 minute video explains the theory:

https://www.youtube.com/watch?v=uKVQERi83lM&feature=emb_logo

Then the rest here demonstrates the implementation of the above theory.

We can use NEO4J to uncover any hidden clusters of Covid19 infections. NEO4J community edition is a FREE software. We can also PREDICT the key people in a community that are likely to get infected FIRST. Together these two abilities help LGUs immensely in conducting efficient and effective CONTACT tracing.

And you only need to copy and paste the ready code below to generate the analysis. You don’t need to be an SQL guru. No more need to do the complicated JOINS to connect various tables together. It should be EASY to do.

First we start with a dummy data that is collected by a BlueTrace or similar APP. I am currently working with local startup CLEAR, Staysafe.ph and Curve. NOVID is also possible.

https://www.theverge.com/2020/4/10/21216484/google-apple-coronavirus-contract-tracing-bluetooth-location-tracking-data-app

Only the locations are actual names of places ( and GPS Lat /Long) . The rest are randomly generated. (Standard disclaimer: the cases are fictitious and if there are similarities, these are purely coincidental):

Take note of the 3rd row data. We will use it for our demo

Tech Note: (You can skip these 3 paragraph)
User1 and User2 have some sort of BlueTrace Algorithm installed on their smartphones. When they are near each other, the Bluetooth will sense this and record each other’s ID including date, time and location (GPS lat and long). Then using Google Geo-lookup, we have established the names of places that correspond to the GPS coordinates.

Imagine that you are the Analytics guru that is asked to generate insights out of millions of records like the above. You could do it by hand, using EXCEL pivot. But it would be extremely difficult to tease out the relationships between People, and Places to uncover the hidden clusters.

The instructions to install NEO4j is in this video. And the instructions to load the .csv into NEO4j is in the APPENDIX section at the end of this NOTE.

I call your attention to row 3 of the excel file. We see a record that says:
PH20 was near PH91 on 4/6/2020 and ate at a place called ‘Dagupenas’.

Let us check how NEO4J visualizes his particular data:

Match (n:Person)-[]->(p:Place)<-[]-(m:Person)
WHERE n.userID=”PH20" AND m.userID=”PH91" AND p.name=”Dagupenas”
RETURN n,p,m

Notice that the GPS coordinate is a property of the NODE: Place

If we click on the Relationship LINK, we can see additional information on the event:

We also add the Date and Time Info as a property of the “LINK”

How are two people related?

Let us say that both PH48 and PH10 became INFECTED. And we want to investigate if they came into contact with each other. Do we see them ‘NEAR’ each other? We type this command to reveal how they came into contact with each other: (Remember, you can cut and paste the command, just substitute the userID portion)

Match (n:Person)-[]->(m:Person)
WHERE n.userID=”PH20" AND m.userID=”PH91"
RETURN n,m

Both nodes were near each other.

We can expand PH91 node to see all the records we have on them:

It looks messy. Each of the BLUE circles represent People that they came into contact with. And each RED circle denotes the locations that they were in. The relationship between people to other people, or people to places are represented by lines. These lines link people to each other and to places. They are also color coded to represent various actions that took place. Yellow lines represent an action of ‘Study’. Green lines represent action of ‘ Shop’ and so on.

We can filter this result to zero in to the interaction of interest. We do this with a new Pattern:

MATCH (n:Person)-[]-(z:Place)-[]-(l:Person)
WHERE n.userID=”PH48" AND l.userID=”PH10"
RETURN n,z,l

Places where both PH10 and PH48 visited.

Suppose PH10 is the infected person. And we want to see all places that were common to them with PH48. But this time, only AFTER PH10 visited it.

MATCH (n:Person)-[r]-(z:Place)-[m]-(l:Person)
WHERE n.userID=”PH48" AND l.userID=”PH10" AND r.Date >= m.Date AND r.Time >= m.Time
RETURN n,z,l

They were exposed at the hospital

If we click on the LINK, we also get additional information:

If we want to see who else went to a particular location (and possibly became infected there as well) , we click on the location and expand the links:

The blue circles represent all the other people that visited the hospital.

In the above case, we not only identified how the two infected people are linked, but we are also able to generate a list of additional people that could be at risk. These are the people that visited the same location that the two confirmed cases went to.

However, some of these people visited the hospital PRIOR to the visit made by the two confirmed cases. To filter this out to a particular Date and Time:

MATCH (n:Person)-[r]-(z:Place)
WHERE z.name=”Villaflor Hospital” AND r.Date >= “04/07/2020” and r.Time <=”23:59:00"
RETURN n,z,r
ORDER BY r.Date DESC

These blue circles represent people that should be contact traced.

If we take another pair of cases (PH14 and PH 53), we can see the common places (from 1 to 2nd degree) that two people have been to:

MATCH p=(n:Person)-[*1..2]-(m:Person)
WHERE n.userID=”PH14" AND m.userID=”PH53"
RETURN p

We can also show the mesh of people that are near each other. Limited to the first 25

MATCH p=((n:Person)-[r:NEAR]->()<-[]-(m:Person))
RETURN p LIMIT 25

The RED circle represents an infected individual.

Or, we can also show the people that shopped , limited to the first 25

MATCH (n:Person)-[:Shop]->(m:Place)
RETURN n, m LIMIT 50

Notice that all the links are in GREEN. This is the color code for SHOP action.

Now, supposed PH14 became POSITIVE for covid. We want to know who his close contacts were.

MATCH (n:Person)-[:NEAR]-(m:Person)
WHERE n.userID=”PH14"
RETURN n,m

And we also need to know:
when and where did these contacts happen?

MATCH (n:Person)-[]-(p:Place),
(m:Person)-[:NEAR]-(n)
WHERE n.userID=”PH14"
RETURN n,m,p

The places will need to be deep cleaned or closed. The persons that came into contact with PH14 will need to be contact traced.

Now, suppose we are asked to IDENTIFY the potential super spreaders to TEST for covid infections, we can use this cypher command to show the people that have a lot of ‘contacts’

match (n:Person)
WITH n, size((n)-[:NEAR]->(:Person)) as connections
WHERE connections >=5
RETURN n.userID, connections ORDER BY connections DESC

PH53 and PH99 should be tested FIRST.

We can also do the same for LOCATIONs.

Which places in Dagupan has the most traffic and should be disinfected, or placed under strict social distancing ?

match (n:Place)
WITH n, size((n)<-[]-(:Person)) as connections
WHERE connections >=5
RETURN n.name, n.Lat, n.Long, connections ORDER BY connections DESC

The results:

Nepo Mall needs to ensure strict social distancing and take extra precautions

We can export GPS locations (longitude and latitude) out as a CSV file, and using Python, we might visualize the GPS data to create a heat map similar to this one :

Simulated Heat Map of GPS locations

Update: Sept 10, 2020 Using NeoMap (applications found in the Graph Applications Library)

Click on the link circled and copy the URL

This gives the URL as : https://registry.npmjs.org/neomap

Input this into the install field below:

To launch NeoMap:

Then scroll down to add another layer (like QGIS) and add a heatmap
For the heatmap, we added Person details (like brgy, long and lat of his residence) see appendix optional step.

This is the resulting heatmap from NeoMap:

The heat map shows the residence of persons in the study:
MATCH (n:Positive)-[r]->(m:Place)
Where n.userID=”PH10" AND r.Date=”4/6/2020"
return n.userID,r.Date,r.Time, m.Lat as longitude, m.Long as latitude
order by n.userID, r.Date,r.Time

Additionally, we can use the Advanced mode to specify a neo4j query that returns output of a path, that we can use POLYLINE on:

We can also setup relationship among the Place nodes to set those nearest each other:

MATCH (p:Place)
set p.latitude=toFloat(p.Long), p.longitude=toFloat(p.Lat)
return p;

Note: i placed the lat into the long field and vice versa hence the need to use the wrong labels. Then we need to convert it to Floating point.

MATCH (p1:Place), (p2:Place)
where id(p1) <> id(p2)
with p1,p2, distance(point(p1),point(p2)) as distance
ORDER BY distance ASC
with p1, collect({node:p2,distance:distance}) [..5] as nearest
UNWIND nearest as near
with p1, near, near.node as nearest_node
MERGE (p1)-[m:NEAROTHERLOCATION]-(nearest_node)
SET m.distance = near.distance

So we can now plot the shortest path neo4j command:

MATCH p=shortestpath((n1:Place)-[r:NEAROTHERLOCATION*..5]-(n2:Place{name:”CSI Lucao”}))
WHERE n1.name=”Eastgate Plaza”
WITH p
unwind nodes(p) as places
RETURN places.Lat as longitude, places.Long as latitude

And putting it as an advanced cyper query in NeoMap:

The shortest path here did not really take into account the distance…

We can see that the chosen paths did not take into account the distance parameter. It basically displayed the ‘shortest hops’. As shown using Graph Data Science Playground, the shortest path uses other nodes:

So now we need to figure out how to use the distance in the NeoMap-ing.

MATCH (from:Place {name:”Eastgate Plaza” }), (to:Place {name: “CSI Lucao”}) , path = (from)-[:NEAROTHERLOCATION*..4]-(to)

WITH path AS shortestPath,reduce(distance = 0, r in relationships(path) | distance+r.distance) AS totalDistance
order by totalDistance asc limit 1

UNWIND nodes(shortestPath) as places
RETURN places.Lat as longitude, places.Long as latitude

We can then plug this into NEOMAP and show this:

Supposed the DOH asked you to produce the CHAIN of infections from PH 98 to PH 10 ?

MATCH (n1:Person)-[]->(n2:Person),
(n2:Person)-[]->(n3:Person),
(n3:Person)-[]->(n4:Person),
(n4:Person)-[]->(n5:Person),
(n5:Person)-[]->(n6:Person)
WHERE n1.userID=”PH98" and n6.userID=”PH10"
RETURN n1,n2,n3,n4,n5,n6

Another way to code it:
MATCH p=shortestpath((n1:Person)-[*1..7]->(n2:Person))
WHERE n1.userID=”PH98" and n2.userID=”PH10"
RETURN p

Suppose we want to see all contact and contacts of those contacts to nth degree of userID PH98?

MATCH (n:Person)
WHERE n.userID=”PH98"
SET n:INFECTED

Who are THIS USER’s CLOSES Contacts?
MATCH (n:Person)-[:NEAR]-(m:Person)
WHERE n.userID=”PH98"
RETURN n,m

Your Boss is amazed. But then he thinks to himself. Hang on. This USER98 could also have visited PLACES and left the Covid19 virus there.

He needs you to print out all the places the PH98 has been to, and all the USERS that have been to those places as well.

You smile at this. For you know how to do this quickly:

MATCH (n:Person)-[*1]->(m:Place)<-[*1]-(r:Person)
WHERE n.userID=”PH98"
RETURN n,m,r

And we can view this in table form:

Or Export that out in several formats:

Now suppose Mayor Brian Lim of Dagupan wants you to provide data on the effect of closing down schools. How would you do it?

First, let me suggest the use of a metric to base his decision on. I propose we look at the [Total Number of interactions]. For this we look at the Total number of “Near contacts’ when all the activities are taken into account: (Eating, Praying, Shopping, Studying, Meetings etc)

match (n:Person)-[:Eat|:Study|:Massage|:Pray|:Trial|:Shop|:Travel|:Checkup]-(p:Place)-[]-
(m:Person)
RETURN n,count((n)-[:NEAR]-(m)) as count
ORDER by count desc

Total NEAR connections = 21,462

Now We take out the :Study LINKS:
match (n:Person)-[:Eat|:Massage|:Pray|:Trial|:Shop|:Travel|:Checkup]-(p:Place)-[]-
(m:Person)
RETURN n,count((n)-[:NEAR]-(m)) as count
ORDER by count desc

Total NEAR connections, 17,694

The difference in the number of interactions is 17.5% reductions!
This provides LGUs like Dagupan a solid basis for DATA-driven decisions on which to open, and which to close down.

TIP: A quick and dirty way to figure out what types of interactions to stop:
CALL db.relationshipTypes() YIELD relationshipType as type
CALL apoc.cypher.run(‘MATCH ()-[:`’+type+’`]->() RETURN count(*) as count’,{}) YIELD value RETURN type, value.count
ORDER by value.count DESC

Next Steps?
If you are not yet bored out of your mind, here is more.
We can do a lot of network graph stuff.

The demo above is the REACTIVE things an LGU can do. The above commands are executed when an infectious case is confirmed. But why wait? We can use Data Analytics to do PREDICTION. Remember the movie Minority Report? Kaya na natin now:

Pre-emptive/Predictive COVID TESTING?

What if instead of waiting for the first infection to popup, your BOSS decides to take the proc-active route. He wants you to identify all the potential SUPER spreaders in the community.

Test them. And if positive, ISOLATE them before are INFECTED.
Tall Order? You bet? Programming skills? No need.

Since we have contact tracing data, we can reasonably estimate the date of exposure, and ask for exposed people to come in after 4th day of exposure to take the Blood based antibody tests. This will reduce the number of false negatives!

Now, you do need to brush up on Network Graph theory. Check out Social Network Terms: https://cambridge-intelligence.com/keylines-faqs-social-network-analysis/

Degree Centrality Graph :
Degree centrality assigns an importance score based simply on the number of links held by each node. In our graph, these are the people that are exposed to a lot of OTHER people. They are most likely to catch the virus ahead of other people. And if they get infected, they are most likely to become super spreaders.

If we had only 3 covid tests left, which 3 persons do we test?

Betweenness Graph :
Betweenness centrality measures the number of times a node lies on the shortest path between other nodes. That is tech speak for “These guys are the connectors. Infections from one barangay only happens when this central person gets infected. They are the ‘tulay’. If they are protected, the virus never gets to jump from one barangay/community to the next community.

If we wanted to Isolate 3 persons, who would these be to compartamentalize infections

Closeness Graph :
Closeness centrality scores each node based on their ‘closeness’ to all other nodes in the network. Closeness centrality can help find good ‘broadcasters’

Eigenvector Graph :
EigenCentrality measures a node’s influence based on the number of links it has to other nodes in the network but it then goes a step further by also taking into account how well connected a node is, and how many links their connections have, and so on through the network.

Similar to Degree Centrality, but it adds the value of the connections’ connections

Strongly Connected Components (either family or very close friends) Detection:

Created with Neo4j 3.5.17 and Graph Data Science Playground Plugin. To Learn more, watch this video:

Some Notes from the video:

Addendum:
Hand coding using Graph Algorithms library instead of Graph Data Science Library (both are mutually exclusive) based on the codes from the book : Graph Algorithms by Mark Needham and Amy

If you are hardcore, and don’t want the convenience of Drag and Drop of the Graph Data Science Playground, you can also do it manually. Use Graph Algorithm Plugins instead of the Graph Data Science Library plugin (these two don’t get along well together). There can only be ONE plugin of this type.

Degree Centrality
MATCH (u:Person)
RETURN u.userID as name,
size((u)-[:NEAR]->()) AS outDegree,
size((u)<-[:NEAR]-()) AS inDegree,
size((u) — ()) AS Degree
ORDER by Degree desc

Closeness Centrality (Assuming all are connected)

CALL algo.closeness.stream(“Person”, “NEAR”)
YIELD nodeId, centrality
RETURN algo.getNodeById(nodeId).userID, centrality
ORDER BY centrality DESC

Closeness Centrality using Harmonics (assuming there are unconnected nodes)

CALL algo.closeness.harmonic.stream(“Person”, “NEAR”)
YIELD nodeId, centrality
RETURN algo.getNodeById(nodeId).userID, centrality
ORDER BY centrality DESC

Betweenness Centrality

CALL algo.betweenness.stream(“Person”,”NEAR”)
YIELD nodeId, centrality
RETURN algo.getNodeById(nodeId).userID AS user, centrality
ORDER by centrality DESC

Page Rank computation

CALL algo.pageRank.stream(‘Person’, ‘NEAR’, {iterations:50, dampingFactor:0.85})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).userID AS user, score
ORDER BY score DESC

Community related computations: (Triangle Count)

CALL algo.triangleCount.stream(‘Person’, ‘NEAR’)
YIELD nodeId, triangles, coefficient
WHERE coefficient > 0
RETURN algo.getNodeById(nodeId).userID AS user, coefficient
ORDER BY coefficient DESC

Community related computations: (Strongly connected components)

CALL algo.scc.stream(“Person”, “NEAR”)
YIELD nodeId, partition
RETURN partition, collect(algo.getNodeById(nodeId)) AS users
ORDER BY size(users) DESC

Finding connected components

CALL algo.unionFind.stream(“Library”, “DEPENDS_ON”)
YIELD nodeId,setId
RETURN setId, collect(algo.getNodeById(nodeId)) AS libraries
ORDER BY size(libraries) DESC

Communities via Label Propagation:

CALL algo.labelPropagation.stream(“Person”, “NEAR”,{ iterations: 10 })
YIELD nodeId, label
RETURN label, collect(algo.getNodeById(nodeId).userID) AS users
ORDER BY size(users) DESC

Communities via Louvain Modularity

CALL algo.louvain.stream(“Person”, “NEAR”)
YIELD nodeId, communities
RETURN algo.getNodeById(nodeId).userID AS users, communities

Data Aggregation to reduce dataset size

Aggregate to a Daily level of detail

We may need to reduce the number of records to improve Neo4J processing speed. I think we can do this by assuming that each record of contact is worth a duration of 5 minutes (assuming that the tracing app sends contact details every 5 minutes).

We can then aggregate into a daily record of contacts by adding a property of [Duration] and consolidating all the related records into one.

MATCH (n:Person)-[r:NEAR]-(m:Person)
MATCH (n)-[a:LOCATION {Date:r.Date,Time:r.Time}]->(p:Place)
RETURN r.Date as Date,n.userID,m.userID,p.name as Place,count(*)*5 AS Duration
ORDER BY r.Date,n.userID,m.userID

Then export as CSV and load as a new dataset where duration is a property of the [:NEAR] relationship.

Share This out, PLEASE

Hope you can share this to the people doing contact tracing. I hope this helps. Note: you don’t need to study Neo4J. Just prepare your data in the exact format as the csv file above. then copy and paste the code, only substituting the PH# you need. Attribution of this work is appreciated but not necessary. I hope that this helps us pull ourselves OUT of lockdowns.

Note: The instructions for how to load the CSV into NEO4j is in the comment section. The CSV file has to be loaded into the Neo4j import folder.

APPENDIX :

A. How to Install

B. How to convert CSV and import into NEO4j:

Cypher command to test for completeness:
(You need to save the TEST-NEO4J-COVID.csv into the NEO4J …/import folder.)

LOAD CSV WITH HEADERS
FROM ‘file:///TEST-NEO4J-COVID.csv’
AS line
RETURN count(*)

Cypher command to see the field names:

LOAD CSV WITH HEADERS
FROM ‘file:///TEST-NEO4J-COVID.csv’
AS line
RETURN * limit 1

Cypher data to load the csv file:

LOAD CSV WITH HEADERS
FROM ‘file:///TEST-NEO4J-COVID.csv’
AS csvrow

// create the User1 node
MERGE (n:Person:UNINFECTED {userID:csvrow.USER1})
// create the user2 node
MERGE (f:Person:UNINFECTED {userID:csvrow.USER2})
// create the places node
MERGE (p:Place {name:csvrow.Place})
ON CREATE SET p.Lat=csvrow.Lat, p.Long=csvrow.Long

// now we create the links
MERGE (n)-[u:NEAR]->(f)
ON CREATE SET u.Date=csvrow.Date, u.Time=csvrow.time, u.Action=csvrow.Activity
MERGE (n)-[b:LOCATION]->(p)
ON CREATE SET b.Date=csvrow.Date, b.Time=csvrow.time, b.Action=csvrow.Activity
MERGE (f)-[c:LOCATION]->(p)
ON CREATE SET c.Date=csvrow.Date, c.Time=csvrow.time, c.Action=csvrow.Activity

// am actually thinking of using CREATE for the links instead of MERGE

Optional Step: Adding Person details

LOAD CSV WITH HEADERS
FROM ‘file:///TEST-NEO4J-COVID-People.csv’
AS csvrow

MERGE (n:Person{userID:csvrow.USER1})
on match
set n.Municipality = csvrow.Municipality, n.Brgy =csvrow.Brgy,
n.Province= csvrow.Province, n.Longitude=csvrow.Longitude, n.Latitude=csvrow.Latitude

RETURN n

Then to be able to map ACTIONs to the relationship labels:

MATCH (n:Person)-[l:LOCATION]-(m:Place)
call apoc.create.relationship(n,l.Action,l{.Date,.Time},m) yield rel
return rel

Then delete the generic :LOCATION location relationship :

match (n:Person)-[r:LOCATION]->(m:Place)
DELETE r

Note:
A better way would be to create Dynamic RELATION LINKS using FOREACH using this template.

LOAD csv with headers from “file://…” as row
MERGE (P1:PERSON (name:row.from})
MERGE (p2:Person {name:row.to})
foreach (t in case when row.type=”email” THEN [1] ELSE [] END |
merge (p1)-[:EMAIL]->(p2) )
foreach (t in case when row.type=”sms” THEN [1] ELSE [] END |
merge (p1)-[:SMS]->(p2))

In case LGUs can get Telco Data, i foresee the need to be able to compute social contact distance based on GPS data. Cypher has a spatial function distance() that can be used for this purpose. Using our test data, le’s compute the distance between places:

Match (n:Person)-[:Study]-(m:Place)-[:Study]-(o:Person)-[:Eat]->(p:Place)
RETURN n,m,o,p

So the above is a graph of all people that studied where one of them later went on to eat. We can use the spatial function of distance() to find the nearest pair of school and restaurant. Consider that all the places in our graph had the additional GPS property of Lat and Long.

Match (n:Person)-[:Study]-(m:Place)-[:Study]-(o:Person)-[:Eat]->(p:Place)
WITH point({longitude:toFloat(m.Long),latitude:toFloat(m.Lat)}) as p1,
point({longitude:toFloat(p.Long),latitude:toFloat(p.Lat)}) as p2,
m,p
RETURN DISTINCT m,p,distance(p1,p2) as socialDistance
ORDER by socialDistance limit 10

Video demo https://www.facebook.com/webmobile.ph/videos/852275298618880/?hc_location=ufi

--

--

Wilson Chua
Wilson Chua

Written by Wilson Chua

Data Analyst, Startup Founder, Tech Columnist

No responses yet