Noe4J@Cybersecurity Conference 2020

These are my notes in preparing the presentation for the upcoming conference on Oct 24, 2020. I decided to work on Kaggel’s Data for Network security. The solution was already published using Python.Source of Data and Kaggel Competition Notes: https://www.kaggle.com/hawkcurry/data-analysis-for-network-security-101-solution/notebook?select=gowiththeflow_20190826.csv

Why should we care about analyzing logs?
Because logs are the clues left behind for us to piece together and learn what happened. However, it is not humanly possible to go through 114 millions of lines of the security logs and piece together — at least not in the time frame we need it. You will need tools to automate this. Kaggel’s solution used Python. Others use R or SQL. I plan to use another tool called Neo4J.

NEO4J is a Network graph analytics software that not only arrives at the same answers but also shine by in certain scenarios. These capabilities help guide data security governance in NEW ways. These will be shown at the end.

Using Tableau, I attempted to get a sense of the traffic. Here are some cursory visuals to help us understand the underlying data.

114 million lines/rows form the 5gb data source
One week view of the Logs
First Day, January 8. This is the activity during Office hours. Not much activity after office hours
2nd Day: Now we see the difference between Office hours and After Office hours. Did you notice that slight bump in outbound traffic around 10am ?
3rd Day: Same bump in outbound traffic around 10am
4th day: Same pattern
5th Day: Same Pattern still
Final Day: the spike in outbound traffic during AFTER officehours is now more pronounced.

First Attempt : Failed

The larger 5gb filesize took NEO4j forever to load.

:auto
USING PERIODIC COMMIT 1500
LOAD CSV FROM ‘file:///gowiththeflow_20190826.csv’ as csvrow
MERGE (n:Node {IPv4:csvrow[1]})
MERGE (m:Node {IPv4:csvrow[2]});

:auto
USING PERIODIC COMMIT 1500
LOAD CSV FROM ‘file:///gowiththeflow_20190826.csv’ as csvrow
MATCH (n:Node {IPv4:csvrow[1]})
MATCH (m:Node {IPv4:csvrow[2]})
// WHERE n.IPv4 <> m.IPv4
Create (n)-[:Traffic{date:apoc.date.format(toInteger(csvrow[0]),’ms’,’mm/dd/yyyy HH:mm:ss’),bytes:toInteger(csvrow[4]),
port:csvrow[3]}]->(m)

I “chopped” this into smaller segments and then let NEO4j load these in batches.

Tableau Prep Flow showing how the large file is chopped up and exported
Tableau Prep Flow showing how the large file is chopped up and exported
Tableau Prep Flow
Actual output of the Tableau Prep flow
Actual output of the Tableau Prep flow
Files exported by Tableau Prep Flow

I used Tableau Prep to extract the first octet by using CustomSplitusing [.] and getting the first part. Then all SourceIP whose first octet is 12 will be use as the filter and saved as [output12.csv] etal.

Update: I also used Tableau Prep to extract: Hour, Minute, Day and Month fields from the F1 column.

Another strategy is to separate the NODE creation from LINKS creation operations. In NEO4J Browser, I increased the default memory to 20g:

// the use of periodic commit helps with limited memory environments
:auto
USING PERIODIC COMMIT 5000
LOAD CSV with header FROM “file:///output12.csv” as csvrow
MERGE (n:Node {IPv4:csvrow.F2})
MERGE (m:Node {IPv4:csvrow.F3});

//Then create the links
:auto
USING PERIODIC COMMIT 1500
LOAD CSV with HEADERS FROM “file:///Output12.csv” as csvrow
Match (n:Node {IPv4:csvrow.F2})
Match (m:Node {IPv4:csvrow.F3})
// WHERE n.IPv4 <> m.IPv4
Create (n)-[:Traffic{date:apoc.date.format(toInteger(csvrow.F1),’ms’,’mm/dd/yyyy HH:mm:ss’),bytes:toInteger(csvrow.F5),port:csvrow.F4}]->(m)

// then repeat until you finish all the OutputXX.csv files

6K nodes and 114m links

Now we do a backup here.

Then we add TYPE property to the NODE object to set either “Internal” or “External” based on the first octet values. Internal IPv4s are those that begin with 12,13, and 14. All the rest are External

Match (n:Node)
set n.type = (
CASE
WHEN toInteger(split(n.IPv4,’.’)[0])<15
THEN ‘Internal’
ELSE ‘External’
END)
return *

Display the [type] property in the Nodes. Notice that without the FILTER to exclude records where the sourceIP and destIP are the same, you will see the above display where lines start and end (loop back to itself).

Next step, If did not compute the hour, day and minute in Tableau Prep, this is how would do it in NEO4J:

MATCH (n:Node)-[r:Traffic]->(m:Node)
WITH r, toInteger(split(split(r.date,’ ‘)[1],’:’)[0]) as Hours
SET r.OfficeHours= (CASE WHEN Hours>16 THEN ‘YES’ ELSE ‘NO’END), r.Hours= Hours;

The inner split gets us the TIME portion.
The outer split gets the hour from the TIME portion and checks if it is greater than 16. This is the office start hour.

I had to give up on this attempt because the resulting NEO4J database was so large that each statement I execute would take HOURS. It proved to be too slow and costly to execute. I had to find a faster method.

Second Attempt: Moderately Successful

It was much faster for us to calculate and create fields for Hour, Day, Month from the F1 column using Tableau Prep and allow us to import them into NEO4J.

MATCH (n:Node)-[r:Traffic]->(m:Node)
RETURN n.IPv4 as SourceIP, m.IPv4 as DestIP, r.Month, r.Day, r.Hour, count(*) as counts,
sum(r.bytes) as TotalBytes, avg(r.bytes) as AvgBytes, min(r.bytes) as minBytes,
max(r.bytes) as Maxbytes, stdev(r.bytes) as stdev,
(sum(r.bytes)-avg(r.bytes))/stdev(r.bytes) as zScore;

Or we can use Tableau Prep to aggregate the records and consolidate them:

Tableau Prep Flow to aggregate by SourceIP, DestIP, Port, OfficeHours
Already we can see that certain IPs have more traffic than others.
Traffic by Ports also show distribution of traffic.
Sample Output

So we start from scratch (again) but this time:

Set up INDEX:

// having an index speeds up the loading,creation of links and results
CREATE INDEX for (m:Node) on (m.IPv4);
CREATE INDEX for (m:Node) on (m.type);
CREATE INDEX for (p:Port) on (p.number);
CREATE INDEX for (n:Node) on (n.louvain);

Test load the file:

LOAD CSV with headers from ‘file:///Output.csv’ as csvrow
RETURN * limit 1

{
AvgBytes”: “937”,
OfficeHours”: “No”,
Bytes”: “937”,
SourceIP”: “14.35.21.57”,
DestIP”: “15.112.80.35”,
Port”: “94”,
Count”: “1”,
StdDevBytes”: null
}

// the use of periodic commit helps with limited memory environments
:auto
USING PERIODIC COMMIT 1500
LOAD CSV with headers from ‘file:///Output.csv’ as csvrow
MERGE (n:Node {IPv4:csvrow.SourceIP})

:auto
USING PERIODIC COMMIT 1500
LOAD CSV with headers from ‘file:///Output.csv’ as csvrow
MERGE (m:Node {IPv4:csvrow.DestIP});

Then we re-create the links between sourceIP and destIP using the aggregated values of “Number of Rows (Aggregated)” as count and Bytes

:auto
USING PERIODIC COMMIT 1500
LOAD CSV with HEADERS FROM ‘file:///Output.csv’ as csvrow
Match (n:Node {IPv4:csvrow.SourceIP})
Match (m:Node {IPv4:csvrow.DestIP})
// WHERE n.IPv4 <> m.IPv4
Create (n)-[:Traffic{Bytes:toInteger(csvrow.Bytes), AvgBytes:toFloat(csvrow.AvgBytes),
StdDevBytes:toFloat(csvrow.StdDevBytes),
OfficeHours: csvrow.OfficeHours,
Port:csvrow.Port,
Count:toInteger(csvrow.Count),
Zscore:(toInteger(csvrow.Bytes)-toFloat(csvrow.AvgBytes))/toFloat(csvrow.StdDevBytes)}]->(m)

These steps will create links from NODEs to Ports:

MATCH (n:Node)-[r:Traffic]->(m:Node)
WITH n, r.Port as Port, sum(r.Count) as totalcount, sum(r.Bytes) as totalbyte , avg(abs(r.Zscore)) as AvgZscore
MERGE (p:Port{number:Port})
create (n)-[:ToPort{Count:totalcount,Byte:totalbyte,Zscore:AvgZscore}]->(p);

MATCH (n:Node)-[r:Traffic]->(m:Node)
WITH m, r.Port as Port, sum(r.Count) as totalcount, sum(r.Bytes) as totalbyte, avg(abs(r.Zscore)) as AvgZscore
MERGE (p:Port{number:Port})
create (p)-[:FromPort{Count:totalcount,Byte:totalbyte,Zscore:AvgZscore}]->(m);

However, to save time, we aggregate Count,Bytes and Zscore using tableau prep and then use the following to create the links in NEO4J:

:auto
USING PERIODIC COMMIT 1500
LOAD CSV with HEADERS FROM ‘file:///Output-Src.csv’ as csvrow
MERGE (p:Port{number:csvrow.Port})
MERGE (n:Node{IPv4:csvrow.SourceIP})
Create (n)-[:ToPort{Count:toInteger(csvrow.Count),Byte:toInteger(csvrow.Bytes),Zscore:toFloat(csvrow.Zscore)}]->(p);

:auto
USING PERIODIC COMMIT 1500
LOAD CSV with HEADERS FROM ‘file:///Output-Dest.csv’ as csvrow
MERGE (p:Port{number:csvrow.Port})
MERGE (n:Node{IPv4:csvrow.DestIP})
Create (p)-[:FromPort{Count:toInteger(csvrow.Count),Byte:toInteger(csvrow.Bytes),Zscore:toFloat(csvrow.Zscore)}]->(n);

Finally, we add the property type to Node. This helps us mark traffic between internal and external IPs.

// add type property to Node
Match (n:Node)
set n.type = (
CASE
WHEN toInteger(split(n.IPv4,’.’)[0])<15
THEN ‘Internal’
ELSE ‘External’
END)
return *

Exercise Questions:

Question 1: Discover Data Exfiltration 1
Our intellectual property is leaving the building in large chunks. A machine inside is being used to send out all of our widget designs. One host is sending out much more data from the enterprise than the others. What is its IP?

MATCH (n:Internal)-[r:Traffic]->(m:External)
RETURN n.IPv4 as SourceIP, sum(r.Bytes) as TotalBytes
ORDER by TotalBytes DESC
LIMIT 5

Results after 10,084 seconds

Below is a Tableau Chart of the 114m rows:

Among the Internal IPs, 13.37.84.125 clearly has the largest traffic going to EXTERNAL IPs.
And this data leaker sent the data out to 15 external IPs (note the colors) and used 5 different ports.

Let’s tag this device as a “DataLeaker” and “Interesting

MATCH (n:Node)
WHERE n.IPv4 IN [‘13.37.84.125’ , ‘12.59.25.34’, ‘14.57.50.29’,’12.45.94.34',’14.57.60.122']
set n:Interesting

MATCH (n:Node)
WHERE n.IPv4 =‘13.37.84.125’
set n:DataLeaker

Who else is sending data to our exfiltrator?

// what is the shortest path from a node to the exfiltrator

MATCH (from:Internal{IPv4:”12.58.122.56"}),
(to:External { IPv4: “18.37.102.33”}),
path = (from)-[:Traffic*]->(to)
RETURN path AS shortestPath,
reduce(Bytes = 0, r in relationships(path) | Bytes+r.Bytes) AS totalbytes
ORDER BY totalbytes ASC
LIMIT 1

Question 2: Discover Data Exfiltration 2
Another attacker has a job scheduled that export the contents of our internal wiki. One host is sending out much more data during off hours we filter for (0:00 to 16:00) from the enterprise than the others, different from the host in the Question 1. What is its IP?

// create node property for Officehours
MATCH (n:INTERNAL)-[r:Traffic]->(m:EXTERNAL)
WITH r, toInteger(split(split(r.date,’ ‘)[1],’:’)[0]) as Hours
SET r.OfficeHours= (CASE WHEN Hours>16 THEN ‘YES’ ELSE ‘NO’ END), r.Hours= Hours;

// do the calculations
MATCH (n:INTERNAL)-[r:Traffic]->(m:EXTERNAL)
WHERE n.type=’Internal’ and m.type=’External’ AND r.OfficeHours=’NO’
RETURN n.IPv4 as SourceIP, sum(r.Bytes) as TotalBytes
ORDER by TotalBytes DESC
LIMIT 5

Python ANSWER: 12.55.77.96

Let’s tag this device as a “DataLeaker” and “Interesting

MATCH (n:Node)
WHERE n.IPv4=”12.55.77.96"
set n:DataLeaker, n:Interesting

Our two nodes with label “DataLeaker”

Question 3: Discover Data Exfiltration 3
Some assailant is grabbing all the employee and vendor email addresses, and sending them out on a channel normally reserved for other uses. This is similar to attackers abusing DNS for data exfiltration. One host is sending out much more data on a some port from the enterprise than other hosts do, different from the hosts in Questions 1 and 2. What is its port?

Python ANSWER: 124

MATCH (n:Node{type:’Internal’})-[r:ToPort]->(p:Port)-[r1:FromPort]->(m:Node{type:’External’})
WHERE r.Zscore > 1 and r.Zscore <100000
RETURN n.IPv4 as SourceIP, p.number as Port, avg(r.Zscore) as Variance
ORDER by Variance Desc
Limit 5

Tag these IPs as “DataLeaker” and ‘Interesting”

MATCH (n:Node)
WHERE n.IPv4 IN [‘12.30.96.87']
set n:DataLeaker, n:Interesting

Question 4: Private C&C channel
We’re always running a low-grade infection; some internal machines will always have some sort of malware. Some of these infected hosts phone home to C&C on a private channel.

What unique port is used by external malware C&C to marshal its bots?
We are looking at ports that are publicly accessible. If a public facing port is legitimate, which might mean it is part of typical business functions like a company website or corporate email server, then we would expect it to be used frequently and from many unique sources.

Match (p:Port)<-[r:ToPort]-(m:Node{type:’External’})
WITH p, m.IPv4 as External
return p.number as Port, count(distinct External) as Servers
ORDER by Servers
limit 5

Port 113 was seen coming from only ONE external IP

Match (n:Node{type:’External’})-[]->(p:Port{number:’113'})-[]->(m:Node{type:’Internal’})
return *

From External IP to Internal IP using Port 113

ANSWER: 113

Then we tag these sourceIPs as “interesting” and “Infected”

Match (n:Node{type:’External’})-[]->(p:Port{number:’113'})-[]->(m:Node{type:’Internal’})
with m.IPv4 as Infected
Match (s:Node{IPv4:Infected})
set s:Infected,s:Interesting

Why do we keep on tagging (labeling) these interesting IPv4s?
We plan to run similarity algorithms to find OTHERs that may be similarly infected!

Question 5: Internal P2P
Sometimes our low-grade infection is visible in other ways. One particular virus has spread through a number of machines, which now are used to relay commands to each other. The malware has created an internal P2P network. What unique port is used by the largest internal clique, of all hosts talking to each other?

My solution is to find the strongly connected components first as shown below:

First we create Labels for NODEs as Internal and External based on Node.type. This enables us to limit community detection to ONLY the INTERNAL NODES.

Match (n:Node{type:’Internal’})
set n:Internal;
Match (n:Node{type:’External’})
set n:External;

Then detect the communities using Louvain :

NEO4J detected two large communities

Then create an index on Node for ‘louvain’ property. Then for members of this strongly connected components (Community=29), we display a sorted list of ports based on count of ports in descending order where Node.type=’Internal’.

Then looking at the biggest community, find the port with the most interaction.

Match (n:Internal)-[]->(p:Port)-[]->(m:Internal)
WHERE n.louvain=29
RETURN n.louvain as Community, p.number, count(distinct n) as sizeofcluster
ORDER by sizeofcluster
LIMIT 5

Then we tag these IPs as “Interesting” and “Infected”

Python solution:
ANSWER: Port 83 has approx. max clique size 264
If you are wondering where the other members are? They are in Louvain community 906!

So we shift to using Strongly Connected Components Algorithm:

Match (n:Internal)-[]->(p:Port)-[]->(m:Internal)
WHERE n.scc=0
RETURN n.scc as Community, p.number, count(distinct n) as sizeofcluster
ORDER by sizeofcluster
LIMIT 50

Port 83 is actually the Port with the smaller number of members. Other Ports have 417 members! like Ports 65,55,75,121,72,40,114,119,37,105. This is, I think where Neo4J shines!!

First we Tag all the INTERESTING traffic:

Match (n:Internal)
WHERE n.scc=0
Set n.Interesting

Then we tag all the Internal NODES using port 83 as INFECTED:

Match (n:Internal)-[]->(p:Port)-[]->(m:Internal)
WHERE n.scc=0
WITH n, n.scc as Community, p.number as Port, count(distinct n) as sizeofcluster
Where Port=’83'
SET n:Infected

My thoughts: here we see the advance community detection algorithms from NEO4J can actually identify MORE infected clients …and Ports!

Question 6: Malware Controller
We were just blacklisted by an IP reputation service, because some host in our network is behaving badly. One host is a bot herder receiving C&C callbacks from its botnet, which has little other reason to communicate with hosts in the enterprise. What is its IP?

For this question, we just follow the clue “which has little other reason to communicate with hosts in the enterprise”. We look for external IPs that established a connection only to a single host.

Match (n:External)-[r:Traffic]->(m:Internal)
WITH n,count(distinct m.IPv4) as HOSTCOUNT, collect(m.IPv4) as HOST
WHERE HOSTCOUNT=1
RETURN n.IPv4, HOST

Again we tag the IP as INTERESTING and INFECTED:

Match (m:Internal {IPv4:’14.45.67.46'})
SET m:Interesting, m:Infected
RETURN labels(m)

Python ANSWER: 14.45.67.46

Question 7: Infected Host
One host is part of the botnet from Question 6, what is its IP?

From the previous question we know that hosts in the botnet contact 14.45.67.46 on port 27, so we look for internal hosts that have this similar behavior.

// find all Internal IPs that connect to 14.45.67.46 on port 27
MATCH (n:Internal{IPv4:’14.45.67.46'})<-[r:Traffic{Port:’27'}]-(m:Internal)
RETURN m.IPv4

Python ANSWER: 14.51.84.50

Match (m:Internal )
WHERE m.IPv4=”14.51.84.50" or m.IPv4=”13.42.70.40"
SET m:Interesting, m:Infected
RETURN labels(m)

Question 8: Botnet Inside
There is a stealthier botnet in the network, using low frequency periodic callbacks to external C&C, with embedded higher frequency calls. What port does it use?

One way we can detect this is to look for ports used more for OUTBOUND traffic. Then we use the std deviation value to rank these. That low frequency periodic callbacks looks like a beacon with a set number of bytes. These bytes will then have a std deviation close to zero. Besides, if one were the hacker, you wouldn’t want the beacon traffic to be larger than is necessary. It will attract attention.

ANSWER: 51

Match (n:Internal)-[r1:ToPort]->(p:Port)
return p.number as Port, sum(r1.Count) as Instance
ORDER by Instance

So which Internal Hosts are infected? Let’s tag them as Interesting and Infected.

Match (n:Internal)-[r1:ToPort]->(p:Port)
where p.number=’51'
SET n:Intersting, n:Infected

Question 9: Lateral Brute
Once a machine is popped, it’s often used to explore what else can be reached. One host is being used to loudly probe the entire enterprise, trying to find ways onto every other host in the enterprise. What is its IP?

Here we look for telltale signs of scanning which are one of:

large number of distinct destination ports
large number of distinct destination ips

Python ANSWER: 13.42.70.40

Find the top 500 internal IPs that are sending out traffic.
Pass these 500 internal IPs to find the top ones that are sending ONLY to INTERNAL IPs, and sort them by sum(counts)

Match (n:Internal)-[r1:ToPort]->(p:Port)
WITH n, sum(r1.Count) as Portscans
ORDER by Portscans desc
Limit 150
Match (n)-[r:Traffic]->(m:Internal)
return distinct n.IPv4 as SourceIP, sum(r.Count) as TotalScans
ORDER by TotalScans desc
LIMIT 5

Now we tag this 13.42.70.40 as Interesting and Infected
Ah it was tagged in the earlier question!

Question 10:
Lateral Spy
One host is trying to find a way onto every other host more quietly. What is its IP?

Python ANSWER: 12.49.123.62

Match (n:Internal)-[r:Traffic]->(m:Internal)
WITH n ,m,r.Port as Port,sum(r.Count) as CountT
WHERE CountT=1
Match (n)-[r:Traffic]->(m)
RETURN n.IPv4, sum(r.Count) as Instance
ORDER by Instance desc
Limit 100

or:

Match (n:Internal)-[r:Traffic]->(m:Internal)
WITH n ,m,r.Port as Port,sum(r.Count) as CountT
WHERE CountT=1
Match (n)-[t:ToPort]->(p:Port)-[s:FromPort]->(m)
WHERE p.number IN [Port]
RETURN n.IPv4,count(distinct p.number), count(distinct m.IPv4)
Limit 10

Advantages of NEO4J
Data For Security Governance

It can show the network path
It shows the shortest path between Hacker and Victim

It can identify other potential victims with the use of communities
It can help management do risk rankings based on degree centralities and
betweeness scores.

It can find other victims or hacker using jaccard similarities

Data Analyst, Startup Founder, Tech Columnist