Covid Bibliometrics Using Graph Analytics

14 min readOct 19, 2020

For Philippine National Librarians Convention

Group Members:
Stephen Alayon, Janny Surmieda, Paul Jayson Perez, Dan Anthony Dorado and Wilson Chua

Why this may interest you,
We hope to answer some questions about Covid19 research:
1. Who are the most influential authors for Covid19 topics?
2. Which institutions have produced the most Cited Papers?
3. Which authors are LIKELY to collaborate for future papers?
4. Which authors form a cabal ?
5. Recommendation Engine: If a library borrower is looking for one title, what other titles might we recommend to this person?
6. Who are the top Publishers based on Number of Publication by Authors?
7. Tag Cloud of keywords and their coexistence!! (mapping of the keywords).
8. Would the leading institutions be a good predictor of eventual vaccine production?
9. Can the insight help focus on new research directions?
10. Philippines , quo vadis? (which institutions, hospitals etc)

The following are the actual steps taken to ingest the data and analyze it. If you don’t have Neo4j installed, here is the youtube to help get you setup:
https://www.youtube.com/watch?v=hYZ_qlXlj90

Create a new database on Neo4J (Tip: Don’t upgrade to version 1.3.8<-buggy)

Now, fire up the Neo4j Browser and start to ingest the data. First download and save the file. Source data Extracted by Paul Jason Perez : https://drive.google.com/drive/folders/1qZlpbRJ7r7U0IN5SvhrLhOwIFZ3Gpguf

Make sure that the file is saved into the import folder on Neo4j database.

Note: We need to add this line to the NEO4J configuration settings:
apoc.import.file.enabled=true

//First attempt to load the json using the command :
CALL apoc.load.json(‘file:///all_works.json’) yield value
unwind value.item as insti
return insti

{
  "is-referenced-by-count": 0,
  "indexed": {
"date-time": "2020-04-03T04:46:08Z",
"date-parts": [
      [
        2020,
        4,
        3
      ]
    ],
"timestamp": 1585889168985
  },
  "created": {
"date-time": "2020-04-02T16:09:27Z",
"date-parts": [
      [
        2020,
        4,
        2
      ]
    ],
"timestamp": 1585843767000
  },
  "prefix": "10.1257",
  "author": [
    {
      "given": "Daniel",
      "sequence": "first",
      "family": "Bennett",
      "affiliation": []
    }
  ],
  "deposited": {
"date-time": "2020-04-02T16:09:27Z",
"date-parts": [
      [
        2020,
        4,
        2
      ]
    ],
"timestamp": 1585843767000
  },
  "source": "Crossref",
  "content-created": {
"date-parts": [
      [
        2020,
        4,
        2
      ]
    ]
  },
  "type": "dataset",
  "title": [
    "The Labor Market Impact of the COVID-19 Epidemic"
  ],
  "references-count": 0,
  "URL": "http://dx.doi.org/10.1257/rct.5633",
  "institution": {
"name": "American Economic Association",
"place": [
      "-"
    ],
"acronym": [
      "AEARCT"
    ]
  },
  "score": 8.29216,
  "container-title": [
    "AEA Randomized Controlled Trials"
  ],
  "content-domain": {
"crossmark-restriction": false,
"domain": []
  },
  "member": "11",
  "reference-count": 0,
  "publisher": "American Economic Association",
  "issued": {
"date-parts": [
      [
        2020,
        4,
        2
      ]
    ]
  },
  "DOI": "10.1257/rct.5633"
}

To get the keys:

CALL apoc.load.json(‘file:///all_works.json’) yield value
unwind value.item` as insti
return keys(insti)

JSON Viewer OUTPUT:

Sample Test Code:

CALL apoc.load.json(‘file:///all_works.json’) yield value
UNWIND value.item as insti
RETURN insti.indexed.`date-time`
,insti.indexed.`date-parts`[0][0] as Year
,insti.indexed.`date-parts`[0][1] as Month
,insti.indexed.`date-parts`[0][2] as Day

Consolidated Version to INGEST and create NODES/LINKS:

A word of caution, ingesting 500mb json files into Neo4j will take up a lot of memory and cpu as shown below:

Memory and CPU intensive to load 500mb of json file into NEO4j

Ingestion Code Optimized V4 (Use this !)

// OPTIONAL in case you need to delete the entire database
call apoc.periodic.iterate(
“match (n) detach delete n”,
“return n”,
{batchSize:2500,iterateList:true,parallel:true})

// create the indexes to speed up data ingestion
CREATE INDEX FOR (m:Author) on (m.First,m.Last,m.Orcid);
CREATE Constraint on (p:Institution) ASSERT p.Name is UNIQUE;
CREATE Constraint on (p:Publication) ASSERT p.DOI is UNIQUE;
CREATE Constraint on (p:Publisher) ASSERT p.Name is UNIQUE;

CREATE INDEX for (p:Publication) on (p.Subject);
CREATE INDEX for (e:Entity) on (e.Name);

Additional Indexes:

// Since the files are list_data_0.json up to list_data_18.json
// we cycle through it automatically with the unwind range
UNWIND range(0,18) as inner
CALL apoc.load.json(‘file:///list_data_’+toString(inner) +’.json’) yield value
UNWIND value.items as insti
WITH insti, insti.author as authorname
UNWIND authorname as authors
WITH insti,count(authors) as cntauthors
MERGE (p:Publication{DOI:coalesce(insti.DOI,’NONE’)})
ON CREATE SET p.Count= insti.`is-referenced-by-count`,
p.DateTime= insti.indexed.`date-time`,p.IndexYear=insti.indexed.`date-parts`[0][0],
p.IndexMonth=insti.indexed.`date-parts`[0][1],p.IndexDay=insti.indexed.`date-parts`[0][2],
p.Prefix=coalesce(insti.prefix,’NONE’),p.DateDeposited=insti.deposited.`date-time`,
p.Type=coalesce(insti.type,’NONE’),p.Title=coalesce(insti.title[0],’NONE’),
p.URL=coalesce(insti.URL,’NONE’),p.Score=coalesce(insti.score,’NONE’),
p.ContainerTitle=coalesce(insti.`container-title`,’NONE’),
p.Restrictions=coalesce(insti.`content-domain`.`crossmark-restriction`,’NONE’),
p.FunderDOI= coalesce(insti.funder[0].DOI,’NONE’),
p.FunderName= coalesce(insti.funder[0].name,’NONE’),
p.InstitutionName=coalesce(insti.institution.name,’NONE’),
p.InstitutionAcronym=coalesce(insti.institution.acronym,’NONE’),
p.InstitutionDepartment=coalesce(insti.institution.department,’NONE’),
p.UpdateTo= coalesce(insti.`update-to`[0].DOI,’NONE’),
p.Member=coalesce(insti.member,’NONE’),
p.Language=coalesce(insti.language,’NONE’),
p.Issntype=coalesce(insti.`issn-type`[0].type,’NONE’),
p.IssnValue=coalesce(insti.`issn-type`[0].value,’NONE’),
p.LinkURL=coalesce( insti.link[0].URL,’NONE’),
p.Subject=coalesce(insti.subject[0],’NONE’),
p.IntendedApplication=coalesce(insti.link[0].`intended-application`,’NONE’),
p.Publisher=coalesce(insti.publisher,’NONE’),
p.Subtype=coalesce(insti.Subtype,’NONE’),
p.OrigCorpus=”True”,
p.ContainerTitle=coalesce(insti.`container-title`,’NONE’)
FOREACH (g in range(0,insti.`reference-count`) |
MERGE (k:Publication{DOI:coalesce(insti.reference[g].DOI,’NONE’)})
ON CREATE SET k.key=coalesce(insti.reference[g].key,’NONE’)
MERGE (p)-[:References]->(k))
FOREACH ( l in range(0,cntauthors)|
MERGE (a:Author{First:coalesce(insti.author[l].given,’NONE’),
Last:coalesce(insti.author[l].family,’NONE’),
Orcid:coalesce(insti.author[l].ORCID,’NONE’)})
MERGE (i:Institution{Name:coalesce(insti.author[l].affiliation[0].name,’NONE’)})
MERGE (a)-[r:BelongsTo]->(i)
MERGE (a)-[s:Authored{Seq:coalesce(insti.author[l].sequence,’NONE’)}]->(p)
)

// optionally complete the details of Referenced publications
// Since most referenced publications did not contain details about it
UNWIND range(0,58) as inner
CALL apoc.load.json(‘file:///list_doi_’+toString(inner) +’.json’) yield value
UNWIND value.items as insti
WITH insti, insti.author as authorname
UNWIND authorname as authors
WITH insti,count(authors) as cntauthors
MATCH (p:Publication{DOI:coalesce(insti.DOI,’NONE’)})
SET p.Count= insti.`is-referenced-by-count`,
p.DateTime= insti.indexed.`date-time`,p.IndexYear=insti.indexed.`date-parts`[0][0],
p.IndexMonth=insti.indexed.`date-parts`[0][1],p.IndexDay=insti.indexed.`date-parts`[0][2],
p.Prefix=coalesce(insti.prefix,’NONE’),p.DateDeposited=insti.deposited.`date-time`,
p.Type=coalesce(insti.type,’NONE’),p.Title=coalesce(insti.title[0],’NONE’),
p.URL=coalesce(insti.URL,’NONE’),p.Score=coalesce(insti.score,’NONE’),
p.ContainerTitle=coalesce(insti.`container-title`,’NONE’),
p.Restrictions=coalesce(insti.`content-domain`.`crossmark-restriction`,’NONE’),
p.Member=coalesce(insti.member,’NONE’),
p.Language=coalesce(insti.language,’NONE’),
p.Issntype=coalesce(insti.`issn-type`[0].type,’NONE’),
p.IssnValue=coalesce(insti.`issn-type`[0].value,’NONE’),
p.LinkURL=coalesce( insti.link[0].URL,’NONE’),
p.Subject=coalesce(insti.subject[0],’NONE’),
p.IntendedApplication=coalesce(insti.link[0].`intended-application`,’NONE’),
p.Publisher=coalesce(insti.publisher,’NONE’),
p.Subtype=coalesce(insti.Subtype,’NONE’),
p.IsReferenced= “True”,
p.ContainerTitle=coalesce(insti.`container-title`,’NONE’)

FOREACH ( l in range(0,cntauthors)|
MERGE (a:Author{First:coalesce(insti.author[l].given,’NONE’),
Last:coalesce(insti.author[l].family,’NONE’),
Orcid:coalesce(insti.author[l].ORCID,’NONE’)})
MERGE (i:Institution{Name:coalesce(insti.author[l].affiliation[0].name,’NONE’)})
MERGE (a)-[r:BelongsTo]->(i)
MERGE (a)-[s:Authored{Seq:coalesce(insti.author[l].sequence,’NONE’)}]->(p)
)

// Then we create Publisher node based on Publication Publisher
Match (p:Publication)
MERGE (p1:Publisher{Name:coalesce(p.Publisher,’NONE’)})
MERGE (p)-[:WasPublishedby]->(p1)
//Create links from Covid19 papaer to Institutions
MATCH (n:COVID19)
WHERE n.InstitutionName <>’NONE’
MERGE (i:Institution{Name:n.InstitutionName})
MERGE (n)-[:HasInstitution]->(i)
Create links from Covid19 papers to Funders
MATCH (n:COVID19)
WHERE n.FunderDOI <>’NONE’
MERGE (f:Funder{DOI:n.FunderDOI,Name:n.FunderName})
MERGE (n)-[:HasFunder]->(f)
//Now we delete the NONE authors
MATCH (n:Author)
WHERE n.First=’NONE’ and n.Last=’NONE’ and n.Orcid=’NONE’
DETACH delete n

//Now we delete the NONE Institutions
MATCH (n:Institution)
WHERE n.Name=’NONE’
DETACH DELETE n

//Now we delete the NONE References
MATCH (n:Reference)
WHERE n.key=’NONE’
DETACH DELETE n

//Now we delete the NONE Publications
MATCH (n:Publication)
WHERE n.key=’NONE’
DETACH DELETE n

// set Covid19 labels
MATCH (n:Publication)
Where n.OrigCorpus = ‘True’
set n:COVID19
return n

// Create Labels for ReferencedWork and for OrigWork
MATCH (n:Publication)
Where n.OrigCorpus is Null
set n:ReferencedWork
return n

Sample Output based on initial test load

// There is a need to do data cleansing
// There are a lot of duplicate Data like Authors, institutions and Publishers

// First Export the duplicated Author names.
MATCH (a:Author)
With a, a.Last+”, “+a.First as Author, a.Orcid as OrcidNumber
Where size(Author)>1
Return Author, count(Author), collect(Author+”:”+OrcidNumber)
ORDER by count(Author) desc

(Results show that the duplicate names have different ORCID numbers)

Just Note that the Neo4J code will treat these entities as different persons.

// test for duplicate Publications
Match (p:Publication)
WITH p.DOI as DOI
return distinct DOI as DOIs, count(DOI) as cntDOI
Order by cntDOI desc
limit 1000

// Export Referenced Publications that have incomplete data
Match (p:Publication)
Where p.Title is null
return distinct p.DOI as DOIs
Order by DOIs

Then export as CSV

//Create co-author links
Match (n:Author)-[:Authored]->(p:Publication)
MATCH (m:Author)-[:Authored]->(p)
Where n.First <> m.First AND n.Last <>m.Last AND n.Orcid <> m.Orcid
MERGE (n)-[:CoAuthored]->(m)

//Create same institution links
Match (n:Author)-[:BelongsTo]->(i:Institution)
Match (m:Author)-[:BelongsTo]->(i)
Where n.First <> m.First AND n.Last <>m.Last
MERGE (n)-[:SameInstitution]->(m)

//Create Author ‘HasPublisher’ links to Publisher
Match (n:Author)-[:Authored]->(p:Publication)-[:WasPublishedby]->(p1:Publisher)
MERGE (n)-[:HasPublisher]->(p1)
// then we create HasSamePublisher links ONLY for COVID19 Publications
// this took forever to finish
Match (n:Author)-[:Authored]->(p1:COVID19)-[:WasPublishedby]->(p:Publisher)
WITH n,p
Match (m:Author)-[:HasPublisher]->(p)
Where n.Orcid<> m.Orcid AND n.Last<> m.Last AND n.First <> m.First
MERGE (n)-[:HasSamePublisher]->(m)

Using traditional CO-occurence link codes

The EXPLAIN query execution flow above shows about 41m rows

Alternative Faster Co-occurrence Query
Using apoc.periodic.iterate

CALL apoc.periodic.iterate(‘
MATCH (a:Author)-[:HasPublisher]->(p:Publisher)<-[:HasPublisher]-(b:Author)
WHERE EXISTS(a.Last) AND EXISTS(a.First) AND EXISTS(b.Last) AND EXISTS(b.First)
RETURN a, b, p’,
‘MERGE (a)-[r:SamePublisher {Name: p.Name}]->(b)’,
{batchSize: 1500, parallel:true, iterateList:true})

The EXPLAIN execution flow on the above query shows about 3m rows read

Use Community algorithms on this.
Publication Centralities:

Degree Centrality for Publications
Node Projection: COVID19
Relationship Projection: References, natural

MATCH (n:Publication)
RETURN n.DOI, n.Title,n.Centrality_Degree
ORDER by n.Centrality_Degree desc
LIMIT 25

If we expand this node, we see a LOT of other covid19 publication referencing it!

Betweeness Centrality (Bridges) for Publications
Node Projection: COVID19
Relationship Projection: *, natural

MATCH (n:Publication)
RETURN n.DOI, n.Title,n.CentralityBetweenness
ORDER by n.CentralityBetweenness desc
LIMIT 25

Page Rank for Publications (Most Influential)
Node Projection: COVID19
Relationship Projection: References, natural

MATCH (n:COVID19)
RETURN n.DOI, n.Title,n.CentralityPagerank
ORDER by n.CentralityPagerank desc
LIMIT 20

Publication Communities Using Louvain Score
Node Projection: COVID19
Relationship Projection: ANY, natural

A look at Louvain Community 1037 using Neo4J Bloom :

Jaccard Similarity for Publications
Node Projection:PUBLICATION (instead of just Covid19)
Relationship Projection: ANY, natural

Author Centralities: Degree
Node Projection: Author
Relationship Projection: *, natural

MATCH (n:Author)
RETURN n.Last,n.First,n.Orcid,n.Centrality_Degree
ORDER by n.Centrality_Degree desc
LIMIT 25

Author Centralities: (Approx) Betweenness
Node Projection: Author
Relationship Projection: CoAuthored, natural

MATCH (n:Author)
RETURN n.Last, n.First, n.Orcid, n.CentralityBetweenness
ORDER by n.CentralityBetweenness DESC LIMIT 25

Author Centralities: PageRank
Node Projection: Author
Relationship Projection: any, natural

Sample Partial Visualization (limited results)

MATCH (n:Author)
RETURN n.Last, n.First, n.Orcid, n.CentralityPagerank
ORDER by n.CentralityPagerank desc
LIMIT 25

Author Communities using Louvain
Node Projection: Author
Relationship Projection: any, natural

MATCH (n:Author)
RETURN n.CommunityLouvain, count(n) as members, collect(n.Orcid)
ORDER by members desc
LIMIT 25

Natural Language Processing: Name Entity Extraction
Source: https://medium.com/neo4j/using-nlp-in-neo4j-ac40bc92196f
See Annex B for the solution.

Analysis:

// Authors with the highest number of published papers
Match (n:Author)-[:Authored]->(p:COVID19)
WHERE p.Type IN [‘journal-article’,’dataset’,’peer-review’]
Return n.First,n.Last,n.Orcid, count(n) as authors
order by authors desc
LIMIT 100

But if we filter to only those that are journal-article, peer-reviewed or datasets:

// Find Published papers with the most authors
Match (n:Author)-[:Authored]->(p:COVID19)
Return p.Title, count(n) as authors
order by authors desc
LIMIT 100

// find authors of most cited (top20) Covid19 publications
MATCH (p:Publication)-[r:References]->(n:COVID19)
WITH n, count(r) as cntrefs
ORDER by cntrefs desc
Limit 20
Match (n)<-[:Authored]-(a:Author)
WITH n, collect(a.First+a.Last) as authors, cntrefs
RETURN n.Title, cntrefs, authors
ORDER BY cntrefs desc
LIMIT 20;

// Faster code
MATCH (p:COVID19)
WHERE p.Type IN [‘journal-article’,’dataset’,’peer-review’]
RETURN p.Title, SIZE( ()-[:References]->(p) ) AS cite_count
ORDER BY cite_count DESC
LIMIT 10;

Find the Institution where Authors of the MOST cited Papers belong to:

MATCH (p:COVID19)
WHERE p.Type IN [‘journal-article’,’dataset’,’peer-review’]
WITH p.Title as nTitle, SIZE( ()-[:References]->(p) ) AS cite_count
ORDER BY cite_count DESC
LIMIT 100
Match (p{Title:nTitle})<-[:Authored]-(a:Author)-[:BelongsTo]->(i:Institution)
RETURN i.Name, count(p), collect(p.Title) as Papers, collect (a.Last) as Authors
Order by count(p) DESC

Which authors are LIKELY to collaborate for future papers?

MATCH (a1:Author)-[:Authored]->(p:COVID19)-[r:SIMILAR_JACCARD]->(p1:COVID19)<-[:Authored]-(a:Author)
WHERE a1.ConnnectComponents=a.ConnnectComponents
return a,a1,p,p1
LIMIT 25

Find VIP Authors

Using Degree Centrality

MATCH (n:Author)
RETURN n
ORDER by n.Centrality_Degree DESC LIMIT 25

Using PageRank

MATCH (n:Author)
RETURN n
ORDER by n.CentralityPagerank DESC LIMIT 25

Using Betweenness Scores:

MATCH (n:Author)
RETURN n
ORDER by n.CentralityBetweenness DESC LIMIT 25

But the question here is: Are the most cited papers, also the most influential ? Here we use Page Rank to find out the answer:

MATCH (n:COVID19)
WHERE n.Type IN [‘journal-article’,’dataset’,’peer-review’]
return n.Title,n.CentralityPagerank as PageRank
ORDER by PageRank desc
LIMIT 20

// Find publishers of TOP 10 COVID19 publications

MATCH (p:Publication)-[r:References]->(n:Publication)
WHERE n.OrigCorpus=’True’
WITH n, count(r) as cntrefs
ORDER by cntrefs desc
Limit 20
Match (n)-[:WasPublishedby]->(m:Publisher)
Return n.Title, cntrefs, m.Name
ORDER by cntrefs desc

Find shortest path between any two nodes:

Match p=shortestpath((a:Author{Last:’Mahase’,First:’Elisabeth’})-[*1..6]- (m:Institution{Name:’Singapore General Hospital’}))
return p

Match p=shortestpath((a:Author{Last:’Wiwanitkit’,First:’Viroj’})-[*]- (m:Author{Last:’Bennett’,First:’Daniel’}))
return p
Match p=shortestpath((i:Institution{Name:’NTUSingapore’})-[*]- (m:Author{Last:’Tan’,First:’Heang Kuan Joel’}))
return p

Network Graph Metrics

Helps Librarians answer the question, What other papers are related to a research topic, given an Initial choice of paper? (See Annex B on the use of Entities)

(to do, compute louvain among papers, among Authors, among publishers)
(to do, compute similarity among papers, among authors, among publishers)

Finding the longest reference path:

MATCH p1=(n:COVID19)-[r:References*1..12]->(p:COVID19)
RETURN p1, length(p1) as depth
ORDER by depth desc
LIMIT 100

Some Cool Search using Neo4j Bloom

Try the pre-built Search Queries:

//s7 Find 10 Highest PageRanked Publications AND their (First) Authors
MATCH (p:COVID19)
WITH p , p.CentralityPagerank as Rank
ORDER by p.CentralityPagerank desc
limit 10
MATCH (p)-[r:Authored{Seq:’first’}]-(a:Author)
RETURN p,r,a

Backup and Restoration:

https://medium.com/@niazangels/export-and-import-your-neo4j-graph-easily-with-apoc-4ea614f7cbdf

Related Bibliograph Predictive Models

Practice Loading JSON:

:param url =>”https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf"
call apoc.load.json($url) yield value
unwind value.items as q
return q.question_id,q.title, q.tags, q.is_ansmwered, q.owner.display_name

Link Prediction with Neo4j Part 2: Predicting co-authors using scikit-learn

This is the 2nd in series of posts on the link prediction functions that were recently added to the Neo4j Graph…

towardsdatascience.com

Annex A

:param limit => ( 100);
:param config => ({
nodeProjection: ‘Publication’,
relationshipProjection: {
relType: {
type: ‘References’,
orientation: ‘NATURAL’,
properties: {}
}
},
relationshipWeightProperty: null,
dampingFactor: 0.85,
maxIterations: 20
});
:param communityNodeLimit => ( 100);

CALL gds.pageRank.stream($config) YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS node, score
RETURN node.Title, node.Publisher,score
ORDER BY score DESC
LIMIT toInteger($limit);

Annex B

Google Cloud Platform (GCP) - Neo4j Graph Database Platform

Google Cloud Platform's Natural Language API lets users derive insights from unstructured text using Google machine…

neo4j.com

Also need to download and save apoc-nlp-dependencies-4.1.02.jar from
https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases

into the plugins directory :

Otherwise, you will get the Procedure not found error!
Also One needs to enable the Google cloud API!

Before we can run this below:

// setup Google Cloud Key

// then call the procedures (trial only)
MATCH (p:Publication{DOI:’10.2196/preprints.19283'})
CALL apoc.nlp.gcp.entities.stream(p, {
key: $apikey,
nodeProperty: “Title”
})
YIELD value
UNWIND value.entities AS entity
RETURN entity.name, entity.salience,entity.type;

Since we are successful, then we now execute the actual code to extract the entities AND to create the links from the publications to the entities:

MATCH (n:COVID19)
WHERE n.Subject<>’NONE’ AND NOT EXISTS(n.Processed)
AND n.Type IN [‘journal-article’,’dataset’,’peer-review’]
WITH n,n.CentralityPagerank as PageRank
ORDER by PageRank desc
LIMIT 1000
WITH n, trim(toUpper(replace(n.Title,” Medicine”, “-Medicine”))) as Title
CALL apoc.nlp.gcp.entities.stream(n, {key:$apikey,nodeProperty: “Title”})
YIELD value
UNWIND value.entities AS entity
MERGE (e:Entity{Name:toUpper(entity.name),Type:toUpper(entity.type)})
WITH n.DOI as nDOI, entity.salience as nSalience,e
Match (p:COVID19{DOI:nDOI})
set p.Processed=1
MERGE (p)-[:HasEntity{Salience:nSalience}]->(e)

Note: Our librarian team mates suggested the use of the Subject Field in addition to the Title field to get the Entities. I will re-run this using p.Subject as Title. And then re-run the Jaccard Similarity based on Entity and Jaccard Similarity based on ANY fields.

Note: There seems to be a bug with the apoc procedure. Notice we don’t just pass the Subject field contents. We need to trim, and replace the “ “ with “-” in front of “Medicine”.

After the Name Entities have been extracted and the links to publications created, let us now see the Entity that is most used by the top 100 publications. (See Annex C)

MATCH ()-[r:HasEntity]->(e:Entity)
RETURN e, count(r) as occurence
order by occurence desc
LIMIT 25

Interesting things to do with Entity:
MATCH (n:Entity) RETURN distinct n.Type LIMIT 25

MATCH (n:Entity)
WHERE n.Type=’PERSON’
RETURN n LIMIT 25

“CANDIDATES”
“PATIENTS”
“HEALTH CARE WORKERS”
“RHEUMATOLOGIST”
“POPULATION”
“SMOKERS”
“MEN”
“INDIVIDUALS”
“PEOPLE”
“THERAPY CANDIDATES”
“RESIDENTS”
“STAFF”
“COVID-19”
“CHILDREN”
“NEONATES”
“FAMILIES”
“PSYCHOSOCIAL CONTRIBUTORS”

MATCH (n:Entity)
WHERE n.Type=’LOCATION’
RETURN distinct toUpper(n.Name) as Persons

“STATE”
“INDIA”
“COUNTRIES”
“UNITED STATES”
“COVID-19”
“AFRICA”
“WORLD”
“PAKISTAN”
“CROSSROADS”
“NURSING HOMES”
“RUSSIA”
“CLINICAL AND ACADEMIC INSTITUTIONS”
“TAIWAN”
“ITALY”
“NATIONAL CAPITAL TERRITORY OF DELHI”
“DERMATOLOGY OUTPATIENT CLINIC”
“STATES”
“US”
“SUB-SAHARAN AFRICA”

ANNEX C

Used Tableau Prep to Pivot Subject Fields into Entities and then Used Neo4j to import , create Entities AND create links from Publications to Entities.

First we export all the Subject Field with DOI
MATCH (n:Publication)
WHERE n.Subject<>’NONE’ AND n.Type IN [‘journal-article’,’dataset’,’peer-review’]
RETURN n.DOI, n.Subject

We export it out as Export-Subject.CSV. We then imported it into Tableau Prep

The OUTPUT.csv looks like this:

We then opened this in Excel and removed Column B, and removed blank rows in Column C. Further replaced all ‘AND Child Health” to “CHILD HEALTH’ Then saved it to the import folder as Output-Subject.csv

We then re-insert that into NEO4j as follows:

LOAD CSV with headers FROM ‘file:///Output-Subject.csv’ as csvrow
MERGE (e:Entity{Name:csvrow.SubjectField})
MERGE (p:Publication{DOI:csvrow.DOI})
MERGE (e)<-[:HasEntity{Salience:’CleanedData’}]-(p)

The Salience value of ‘CleanedData’ is to let us know that these were not from the GCP.NLP process.