Skip to content

Scrapping diseases information from Mayo Clinic and saving it in Neo4j

License

Notifications You must be signed in to change notification settings

fediazgon/mayoclinic-scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mayoclinic-scrapper

Does not work anymore. It also infringes Mayo's Clinic TOS

Scrapping diseases information from Mayo Clinic and saving it in Neo4j

SetupUsageData formatPossible improvements

screenshot

Setup

This project has been developed using Python 3 (Python 2 may work). You need to install Scrapy and the Neo4j Bolt Driver for Python. Execute the following command in the project's root directory to install all the required dependencies (using a virtual environment is recommended):

pip install -r requirements

A running Neo4j instance is needed. For development purposes, the easiest way of starting an instance is running the official Neo4j Docker image using this command:

docker run --publish=7474:7474 --publish=7687:7687 --env=NEO4J_AUTH=none neo4j

Usage

There are two scripts: scraper.py and neo4j_importer.py. The first one does not need any parameter and it will extract diseases data from the Mayo Clinic's diseases and conditions index, generating a JSON file. The second script receives this file as a parameter and will import the data into the Neo4j instance at http://localhost:7687. If you haven not used the command in the previous section to start Neo4j, make the necessary modifications in the second script.

Finally go to the Neo4j dashboard and start playing! For example, in the next gif you can see how the causes that are related with more than 3 diseases are retrieved:

demo

Data format

This file generated by the first script is a JSON Array containing the extracted diseases. An example of a disease is:

{
   "disease_id": 0,
   "disease_name": "Sweet's syndrome",
   "causes": [
      {
         "cause_id": 0,
         "cause_name": "Sex"
      },
      {
         "cause_id": 1,
         "cause_name": "Age"
      },
      {
         "cause_id": 2,
         "cause_name": "Cancer"
      }
   ],
   "risk_factors": [
      {
         "risk_id": 0,
         "risk_name": "Sex"
      },
      {
         "risk_id": 1,
         "risk_name": "Age"
      },
      {
         "risk_id": 2,
         "risk_name": "Cancer"
      }
    ]
}

The data from this file is inserted in Neo4j with the following schema:

(d:Disease { id, name })-[:CAUSED_BY]->(:Cause { id, name })
(d:Disease { id, name })-[:HAS_RISK]->(:RiskFactor { id, name })

Possible improvements

The scrapping is imperfect. There are some disease's causes that should be processed as the same one. For example, Smoking can also appear as Smokin or You smoke. It would be cool to extract entities from the text and perform some fuzzy matching. Maybe using NLTK.

The same idea could be applied to extract the symptoms, since in the webpage the symptoms are contained in a free-text box as opposed to causes and risk factors that are bullet points.

About

Scrapping diseases information from Mayo Clinic and saving it in Neo4j

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages