gedcom-to-mongo

Ingest a file in GEDCOM format into MongoDB.

Summary

Execution

Testing

Conversion Conventions

Example MongoDB Queries

Integrating with R

Summary

Embed a lineage-linked Genealogical Data Communication (GEDCOM) data model, which is structured around individuals and families, into a MongoDB database.

An example of the GEDCOM file format is:

0 HEAD
1 SOUR PAF
2 NAME Personal Ancestral File
2 VERS 5.0
1 DATE 30 NOV 2000
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED
1 CHAR ANSEL
1 SUBM @U1@
0 @I1@ INDI
1 NAME John /Smith/
1 SEX M
1 FAMS @F1@
0 @I2@ INDI
1 NAME Elizabeth /Stansfield/
1 SEX F
1 FAMS @F1@
0 @I3@ INDI
1 NAME James /Smith/
1 SEX M
1 FAMC @F1@
0 @F1@ FAM
1 HUSB @I1@
1 WIFE @I2@
1 MARR
1 CHIL @I3@
0 @U1@ SUBM
1 NAME Submitter
0 TRLR

The ingestion of this file results in the following MongoDB structure:

/* 1 */
{
    "_id" : "@I1@",
    "name" : {
        "name" : "John /Smith/"
    },
    "sex" : "M",
    "fams" : "@F1@"
}

/* 2 */
{
    "_id" : "@I2@",
    "name" : {
        "name" : "Elizabeth /Stansfield/"
    },
    "sex" : "F",
    "fams" : "@F1@"
}

/* 3 */
{
    "_id" : "@I3@",
    "name" : {
        "name" : "James /Smith/"
    },
    "sex" : "M",
    "famc" : "@F1@"
}

Execution

Dependencies are contained in requirements.txt:
```
$pip install -r requirements.txt
```
Ingesting a data file, say data/wiki.ged, may be done as follows:
```
$python main.py -i data/wiki.ged
```
Usage information may be gathered from the main runner via:
```
$python main.py -h
```

Testing

Unittests may be run via:

  python -m unittest tests

Two sample files are provided in data/

Conversion Conventions

Keys in the GEDCOM file (e.g. FAMS, NAME, ...) are retained as keys in MongoDB, as far as possible. The instances where this is not the case, is:
- when an individuals record is edited, a change record is added. In this instance, the DATE and TIME are concatenated to form chan_date, an ISO datetime:

  1 CHAN
  2 DATE 14 Jun 2020
  3 TIME 18:24:09
  
  >> chan_date: '2020-06-14 18:24:09.000Z'

ISO dates are used wherever possible to allow querying over time ranges.
An individual record (e.g. @I2@) is currently used as the document _id in MongoDB.
An embedded database structure is sought, but a secondary family database has been retained until the data is fully embedded. These two databases are the
- person, and
- family databases.
It is possible to join between the them in MongoDB.
Transformation are always repeatable and embedded in the data flow. There are currently the following transformations applied to the incoming data:
- embed_p1:
  - [parents] insert the parents of each individual in a list (there can be more than two parents)
  - [married_count] count the number of times an individual has been married
  - [spouses] create a list of spouses that an individual had
  - [children_count] provide a counter of the number of children an individual had
Each attribute

Example MongoDB Queries

Q: Get all husbands and their marriage dates

db.coll.aggregate([
    {
        "$lookup": {
            "from": "family",
            "localField": "_id",
            "foreignField": "husb",
            "as": "husband"
        }
    },
    {
        "$match": {
            "husband": {"$ne": []}
        }
    },
    {
        "$unwind": "$husband"
    },
    {
        "$project": {"husband.marr": 1}
    }, 
])

Q: Which town had the most marriages

db.coll.aggregate([
     {
        "$match": { "plac": {$ne: null}}
     },
     {  
        "$group": {
            "_id": "$plac",
            "count": {
                "$sum": 1
            }
        }
    },
    {
        "$sort": {"_id":-1}
    }
])

Q: Average number of children per family

db.coll.aggregate([
    {
        "$match":{ "chil": {"$ne": null} }
    },
    {
        "$group":
        {
            "_id": "res",
            "total":
            {
                "$sum": { "$size": "$chil" }
            },
            "nr_arrays":
            {
                "$sum": 1
            }
        }
    },
    {
        "$project":
        {
              "total": 1, "nr_arrays": 1, "average": {"$divide": ["$total", "$nr_arrays"]}
        }
    }
])

Q: Get all ancestors of of an individual @I3@

db.coll.aggregate(
    [ 
        { "$graphLookup": { 
            "from": "coll", 
            "startWith": "$parents", 
            "connectFromField": "parents", 
            "connectToField": "_id", 
            "as": "ancestors"
        }}  , 
        { "$match": { "_id": "@I3@" } }, 
        { "$addFields": { 
            "ancestors": { 
                "$reverseArray": { 
                    "$map": { 
                        "input": "$ancestors", 
                        "as": "t", 
                        "in": { "_id": "$$t._id" }
                    } 
                } 
            }
        }},
        { "$project": {"ancestors": 1}}
    ]
)

Integrating with R

R packages requirements include:
- mongolite
todo: an R notebook will be provided (e.g. here)
Connecting to a MongoDB instance with database gedcom, collections person and family, and located on a host with IP 10.22.14.2 and port 27017

library(mongolite)
   person_coll = mongo(collection = "person", db = "gedcom", url = "mongodb://10.22.14.2:27017")
   family_coll = mongo(collection = "family", db = "gedcom", url = "mongodb://10.22.14.2:27017")

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
datatests		datatests
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
app_logger.py		app_logger.py
build.py		build.py
dbinterface.py		dbinterface.py
elements.py		elements.py
embedding_person.py		embedding_person.py
export.sh		export.sh
ingest.py		ingest.py
main.py		main.py
requirements.txt		requirements.txt
settings.py		settings.py
utils.py		utils.py

License

nand8a/gedcom-to-mongo

Folders and files

Latest commit

History

Repository files navigation

gedcom-to-mongo

Summary

Execution

Testing

Conversion Conventions

Example MongoDB Queries

Integrating with R

About

Topics

Resources

License

Stars

Watchers

Forks

Languages