Experimental yaml input format #1842

jeromekelleher · 2021-09-18T15:31:39Z

This is an experiment to see what a yaml/json input format (building on demes) would look like. It mostly works I think, except for the basic confusion about the direction of time. We can easily imagine adding to this to allow for things like recombination maps.

Here's an example input file:

demography:
  # This is an **embedded** Demes yaml model.
  time_units: generations
  demes:
    - name: X
      epochs: [{end_time: 1000, start_size: 2000}]
    - name: A
      ancestors: [X]
      epochs: [{start_size: 2000}]
    - name: B
      ancestors: [X]
      epochs: [{start_size: 2000}]

# Note: We are **referring** to the Demes model here.
samples: {A: 100, B: 100}
sequence_length: 100000
recombination_rate: 1e-8
ploidy: 1
model: hudson

The idea is that we embed the Demes yaml description within the larger simulation configuration context. When we're parsing the input yaml, we just hand-off the parsing of the demography object to demes-python which will do all the hard work for us.

I'm not suggesting this as a general specification for popgen simulations, I just want to illustrate the power that we get from keeping Demes simple and self-contained. To me, the ability to make a simple configuration file for a specific simulator like this is a powerful argument for not over-specifying the standard. The more bells and whistles we add to the spec the less likely it is that it'll be compatible across different simulators.

Any thoughts @molpopgen @grahamgower @apragsdale? I've been talking about simulation configurations being able to "refer" to elements of the Demes model for a while, and this is an attempt to make things concrete. (I guess we shouldn't get into detailed discussions about Demes itself here though: if someone wants to follow up, maybe create an issue on the spec repo to discuss?)

codecov · 2021-09-18T15:43:01Z

Codecov Report

Merging #1842 (2f8956e) into main (6a9c603) will decrease coverage by 0.18%.
The diff coverage is 50.98%.

@@            Coverage Diff             @@
##             main    #1842      +/-   ##
==========================================
- Coverage   90.46%   90.28%   -0.19%     
==========================================
  Files          20       21       +1     
  Lines       10682    10733      +51     
  Branches     2167     2174       +7     
==========================================
+ Hits         9664     9690      +26     
- Misses        572      597      +25     
  Partials      446      446

Flag	Coverage Δ
C	`90.28% <50.98%> (-0.19%)`	⬇️
python	`96.89% <50.98%> (-0.63%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
msprime/json_input.py	`45.16% <45.16%> (ø)`
msprime/cli.py	`96.94% <52.94%> (-1.58%)`	⬇️
msprime/mutations.py	`98.59% <100.00%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abe6116...2f8956e. Read the comment docs.

petrelharp · 2021-09-19T22:50:41Z

Two thoughts:

it seems like this could be a nice bridge for the folks who aren't comfortable in python? It might be worth finding some of those people to test it out on.
Perhaps this all should be within an ancestry: block, to be followed by a mutations: block (and then maybe an output: block?) for a more complete specificatoin?

grahamgower · 2021-09-20T07:07:55Z

msprime/json_input.py

+        # TODO nasty going back to JSON here - can we make a demes.fromdict()
+        # function to do this directly?
+        demes_model = demes.loads(json.dumps(demes_dict), format="json")


demes.Graph.fromdict()?

Aha! Thanks @grahamgower.

grahamgower · 2021-09-20T07:12:24Z

I agree with @petrelharp that it maybe needs to have separate ancestry: and mutations: blocks. But then it doesn't neatly align with the current CLI msp ancestry subcommand. Also, maybe the demography could be either inline or refer to a file path?

jeromekelleher · 2021-09-20T09:03:32Z

Thanks, great points @petrelharp and @grahamgower ! I think a combined ancestry and mutation format is the right approach, and yes, this would be a good bridge for people who aren't comfortable with Python.

WRT to the CLI, I've already created an msp ancestry-yaml as a quick way of getting something working without having to worry about the semantics of msp ancestry. So, we just need a command to run a simulation from a yaml config. Unfortunately msp simulate is already used as the legacy interface. We could do msp yaml?

jeromekelleher · 2021-09-20T14:12:40Z

Update: I've added the proposed mutations/ancestry sections and the config looks like this now:

ancestry:
  sequence_length: 100000
  recombination_rate: 1e-8
  samples: {A: 100, B: 100}
  ploidy: 1
  model: hudson
  demography:
    time_units: generations
    demes:
      - name: X
        epochs: [{end_time: 1000, start_size: 2000}]
      - name: A
        ancestors: [X]
        epochs: [{start_size: 2000}]
      - name: B
        ancestors: [X]
        epochs: [{start_size: 2000}]

mutations:
  rate: 1e-8
  model: blosum62

To make this fully general we'd need to

Add support for reading RateMaps from dictionaries (easy)
Support parsing Ancestry and Mutation models from dictionaries (should be pretty easy, this is basically what we turn the classes into anyway). Since the ancestry models use a duration, we actually sidestep the awkward time business
Think properly about time and implement start_time and end_time accordingly (but, these are pretty niche options, so could just be dropped)

apragsdale · 2021-09-20T14:45:01Z

This looks really nice to me. Agree that ancestry/mutations/output blocks makes a lot of sense, and those updates look clean. If I'm reading the changes correctly, you can place any valid argument to sim_ancestry and sim_mutations into this yaml? So specify seeds, or more complicated models (e.g. dtfw then switch to hudson), etc. For an "output" block, it might be nice to be able to specify "trees" vs "vcf", plus all the bells and whistles that go with those. Not sure how general you intend this input approach to be.

Overall, I think this would be a nice middle ground between avoiding both python scripting and the cli (which can sometimes be confusing for some). Looking forward to discussing more today in a bit.

molpopgen · 2021-09-20T21:18:50Z

I like the approach overall. I think embedding the demes bits is quite elegant.

Experimental yaml input format

d033b1d

jeromekelleher marked this pull request as draft September 18, 2021 15:31

grahamgower reviewed Sep 20, 2021

View reviewed changes

Add mutations support for yaml

2f8956e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental yaml input format #1842

Experimental yaml input format #1842

jeromekelleher commented Sep 18, 2021 •

edited

codecov bot commented Sep 18, 2021 •

edited

petrelharp commented Sep 19, 2021

grahamgower Sep 20, 2021

jeromekelleher Sep 20, 2021

grahamgower commented Sep 20, 2021

jeromekelleher commented Sep 20, 2021

jeromekelleher commented Sep 20, 2021

apragsdale commented Sep 20, 2021

molpopgen commented Sep 20, 2021

Experimental yaml input format #1842

Are you sure you want to change the base?

Experimental yaml input format #1842

Conversation

jeromekelleher commented Sep 18, 2021 • edited

codecov bot commented Sep 18, 2021 • edited

Codecov Report

petrelharp commented Sep 19, 2021

grahamgower Sep 20, 2021

Choose a reason for hiding this comment

jeromekelleher Sep 20, 2021

Choose a reason for hiding this comment

grahamgower commented Sep 20, 2021

jeromekelleher commented Sep 20, 2021

jeromekelleher commented Sep 20, 2021

apragsdale commented Sep 20, 2021

molpopgen commented Sep 20, 2021

jeromekelleher commented Sep 18, 2021 •

edited

codecov bot commented Sep 18, 2021 •

edited