Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source-specific stats? #37

Open
josephwb opened this issue Dec 22, 2016 · 3 comments
Open

Source-specific stats? #37

josephwb opened this issue Dec 22, 2016 · 3 comments

Comments

@josephwb
Copy link
Member

I think it might be useful to generate statistics for the individual source trees. For example, how many nodes in the supertree are due to a particular source (that is, it is the highest ranked source that displays the split)? How many nodes in a source simply support a node created by a higher ranked tree? How many nodes in a source conflict (i.e. were overruled) with the supertree? How many nodes in a source are present in the supertree at all? These types of statistics might give us some sort of "impact score" for individual sources.

Something I've been considering with the Aves supertree: can we identify inconsequential source trees (that it, the supertree would be the same without the source being present)? This is slightly different than the simple counts above, as a tree may be inconsequential because it agrees or conflicts with a higher ranking tree.

@mtholder
Copy link
Member

  1. for how many nodes in the supertree is tree XYZ the highest ranked source can be calculated from the annotations file. A combination of is a sources list of trees in ranked order, and the annotation for each node in the nodes property can be used to answer this question. Let me know if you have questions on how to do that. See https://github.com/OpenTreeOfLife/germinator/wiki/Open-Tree-API-datatypes

  2. "How many nodes in a source simply support a node created by a higher ranked tree?" Same info as mentioned above can answer this.

  3. "How many nodes in a source conflict (i.e. were overruled) with the supertree?" Also implied by the info in the nodes property.

  4. "How many nodes in a source are present in the supertree at all?" Sum of supported_by, partial_path_of and resolves statements that refer to a source tree.

  5. "Can we identify inconsequential source trees (that it, the supertree would be the same without the source being present)?" Hard to do in general without removing the tree and rerunning. If a tree does not contest a taxon (see, for example, the report at http://files.opentreeoflife.org/synthesis/opentree8.0/output/subproblems/index.html#contested) and all of the nodes in the source tree are listed as conflicts_with, then the tree had no effect.

In any other case, it is impossible to tell that a tree has no effect.

A source tree definitely has an effect if there exists a supertree node for which there is a supported_by or resolves entry which only lists that source tree.

Other cases are grey areas.

Note that one could use the subproblem solver on modified versions of the subproblems to test for effects of including a tree without rerunning the entire pipeline. That would require some scripting. If a source tree does not affect any subproblem solution, then it will not effect the whole tree, unless it is the only tree that contests any particular taxon.

hope that helps...

@josephwb
Copy link
Member Author

Sounds good. This is what I was thinking as well. Thanks. I wonder if users might want access to such processed information. Alternatively, I could write something (in python, say) and make it available.

@kcranston
Copy link
Member

If you write such a script, you can put it in the bin directory of this repo. That's where we have other post-processing tools (like the script to compare two synthesis versions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants