Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate studies #5

Open
snacktavish opened this issue Feb 18, 2015 · 12 comments
Open

Duplicate studies #5

snacktavish opened this issue Feb 18, 2015 · 12 comments

Comments

@snacktavish
Copy link
Member

How do we want to handle studies that are duplicated in the database?
e.g. https://tree.opentreeoflife.org/curator/study/view/pg_2450 and https://tree.opentreeoflife.org/curator/study/view/pg_2398
In this case they are both coming in from phylographter.
The curator tip that there are two studies with the same DOI is very helpful, but it's not clear which one a curator should trust/edit/use.
Do we want to maintain both in Phylesystem or merge them?

@jar398
Copy link
Member

jar398 commented Feb 18, 2015

I would think we'd like to merge them, but there's no magic - either copy
might be better than the other, in terms of number of trees, quality of OTU
mapping, quality and amount of metadata, and other factors. We could expend
a lot of effort writing an automatic merge tool but I suspect the case is
so rare that doing it by hand would be a better use of time. The 'better'
version should be picked as the one to keep, and then if the other version
has goodies missing, they should be transferred over by hand.

Any idea how many of these there are?

On Wed, Feb 18, 2015 at 9:49 AM, Emily Jane McTavish <
notifications@github.com> wrote:

How do we want to handle studies that are duplicated in the database?
e.g. https://tree.opentreeoflife.org/curator/study/view/pg_2450 and
https://tree.opentreeoflife.org/curator/study/view/pg_2398
In this case they are both coming in from phylographter.
The curator tip that there are two studies with the same DOI is very
helpful, but it's not clear which one a curator should trust/edit/use.
Do we want to maintain both in Phylesystem or merge them?


Reply to this email directly or view it on GitHub
#5.

@josephwb
Copy link
Member

Some are part of synthesis, so we want to keep that copy if for only that reason (although they are probably better curated as well).

@snacktavish
Copy link
Member Author

Based on a quick grep there are a lot more than I anticipated.
110 doi's are in more than once,
In chart form counts are:
number of individual doi's, number of times repeated
92, 2
11, 3
6, 4
1, 5
1, 9
It is http://dx.doi.org/10.3732/ajb.94.11.1860 that is in there 9 times.
On first glance most I looked at reflect real duplicates, rather than some DOI problem.

@snacktavish snacktavish changed the title Duplicate DOIs on studies from phylographter Duplicate studies Feb 18, 2015
@josephwb
Copy link
Member

Is there a check that the DOIs are correct? I seem to recall coming across studies with warnings of identical DOIs when the studies were different (i.e. the curator entered the wrong DOI, or accepted the wrong DOI).

@snacktavish
Copy link
Member Author

I have seen one like that, but the majority appear to be accidental duplicates, often without even trees associated. We have just added the ability to write an informative commit message on delete, which gives the opportunity to point to the correct study id in the repo for dups, so I guess one that those PRs are merged we can just delete them by hand via the curator app, and point in the commit to the remaining version. Requires a human decision I think, and so can't/shouldn't be automated.

@snacktavish
Copy link
Member Author

Is deleting with comment only on devtree? I thought I would start deleting some duplicate studies with 'delete' tags, but there is no input of a commit message and correct study ID. Should we take special care with duplicates that are duplicated in phylografter? (I just deleted one, but perhaps shouldn't have...)

@jimallman
Copy link
Member

On Mar 12, 2015, at 9:21 AM, Emily Jane McTavish notifications@github.com wrote:
Is deleting with comment only on devtree? I thought I would start deleting some duplicate studies with 'delete' tags, but there is no input of a commit message and correct study ID.

Yes, last week’s PR review had nothing major, so we decided not to deploy changes to production.

=jimA=

Jim Allman
Interrobang Digital Media
http://www.ibang.com/
(919) 649-5760

@snacktavish
Copy link
Member Author

Sounds good, I'll wait on deleting for a bit. Do we think that putting the correct study ID in the delete commit is sufficient? The correct study is easily found using the DOI.

@jar398
Copy link
Member

jar398 commented Mar 13, 2015

The deletion comment feature was deployed yesterday

On Thu, Mar 12, 2015 at 9:47 AM, Emily Jane McTavish <
notifications@github.com> wrote:

Sounds good, I'll wait on deleting for a bit. Do we think that putting the
correct study ID in the delete commit is sufficient? The correct study is
easily found using the DOI.


Reply to this email directly or view it on GitHub
#5 (comment)
.

@snacktavish
Copy link
Member Author

Cool! Do we have some kind of deletion rules/best practices? Or just delete duplicates and rely on the DOI's, so that if someone is looking for a study they will find it. I lean towards the latter.

@josephwb
Copy link
Member

josephwb commented Mar 13, 2015

We're preferentially keeping all synth-input studies, yes?

All else being equal, filtering on curator seems like a prudent approach.

@jar398
Copy link
Member

jar398 commented Oct 1, 2016

Suggestion:

The one to keep is either
(a) the one that's in synthesis, if any, or
(b) the one that's better curated (has uploaded trees, OTU mappings, ingroup, etc.)
Manually move valuable curation effort, if any, from the one(s) not being kept to the one being kept. Then delete the ones to be deleted, with a commit message.

I think in most cases this won't be too difficult, although in theory it could be pretty awful, if both copies have some curation and it's different curation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants