Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain all source IDs on VariantContext merge + test #8752

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

lbergelson
Copy link
Member

@bbimber I've added a test to your pr #8750. Should be good to merge if tests pass.

bbimber and others added 2 commits March 25, 2024 06:35
In GATK3, when merging variants, the IDs of all the source VCFs were retained. This code path seems like it intended that, since the variantSources set is generated, but it doesnt get used for anything. This PR will use that set to set the source of the resulting merged VC.
Copy link
Contributor

@bbimber bbimber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson great, thanks for adding the test

@bbimber
Copy link
Contributor

bbimber commented Mar 27, 2024

@lbergelson: this might be permission-related, but despite entering an approving review github is asking for another. I assume that's b/c I dont have write permissions in this project?

@lbergelson
Copy link
Member Author

@bbimber Hmn, yeah, think it needs someone who has direct write access. I'll get a thumb from a teammate. Thanks for looking at it and for the pr!

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson A question for you

@@ -1217,7 +1217,10 @@ public static VariantContext simpleMerge(final Collection<VariantContext> unsort

final String ID = rsIDs.isEmpty() ? VCFConstants.EMPTY_ID_FIELD : Utils.join(",", rsIDs);

final VariantContextBuilder builder = new VariantContextBuilder().source(name).id(ID);
// This preserves the GATK3-like behavior of reporting all sources, delimited with hyphen:
final String allSources = variantSources.isEmpty() ? name : variantSources.stream().sorted().collect(Collectors.joining("-"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we are merging a large number (say, thousands) of VCF records? Do we still want a concatenated list of the source names in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. Apparently this is what GATK3 did. @bbimber Any objection to capping it at some reasonable length?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I dont have personal use-cases around this, but on DISCVR-seq (which ported GATK3's CombineVariants) I get a surprising amount of requests from people around the lack-of or changes to this feature.

I have no real objection to some kind of cap on length. What length would you find reasonable? I suppose that value could be parameterized, but this would require exposing an argument to simpleMerge() and maybe overloading that method.

Anyway, this feels like a valid but fringe case to me. I dont have strong opinions one way or another.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining hundreds or thousands of VCF records is something we actually do frequently in practice. I think a parameterized option would be overkill, but some kind of sensible hardcoded max length for the source field seems appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. In practice, our source names are small, but if you're talking about 1000+ VCFs then should this value be on the order of 500 characters? 1000? 5k?

In your use case, it actually might be more valuable to have a boolean that prevents storing this value (whatever the length), since it might blow up the size of the output VCF. this might argue for overloading the method with this behavior being opt-in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson I defer to you to do something reasonable in this PR to address the effects of this change on VCF file size for large cohorts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, after thinking O/N on this I would probably argue for overloading the method with one that adds arguments for "storeAllVcfSources", and "maxSourceLength", but keep the existing signature where storeAllVcfSources=False, and maxSourceLength=-1 (which is moot since the existing behavior of taking the first source would remian). this preserves the status quo, but callers could opt-in to GATK3-like behavior. I am happy to make those changes.

@bbimber
Copy link
Contributor

bbimber commented Apr 12, 2024

Hi @lbergelson and @droazen: it looks like @lbergelson added a limit to avoid excessive sources. Does that satisfy issues remaining on this PR?

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson Back to you with another concern

final VariantContextBuilder builder = new VariantContextBuilder().source(name).id(ID);
// This preserves the GATK3-like behavior of reporting multiple sources, delimited with hyphen:
final String allSources = variantSources.isEmpty() ? name : variantSources.stream()
.sorted()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about doing a sort on the source names for every single merged VCF record we're writing -- with thousands or tens of thousands of samples, that would get expensive!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable - what about what I proposed a little earlier in this thread and overloading the method to make the existing default case unchanged (list the first source), but have the ability for code to opt-in to the behavior of tracking all sources? I am happy to write that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbimber That option is looking more attractive -- if you'd like to implement it, please go ahead and tag us for review. If we don't touch the default behavior this will be much easier to get in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants