Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining aws-java-nio-spi-for-s3 with GATK. #8672

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DanishIntizar
Copy link

Introduction

Recently, Amazon has created the tool aws-java-nio-spi-for-s3 that allows java-based applications to read and/or write to aws without the need for recompilation during runtime. Since then, we've utilised this tool, in conjunction with a locally modified version of gatk, to communicate with aws. Since we had the code that allows for communication with aws anyway, we decided to share it and maybe it can be part of the gatk toolkit in the future.

How does it work?

The user is able to provide an additional parameter '--s3', adding the nio-spi-for-s3-2.0.0-dev-all.jar file to the java classpath. File locations starting with 's3://' are then able to be provided, resulting of reading/writing of these files to aws. When using this option, however, the aws credentials have to be set correctly, for which you can find more information here. Currently, I haven't implemented it for --spark due to a lack of need/inexperience with spark.

Current Issues

We found some issues for which we do not know any solution. If this tool was to be implemented in GATK in the future, these have to be resolved eventually.

Doesn't work for picard-based tools

First, 'aws-java-nio-spi-for-s3' doesn't seem work for (most) picard tools, since most of them utilise the java.io.File package, which is limited to local filesystem files, as opposed to java.nio.Path (we think).

Issues reading genome reference files from AWS

Secondy, most tools that require a reference genome (i.e. BaseRecalibrator, HaplotypeCaller..) do not seem function when provided with a reference genome file stored on AWS. The error we receive can be found underneath and is much less clear. We believe that the issue lies in the interaction between the caching of the indexed reference file and 'aws-java-nio-spi-for-s3', since we tested in a custom java script that the package 'htsjdk' works like intended when the reference genome is read from AWS.
Notably, some tools do not have this issue, such as the the vqsr tools (VariantRecalibrator and applyVQSR).
image

@DanishIntizar DanishIntizar marked this pull request as draft February 2, 2024 07:56
@DanishIntizar DanishIntizar marked this pull request as ready for review February 2, 2024 13:24
@lbergelson
Copy link
Member

@DanishIntizar Hello! Thank you for this pr. This is great to see an official plugin from amazon available. I appreciate that you took the time to make it an optional include. I think if we're going to include it we might as well just add it as one of our normal dependencies though. Assuming there aren't any dependency conflicts it should (always a risky statement) be independent from everything else.

Thanks also for identifying the different issues you mentioned. It's expected that it won't work with most picard tools as you discovered, but we're actively in the process of updating more of them too support Paths instead of Files so that will slowly improve.

The second issue is more worrisome. We regularly use an equivalent provider with google to read reference files through the exact same code, so I suspect there is either some sort of mismatched assumptions in the way they are handling things. Maybe something strange with the Path.resolve methods or the like. (Or in in the much worse potential case a bug in their look ahead caching.)

I'd like to look into that before we'd merge this. Ideally we would have tests for this. Are there any public AWS paths we could read from without any secret authentication?

@DanishIntizar
Copy link
Author

Hello! Unfortunately, I am unable to provide public AWS paths myself, since I tested it using our own AWS credentials. What I can do, however, is provide the reference data we used (although I believe you have enough testing data already). Also I can maybe bring you in contact with the developer of this tool, if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants