GitHub - aws-samples/genomic-language-model-pretraining-with-healthomics-seq-store

Genomic language model pretraining with the HealthOmics sequence store

In this repo we show how you can use AWS infrastructure---AWS HealthOmics and AWS Sagemaker---to to easily and cost-effectively pre-train a genomic language model, HyenaDNA.

Solution Overview

The solution illustrates an architecture diagram for training HyenaDNA model using the data stored in AWS HealthOmics sequence store.

Read the Data: Data is read from an external genomic data source, such as GenBank.
Load the Data to Store: The data is then loaded into an AWS HealthOmics sequence store using Data Loading SageMaker Notebook.
Start Training Job: Utilizes SageMaker train & Deploy Notebook to initiate a training job on Amazon SageMaker.
Read the Data from Sequence Store: Training job accesses data from the Sequence Store using S3 access point of sequence store.
Download Model Checkpoint: A model checkpoint from Hugging Face (HyneDNA model) is downloaded.
Save Trained Model: The trained model is saved following the training process.
Deploy Trained Model: The trained model is then deployed using Amazon SageMaker, establishing a real-time endpoint.
Inference: Finally, the model performs inference tasks, likely using the deployed SageMaker real-time endpoint.

Installation & First Steps

Clone this repo in a SageMaker notebook.
If you don't already have your genomic data (FASTA files) in a sequence store, then use the load-genome-to-sequence-store.ipynb notebook to do so.
Use the hyenaDNA-training.ipynb notebook to initiate a training job.

Results

This plots the evaluation loss of a HyenaDNA model training over a series of epochs. The overall trend suggests that the model's loss decreased significantly early in the training and reached a plateau, indicating potential convergence of the model training process.

This plots the evaluation perplexity values of HyenaDNA model during its training over a sequence of epochs. This decreasing trend followed by stabilization indicates that the model's ability to predict or understand the data improved quickly initially and then reached a level of consistency as training progressed.

This plots the training loss of HyenaDNA model over training steps. In this graph, the training loss decreases sharply initially, which is typical as the model begins to learn from the data. After this initial drop, the loss continues to decrease more gradually and eventually stabilizes around the 1.15 mark. This suggests that the model is converging.

This plots the perplexity values of a machine learning model over training steps. training perplexity, which demonstrates a significant decrease early on, followed by a gradual decline and stabilization around 3.2. This behavior suggests that as training progresses, the model becomes increasingly efficient at predicting or understanding the training data, indicated by the decreasing perplexity values. The stabilization at a lower perplexity level indicates that the model has likely achieved a good level of generalization.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
images		images
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
hyenaDNA-training.ipynb		hyenaDNA-training.ipynb
load-genome-to-sequence-store.ipynb		load-genome-to-sequence-store.ipynb
sample_mouse_data.json		sample_mouse_data.json
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

scripts

scripts

.gitignore

.gitignore

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE

LICENSE

README.md

README.md

hyenaDNA-training.ipynb

hyenaDNA-training.ipynb

load-genome-to-sequence-store.ipynb

load-genome-to-sequence-store.ipynb

sample_mouse_data.json

sample_mouse_data.json

utilities.py

utilities.py

Repository files navigation

Genomic language model pretraining with the HealthOmics sequence store

Solution Overview

Installation & First Steps

Results

Security

License

About

Releases

Packages

Contributors 3

Languages

License

aws-samples/genomic-language-model-pretraining-with-healthomics-seq-store

Folders and files

Latest commit

History

Repository files navigation

Genomic language model pretraining with the HealthOmics sequence store

Solution Overview

Installation & First Steps

Results

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages