Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AQ2009 benchmark dataset #292

Merged
merged 3 commits into from
Jun 5, 2024
Merged

AQ2009 benchmark dataset #292

merged 3 commits into from
Jun 5, 2024

Conversation

mbagagli
Copy link
Contributor

With this pull request, I would like to suggest the distribution of the following benchmark dataset within Seisbench:

AQ2009 – The 2009 Aquila Mw 6.1 earthquake aftershocks seismic dataset for machine learning application
https://doi.org/10.13127/AI/AQUILA2009

Documentation (with a dedicated figure) and relative classes are already created.

@yetinam yetinam added the dataset extension Integration of dataset label May 10, 2024
@yetinam
Copy link
Member

yetinam commented May 10, 2024

Hi @mbagagli ,

thanks a lot! This looks like a very thorough PR. My only remark is that I would remove the combined dataset. I think it will be rare that users want to request both ground motion units and counts within the same dataset. We have a combined dataset for INSTANCE, but there it's for combining noise and event waveforms. Could you remove the extra class?

I'll take care of moving a copy of the data to our servers. As I'm currently travelling, I'm not sure when I'll get to it. Maybe late next week but potentially only in two weeks from now.

@mbagagli
Copy link
Contributor Author

Done! Thank you for the consideration.
You can download the bz2 files here (https://www.pi.ingv.it/banche-dati/aquila2009). Once expanded, the folders should be renamed following the class names I provided in the PR:

  • AQ2009Counts
  • AQ2009GM

Let me know if there's anything more to do, or if I can be of any help.
Looking forward to hearing from you

@yetinam
Copy link
Member

yetinam commented May 22, 2024

Hi @mbagagli , I just noticed that the data does not have a predefined split into training, development and test data. I would like to add a split because it improves reproducibility of experiments. As your data is already published with a DOI, I'd suggest to just add it to the SeisBench version. I can do that before uploading the data. However, I'd like to discuss how to chose the split. Any recommendations from your side?

@mbagagli
Copy link
Contributor Author

mbagagli commented May 23, 2024

Hello @yetinam , indeed I didn't put any specification for train/dev/test splitting, as I thought the end-user would like to choose it freely depending on the specific task. On the other hand, I totally understand the need for a "split" column in a framework like Seisbench.

For quality-checks and usage tests, I used to roughly split it in a sequential way (70% - 10% - 20% of the dataset
labelled as train-dev-test respectively). Though, given the type of dataset, and the advantage of having multiple picks in the same window-cut, I would go and split it based on the PICK_COUNT column, and have an equal pick distribution of picks for the 3 sub-dataset. Something like:

# import pandas as pd
# df = pd.read_csv("PATH/TO/AQ2009/META")

from sklearn.model_selection import train_test_split
df_train, _df_temp = train_test_split(
                        df, test_size=0.3,
                        stratify=df['PICK_COUNT'], random_state=42)

df_test, df_dev = train_test_split(_df_temp, test_size=1/3,
                                   stratify=_df_temp['PICK_COUNT'], random_state=42)

print(f"\nSize of TRAIN subset: {len(df_train)}")
print("Distribution in TRAIN subset:")
print(df_train['PICK_COUNT'].value_counts(normalize=True))


print(f"\nSize of TEST subset: {len(df_test)}")
print("Distribution in TEST subset:")
print(df_test['PICK_COUNT'].value_counts(normalize=True))


print(f"\nSize of DEV subset: {len(df_dev)}")
print("Distribution in DEV subset:")
print(df_dev['PICK_COUNT'].value_counts(normalize=True))

# ... Then of course, extracting the index and populating the `split` column ...

What do you think? Of course you can think of a different split proportion also based on the other dataset's splitting rule-of-thumb.

@yetinam
Copy link
Member

yetinam commented May 24, 2024

Sounds good! I'll add it to the datasets and make sure they're identical in the version with counts and the one with ground motion units.

@yetinam
Copy link
Member

yetinam commented Jun 4, 2024

Sorry, taking a little longer than I hoped for. I've added the split, however, I notice the data format group is missing. Can you confirm the component order of the dataset? Is it ZNE?

@mbagagli
Copy link
Contributor Author

mbagagli commented Jun 5, 2024

No problem! Sure, I confirm that the component order is ZNE
Please, make sure to remove unnecessary nested folders that you may find after the unzipping, and also to rename the folders according to the new classes` names.

You may as well contact me in private if problems arise.
Thank you again for the support and constant maintenance of this great framework

@yetinam
Copy link
Member

yetinam commented Jun 5, 2024

Sorry, just realized the component order was defined... I got confused by a warning message.

I've done a small modification to the PR and am currently uploading the files. Should be ready to merge later today. Thanks a lot again for the contribution.

@yetinam yetinam merged commit 266ef8d into seisbench:main Jun 5, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset extension Integration of dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants