AQ2009 benchmark dataset #292

mbagagli · 2024-05-10T14:27:43Z

With this pull request, I would like to suggest the distribution of the following benchmark dataset within Seisbench:

AQ2009 – The 2009 Aquila Mw 6.1 earthquake aftershocks seismic dataset for machine learning application
https://doi.org/10.13127/AI/AQUILA2009

Documentation (with a dedicated figure) and relative classes are already created.

…09) aftershock sequence.

yetinam · 2024-05-10T15:10:11Z

Hi @mbagagli ,

thanks a lot! This looks like a very thorough PR. My only remark is that I would remove the combined dataset. I think it will be rare that users want to request both ground motion units and counts within the same dataset. We have a combined dataset for INSTANCE, but there it's for combining noise and event waveforms. Could you remove the extra class?

I'll take care of moving a copy of the data to our servers. As I'm currently travelling, I'm not sure when I'll get to it. Maybe late next week but potentially only in two weeks from now.

mbagagli · 2024-05-10T18:09:38Z

Done! Thank you for the consideration.
You can download the bz2 files here (https://www.pi.ingv.it/banche-dati/aquila2009). Once expanded, the folders should be renamed following the class names I provided in the PR:

AQ2009Counts
AQ2009GM

Let me know if there's anything more to do, or if I can be of any help.
Looking forward to hearing from you

yetinam · 2024-05-22T15:05:09Z

Hi @mbagagli , I just noticed that the data does not have a predefined split into training, development and test data. I would like to add a split because it improves reproducibility of experiments. As your data is already published with a DOI, I'd suggest to just add it to the SeisBench version. I can do that before uploading the data. However, I'd like to discuss how to chose the split. Any recommendations from your side?

mbagagli · 2024-05-23T08:30:27Z

Hello @yetinam , indeed I didn't put any specification for train/dev/test splitting, as I thought the end-user would like to choose it freely depending on the specific task. On the other hand, I totally understand the need for a "split" column in a framework like Seisbench.

For quality-checks and usage tests, I used to roughly split it in a sequential way (70% - 10% - 20% of the dataset
labelled as train-dev-test respectively). Though, given the type of dataset, and the advantage of having multiple picks in the same window-cut, I would go and split it based on the PICK_COUNT column, and have an equal pick distribution of picks for the 3 sub-dataset. Something like:

# import pandas as pd
# df = pd.read_csv("PATH/TO/AQ2009/META")

from sklearn.model_selection import train_test_split
df_train, _df_temp = train_test_split(
                        df, test_size=0.3,
                        stratify=df['PICK_COUNT'], random_state=42)

df_test, df_dev = train_test_split(_df_temp, test_size=1/3,
                                   stratify=_df_temp['PICK_COUNT'], random_state=42)

print(f"\nSize of TRAIN subset: {len(df_train)}")
print("Distribution in TRAIN subset:")
print(df_train['PICK_COUNT'].value_counts(normalize=True))


print(f"\nSize of TEST subset: {len(df_test)}")
print("Distribution in TEST subset:")
print(df_test['PICK_COUNT'].value_counts(normalize=True))


print(f"\nSize of DEV subset: {len(df_dev)}")
print("Distribution in DEV subset:")
print(df_dev['PICK_COUNT'].value_counts(normalize=True))

# ... Then of course, extracting the index and populating the `split` column ...

What do you think? Of course you can think of a different split proportion also based on the other dataset's splitting rule-of-thumb.

yetinam · 2024-05-24T13:53:09Z

Sounds good! I'll add it to the datasets and make sure they're identical in the version with counts and the one with ground motion units.

yetinam · 2024-06-04T17:01:56Z

Sorry, taking a little longer than I hoped for. I've added the split, however, I notice the data format group is missing. Can you confirm the component order of the dataset? Is it ZNE?

mbagagli · 2024-06-05T06:58:25Z

No problem! Sure, I confirm that the component order is ZNE
Please, make sure to remove unnecessary nested folders that you may find after the unzipping, and also to rename the folders according to the new classes` names.

You may as well contact me in private if problems arise.
Thank you again for the support and constant maintenance of this great framework

yetinam · 2024-06-05T15:27:33Z

Sorry, just realized the component order was defined... I got confused by a warning message.

I've done a small modification to the PR and am currently uploading the files. Should be ready to merge later today. Thanks a lot again for the contribution.

AQ2009: a new benchmark dataset of the L'Aquila 6.1 Mw earthquake (20…

f3ab0e7

…09) aftershock sequence.

yetinam added the dataset extension Integration of dataset label May 10, 2024

pr_update_1

9e1df50

Add AQ2009 datasets to __init__.py

66ec5b2

yetinam merged commit 266ef8d into seisbench:main Jun 5, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AQ2009 benchmark dataset #292

AQ2009 benchmark dataset #292

mbagagli commented May 10, 2024

yetinam commented May 10, 2024

mbagagli commented May 10, 2024

yetinam commented May 22, 2024

mbagagli commented May 23, 2024 •

edited

yetinam commented May 24, 2024

yetinam commented Jun 4, 2024

mbagagli commented Jun 5, 2024

yetinam commented Jun 5, 2024

AQ2009 benchmark dataset #292

AQ2009 benchmark dataset #292

Conversation

mbagagli commented May 10, 2024

yetinam commented May 10, 2024

mbagagli commented May 10, 2024

yetinam commented May 22, 2024

mbagagli commented May 23, 2024 • edited

yetinam commented May 24, 2024

yetinam commented Jun 4, 2024

mbagagli commented Jun 5, 2024

yetinam commented Jun 5, 2024

mbagagli commented May 23, 2024 •

edited