-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AQ2009 benchmark dataset #292
Conversation
…09) aftershock sequence.
Hi @mbagagli , thanks a lot! This looks like a very thorough PR. My only remark is that I would remove the combined dataset. I think it will be rare that users want to request both ground motion units and counts within the same dataset. We have a combined dataset for INSTANCE, but there it's for combining noise and event waveforms. Could you remove the extra class? I'll take care of moving a copy of the data to our servers. As I'm currently travelling, I'm not sure when I'll get to it. Maybe late next week but potentially only in two weeks from now. |
Done! Thank you for the consideration.
Let me know if there's anything more to do, or if I can be of any help. |
Hi @mbagagli , I just noticed that the data does not have a predefined split into training, development and test data. I would like to add a split because it improves reproducibility of experiments. As your data is already published with a DOI, I'd suggest to just add it to the SeisBench version. I can do that before uploading the data. However, I'd like to discuss how to chose the split. Any recommendations from your side? |
Hello @yetinam , indeed I didn't put any specification for train/dev/test splitting, as I thought the end-user would like to choose it freely depending on the specific task. On the other hand, I totally understand the need for a "split" column in a framework like Seisbench. For quality-checks and usage tests, I used to roughly split it in a sequential way (70% - 10% - 20% of the dataset # import pandas as pd
# df = pd.read_csv("PATH/TO/AQ2009/META")
from sklearn.model_selection import train_test_split
df_train, _df_temp = train_test_split(
df, test_size=0.3,
stratify=df['PICK_COUNT'], random_state=42)
df_test, df_dev = train_test_split(_df_temp, test_size=1/3,
stratify=_df_temp['PICK_COUNT'], random_state=42)
print(f"\nSize of TRAIN subset: {len(df_train)}")
print("Distribution in TRAIN subset:")
print(df_train['PICK_COUNT'].value_counts(normalize=True))
print(f"\nSize of TEST subset: {len(df_test)}")
print("Distribution in TEST subset:")
print(df_test['PICK_COUNT'].value_counts(normalize=True))
print(f"\nSize of DEV subset: {len(df_dev)}")
print("Distribution in DEV subset:")
print(df_dev['PICK_COUNT'].value_counts(normalize=True))
# ... Then of course, extracting the index and populating the `split` column ... What do you think? Of course you can think of a different split proportion also based on the other dataset's splitting rule-of-thumb. |
Sounds good! I'll add it to the datasets and make sure they're identical in the version with counts and the one with ground motion units. |
Sorry, taking a little longer than I hoped for. I've added the split, however, I notice the data format group is missing. Can you confirm the component order of the dataset? Is it ZNE? |
No problem! Sure, I confirm that the component order is ZNE You may as well contact me in private if problems arise. |
Sorry, just realized the component order was defined... I got confused by a warning message. I've done a small modification to the PR and am currently uploading the files. Should be ready to merge later today. Thanks a lot again for the contribution. |
With this pull request, I would like to suggest the distribution of the following benchmark dataset within Seisbench:
AQ2009 – The 2009 Aquila Mw 6.1 earthquake aftershocks seismic dataset for machine learning application
https://doi.org/10.13127/AI/AQUILA2009
Documentation (with a dedicated figure) and relative classes are already created.