Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_worker training dependency #196

Open
hnbabaei opened this issue Sep 15, 2022 · 8 comments
Open

num_worker training dependency #196

hnbabaei opened this issue Sep 15, 2022 · 8 comments

Comments

@hnbabaei
Copy link

Hi mufeili,
I have a couple of question which I appreciate it if you could help with.
-Changing the number of workers changes the number of epochs required to converge which is not expected. Increasing # of CPUs increases the training time. Any advice on why these happen?

-Could we use graph.bin file generated previously to start training without loading graph from a .csv file?

Thanks.

@mufeili
Copy link
Contributor

mufeili commented Sep 16, 2022

Hi, which example are you talking about?

@hnbabaei
Copy link
Author

Hi, the property_prediction with csv_data_configuration. I used the regression_train.py code.

@mufeili
Copy link
Contributor

mufeili commented Sep 29, 2022

Sorry for the late reply.

Changing the number of workers changes the number of epochs required to converge which is not expected. Increasing # of CPUs increases the training time. Any advice on why these happen?

Have you eliminated all sources of randomness? By default regression_train.py does not do so like fixing the random seed. Without eliminating randomness, we cannot perform a fair comparison.

Could we use graph.bin file generated previously to start training without loading graph from a .csv file?

Yes, you can set load=True here.

@hnbabaei
Copy link
Author

hnbabaei commented Sep 30, 2022

Thanks very much for your response.

I try to use my own splitting for the Train/test/val sets which are based on splitting 0 and 1 labels separately. I have a column that has the splitting. Is there an easy way to do this?
To do this, currently, I have added the following lines to the classification_train.py and regression_train.py:

added -ttvc (--train-test-val-col) argument which indicates column for train-test-val split labels

parser.add_argument('-ttvc', '--train-test-val-col', default=None, type=str,
                    help='column for train-test-val split labels. If None, we will use '
                         'the default method in dgllife for splitting.'
                         '(default: None)')

And here is the change I made where the data gets read and splitting done:

if args['train_test_val_col'] is not None:
    train_set = load_dataset(args, df[df[args['train_test_val_col']]=='train'])
    test_set = load_dataset(args, df[df[args['train_test_val_col']]=='test'])
    val_set = load_dataset(args, df[df[args['train_test_val_col']]=='valid'])
else:
    train_set, val_set, test_set = split_dataset(args, dataset)

Thanks

@hnbabaei
Copy link
Author

I actually found the SingleTaskStratifiedSplitter class which I think will do what I found but did not see it in the options for splitting method. I will try to use it. Please let me know if you think this is a correct way to do it.

@mufeili
Copy link
Contributor

mufeili commented Oct 2, 2022

That should work. Feel free if you encounter any further issues.

@hnbabaei
Copy link
Author

hnbabaei commented Oct 3, 2022

Thanks Mufei. Just wondering if the code has been ever used for large scale datasets(e.g., 100 million molecules). If so, what you suggest to use or change within the code to make it scalable and memory efficient? Thanks.

@mufeili
Copy link
Contributor

mufeili commented Oct 4, 2022

Thanks Mufei. Just wondering if the code has been ever used for large scale datasets(e.g., 100 million molecules). If so, what you suggest to use or change within the code to make it scalable and memory efficient? Thanks.

I have not tested the code for that scale. Likely you will need to check if you have enough memory to load the data at once or alternatively load the data in batches. You will also need more computational resources, e.g., multi-GPU training. The example here might help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants