Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility issue with different splitters. Every time I rerun the splitter I get different train test scores. What can I do to reproduce the results? The differences are significant. #3897

Open
v-saini opened this issue Mar 13, 2024 · 3 comments

Comments

@v-saini
Copy link

v-saini commented Mar 13, 2024

❓ Questions & Help

I am having issue reproducing my results even though, I have put in the seed parameter. For example, in the code below after generating the dataset, every time I rerun the splitter and fit the model I get quite different train test results. I would not have bother had the difference been small. Sometimes, I get 0.92 R2 score and sometimes 0.72. I have checked the split is same every time but the graphs objects are different. What can I do for reproducibility?

df = pd.read_csv('file.csv')
with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile:
df.to_csv(tmpfile.name)
loader = dc.data.CSVLoader(["task1"], feature_field="smiles",
featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=False))
dataset = loader.create_dataset(tmpfile.name)

model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)

butinasplitter = dc.splits.ButinaSplitter()
train_dataset, test_dataset = butinasplitter.train_test_split(dataset, frac_train=.80, seed=2)
model.fit(train_dataset, nb_epoch=100)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print("Training set score:", model.evaluate(train_dataset, [metric]))
print("Test set score:", model.evaluate(test_dataset, [metric]))

@v-saini v-saini changed the title Reproducibility issue with different splitters Reproducibility issue with different splitters. Every time I rerun the splitter I get different train test scores. What can I do to reproduce the results? The differences are significant. Mar 13, 2024
@narashimha05
Copy link

I can give you some suggestions try to do that:
-make sure model is trained with same parameters.
-make sure data is preprocessed the same way every time.

@v-saini
Copy link
Author

v-saini commented Mar 14, 2024

-there is no data preprocessing. It just has the smiles and task.
-I did not mess with the parameters. model instance is defined with default parameters. It the code below that I rerun and get different results.

butinasplitter = dc.splits.ButinaSplitter()
train_dataset, test_dataset = butinasplitter.train_test_split(dataset, frac_train=.80, seed=2)
model.fit(train_dataset, nb_epoch=100)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print("Training set score:", model.evaluate(train_dataset, [metric]))
print("Test set score:", model.evaluate(test_dataset, [metric]))

@Aditya-dom
Copy link

Hey @v-saini
To improve reproducibility in your code and obtain consistent train-test
splits and model results, you can follow these steps:

  1. Read the CSV file once and store it as a global variable to avoid
    loading it multiple times.
  2. Create a temporary file for saving the dataset only once.
  3. Use the same seed for initializing the random number generator, data
    loader, splitter, and model.
  4. Load the dataset from the temporary file without creating a new one
    every time.
  5. Instead of using UniversalNamedTemporaryFile, use a regular file path
    to save the dataset and load it back in the same location. This will help
    you avoid creating new files every time, which can lead to different
    results due to different file paths or permissions.

Here's how you can modify your code:

import pandas as pd
import deepchem as dc
import tempfile

# Read the CSV file once and store it as a global variable
df = pd.read_csv('file.csv')

# Create a temporary directory for saving the dataset and set the seed
tmpdir = tempfile.TemporaryDirectory()
seed = 2

# Set random seeds for various components
dc.utils.set_random_seed(seed)

with dc.utils.UniversalNamedTemporaryFile(mode='w', 
directory=str(tmpdir.name), name='task1.csv') as tmpfile:
    df.to_csv(tmpfile.name)
    
loader = dc.data.CSVLoader(["task1"], feature_field="smiles",
                          
featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=False))
dataset = loader.create_dataset(tmpfile.name)

model = dc.models.GraphConvModel(n_tasks=1, mode='regression', 
dropout=0.2)

# Use the same seed for initializing the splitter and model
butinasplitter = dc.splits.ButinaSplitter()
train_dataset, test_dataset = butinasplitter.train_test_split(dataset, 
frac_train=.80, seed=seed)
model.fit(train_dataset, nb_epoch=100)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)

# Print the train and test set scores
print("Training set score:", model.evaluate(train_dataset, [metric]))
print("Test set score:", model.evaluate(test_dataset, [metric]))

# Close the temporary directory when done
tmpdir.close()

Now, every time you run your code, it should give you consistent
train-test splits and model results due to using a fixed seed for
different components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants