Reproducibility issue with different splitters. Every time I rerun the splitter I get different train test scores. What can I do to reproduce the results? The differences are significant. #3897

v-saini · 2024-03-13T17:34:27Z

❓ Questions & Help

I am having issue reproducing my results even though, I have put in the seed parameter. For example, in the code below after generating the dataset, every time I rerun the splitter and fit the model I get quite different train test results. I would not have bother had the difference been small. Sometimes, I get 0.92 R2 score and sometimes 0.72. I have checked the split is same every time but the graphs objects are different. What can I do for reproducibility?

df = pd.read_csv('file.csv')
with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile:
df.to_csv(tmpfile.name)
loader = dc.data.CSVLoader(["task1"], feature_field="smiles",
featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=False))
dataset = loader.create_dataset(tmpfile.name)

model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)

butinasplitter = dc.splits.ButinaSplitter()
train_dataset, test_dataset = butinasplitter.train_test_split(dataset, frac_train=.80, seed=2)
model.fit(train_dataset, nb_epoch=100)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print("Training set score:", model.evaluate(train_dataset, [metric]))
print("Test set score:", model.evaluate(test_dataset, [metric]))

narashimha05 · 2024-03-13T20:18:34Z

I can give you some suggestions try to do that:
-make sure model is trained with same parameters.
-make sure data is preprocessed the same way every time.

v-saini · 2024-03-14T00:36:43Z

-there is no data preprocessing. It just has the smiles and task.
-I did not mess with the parameters. model instance is defined with default parameters. It the code below that I rerun and get different results.

butinasplitter = dc.splits.ButinaSplitter()
train_dataset, test_dataset = butinasplitter.train_test_split(dataset, frac_train=.80, seed=2)
model.fit(train_dataset, nb_epoch=100)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print("Training set score:", model.evaluate(train_dataset, [metric]))
print("Test set score:", model.evaluate(test_dataset, [metric]))

Aditya-dom · 2024-04-18T17:47:02Z

Hey @v-saini
To improve reproducibility in your code and obtain consistent train-test
splits and model results, you can follow these steps:

Read the CSV file once and store it as a global variable to avoid
loading it multiple times.
Create a temporary file for saving the dataset only once.
Use the same seed for initializing the random number generator, data
loader, splitter, and model.
Load the dataset from the temporary file without creating a new one
every time.
Instead of using UniversalNamedTemporaryFile, use a regular file path
to save the dataset and load it back in the same location. This will help
you avoid creating new files every time, which can lead to different
results due to different file paths or permissions.

Here's how you can modify your code:

import pandas as pd
import deepchem as dc
import tempfile

# Read the CSV file once and store it as a global variable
df = pd.read_csv('file.csv')

# Create a temporary directory for saving the dataset and set the seed
tmpdir = tempfile.TemporaryDirectory()
seed = 2

# Set random seeds for various components
dc.utils.set_random_seed(seed)

with dc.utils.UniversalNamedTemporaryFile(mode='w', 
directory=str(tmpdir.name), name='task1.csv') as tmpfile:
    df.to_csv(tmpfile.name)
    
loader = dc.data.CSVLoader(["task1"], feature_field="smiles",
                          
featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=False))
dataset = loader.create_dataset(tmpfile.name)

model = dc.models.GraphConvModel(n_tasks=1, mode='regression', 
dropout=0.2)

# Use the same seed for initializing the splitter and model
butinasplitter = dc.splits.ButinaSplitter()
train_dataset, test_dataset = butinasplitter.train_test_split(dataset, 
frac_train=.80, seed=seed)
model.fit(train_dataset, nb_epoch=100)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)

# Print the train and test set scores
print("Training set score:", model.evaluate(train_dataset, [metric]))
print("Test set score:", model.evaluate(test_dataset, [metric]))

# Close the temporary directory when done
tmpdir.close()

Now, every time you run your code, it should give you consistent
train-test splits and model results due to using a fixed seed for
different components.

v-saini changed the title ~~Reproducibility issue with different splitters~~ Reproducibility issue with different splitters. Every time I rerun the splitter I get different train test scores. What can I do to reproduce the results? The differences are significant. Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility issue with different splitters. Every time I rerun the splitter I get different train test scores. What can I do to reproduce the results? The differences are significant. #3897

Reproducibility issue with different splitters. Every time I rerun the splitter I get different train test scores. What can I do to reproduce the results? The differences are significant. #3897

v-saini commented Mar 13, 2024 •

edited

narashimha05 commented Mar 13, 2024

v-saini commented Mar 14, 2024

Aditya-dom commented Apr 18, 2024

Reproducibility issue with different splitters. Every time I rerun the splitter I get different train test scores. What can I do to reproduce the results? The differences are significant. #3897

Reproducibility issue with different splitters. Every time I rerun the splitter I get different train test scores. What can I do to reproduce the results? The differences are significant. #3897

Comments

v-saini commented Mar 13, 2024 • edited

❓ Questions & Help

narashimha05 commented Mar 13, 2024

v-saini commented Mar 14, 2024

Aditya-dom commented Apr 18, 2024

v-saini commented Mar 13, 2024 •

edited