Tensor's device mismatch #96

hzcheney · 2022-03-15T05:45:29Z

Hi! I have found a bug during the training of the caster model. It was caused by the torch.eye manipulation, simply it did not specify the device. When the Cuda is available, torch.eye will create the tensor on the CPU while the whole model is on the GPU.

The text was updated successfully, but these errors were encountered:

cthoyt · 2022-03-15T14:10:27Z

Can you please give a code example that reproduces this error as well as copying the full stack trace?

hzcheney · 2022-03-16T02:17:06Z

To reproduce:

Just run the caster_example.py file and you will get this error below.

  0%|                                                    | 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/hzcheney/DGL/chemicalx/examples/caster_example.py", line 30, in <module>
    main()
  File "/home/hzcheney/DGL/chemicalx/examples/caster_example.py", line 13, in main
    results = pipeline(
  File "/home/hzcheney/DGL/chemicalx/chemicalx/pipeline.py", line 165, in pipeline
    prediction = model(*model.unpack(batch))
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hzcheney/DGL/chemicalx/chemicalx/models/caster.py", line 124, in forward
    dictionary_features_latent = self.encoder(torch.eye(self.drug_channels))
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_addmm)

Process finished with exit code 1

Possible solution

I have solved this bug by specifying the device when using torch.eye. Just change this

chemicalx/chemicalx/models/caster.py

Line 106 in 4222a67

    
           dict_feat_squared_inv = torch.inverse(dict_feat_squared + self.lambda3 * (torch.eye(self.drug_channels)))

chemicalx/chemicalx/models/caster.py

Line 124 in 4222a67

dictionary_features_latent = self.encoder(torch.eye(self.drug_channels))

to

dict_feat_squared_inv = torch.inverse(dict_feat_squared + self.lambda3 * (torch.eye(self.drug_channels, device=drug_pair_features_latent.device)))
dictionary_features_latent = self.encoder(torch.eye(self.drug_channels, device=drug_pair_features.device))

Another similar bug

To reproduce this one, just run the mhcaddi_example.py file and you will get this error below:

  0%|                                                    | 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return bound(*args, **kwds)
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/torch/_tensor.py", line 680, in __array__
    return self.numpy().astype(dtype, copy=False)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hzcheney/DGL/chemicalx/examples/mhcaddi_example.py", line 26, in <module>
    main()
  File "/home/hzcheney/DGL/chemicalx/examples/mhcaddi_example.py", line 13, in main
    results = pipeline(
  File "/home/hzcheney/DGL/chemicalx/chemicalx/pipeline.py", line 165, in pipeline
    prediction = model(*model.unpack(batch))
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hzcheney/DGL/chemicalx/chemicalx/models/mhcaddi.py", line 398, in forward
    outer_segmentation_index_left, outer_index_left, atom_left, bond_left = self._get_molecule_features(
  File "/home/hzcheney/DGL/chemicalx/chemicalx/models/mhcaddi.py", line 374, in _get_molecule_features
    outer_segmentation_index, outer_index = self.generate_outer_segmentation(
  File "/home/hzcheney/DGL/chemicalx/chemicalx/models/mhcaddi.py", line 461, in generate_outer_segmentation
    outer_segmentation_index = [
  File "/home/hzcheney/DGL/chemicalx/chemicalx/models/mhcaddi.py", line 462, in <listcomp>
    np.repeat(np.array(range(0, left_graph_size)), right_graph_size)
  File "<__array_function__ internals>", line 5, in repeat
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 479, in repeat
    return _wrapfunc(a, 'repeat', repeats, axis=axis)
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 66, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 43, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
  File "/home/hzcheney/miniconda3/envs/chemicalx/lib/python3.8/site-packages/torch/_tensor.py", line 680, in __array__
    return self.numpy().astype(dtype, copy=False)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I think it was caused by the numpy manipulation, since the numpy did not allowed to be compute on the GPUs.

chemicalx/chemicalx/models/mhcaddi.py

Lines 461 to 464 in 4222a67

    
           outer_segmentation_index = [ 
        
               np.repeat(np.array(range(0, left_graph_size)), right_graph_size) 
        
               for left_graph_size, right_graph_size in zip(graph_sizes_left, graph_sizes_right) 
        
           ]

Zilu-Zhang · 2022-07-08T14:19:33Z

Hi all, any updates about the numpy issue? @hzcheney @cthoyt @benedekrozemberczki

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor's device mismatch #96

Tensor's device mismatch #96

hzcheney commented Mar 15, 2022

cthoyt commented Mar 15, 2022

hzcheney commented Mar 16, 2022

Zilu-Zhang commented Jul 8, 2022 •

edited

Tensor's device mismatch #96

Tensor's device mismatch #96

Comments

hzcheney commented Mar 15, 2022

cthoyt commented Mar 15, 2022

hzcheney commented Mar 16, 2022

To reproduce:

Possible solution

Another similar bug

Zilu-Zhang commented Jul 8, 2022 • edited

Zilu-Zhang commented Jul 8, 2022 •

edited