Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble loading CUDA support under dotnet-interactive (C#) #1146

Open
tombatron opened this issue Nov 14, 2023 · 10 comments
Open

Trouble loading CUDA support under dotnet-interactive (C#) #1146

tombatron opened this issue Nov 14, 2023 · 10 comments

Comments

@tombatron
Copy link

Hi there!

This may be related to #345, so please bear with me.

I'm trying to use TorchSharp with dotnet-interactive with Jupyter notebook and I'm encountering the following behavior:

image

Now, I am running my setup through Docker, so I wondered if perhaps I had an issue there, so I made a quick console application to test "connectivity" with my GPU.

image

I'm kind of struggling to get my arms around the issue, what are some next steps I could take?

Cheers!

@NiklasGustafsson
Copy link
Contributor

I've tried to reproduce this problem with WSL, but I'm running into a very different problem, which doesn't even get as far as calling is_available()

@NiklasGustafsson
Copy link
Contributor

It's worth trying -- and this is a total shot in the dark -- to delete everything *torch* under ~/.nuget/packages/ and then try again. I wonder if there's some sort of package confusion going on when running with .NET Interactive.

@tombatron
Copy link
Author

Yeah that didn't seem to have any impact. :\

Here is a directory listing of my .nuget directory on the Jupyter server:

drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 google.protobuf
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 ilgpu
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part2-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part2-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-fragment2
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-fragment3
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part3-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part4-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part4-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part5-fragment1
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part5-primary
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part6
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 libtorch-cuda-12.1-linux-x64-part7
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 sharpziplib
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 skiasharp
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 skiasharp.nativeassets.macos
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 skiasharp.nativeassets.win32
drwxr-sr-x 3 jovyan users 4096 Nov 14 15:07 system.memory
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 torchsharp
drwxr-sr-x 3 jovyan users 4096 Nov 18 14:57 torchsharp-cuda-linux

Here is the error message:

System.TypeInitializationException: The type initializer for 'TorchSharp.torch' threw an exception.
 ---> System.NotSupportedException: The libtorch-cpu-linux-x64 package version 2.1.0.1 is not restored on this system. If using F# Interactive or .NET Interactive you may need to add a reference to this package, e.g. 
    #r "nuget: libtorch-cpu-linux-x64, 2.1.0.1". Trace from LoadNativeBackend:

TorchSharp: LoadNativeBackend: Initialising native backend, useCudaBackend = False

Step 1 - First try regular load of native libtorch binaries.

    Trying to load native component torch_cpu relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Failed to load native component torch_cpu relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Trying to load native component LibTorchSharp relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Failed to load native component LibTorchSharp relative to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/TorchSharp.dll
    Result from regular native load of LibTorchSharp is False

Step 3 - Alternative load from consolidated directory of native binaries from nuget packages

    torchsharpLoc = /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0
    packagesDir = /home/jovyan/.nuget/packages
    torchsharpHome = /home/jovyan/.nuget/packages/torchsharp/0.101.2
    Trying dynamic load for .NET/F# Interactive by consolidating native libtorch-cpu-linux-x64-* binaries to /home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/cpu...
    Consolidating native binaries, packagesDir=/home/jovyan/.nuget/packages, packagePattern=libtorch-cpu-linux-x64, packageVersion=2.1.0.1 to target=/home/jovyan/.nuget/packages/torchsharp/0.101.2/lib/net6.0/cpu...

   at TorchSharp.torch.LoadNativeBackend(Boolean useCudaBackend, StringBuilder& trace)
   at TorchSharp.torch.InitializeDeviceType(DeviceType deviceType)
   at TorchSharp.torch.InitializeDevice(Device device)
   at TorchSharp.torch..cctor()
   --- End of inner exception stack trace ---
   at TorchSharp.torch.TryInitializeDeviceType(DeviceType deviceType)
   at TorchSharp.torch.cuda.is_available()
   at Submission#5.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)
   at TorchSharp.torch.TryInitializeDeviceType(DeviceType deviceType)
   at TorchSharp.torch.cuda.is_available()
   at Submission#5.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)

@tombatron
Copy link
Author

tombatron commented Nov 29, 2023

A co-worker (@wss-rbrennan) of mine my have shed some light on this issue:

"The problem has more to do with nuget itself. TorchSharp used a clever way of putting together the libtorch-cuda-12.1-linux-x64 package because nuget has a max package size of 250mb. The work around combines multiple packages at build time in a project, so your project works, but interactive doesn't build the same way, so the reference fails."

Not sure if this is a problem per se, or just something to account for when using TorchSharp from within interactive mode or whatever?

@NiklasGustafsson
Copy link
Contributor

Thank you for the follow-up, and that's sort of what I was seeing, too. But... it used to work!

The stitching together only happens the first time, i.e. when a build finds that the stitched package is not available in the NuGet cache locally.

@tombatron
Copy link
Author

tombatron commented Nov 29, 2023 via email

@NiklasGustafsson
Copy link
Contributor

And it works on Windows, which has the same package stitching problem.

@NiklasGustafsson
Copy link
Contributor

You think there is some sort of snippet that could be run to ensure proper stitching?

All I can think of is a dotnet build, but I think you already did that and it worked, so the stitching should already have been done.

@NiklasGustafsson
Copy link
Contributor

Or, maybe... clear the ~/.nuget/packages cache, as well as anything under ~/.packagemanagement/nuget. Then, build your console program again, then try the .ipynb file again. Another shot in the dark...

@NiklasGustafsson
Copy link
Contributor

Okay, so after a bunch of finagling, I finally get to where you are -- no blow-up when loading the backend, but is_available() returns false. It works fine when I run one of the TorchExamples on CUDA, or on Windows interactively or console app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants