TorchSharp memory issue #1278

lintao185 · 2024-04-02T06:44:00Z

 m = torch.nn.Conv2d(3, 64, 7, 2, 3, bias=False).cuda()

    for i in range(1000000):
        x = torch.randn(1, 3, 224, 224, dtype=torch.float).cuda()
        y = m.forward(x)

var m=TorchSharp.torch.nn.Conv2d(3, 64, 7, 2, 3, bias: false).cuda();

for (int i = 0; i < 1000000; i++)
{
    var x = torch.randn(1, 3, 224, 224).@float().cuda();
    var y = m.forward(x);
}

In PyTorch, when using GPU inference, GPU memory can be released at the appropriate time. In TorchSharp, when using GPU inference, there is a GPU memory leak that requires manual release.

The text was updated successfully, but these errors were encountered:

lintao185 · 2024-04-02T08:38:45Z

var m = torch.nn.Conv2d(3, 64, 7, 2, 3, bias: false).cuda();
for (int i = 0; i < 1000000; i++)
{
    using var x = torch.randn(1, 3, 224, 224).@float().cuda();
    using var y = m.forward(x);
}

Although, however, this is the only way it can be written.

yueyinqiu · 2024-04-03T01:20:14Z

Perhaps this page could help: https://github.com/dotnet/TorchSharp/wiki/Memory-Management

lintao185 · 2024-04-03T02:32:47Z

torch.NewDisposeScope() is a relatively elegant solution, although not as elegant as Pytorch.

lintao185 · 2024-04-03T09:43:32Z

I'm kind of considering giving up, as unexpected exceptions occur when torch.NewDisposeScope is nested, especially when one function calls another and the called function also has torch.NewDisposeScope. Objects that shouldn't be disposed of are being released. It seems training AI with C# is not very realistic, which is quite frustrating.

yueyinqiu · 2024-04-03T11:40:12Z

I‘m sorry to hear that. But DisposeScope is designed to work in that case. Could you describe the issue more specifically and thus we could fix that?

Oh... Since you mentioned that 'objects that shouldn't be disposed of are being released', I guess MoveToOuterDisposeScope could work?

yueyinqiu · 2024-04-03T15:06:19Z

It might be because f is not disposed. Hope this could help:

using TorchSharp;

for (int i = 0; i < 10000000000; i++)
{
    using (torch.NewDisposeScope())
    {
        var f = torch.randn(1, 3, 224, 224).@float().cuda();
        using (torch.NewDisposeScope())
        {
            var f3 = torch.randn(1, 3, 224, 224).@float().cuda();
            f[..] = f3;
        }
    }
}
Console.ReadKey();

By the way, are you a Chinese user? I have created a qq group (957204993) just now so we perhaps could discuss there with instant messages, which could be more convenient.

lintao185 · 2024-04-04T03:41:59Z

public static class Ops{
    public static Tensor clip_boxes(Tensor boxes, int[] shape)
    {
        using (torch.NewDisposeScope())
        {
            boxes[TensorIndex.Ellipsis, 0] = boxes[TensorIndex.Ellipsis, 0].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 1] = boxes[TensorIndex.Ellipsis, 1].clamp(0, shape[0]);
            boxes[TensorIndex.Ellipsis, 2] = boxes[TensorIndex.Ellipsis, 2].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 3] = boxes[TensorIndex.Ellipsis, 3].clamp(0, shape[0]);
            return boxes.MoveToOuterDisposeScope();
        }
    }
public static Tensor scale_boxes(int[] img1_shape, Tensor boxes, int[] img0_shape, (int, int)[] ratio_pad = null!, bool padding = true, bool xywh = false)
{
    using (torch.NewDisposeScope())
    {
        double gain;
        (double, double) pad;
        if (ratio_pad == null)
        {
            gain = Math.Min(img1_shape[0] * 1.0 / img0_shape[0], img1_shape[1] * 1.0 / img0_shape[1]);
            pad = (
                Math.Round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1),
                Math.Round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)
            );
        }
        else
        {
            gain = ratio_pad[0].Item1;
            pad = ratio_pad[1];
        }

        if (padding)
        {
            boxes[TensorIndex.Ellipsis, 0] -= pad.Item1;
            boxes[TensorIndex.Ellipsis, 1] -= pad.Item2;
            if (!xywh)
            {
                boxes[TensorIndex.Ellipsis, 2] -= pad.Item1;
                boxes[TensorIndex.Ellipsis, 3] -= pad.Item2;
            }
        }

        boxes[TensorIndex.Ellipsis, ..4] /= gain;
        return clip_boxes(boxes, img0_shape).MoveToOuterDisposeScope();
    }
}
}

public abstract class OutputData : IDisposable
{
    public abstract void Dispose();

    public abstract List<dynamic> ToList();
    public abstract OutputData MoveToOuterDisposeScope();
    ~OutputData()
    {
        Dispose();
    }
}

public class DetectPredictData : OutputData
{
    public Tensor Y { get; set; }
    public List<Tensor> X { get; set; }

    public override void Dispose()
    {
        Y?.Dispose();
        X?.ForEach(x => x.Dispose());
    }

    public override OutputData MoveToOuterDisposeScope()
    {
        Y?.MoveToOuterDisposeScope();
        X.ForEach(x => x.MoveToOuterDisposeScope());
        return this;
    }

    public override List<dynamic> ToList()
    {
        return [Y,X];
    }
}

for (int i = 0; i < 10000000; i++)
{
    using (torch.NewDisposeScope())
    {
        OutputData data = new DetectPredictData()
        {
            Y = torch.randn(3, 84, 8400).@float().cuda(),
            X = [
                torch.randn(3, 144, 80,80).@float().cuda(),
                torch.randn(3, 288, 40,40).@float().cuda(),
                torch.randn(3, 576, 20,20).@float().cuda(),
                ]
        };
        var pre = data.ToList();
        using (torch.NewDisposeScope())
        {
            Tensor f = pre[0];
            int[]? f1 = [1080, 2];
            var boxes = Ops.scale_boxes([564, 640], f[.., ..4], f1);
        }
        data.Dispose();
    }
}

After a day's work, I've located the position of the memory leak. Now I'm simulating the reproduction of this memory leak issue (with an 80% reproduction rate, as there are some unexplainable phenomena, thus 20% was not reproduced. This means that the solution for the simulated memory leak code does not apply to my project, and conversely, the solution for the memory leak in my project does not apply to this simulation code. It's quite awkward!!!)

yueyinqiu · 2024-04-04T04:33:24Z

I suppose that Ops.clip_boxes and Ops.scale_boxes should not invoke MoveToOuterDisposeScope().

That's because boxes is created here:

So it's related dispose scope is:

When using MoveToOuterDisposeScope once, it's dispose scope will be:

And after using it twice, there are no dispose scope for it. Then it leaks.

(Only the tensors/parameters that is created in one dispose scope will be automatically attached to it. And in place actions will not modify its dispose scope.)

lintao185 · 2024-04-04T04:41:25Z

Yes, the simulated code can resolve the issue by removing MoveToOuterDisposeScope(), but for the code where the actual memory leak occurs, I cannot handle it in this way. I need to modify the code as follows, which is very confusing to me.

   public static Tensor clip_boxes(Tensor boxes, int[] shape)
   {
       using (torch.NewDisposeScope())
       {
           boxes[TensorIndex.Ellipsis, 0] = boxes[TensorIndex.Ellipsis, 0].clone().clamp(0, shape[1]);
           boxes[TensorIndex.Ellipsis, 1] = boxes[TensorIndex.Ellipsis, 1].clone().clamp(0, shape[0]);
           boxes[TensorIndex.Ellipsis, 2] = boxes[TensorIndex.Ellipsis, 2].clone().clamp(0, shape[1]);
           boxes[TensorIndex.Ellipsis, 3] = boxes[TensorIndex.Ellipsis, 3].clone().clamp(0, shape[0]);
           return boxes.MoveToOuterDisposeScope();
       }
       
   }

It's just by adding .clone(), which is very bizarre.

yueyinqiu · 2024-04-04T04:51:13Z

You could just remove that:

    public static Tensor clip_boxes(Tensor boxes, int[] shape)
    {
        using (torch.NewDisposeScope())
        {
            boxes[TensorIndex.Ellipsis, 0] = boxes[TensorIndex.Ellipsis, 0].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 1] = boxes[TensorIndex.Ellipsis, 1].clamp(0, shape[0]);
            boxes[TensorIndex.Ellipsis, 2] = boxes[TensorIndex.Ellipsis, 2].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 3] = boxes[TensorIndex.Ellipsis, 3].clamp(0, shape[0]);
            return boxes;
        }
    }

I suppose there is no problem with this. Are you worried about any other things?

lintao185 · 2024-04-04T04:54:57Z

No no no, your code will cause a memory leak in my project, but adding .clone() fixes it. However, the simulated code still leaks memory even with .clone() added. Please trust me, there is still an issue with torch.NewDisposeScope().

yueyinqiu · 2024-04-04T05:00:40Z

Actually your clone cannot solve this problem. You could use it on the return instead:

    public static Tensor clip_boxes(Tensor boxes, int[] shape)
    {
        using (torch.NewDisposeScope())
        {
            boxes[TensorIndex.Ellipsis, 0] = boxes[TensorIndex.Ellipsis, 0].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 1] = boxes[TensorIndex.Ellipsis, 1].clamp(0, shape[0]);
            boxes[TensorIndex.Ellipsis, 2] = boxes[TensorIndex.Ellipsis, 2].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 3] = boxes[TensorIndex.Ellipsis, 3].clamp(0, shape[0]);
            return boxes.clone().MoveToOuterDisposeScope();
        }
    }

Or:

    public static Tensor clip_boxes(Tensor boxes, int[] shape)
    {
        using (torch.NewDisposeScope())
        {
            boxes = boxes.clone();
            boxes[TensorIndex.Ellipsis, 0] = boxes[TensorIndex.Ellipsis, 0].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 1] = boxes[TensorIndex.Ellipsis, 1].clamp(0, shape[0]);
            boxes[TensorIndex.Ellipsis, 2] = boxes[TensorIndex.Ellipsis, 2].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 3] = boxes[TensorIndex.Ellipsis, 3].clamp(0, shape[0]);
            return boxes.MoveToOuterDisposeScope();
        }
    }

But I suppose there is no reason to use clone and keep MoveToOuterDisposeScope. Would there be any other problems in your project that cause a memory leak?

yueyinqiu · 2024-04-04T05:02:36Z

By the way, you could track the tensor's dispose scope here:

Hope this could help when debugging.

lintao185 · 2024-04-04T05:51:12Z

Please watch the VCR.
https://github.com/dotnet/TorchSharp/assets/55724885/20f3a37a-1267-4ebd-a2e8-81365a122a8a

yueyinqiu · 2024-04-04T05:59:00Z

Hmm I'm really not sure about that. Is it possible to share the whole project with me?

lintao185 · 2024-04-04T06:01:30Z

Sorry about that, it’s not convenient at the moment.

yueyinqiu · 2024-04-04T06:03:54Z

My only guess is that because of the higher usage of the memory (memory, not gpu memory), the garbage collection system is actived and thus the escaped tensors are released?

lintao185 · 2024-04-04T06:09:57Z

Not too sure, haven't found the exact cause yet, it's a bit odd.

lintao185 · 2024-04-04T14:22:37Z

 public static Tensor clip_boxes(Tensor boxes, int[] shape)
 {
     using (torch.NewDisposeScope())
     {
         boxes[TensorIndex.Ellipsis, 0].clamp_(0, shape[1]);
         boxes[TensorIndex.Ellipsis, 1].clamp_(0, shape[0]);
         boxes[TensorIndex.Ellipsis, 2].clamp_(0, shape[1]);
         boxes[TensorIndex.Ellipsis, 3].clamp_(0, shape[0]);
         return boxes;
     }

 }

This can also solve the problem of memory leaks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchSharp memory issue #1278

TorchSharp memory issue #1278

lintao185 commented Apr 2, 2024

lintao185 commented Apr 2, 2024

yueyinqiu commented Apr 3, 2024

lintao185 commented Apr 3, 2024

lintao185 commented Apr 3, 2024

yueyinqiu commented Apr 3, 2024 •

edited

yueyinqiu commented Apr 3, 2024 •

edited

lintao185 commented Apr 4, 2024

yueyinqiu commented Apr 4, 2024 •

edited

lintao185 commented Apr 4, 2024

yueyinqiu commented Apr 4, 2024 •

edited

lintao185 commented Apr 4, 2024

yueyinqiu commented Apr 4, 2024 •

edited

yueyinqiu commented Apr 4, 2024 •

edited

lintao185 commented Apr 4, 2024

yueyinqiu commented Apr 4, 2024

lintao185 commented Apr 4, 2024

yueyinqiu commented Apr 4, 2024 •

edited

lintao185 commented Apr 4, 2024

lintao185 commented Apr 4, 2024 •

edited

TorchSharp memory issue #1278

TorchSharp memory issue #1278

Comments

lintao185 commented Apr 2, 2024

lintao185 commented Apr 2, 2024

yueyinqiu commented Apr 3, 2024

lintao185 commented Apr 3, 2024

lintao185 commented Apr 3, 2024

yueyinqiu commented Apr 3, 2024 • edited

yueyinqiu commented Apr 3, 2024 • edited

lintao185 commented Apr 4, 2024

yueyinqiu commented Apr 4, 2024 • edited

lintao185 commented Apr 4, 2024

yueyinqiu commented Apr 4, 2024 • edited

lintao185 commented Apr 4, 2024

yueyinqiu commented Apr 4, 2024 • edited

yueyinqiu commented Apr 4, 2024 • edited

lintao185 commented Apr 4, 2024

yueyinqiu commented Apr 4, 2024

lintao185 commented Apr 4, 2024

yueyinqiu commented Apr 4, 2024 • edited

lintao185 commented Apr 4, 2024

lintao185 commented Apr 4, 2024 • edited

yueyinqiu commented Apr 3, 2024 •

edited

yueyinqiu commented Apr 3, 2024 •

edited

yueyinqiu commented Apr 4, 2024 •

edited

yueyinqiu commented Apr 4, 2024 •

edited

yueyinqiu commented Apr 4, 2024 •

edited

yueyinqiu commented Apr 4, 2024 •

edited

yueyinqiu commented Apr 4, 2024 •

edited

lintao185 commented Apr 4, 2024 •

edited