Extend Fake Tensor Caching to Symints #126411

eellison · 2024-05-16T17:04:59Z

🚀 The feature, motivation and pitch

We have an existing in-memory, fake tensor cache which short-circuits running metas and decompositions when a particular op overload has inputs with metadata that has already been observed. Fake Tensors get cache hits across multiple code paths in torch.compile, and even within the same graph. Enabling it gave a 10% compilation speedup across huggingface dashboard.

We do not currently serve cache hits for tensors with symints, but we should. If you run:

torch.ops.aten.add(FakeTensor[2, 3, s0]), 2)
torch.ops.aten.add(FakeTensor[2, 3, s0]), 2)

In the second invocation, where we are running the same exact op with the same inputs and symint ids, any guards that might be added must necessarily have been added in the prior run.

According to @ezyang:

"my main fear is you can't literally cache on torch.SymInt id, since these are not interned. so you need to do a structural hash on the sympy expression itself, which has some cost"

There is an existing symint hasher which might be useful: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/dedupe_symint_uses.py#L9

For this repro:

import torch

@torch.compile(backend="aot_eager", dynamic=True)
def foo(x):
    t = torch.rand([1])
    t2 = torch.rand([1])
    return x + t, x + t2


inp = torch.rand([20])
foo(inp)

we should only compute aten.add.Tensor, (FakeTensor(..., size=(s0,)), FakeTensor(..., size=(1,))), {} once. we do it 6 times currently.

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @aorenste , @masnesral

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

aorenste · 2024-05-27T23:39:18Z

I'm not sure if this would be worth it. I instrumented the existing cache and ran:

python benchmarks/dynamo/torchbench.py --performance --inference --amp --backend inductor --disable-cudagraphs --device cuda

From what I can tell the time spent in the FakeTensorMode dispatch of ops that were not cached because they contained any kind of sym expr was tiny. Of 151s spent in dispatch 0.003s were cache bypassed due to containing a SymInt, SymFloat, or SymBool.

Is there a better benchmark I should use to measure the potential of this change?

ezyang · 2024-05-28T01:20:16Z

You have to run with --dynamic-shapes --dynamic-batch-size-only to actually trigger dynamic shapes in the benchmark suite

aorenste · 2024-05-28T04:14:36Z

You have to run with --dynamic-shapes --dynamic-batch-size-only to actually trigger dynamic shapes in the benchmark suite

I used --dynamic-batch-only because --dynamic-batch-size-only wasn't accepted.

Running:

python benchmarks/dynamo/torchbench.py --performance --inference --amp --backend inductor --disable-cudagraphs --device cuda --dynamic-shapes --dynamic-batch-only

Gives a total time of 338s (so slower) but still only 0.045s in dispatching w/ Sym types. Still fairly insignificant. Or I'm measuring it poorly.

ezyang · 2024-05-28T14:31:11Z

ok, well, I never claimed that you would expect a speedup here :)

eellison · 2024-05-28T15:24:32Z

The benchmark runs with --dynamic-shapes --dynamic-batch-only, not just --dynamic-batch-only. Maybe you need both ?

aorenste · 2024-05-28T18:07:08Z

--dynamic-shapes --dynamic-batch-size-only

I had both (scroll to the right of the given command line in the comment).

eellison · 2024-05-28T18:21:33Z

When I run:

python benchmarks/dynamo/huggingface.py --performance --training --amp --backend aot_eager --device cuda --only BertForQuestionAnswering 
 --print-compilation-time --dynamic-batch-only

And add

    def __del__(self):
        print(self.cache_info())

to FakeTensorMode I see:

DispatchCacheInfo(hits=19258, misses=189, bypasses={'symbolic shape': 22386, 'dynamic output shape': 1, 'CompositeImplicitAutograd': 697, 'non-fake tensor': 54, 'non-FakeTensor output': 51}, size=189)

I also see that in the above benchmark without dynamic-batch-only disabling fake tensor cache causes 5 seconds slowdown. It's possible you're only looking at sym_types inputs but not fake tensor inputs with symints. About half of the ops are bypassed due to symints so I would expect a couple seconds of improvement.

aorenste · 2024-05-28T20:49:02Z

So it turns out that the perf measurements work a lot better when you store them with += instead of =. Once I do that the symint stuff pops out as quite a bit more expensive (significant percentage of the dispatch time for some of the benchmarks)

aorenste self-assigned this May 16, 2024

eellison added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module oncall: pt2 labels May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend Fake Tensor Caching to Symints #126411

Extend Fake Tensor Caching to Symints #126411

eellison commented May 16, 2024 •

edited by pytorch-bot bot

aorenste commented May 27, 2024

ezyang commented May 28, 2024

aorenste commented May 28, 2024

ezyang commented May 28, 2024

eellison commented May 28, 2024 •

edited

aorenste commented May 28, 2024

eellison commented May 28, 2024 •

edited

aorenste commented May 28, 2024

Extend Fake Tensor Caching to Symints #126411

Extend Fake Tensor Caching to Symints #126411

Comments

eellison commented May 16, 2024 • edited by pytorch-bot bot

🚀 The feature, motivation and pitch

Alternatives

Additional context

aorenste commented May 27, 2024

ezyang commented May 28, 2024

aorenste commented May 28, 2024

ezyang commented May 28, 2024

eellison commented May 28, 2024 • edited

aorenste commented May 28, 2024

eellison commented May 28, 2024 • edited

aorenste commented May 28, 2024

eellison commented May 16, 2024 •

edited by pytorch-bot bot

eellison commented May 28, 2024 •

edited

eellison commented May 28, 2024 •

edited