sdp::SDPBackend::flash_attention support PrivateUse1 #126392

1274085042 · 2024-05-16T12:24:38Z

Fixes #124271

cc @cpuhrsch @drisspg @albanD @soulitzer

pytorch-bot · 2024-05-16T12:24:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126392

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 481a3f9 with merge base 0ff2f8b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2024-05-20T20:30:03Z

aten/src/ATen/native/native_functions.yaml

@@ -14685,6 +14685,11 @@
    CPU: _scaled_dot_product_flash_attention_cpu
  tags: nondeterministic_seeded

+- func: _scaled_dot_product_flash_attention_overrideable(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, SymInt max_q, SymInt max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)


lets rename this to '_sacled_dot_product_fused_attention_overrideable'

drisspg · 2024-05-20T20:30:49Z

aten/src/ATen/native/transformers/attention.cpp

@@ -673,7 +677,7 @@ Tensor scaled_dot_product_attention(
      return std::get<0>(out_lse_softmax);
    }
    case sdp::SDPBackend::flash_attention: {
-      if(query_.device().type() == DeviceType::CUDA){
+      if(query_.device().type() == DeviceType::CUDA) {


nit: remove

aten/src/ATen/native/transformers/attention.cpp

drisspg · 2024-05-22T00:13:16Z

The current structure of this op looks like;

|-- Determine backend (CUDA, CPU, HIP, PrivateUse1)
|    |
|    |-- if PrivateUse1:
|    |      |-- handle_private_use(...)
|    |-- else:
|          |-- _fused_sdp_choice_stub(...)
|
|-- switch (backend)
     |
     |-- case cudnn_attention:
     |      |-- out_lse_softmax = at::_scaled_dot_product_cudnn_attention(...)
     |
     |-- case flash_attention:
     |      |-- if CUDA:
     |      |      |-- out_lse_softmax = at::_scaled_dot_product_flash_attention(...)
     |      |-- else (CPU):
     |            |-- return at::_scaled_dot_product_flash_attention_for_cpu(...)
     |
     |-- case efficient_attention:
     |      |-- out_and_lse = at::_scaled_dot_product_efficient_attention(...)
     |
     |-- case math:
     |      |-- return at::_scaled_dot_product_attention_math(...)
     |
     |-- default:
            |-- TORCH_CHECK(false, "No viable backend found.")
            |-- return Tensor()

I spoke with Alban offline about this, and we came to the conclusion that we want this structure:

|-- Determine backend (CUDA, CPU, HIP, PrivateUse1)
|    | If stub_registered(){
|    | 		|--_fused_sdp_choice_stub(...)
|	 | Else
|.   | Use math as choice
|
|-- switch (backend)
     |
     |-- case cudnn_attention:
     |      |-- out_lse_softmax = at::_scaled_dot_product_cudnn_attention(...)
     |
     |-- case flash_attention:
     |      |-- if CUDA:
     |      |      |-- out_lse_softmax = at::_scaled_dot_product_flash_attention(...)
     |      |-- else (CPU):
     |            |-- return at::_scaled_dot_product_flash_attention_for_cpu(...)
     |
     |-- case efficient_attention:
     |      |-- out_and_lse = at::_scaled_dot_product_efficient_attention(...)
     |
	 |-- case overridable:
	 		|-- return at::_scaled_dot_product_attention_overridable(...)
	 }
     |-- case math:
     |      |-- return at::_scaled_dot_product_attention_math(...)
     |
	 |
     |-- default:
            |-- TORCH_CHECK(false, "No viable backend found.")
            |-- return Tensor()

So what does that mean for this PR, the structure looks pretty good. I made some changes here that should enable this, so once this lands we can make land your updates: #126832

The dispatching logic for the kernels will be
default_choice is math, (if a device doesnt register a stub then they will get routed to math)

if a choice is registered devices have the option to go to an overridable op that this pr provides. That op should have no preprocessing but will be run through 'validate_sdpa' and convert attn_mask from bool to float

# Summary Adds a public method to dispatchstub to check if a fn has been registered for a device. We use this new function to clean up the dispatching logic for SDPA, as well as make the private use dispatching simpler: #126392 Pull Request resolved: #126832 Approved by: https://github.com/ezyang, https://github.com/albanD

# Summary Adds a public method to dispatchstub to check if a fn has been registered for a device. We use this new function to clean up the dispatching logic for SDPA, as well as make the private use dispatching simpler: pytorch#126392 Pull Request resolved: pytorch#126832 Approved by: https://github.com/ezyang, https://github.com/albanD

1274085042 · 2024-05-29T02:28:14Z

@drisspg
could this update be landed?

drisspg · 2024-05-29T03:35:30Z

The PR I referenced above has landed can you rebase?

1274085042 · 2024-05-29T06:48:41Z

@pytorchmergebot rebase

pytorchmergebot · 2024-05-29T06:50:09Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-05-29T06:50:13Z

Successfully rebased flash_attention_overrideable onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout flash_attention_overrideable && git pull --rebase)

1274085042 · 2024-05-31T01:56:42Z

@drisspg Rebased and fixed some CI issues

drisspg · 2024-05-31T21:47:36Z

aten/src/ATen/native/transformers/attention.cpp

@@ -680,10 +684,15 @@ Tensor scaled_dot_product_attention(
        auto out_lse_softmax = at::_scaled_dot_product_flash_attention(
            query_padded, key_padded, value_padded, dropout_p, is_causal, false /*return_debug_mask*/, og_scale.as_float_unchecked());
        return post_process_flash_output(std::get<0>(out_lse_softmax), og_size);
-      }
+      } else if (query_.device().type() == DeviceType::PrivateUse1) {


This doesnt look right to me

It should now just be 1 more case switch entry

You will need to add the overridable backend

case sdp::SDPBackend::overridable: return std::get<0>(at::_scaled_dot_product_attention_overridable( ...));``` Private use authors would thsu register a dispatch to the stub and have it return the overrridable backend by default they would be routed to the math backend

drisspg

left a comment

1274085042 requested review from albanD and soulitzer as code owners May 16, 2024 12:24

pytorchbot added the open source label May 16, 2024

albanD requested review from drisspg and removed request for albanD May 16, 2024 13:25

drisspg reviewed May 20, 2024

View reviewed changes

1274085042 requested a review from drisspg May 21, 2024 02:55

sujoysaraswati reviewed May 21, 2024

View reviewed changes

aten/src/ATen/native/transformers/attention.cpp Show resolved Hide resolved

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 21, 2024

drisspg requested a review from jainapurva May 21, 2024 23:59

drisspg mentioned this pull request May 22, 2024

Update dispatch stub to make SDPA routing cleaner #126832

Closed

1274085042 added 2 commits May 29, 2024 06:50

sdp::SDPBackend::flash_attention support PrivateUse1

740d10e

update

761a79d

pytorchmergebot force-pushed the flash_attention_overrideable branch from c276523 to 761a79d Compare May 29, 2024 06:50

update

481a3f9

drisspg reviewed May 31, 2024

View reviewed changes

drisspg requested changes May 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sdp::SDPBackend::flash_attention support PrivateUse1 #126392

sdp::SDPBackend::flash_attention support PrivateUse1 #126392

1274085042 commented May 16, 2024 •

edited

pytorch-bot bot commented May 16, 2024 •

edited

drisspg May 20, 2024

1274085042 May 21, 2024

drisspg May 20, 2024

1274085042 May 21, 2024

drisspg commented May 22, 2024

1274085042 commented May 29, 2024

drisspg commented May 29, 2024

1274085042 commented May 29, 2024

pytorchmergebot commented May 29, 2024

pytorchmergebot commented May 29, 2024

1274085042 commented May 31, 2024

drisspg May 31, 2024

drisspg left a comment

sdp::SDPBackend::flash_attention support PrivateUse1 #126392

Are you sure you want to change the base?

sdp::SDPBackend::flash_attention support PrivateUse1 #126392

Conversation

1274085042 commented May 16, 2024 • edited

pytorch-bot bot commented May 16, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126392

✅ No Failures

drisspg May 20, 2024

Choose a reason for hiding this comment

1274085042 May 21, 2024

Choose a reason for hiding this comment

drisspg May 20, 2024

Choose a reason for hiding this comment

1274085042 May 21, 2024

Choose a reason for hiding this comment

drisspg commented May 22, 2024

1274085042 commented May 29, 2024

drisspg commented May 29, 2024

1274085042 commented May 29, 2024

pytorchmergebot commented May 29, 2024

pytorchmergebot commented May 29, 2024

1274085042 commented May 31, 2024

drisspg May 31, 2024

Choose a reason for hiding this comment

drisspg left a comment

Choose a reason for hiding this comment

1274085042 commented May 16, 2024 •

edited

pytorch-bot bot commented May 16, 2024 •

edited