CI: AMD MI300 tests fix #30797

mht-sharma · 2024-05-14T07:57:36Z

What does this PR do?

Fixes the failing tests on MI300.

HuggingFaceDocBuilderDev · 2024-05-14T08:16:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts · 2024-05-14T08:22:18Z

cc @ydshieh and @younesbelkada (as I think I remember you had handle something similar)

younesbelkada

Thanks ! overall looks ok ! I left one suggestion with respect to the expected texts dict, I'll defer to @ydshieh review here for the next steps

younesbelkada · 2024-05-14T08:24:25Z

tests/models/gemma/test_modeling_gemma.py

-                "Hi today I am going to share with you a very easy and simple recipe of <strong><em>Kaju Kat",
-            ],
-        }
+        if IS_ROCM_SYSTEM:


is this logic needed ? you could just add the key 9 in the EXPECTED_TEXTS dict here no?

@younesbelkada @ydshieh I have been comparing the generate output on H100 and MI300 (both with 9 device capability) and have seen deviations in the generated text. After discussing this with AMD engineers, this is expected as long as output is sensible.

The deviation can happen because different hardware may process differently, and if there are non-linearities, minor deviations tend to get amplified.

These deviations might not manifest across all models or prompts (as seen in this particular test). But there are few tests in this PR where the output can be different and thus handled separately for ROCM.

We have a few options to address this:

Since, we do not have H100 tests (?) I could merge the dict, and separate them later if necessary.

Have if/else statements only for tests where the generated output is different.

Distinguish the ROCM EXPECTED output for each tests as done currently in the PR

Ok I understand better now, thanks for explaining! in this case, I think option 1 is better + adding a comment explaining that we might need to change that value for H100s in the future! What do you think?

Sounds good to me, would make the change!

Thanks a lot @mht-sharma !

younesbelkada · 2024-05-14T08:25:01Z

tests/trainer/test_trainer_seq2seq.py

@@ -161,7 +161,6 @@ def test_return_sequences(self):
            tokenizer=tokenizer,
            data_collator=data_collator,
            compute_metrics=lambda x: {"samples": x[0].shape[0]},
-            report_to="none",


is this intended?

Yes, the argument does not exist and error introduced from here: #30266

ok thanks !

ydshieh · 2024-05-14T15:28:42Z

will check but likely not today

ydshieh

Sorry for the delay in the review.

LGTM and thank you a lot for the efforts on AMD CI.

ydshieh · 2024-05-16T12:52:46Z

tests/models/gemma/test_modeling_gemma.py

-        }
+        if IS_ROCM_SYSTEM:
+            EXPECTED_TEXTS = {
+                9: [


Let's update those line with

# 8 is for A100 / A10 and 7 for T4

so people knows what 9 means

younesbelkada · 2024-05-16T12:58:12Z

tests/models/gemma/test_modeling_gemma.py

-                "Hi today I am going to share with you a very easy and simple recipe of <strong><em>Kaju Kat",
-            ],
-        }
+        if IS_ROCM_SYSTEM:


Can you refactor this block so that one uses directly the dict without the if / else statement? 🙏

younesbelkada · 2024-05-16T12:58:17Z

tests/models/gemma/test_modeling_gemma.py

-                "Hi today I am going to share with you a very easy and simple recipe of <strong><em>Kaju Kat",
-            ],
-        }
+        if IS_ROCM_SYSTEM:


younesbelkada · 2024-05-16T12:58:22Z

tests/models/gemma/test_modeling_gemma.py

-                "Hi today I am going to show you how to make a very simple and easy to make a very simple and",
-            ],
-        }
+        if IS_ROCM_SYSTEM:


younesbelkada · 2024-05-16T12:58:29Z

tests/models/llama/test_modeling_llama.py

-            ],
-        }
+        if IS_ROCM_SYSTEM:
+            EXPECTED_TEXT_COMPLETION = {


younesbelkada · 2024-05-16T12:58:43Z

tests/models/mixtral/test_modeling_mixtral.py

-                torch_device
-            ),
-        }
+        if IS_ROCM_SYSTEM:


younesbelkada · 2024-05-16T12:58:49Z

tests/models/mixtral/test_modeling_mixtral.py

-            ),
-        }
+        if IS_ROCM_SYSTEM:
+            EXPECTED_LOGITS_LEFT = {


younesbelkada

Thanks ! the following suggestion should make the CI happy ! 🤞

younesbelkada · 2024-05-17T15:44:20Z

src/transformers/testing_utils.py

+
+    IS_ROCM_SYSTEM = torch.version.hip is not None
+    IS_CUDA_SYSTEM = torch.version.cuda is not None
+


Suggested change

else:

IS_ROCM_SYSTEM = False

IS_CUDA_SYSTEM = False

younesbelkada

Thanks so much !

younesbelkada

Thanks again !

amyeroberts

Thanks for fixing - very nicely handled ❤️

amyeroberts · 2024-05-21T08:37:55Z

tests/trainer/test_trainer.py

+                "--report_to",
+                "none",


How come we have to add this?

Since Codecarbon is not supported on ROCm, the report callbacks are skipped from the trainer tests. Ref: #30266

amyeroberts · 2024-05-21T10:47:44Z

@mht-sharma Do you have permission to merge? If not, I can merge in for you

mht-sharma · 2024-05-21T10:58:21Z

@mht-sharma Do you have permission to merge? If not, I can merge in for you

@amyeroberts I do not have the permissions. Please merge. Thanks for the review 🤗

* add fix * update import * updated dicts and comments * remove prints * Update testing_utils.py

add fix

dfc174b

younesbelkada reviewed May 14, 2024

View reviewed changes

ydshieh self-assigned this May 14, 2024

update import

9a8032d

ydshieh approved these changes May 16, 2024

View reviewed changes

ydshieh marked this pull request as ready for review May 16, 2024 12:56

younesbelkada reviewed May 16, 2024

View reviewed changes

tests/models/llama/test_modeling_llama.py Outdated

],

}

if IS_ROCM_SYSTEM:

EXPECTED_TEXT_COMPLETION = {

Copy link

Contributor

younesbelkada May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

younesbelkada reviewed May 16, 2024

View reviewed changes

tests/models/mixtral/test_modeling_mixtral.py Outdated

torch_device

),

}

if IS_ROCM_SYSTEM:

Copy link

Contributor

younesbelkada May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

younesbelkada reviewed May 16, 2024

View reviewed changes

tests/models/mixtral/test_modeling_mixtral.py Outdated

),

}

if IS_ROCM_SYSTEM:

EXPECTED_LOGITS_LEFT = {

Copy link

Contributor

younesbelkada May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

mht-sharma added 3 commits May 17, 2024 09:41

Merge remote-tracking branch 'upstream/main' into mi300_ci_fix

5650534

updated dicts and comments

7da11b8

remove prints

7841c76

mht-sharma requested a review from younesbelkada May 17, 2024 14:30

younesbelkada reviewed May 17, 2024

View reviewed changes

Update testing_utils.py

b64b9d5

younesbelkada approved these changes May 17, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into mi300_ci_fix

c55576b

younesbelkada approved these changes May 21, 2024

View reviewed changes

younesbelkada requested a review from amyeroberts May 21, 2024 07:43

amyeroberts approved these changes May 21, 2024

View reviewed changes

amyeroberts merged commit 7a4792e into huggingface:main May 21, 2024
22 checks passed

itazap pushed a commit that referenced this pull request May 21, 2024

CI: AMD MI300 tests fix (#30797)

eb518bf

* add fix * update import * updated dicts and comments * remove prints * Update testing_utils.py

itazap pushed a commit that referenced this pull request May 21, 2024

CI: AMD MI300 tests fix (#30797)

5fabd17

* add fix * update import * updated dicts and comments * remove prints * Update testing_utils.py

itazap pushed a commit that referenced this pull request May 22, 2024

CI: AMD MI300 tests fix (#30797)

c34762a

* add fix * update import * updated dicts and comments * remove prints * Update testing_utils.py

itazap pushed a commit that referenced this pull request May 24, 2024

CI: AMD MI300 tests fix (#30797)

62fa86e

* add fix * update import * updated dicts and comments * remove prints * Update testing_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: AMD MI300 tests fix #30797

CI: AMD MI300 tests fix #30797

mht-sharma commented May 14, 2024

HuggingFaceDocBuilderDev commented May 14, 2024

amyeroberts commented May 14, 2024

younesbelkada left a comment

younesbelkada May 14, 2024

mht-sharma May 16, 2024

younesbelkada May 17, 2024

mht-sharma May 17, 2024

younesbelkada May 17, 2024

mht-sharma May 17, 2024

younesbelkada May 14, 2024

mht-sharma May 16, 2024

younesbelkada May 16, 2024

ydshieh commented May 14, 2024

ydshieh left a comment

ydshieh May 16, 2024

mht-sharma May 17, 2024

younesbelkada May 16, 2024

mht-sharma May 17, 2024

younesbelkada May 16, 2024

younesbelkada May 16, 2024

younesbelkada May 16, 2024

younesbelkada May 16, 2024

younesbelkada May 16, 2024

younesbelkada left a comment

younesbelkada May 17, 2024

mht-sharma May 17, 2024 •

edited

younesbelkada left a comment

younesbelkada left a comment

amyeroberts left a comment

amyeroberts May 21, 2024

mht-sharma May 21, 2024

amyeroberts commented May 21, 2024

mht-sharma commented May 21, 2024


		IS_ROCM_SYSTEM = torch.version.hip is not None
		IS_CUDA_SYSTEM = torch.version.cuda is not None

+else:
+    IS_ROCM_SYSTEM = False
+    IS_CUDA_SYSTEM = False

CI: AMD MI300 tests fix #30797

CI: AMD MI300 tests fix #30797

Conversation

mht-sharma commented May 14, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented May 14, 2024

amyeroberts commented May 14, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented May 14, 2024

ydshieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mht-sharma May 17, 2024 • edited

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts commented May 21, 2024

mht-sharma commented May 21, 2024

mht-sharma May 17, 2024 •

edited