Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemini Lake clpeak fails on Ubuntu 22.04 latest kernel #679

Open
looi opened this issue Sep 28, 2023 · 5 comments
Open

Gemini Lake clpeak fails on Ubuntu 22.04 latest kernel #679

looi opened this issue Sep 28, 2023 · 5 comments

Comments

@looi
Copy link

looi commented Sep 28, 2023

System: Dell Wyse 5070 Intel Celeron J4105 (Gemini Lake)
Intel Compute Runtime 23.30.26918.9 installed with official instructions.

What works

  • Ubuntu 20.04 kernel 5.4: Works out of the box
  • Ubuntu 22.04 kernel 5.15: Doesn't work out of the box, but works with i915.enable_hangcheck=0 i915.request_timeout_ms=100000, see Broadwell iGPU hangs running clpeak with 5.13+ kernels #497
  • Without the above params, clpeak hangs with the following kernel logs:
[  101.920695] Fence expiration time out i915-0000:00:02.0:clpeak[969]:454!
[  101.920754] Fence expiration time out i915-0000:00:02.0:clpeak[969]:452!
[  101.920760] Fence expiration time out i915-0000:00:02.0:clpeak[969]:450!

What doesn't work

  • Ubuntu 22.04 kernel 6.2
  • clpeak fails, and produces error clFinish (-5) with kernel logs (note that these logs were never seen with kernel 5.15):
[   70.897984] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[   70.898106] i915 0000:00:02.0: [drm] clpeak[910] context reset due to GPU hang
[   70.907884] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:e757fefe, in clpeak [910]
...
[  170.596540] i915 0000:00:02.0: [drm] Resetting rcs0 for CS error
[  170.596659] i915 0000:00:02.0: [drm] clpeak[910] context reset due to GPU hang
[  170.606125] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:ac045407, in clpeak [910]
  • Setting the params that fixed 5.15 (i915.enable_hangcheck=0 i915.request_timeout_ms=100000) or increasing to 200000 doesn't appear to help.
  • Vulkan compute (vkpeak) works fine. So this appears to be a problem specific to Intel Compute Runtime.
@Ar-Ray-code
Copy link

I have the same problem on both Celeron J4125 and N4000.
I am seeing the following error in GPU inference in OpenVINO 2023.1.0.

terminate called after throwing an instance of 'InferenceEngine::GeneralError'
  what(): [ GENERAL_ERROR ] Check 'false' failed at src/plugins/intel_gpu/src/plugin/program.cpp:401:
GPU program build failed!
[GPU] clWaitForEvents, error code: -14

Note that Linux kernel 5.15 does not have the problem.

@nyanmisaka
Copy link
Contributor

I heard about this issue a year ago but I no longer have a GLK device now.
This is a kernel regression because Gemini Lake/GLK only fails when using the new kernel.

@looi @Ar-Ray-code
Better to file an issue in drm/intel. https://gitlab.freedesktop.org/drm/intel/-/issues/?label_name%5B%5D=Community

@looi
Copy link
Author

looi commented Dec 16, 2023

I don't think this is necessarily a kernel regression, because as I have stated above, vulkan compute works fine.

Personally, I have switched to using vulkan. The performance is comparable (especially making proper use of vulkan subgroups), but more importantly, it seems to be much more stable on both Windows and Linux. Intel Compute Runtime / OpenCL has weird issues like this one. Vulkan also seems to work much better on non-intel GPUs, especially nvidia, where they refuse to support basic features like subgroups and half-precision floats in OpenCL. So I feel like vulkan is the future and OpenCL is dying anyways.

@nyanmisaka
Copy link
Contributor

What works
Ubuntu 22.04 kernel 5.15: Doesn't work out of the box, but works with i915.enable_hangcheck=0

What doesn't work
Ubuntu 22.04 kernel 6.2

Your input suggests this is a kernel regression. The only difference between whether it works or not is the kernel version, bisect the commit between the two should find the culprit.

This isn't the first time I've seen i915 regression, last time it even broke both the Vulkan compute and OpenCL.

@looi
Copy link
Author

looi commented Dec 16, 2023

I agree that a kernel change broke Intel Compute Runtime. I guess whether or not it's a kernel regression is a subjective question depending on what exactly caused the breakage. Maybe Intel Compute Runtime is making incorrect assumptions about i915 or relying on undefined behavior, in which case it would not be a kernel regression. Given that vulkan compute still works, I think it is a likely possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants