Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing: Fix driver installation on Rocky Linux 9 #1710

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

LujieDuan
Copy link
Contributor

@LujieDuan LujieDuan commented May 15, 2024

Description

This PR contains the following changes to the centeo_rhel's dcgm and nvml install scripts:

  1. For the package manager installation method - update the temporary fix introduced by b/330327505 (Testing: Fix GPU driver installation on SLES 15 and Rocky Linux 9 #1657 ). The Rocky Linux 9 image has been updated to kernel 5.14.0-362.24.1.el9_3.0.1.x86_64. Unfortunately the NVIDIA driver still tries to check /lib/modules/5.14.0-362.24.1.el9_3.x86_64 so has to do the softlink fix. But the latest driver (550) does have the kmod for 362.24.1, so we can remove the logic sudo yum -y module install nvidia-driver:545 and sudo yum -y install cuda-12-3.
  2. For the runfile installation method: for Rocky Linux 9, we setup the repo entry manually to the Rocky Vault repo (archive repo), so that we can get a previous version of kernel-devel package. A matching kernel-devel package is needed to compile the driver. We don't need to do this for Rocky Linux 8 (for now) - historically the regular RL8 repo would keep a couple of recent versions of the package, which works with the GCE RL 8 image releasing cycles.
  3. For the runfile installation method: update to install the latest version (550). The previous driver issue with 535 (building main against current centos stream 8 fails NVIDIA/open-gpu-kernel-modules#550) has been resolved in 550, so we can install the driver and CUDA toolkit together. Remove the extra logic.

Related issue

b/340200529

How has this been tested?

Tested to make sure install_cuda_from_package_manager and install_cuda_from_runfile work for centos7, rl8 and rl9.

Checklist:

  • Unit tests
    • Unit tests do not apply.
    • Unit tests have been added/modified and passed for this PR.
  • Integration tests
    • Integration tests do not apply.
    • Integration tests have been added/modified and passed for this PR.
  • Documentation
    • This PR introduces no user visible changes.
    • This PR introduces user visible changes and the corresponding documentation change has been made.
  • Minor version bump
    • This PR introduces no new features.
    • This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
    • This PR bumps the version.

@LujieDuan LujieDuan marked this pull request as ready for review May 15, 2024 18:54
@LujieDuan LujieDuan requested review from a team, rafaelwestphal and braydonk and removed request for a team and rafaelwestphal May 15, 2024 19:02
@LujieDuan
Copy link
Contributor Author

The RL9 image has been updated to 9.4 (5.14.0-427.16.1.el9_4.x86_64).
The nightly is passing now with the new image without this PR.
Will update this PR to remove the temporary fix since the new image doesn't need that anymore.

@LujieDuan LujieDuan removed the request for review from braydonk May 21, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant