Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r8] TestStorageStratisReboot.testEncrypted fails #20373

Closed

Conversation

jelly
Copy link
Member

@jelly jelly commented Apr 25, 2024

Completely wiping the currently running root partition is too aggressive: During shutdown, recent versions of systemd-shutdown [1] re-executes itself to free all open fds on the root fs (so that it can be unmounted cleanly), which doesn't work if /usr/sbin/systemd-shutdown is gone. This leads to reboot hanging at

[  OK  ] Reached target reboot.target - System Reboot.
[!!!!!!] Failed to execute shutdown binary.

Instead, make the partition unbootable by cleaning /etc (i.e. allowed SSH keys, boot targets, etc.) Even that is redundant, as post reboot the test already checks that it booted from the encrypted partition. But let's keep the belt and suspenders.

[1] https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c

Completely wiping the currently running root partition is too
aggressive: During shutdown, recent versions of systemd-shutdown [1]
re-executes itself to free all open fds on the root fs (so that it can
be unmounted cleanly), which doesn't work if /usr/sbin/systemd-shutdown
is gone. This leads to reboot hanging at

```
[  OK  ] Reached target reboot.target - System Reboot.
[!!!!!!] Failed to execute shutdown binary.
```

Instead, make the partition unbootable by cleaning /etc (i.e. allowed
SSH keys, boot targets, etc.) Even that is redundant, as post reboot the
test already checks that it booted from the encrypted partition. But
let's keep the belt and suspenders.

[1] https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c
@jelly jelly requested a review from martinpitt April 25, 2024 14:44
@martinpitt martinpitt changed the title test: Make encrypt_root()'s old root fs destruction less aggressive [r8] test: Make encrypt_root()'s old root fs destruction less aggressive Apr 25, 2024
@martinpitt martinpitt added the backport apply a commit from master to a stable branch label Apr 25, 2024
martinpitt
martinpitt previously approved these changes Apr 25, 2024
Copy link
Member

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, this commit rings a bell! 😁

If the bots are happy, so am I. Thanks!

@jelly
Copy link
Member Author

jelly commented Apr 25, 2024

Hey, this commit rings a bell! 😁

If the bots are happy, so am I. Thanks!

They are not, so I'll have to fetch the image and debug this one I'm afraid.

@martinpitt
Copy link
Member

Ah, so it's not even a regression in the image, it fails on the current one as well

@martinpitt martinpitt dismissed their stale review April 25, 2024 15:53

not working

@jelly
Copy link
Member Author

jelly commented Apr 26, 2024

Ah, so it's not even a regression in the image, it fails on the current one as well

Hmm could it be the switch to our different CI and that is a bit slower in general?

@martinpitt
Copy link
Member

Could be -- this test is really demanding. But it should have failed before then in the 50% of cases where it was running on RHOS.. I have no idea, sorry.

@jelly
Copy link
Member Author

jelly commented Apr 26, 2024

Could be -- this test is really demanding. But it should have failed before then in the 50% of cases where it was running on RHOS.. I have no idea, sorry.

I'll run it locally and maybe debug it in a separate PR (or this one)

@jelly
Copy link
Member Author

jelly commented Apr 26, 2024

Can reproduce this locally

image

@jelly
Copy link
Member Author

jelly commented Apr 26, 2024

Thanks for the virsh console vmname tip:

The passwordless systemd agent seems to be the issue.

Give root password for maintenance
(or press Control-D to continue):
Reloading system manager configuration
Starting default target
Enter password for pool with UUID cbc931de7de944109ef865be29727b9f

@martinpitt
Copy link
Member

Indeed it seems the agent got started and stopped again, so it seems it at least was triggered.

● test-password-agent.service - Test Password Agent
   Loaded: loaded (/etc/systemd/system/test-password-agent.service; static; vendor preset: disabled)
   Active: inactive (dead) since Mon 2024-04-29 01:42:29 EDT; 2min 7s ago
  Process: 760 ExecStart=/usr/local/bin/test-password-agent (code=killed, signal=TERM)
 Main PID: 760 (code=killed, signal=TERM)

Apr 29 01:41:59 rhel-8-10-127-0-0-2-2201 systemd[1]: Started Test Password Agent.
Apr 29 01:42:29 rhel-8-10-127-0-0-2-2201 systemd[1]: Stopping Test Password Agent...
Apr 29 01:42:29 rhel-8-10-127-0-0-2-2201 systemd[1]: test-password-agent.service: Succeeded.
Apr 29 01:42:29 rhel-8-10-127-0-0-2-2201 systemd[1]: Stopped Test Password Agent.

The journal:

Apr 29 01:41:59 rhel-8-10-127-0-0-2-2201 stratisd-min[666]: [2024-04-29T05:41:59Z DEBUG libstratis::stratis::run] 2: thread started
Apr 29 01:41:59 rhel-8-10-127-0-0-2-2201 stratisd-min[666]: [2024-04-29T05:41:59Z DEBUG libstratis::engine::strat_engine::backstore::crypt::shared] Stratis LUKS2 token: {"activation_name":"stratis-1-private-d8527f4ddea6450085d341193e18e3a7-crypt","device_uuid":"d8527f4d-dea6-4500-85d3-41193e18e3a7","keyslots":[],"pool_uuid":"35d2bcc8-2fb8-451f-a010-bc4af976b328","type":"stratis"}
Apr 29 01:41:59 rhel-8-10-127-0-0-2-2201 stratisd-min[666]: [2024-04-29T05:41:59Z WARN  libstratis::engine::strat_engine::backstore::crypt::shared] Key description pool0.1 was not found in the keyring
Apr 29 01:41:59 rhel-8-10-127-0-0-2-2201 stratis-fstab-setup[757]: Error: Error("Error: The key description \"pool0.1\" is not currently set.")
Apr 29 01:41:59 rhel-8-10-127-0-0-2-2201 systemd[1]: Started Dispatch Password Requests to Console.
Apr 29 01:41:59 rhel-8-10-127-0-0-2-2201 systemd[1]: Started Test Password Agent.
Apr 29 01:42:09 rhel-8-10-127-0-0-2-2201 stratisd-min[666]: [2024-04-29T05:42:09Z DEBUG libstratis::stratis::run] 0: thread finished
Apr 29 01:42:29 rhel-8-10-127-0-0-2-2201 stratis-fstab-setup[660]: Failed to unlock pool with UUID 35d2bcc82fb8451fa010bc4af976b328 using Clevis.
Apr 29 01:42:29 rhel-8-10-127-0-0-2-2201 systemd[1]: stratis-fstab-setup@35d2bcc82fb8451fa010bc4af976b328.service: Main process exited, code=exited, status=1/FAILURE
Apr 29 01:42:29 rhel-8-10-127-0-0-2-2201 systemd[1]: stratis-fstab-setup@35d2bcc82fb8451fa010bc4af976b328.service: Failed with result 'exit-code'.
Apr 29 01:42:29 rhel-8-10-127-0-0-2-2201 systemd[1]: Failed to start Set up Stratis filesystems in /etc/fstab.

So this smells more like a problem with the stratis setup than with the agent?

@martinpitt martinpitt changed the title [r8] test: Make encrypt_root()'s old root fs destruction less aggressive [r8] TestStorageStratisReboot.testEncrypted fails May 2, 2024
@martinpitt
Copy link
Member

I now became suspicious if we really fixed that on main, or if it's an actual regression in RHEL 8.10. We haven't run main tests on RHEL 8 for a fair while, but that's what I'd like to try. image-prepare doesn't work any more due to python3-pip and pybridge, so I got a bit creative:

bots/image-customize -i cockpit-storaged rhel-8-10
TEST_OS=rhel-8-10 test/verify/check-storage-stratis TestStorageStratisReboot.testEncrypted -stv

That fails as on the rhel-8 branch, so conclusion: It's not a change in the tests.

So let's try latest main storaged on an otherwise clean rhel-8-10 image:

bots/image-customize -v -r 'rm -r /usr/share/cockpit/storaged' -u dist/storaged:/usr/share/cockpit/ rhel-8-10

but that now crashes in format_fsys_usage() with TypeError: parts.join is not a function. That reeks of the recent cockpit.js format changes, and an undeclared dependency bump? So let's copy cockpit.js too:

bots/image-customize -v -r 'rm -r /usr/share/cockpit/base1' -u dist/base1:/usr/share/cockpit/ rhel-8-10

That crashes on the overview now. So copy that as well (systemd). And voilà, it fails in exactly the same way.

Conclusion

We can't fix this with a backport. This is genuinely broken in RHEL 8.

@martinpitt
Copy link
Member

The last "green" (really in quotes) rhel-8-10 image refresh was cockpit-project/bots#6255, but there the test already failed, but got naughtied:

SKIP Known issue #2412

The image refresh a week later in cockpit-project/bots#6311 then failed, but there were zero packaging changes.

So I can only conclude that somehow this changed in our naughty matching. Running bots/test-failure-policy -o rhel-8-10 < /tmp/out against the log output from 3 weeks ago still works, but from 2 weeks ago it doesn't recognize the naughty.

The reason is quite obvious in retrospect: bots changed the module structure. In the matching case:

machine_core.exceptions.Failure: Timeout waiting for system to reboot properly

and currently:

machine.machine_core.exceptions.Failure: Timeout waiting for system to reboot properly. I'll send a PR to adjust the naughty.

martinpitt added a commit to martinpitt/bots that referenced this pull request May 10, 2024
The traceback now includes the top-level `machine` module.

Delete the pattern on RHEL 9. The bug is fixed there, and the naughty
hasn't matched any more since April for the same reason.

Closes cockpit-project/cockpit#20373
@martinpitt
Copy link
Member

I sent cockpit-project/bots#6369 to update the naughties. This PR is obsolete.

@martinpitt martinpitt closed this May 10, 2024
mvollmer pushed a commit to cockpit-project/bots that referenced this pull request May 10, 2024
The traceback now includes the top-level `machine` module.

Delete the pattern on RHEL 9. The bug is fixed there, and the naughty
hasn't matched any more since April for the same reason.

Closes cockpit-project/cockpit#20373
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport apply a commit from master to a stable branch
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants