Mechanism for Spawners to detect and cleanup orphan resources #4544

minrk · 2023-08-28T09:39:54Z

Proposed change

It would be useful for there to be a mechanism to locate all resources or possibly orphaned spawners and stop them.

There are 3 main sources of these orphans:

plain old bugs (e.g. return poll status after first load finish kubespawner#742)
database rollback or reset (for example: delete jupyterhub.sqlite with running servers)
configuration changes (e.g. pod name template, namespace configuration). Some of these shouldn't actually cause orphans (e.g. persisting resolved name in state instead of using config for running servers), but they can in general dissociate resources from Spawners.

Specific example case that's happening today (KubeSpawner): a single-user pod is running, but does not correspond to any Spawner in the database. These pods should all be deleted. But how?

Alternative options

Leave it as an out-of-band operation for deployments to handle.

Who would use this feature?

Probably mostly kubernetes deployments, but it technically applies to every JupyterHub deployment. Large deployments with many users in general are the most likely to have orphaned resources that take effort to manage manually.

(Optional): Suggest a solution

The challenge here is that Spawner configuration is only loaded per Spawner instance, and our traitlets config approach doesn't lend itself to class methods, so there's no obvious place to put a "list_orphaned_servers".

One candidate is a list_orphaned_servers instance method, called on a Spawner that's either not given a user, or given a mock user (as is done in #4442). This is probably the simplest to implement, but adds certain assumptions not previously on a Spawner, i.e. that it can be instantiated without a 'real' user.

The second question is how exactly do we implement the reconciliation of the list of all running servers with the list of known running servers. Currently, all running servers are in memory after init_spawners, so there's an in-memory collection of username/servername combinations.

One option could be Spawner.cleanup_orphaned_resources(all_available_spawners), which would put the responsibility on the Spawner class to implement the cleanup. This assumption that all running servers have a Spawner in memory is going to be another challenge for the horizontal-scaling changes we hope to make someday, which is a point against this approach, but something needs to be able to compute the difference of all existing resources and all expected resources.

Another, bigger change would be to add a SpawnManager-type class which holds the methods not tied to a single Spawner. But this would need to get all of the deployment configuration like namespaces, credentials, etc. Pro: this is where e.g. KubeSpawner reflectors belong, Con: it would be a very hard transition for existing classes and configuration.

Open questions:

there's sometimes a difference between deleted resources and stopped resources (e.g. removing a PVC). I think this should be out-of-scope, but I'm not sure.
Is it the hub or the spawner's responsibility to determine what should be cleaned up (i.e. spawner returns all known instances, hub calls back with those that should shutdown, or hub tells spawner all known instances, spawner identifies and cleans up difference)
How to handle orphans caused by a change in configuration (e.g. switching pod name template or namespace configuration), or is this out of scope?

The text was updated successfully, but these errors were encountered:

jabbera · 2023-09-16T14:47:39Z

This could be a service like the idle culler pretty easily for each spawner type. We wrote a k8s cron job for it after stumbling across 742 before we knew there was a fix:

yuvipanda · 2023-10-06T06:19:00Z

If something like #4442 lands this can be more easily an idle culler type service, as some service outside JupyterHub can get access to the spawner config!

minrk added the enhancement label Aug 28, 2023

shaneknapp mentioned this issue Sep 22, 2023

user pods severely impacted after hub pod restart jupyterhub/zero-to-jupyterhub-k8s#3229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mechanism for Spawners to detect and cleanup orphan resources #4544

Mechanism for Spawners to detect and cleanup orphan resources #4544

minrk commented Aug 28, 2023

jabbera commented Sep 16, 2023 •

edited

yuvipanda commented Oct 6, 2023

Mechanism for Spawners to detect and cleanup orphan resources #4544

Mechanism for Spawners to detect and cleanup orphan resources #4544

Comments

minrk commented Aug 28, 2023

Proposed change

Alternative options

Who would use this feature?

(Optional): Suggest a solution

jabbera commented Sep 16, 2023 • edited

yuvipanda commented Oct 6, 2023

jabbera commented Sep 16, 2023 •

edited