Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanism for Spawners to detect and cleanup orphan resources #4544

Open
minrk opened this issue Aug 28, 2023 · 2 comments
Open

Mechanism for Spawners to detect and cleanup orphan resources #4544

minrk opened this issue Aug 28, 2023 · 2 comments

Comments

@minrk
Copy link
Member

minrk commented Aug 28, 2023

Proposed change

It would be useful for there to be a mechanism to locate all resources or possibly orphaned spawners and stop them.

There are 3 main sources of these orphans:

  1. plain old bugs (e.g. return poll status after first load finish kubespawner#742)
  2. database rollback or reset (for example: delete jupyterhub.sqlite with running servers)
  3. configuration changes (e.g. pod name template, namespace configuration). Some of these shouldn't actually cause orphans (e.g. persisting resolved name in state instead of using config for running servers), but they can in general dissociate resources from Spawners.

Specific example case that's happening today (KubeSpawner): a single-user pod is running, but does not correspond to any Spawner in the database. These pods should all be deleted. But how?

Alternative options

Leave it as an out-of-band operation for deployments to handle.

Who would use this feature?

Probably mostly kubernetes deployments, but it technically applies to every JupyterHub deployment. Large deployments with many users in general are the most likely to have orphaned resources that take effort to manage manually.

(Optional): Suggest a solution

The challenge here is that Spawner configuration is only loaded per Spawner instance, and our traitlets config approach doesn't lend itself to class methods, so there's no obvious place to put a "list_orphaned_servers".

One candidate is a list_orphaned_servers instance method, called on a Spawner that's either not given a user, or given a mock user (as is done in #4442). This is probably the simplest to implement, but adds certain assumptions not previously on a Spawner, i.e. that it can be instantiated without a 'real' user.

The second question is how exactly do we implement the reconciliation of the list of all running servers with the list of known running servers. Currently, all running servers are in memory after init_spawners, so there's an in-memory collection of username/servername combinations.

One option could be Spawner.cleanup_orphaned_resources(all_available_spawners), which would put the responsibility on the Spawner class to implement the cleanup. This assumption that all running servers have a Spawner in memory is going to be another challenge for the horizontal-scaling changes we hope to make someday, which is a point against this approach, but something needs to be able to compute the difference of all existing resources and all expected resources.

Another, bigger change would be to add a SpawnManager-type class which holds the methods not tied to a single Spawner. But this would need to get all of the deployment configuration like namespaces, credentials, etc. Pro: this is where e.g. KubeSpawner reflectors belong, Con: it would be a very hard transition for existing classes and configuration.

Open questions:

  • there's sometimes a difference between deleted resources and stopped resources (e.g. removing a PVC). I think this should be out-of-scope, but I'm not sure.
  • Is it the hub or the spawner's responsibility to determine what should be cleaned up (i.e. spawner returns all known instances, hub calls back with those that should shutdown, or hub tells spawner all known instances, spawner identifies and cleans up difference)
  • How to handle orphans caused by a change in configuration (e.g. switching pod name template or namespace configuration), or is this out of scope?
@jabbera
Copy link
Contributor

jabbera commented Sep 16, 2023

This could be a service like the idle culler pretty easily for each spawner type. We wrote a k8s cron job for it after stumbling across 742 before we knew there was a fix:

@yuvipanda
Copy link
Contributor

If something like #4442 lands this can be more easily an idle culler type service, as some service outside JupyterHub can get access to the spawner config!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants