Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(resourcequotas): Update namespace-specific hard quota calculation logic #1088

Merged
merged 1 commit into from
May 23, 2024

Conversation

lukasboettcher
Copy link
Contributor

Intends to fix #49

… logic

Signed-off-by: Lukas Boettcher <1340215+lukasboettcher@users.noreply.github.com>
Copy link

netlify bot commented May 17, 2024

Deploy Preview for capsule-documentation canceled.

Name Link
🔨 Latest commit c589894
🔍 Latest deploy log https://app.netlify.com/sites/capsule-documentation/deploys/664789c665c5f900081158f5

Copy link
Collaborator

@oliverbaehler oliverbaehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great! i have tested your fork and i dont think it breaks any other functionality. However i would like @prometherion opinion on this as well. However it does not entirely fix the problem described in the issue.

@prometherion
Copy link
Member

This looks smart and elegant, wondering why we didn't think about it.

@oliverbaehler I remember Resource Quotas at the Tenant scope have been our Achilles' heel, if the proposed changes are preventing the overflow, it's a huge +1 for me.

@prometherion
Copy link
Member

However it does not entirely fix the problem described in the issue.

May I ask you to elaborate a bit more? I suspect we have a small time window where the hard quota is not enforced, isn't it?

@oliverbaehler
Copy link
Collaborator

It does not directly, well at least i am still able to overflow namespaces with the concexutive of two scale commands. Tenant Quota:

---
apiVersion: capsule.clastix.io/v1beta2
kind: Tenant
metadata:
  name: oil
spec:
  owners:
  - name: alice
    kind: User
  resourceQuotas:
    scope: Tenant
    items:
    - hard:
        limits.cpu: "4"
        limits.memory: 4Gi
        requests.cpu: "2"
        requests.memory: 2Gi
    - hard:
        pods: "10"

oil-dev:

$ kubectl get pod -n oil-dev
NAME                               READY   STATUS    RESTARTS   AGE
nginx-deployment-7784999db-4r6j4   1/1     Running   0          10h
nginx-deployment-7784999db-b7xvr   1/1     Running   0          10h
nginx-deployment-7784999db-hf66h   1/1     Running   0          10h
nginx-deployment-7784999db-qzlrq   1/1     Running   0          10h
nginx-deployment-7784999db-scl84   1/1     Running   0          10h
nginx-deployment-7784999db-zm7ws   1/1     Running   0          10h


- apiVersion: v1
  kind: ResourceQuota
  metadata:
    annotations:
      quota.capsule.clastix.io/hard-pods: "10"
      quota.capsule.clastix.io/used-pods: "12"
    creationTimestamp: "2024-05-17T21:21:42Z"
    labels:
      capsule.clastix.io/resource-quota: "1"
      capsule.clastix.io/tenant: oil
    name: capsule-oil-1
    namespace: oil-dev
    ownerReferences:
    - apiVersion: capsule.clastix.io/v1beta2
      blockOwnerDeletion: true
      controller: true
      kind: Tenant
      name: oil
      uid: a62d02e8-561d-4a8d-a7de-3e7a78f44d3e
    resourceVersion: "3624"
    uid: 1f03034d-a918-4981-961f-08ffb279e742
  spec:
    hard:
      pods: "6"
  status:
    hard:
      pods: "6"
    used:
      pods: "6"

oil-test:

$ kubectl get pod -n oil-test
NAME                               READY   STATUS    RESTARTS   AGE
nginx-deployment-7784999db-54j74   1/1     Running   0          10h
nginx-deployment-7784999db-5l2mw   1/1     Running   0          10h
nginx-deployment-7784999db-p4zxn   1/1     Running   0          10h
nginx-deployment-7784999db-pt688   1/1     Running   0          10h
nginx-deployment-7784999db-wm4qt   1/1     Running   0          10h
nginx-deployment-7784999db-x5vc7   1/1     Running   0          10h

- apiVersion: v1
  kind: ResourceQuota
  metadata:
    annotations:
      quota.capsule.clastix.io/hard-pods: "10"
      quota.capsule.clastix.io/used-pods: "12"
    creationTimestamp: "2024-05-17T21:21:44Z"
    labels:
      capsule.clastix.io/resource-quota: "1"
      capsule.clastix.io/tenant: oil
    name: capsule-oil-1
    namespace: oil-test
    ownerReferences:
    - apiVersion: capsule.clastix.io/v1beta2
      blockOwnerDeletion: true
      controller: true
      kind: Tenant
      name: oil
      uid: a62d02e8-561d-4a8d-a7de-3e7a78f44d3e
    resourceVersion: "3625"
    uid: f79557c9-5bda-4772-8918-28b4b63b2a2f
  spec:
    hard:
      pods: "6"
  status:
    hard:
      pods: "6"
    used:
      pods: "6"

But this calculation used here is more consistent, so it's a step towards the right direction.

So we somehow need to cover that case, where at the same time a lot of scaling or resource updates are happening and our controller is to slow to updated.

One way to prevent this racing conditions, is to have a webhook, which calculates the resources requested for any resource which is being created or updated. As you have already pointed out in earlier comments. Essentially we need this function here:

Big questionmark on the performance impact if we implement it like this.

But i dont think we get around the point, that we need a validatingwebhook which validates all the objects, the question is, what's happening in the webhook function. I was also thinking, if we should implement something like a locking mechanism, so that when resource quotas are updated or synced we lock them.

@lukasboettcher
Copy link
Contributor Author

I can reproduce the problem with quick consecutive scaling or when creating resources in separate namespaces in a single request for object count quotas. For compute resource quotas this seems to hold up. However I believe that this is just the case because of the order / timing these quotas are updated and evaluated, so there is definitely a race condition.

@oliverbaehler oliverbaehler merged commit b16bcda into projectcapsule:main May 23, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tenant Resource Quota admission controller doesn't block resources creation
3 participants