metric tekton_pipelines_controller_pipelinerun_duration_seconds suddenly stops reporting #7902

gerrnot · 2024-04-23T14:35:01Z

Expected Behavior

We expect the metric tekton_pipelines_controller_pipelinerun_duration_seconds to always (consistently, e.g. for every single scrape request) report the value for every single PipelineRun as long as the PipelineRun exists in k8s when using the lastvalue setting (see provided config at end).

Actual Behavior

While part of the initial scrapes, the values disappear over time. For example a PipelineRun that was started in the morning yields metrics for several hours, but then after a certain point in time, it yields no more metrics (verified by checking the /metrics endpoint of the pipelines-controller - default port 9090).

A picture says more than a thousand words:

When the metrics are visualized in prometheus (picture above) you would believe that during the gap in the middle - for a duration of around 30 minutes, there was no PipelineRun in the cluster. This is not true! There were plenty, it is just that they are no longer contained in the metrics output.

Steps to Reproduce the Problem

configure metrics to use the lastvalue setting (like in the example provided at the bottom of this post)
recommended: also set up prometheus to scrape them, easier to visualize
produce plenty of PipelineRuns throughout the day
do some cleanups of PipelineRuns throughout the day - but never go to zero. We do cleanups like this, but potentially this is also reproducible without cleanups!
Find the ends of the timerseries.
If you followed step 2. you essentially just need to look at the graph from prometheus.
If you find a gap like in the picture above, the issue is reproduced. (Clarification: It looks like a gap, it is not an actual gap, since an actual gap would mean the same time series is continued later, which it is not - those are new PipelineRuns - new time series!)
Otherwise (not using prometheus), the procedure is: for each pipelinerun in k8s, check if it is also part of the latest scrape.
If an instance is found that is in k8s, but not in the metrics output, the problem is already reproduced.

Additional Info

Kubernetes version:

Output of kubectl version:

<pre>Client Version: version.Info{Major:&quot;1&quot;, Minor:&quot;27&quot;, GitVersion:&quot;v1.27.3&quot;, GitCommit:&quot;25b4e43193bcda6c7328a6d147b1fb73a33f1598&quot;, GitTreeState:&quot;clean&quot;, BuildDate:&quot;2023-06-14T09:53:42Z&quot;, GoVersion:&quot;go1.20.5&quot;, Compiler:&quot;gc&quot;, Platform:&quot;linux/amd64&quot;}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:&quot;1&quot;, Minor:&quot;27&quot;, GitVersion:&quot;v1.27.10&quot;, GitCommit:&quot;0fa26aea1d5c21516b0d96fea95a77d8d429912e&quot;, GitTreeState:&quot;clean&quot;, BuildDate:&quot;2024-01-17T13:38:41Z&quot;, GoVersion:&quot;go1.20.13&quot;, Compiler:&quot;gc&quot;, Platform:&quot;linux/amd64&quot;}
</pre>

Tekton Pipeline version:

Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

Client version: 0.36.0
Chains version: v0.20.0
Pipeline version: v0.56.1
Triggers version: v0.26.1
Dashboard version: v0.43.1
Operator version: v0.70.0

Config info:
We used the following tekton operator settings on the pipeline:

metrics.count.enable-reason: false
metrics.pipelinerun.duration-type: lastvalue
metrics.pipelinerun.level: pipelinerun
metrics.taskrun.duration-type: lastvalue
metrics.taskrun.level: taskrun

The text was updated successfully, but these errors were encountered:

khrm · 2024-04-23T17:09:09Z

For how long do we need to run the controller to reproduce? Is this with the latest package? Is it possible that the metrics endpoint goes down?

gerrnot · 2024-04-24T11:06:03Z

Based on our data, this is always reproducible when looking at a controller lifetime of 12h.
Actually the gaps should appear already after 10h, which made me ponder if they are related to the full reconciliation loops happening at that time.
But there are also terminated time series that are not related to the 10h interval, so I did not mention it in the original description.

The controller pod however has no container restarts within the business day.

We do however restart the pipelinerun controller once a day such that the 10h full reconciliation loop falls outside the business hours.
But still, the reported metrics should somehow reflect what is in k8s (source of truth).

I could verify that other metrics have a normal timeline, so the metrics endpoint should not go down.

Since we are using the latest operator version atm. (exact versions are part of the original post) and I could not
override the controller image e.g. via the TektonConfig property (independent defect) to use the latest release from the controller:

spec:
  pipeline:
    options:
      deployments:
        tekton-pipelines-controller:
          spec:
            template:
              spec:
                containers:
                  - name: tekton-pipelines-controller
                    image: gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/controller:v0.58.0

...I could not easily test if this also happens with the absolutely latest released image version (pipelinerun controller v0.58.0),
but since our version (v0.56.1) is quite recent, there is a high chance that this is reproducible with latest as well.

khrm · 2024-05-06T16:24:28Z

@gerrnot I am not able to reproduce this in our cluster.
Can you share more details on the type of cluster and environment?

gerrnot · 2024-05-13T15:18:41Z

@khrm This occurs only for specific metrics (e.g. tekton_pipelines_controller_pipelinerun_duration_seconds ) and specific settings, e.g. (those are the setings as the tekton operator understands them):

metrics.pipelinerun.duration-type: lastvalue
metrics.pipelinerun.level: pipelinerun

PS: the metric that you provided is also continuous on our system.

khrm · 2024-05-14T17:54:16Z

@gerrnot Are you using OpenShift? Or plain Kubernetes?

khrm · 2024-05-14T18:10:23Z

I believe I have identified the problem. I need to confirm.

gerrnot · 2024-05-16T13:43:32Z

@khrm Thanks a lot for investigating this. We use vanilla kubernetes (on-prem kubespray cluster).

gerrnot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metric tekton_pipelines_controller_pipelinerun_duration_seconds suddenly stops reporting #7902

metric tekton_pipelines_controller_pipelinerun_duration_seconds suddenly stops reporting #7902

gerrnot commented Apr 23, 2024 •

edited

khrm commented Apr 23, 2024

gerrnot commented Apr 24, 2024

khrm commented May 6, 2024

gerrnot commented May 13, 2024

khrm commented May 14, 2024

khrm commented May 14, 2024

gerrnot commented May 16, 2024

metric tekton_pipelines_controller_pipelinerun_duration_seconds suddenly stops reporting #7902

metric tekton_pipelines_controller_pipelinerun_duration_seconds suddenly stops reporting #7902

Comments

gerrnot commented Apr 23, 2024 • edited

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

khrm commented Apr 23, 2024

gerrnot commented Apr 24, 2024

khrm commented May 6, 2024

gerrnot commented May 13, 2024

khrm commented May 14, 2024

khrm commented May 14, 2024

gerrnot commented May 16, 2024

gerrnot commented Apr 23, 2024 •

edited