Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcode success rate metric fixes #2684

Merged
merged 5 commits into from Jan 13, 2023
Merged

Transcode success rate metric fixes #2684

merged 5 commits into from Jan 13, 2023

Conversation

mjh1
Copy link
Contributor

@mjh1 mjh1 commented Dec 9, 2022

What does this pull request do? Explain your changes. (required)

Fix the transcode success rate metric which was not taking into account all errors and therefore mostly reporting 100% success.

Specific updates (required)

  • Previously the success rate metric was only taking into account NoOrchestrator transcode errors. We are now taking into account maximum transcode attempts reached, context cancellations and non-retryable errors.

How did you test each of these updates (required)

Due to it being difficult to reproduce these errors the testing was done locally by temporarily hard coding these errors being thrown and observing the expected reduction in the success rate.

Does this pull request close any open issues?

Fixes #2674

Checklist:

@codecov
Copy link

codecov bot commented Dec 9, 2022

Codecov Report

Merging #2684 (21d0e09) into master (a5606ca) will decrease coverage by 0.02877%.
The diff coverage is 0.00000%.

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #2684         +/-   ##
===================================================
- Coverage   56.33265%   56.30388%   -0.02877%     
===================================================
  Files             88          88                 
  Lines          19131       19139          +8     
===================================================
- Hits           10777       10776          -1     
- Misses          7765        7771          +6     
- Partials         589         592          +3     
Impacted Files Coverage Δ
monitor/census.go 63.24921% <ø> (ø)
server/broadcast.go 77.46835% <0.00000%> (-0.61151%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5606ca...21d0e09. Read the comment docs.

Impacted Files Coverage Δ
monitor/census.go 63.24921% <ø> (ø)
server/broadcast.go 77.46835% <0.00000%> (-0.61151%) ⬇️

@mjh1
Copy link
Contributor Author

mjh1 commented Jan 11, 2023

@yondonfu are you ok with me just testing this locally by putting artificial hard coded errors in? Reproducing them in an environment seems tricky.

@yondonfu
Copy link
Member

@yondonfu are you ok with me just testing this locally by putting artificial hard coded errors in? Reproducing them in an environment seems tricky.

That seems reasonable.

@mjh1 mjh1 marked this pull request as ready for review January 12, 2023 11:36
@mjh1 mjh1 requested review from yondonfu and leszko January 12, 2023 11:36
Copy link
Contributor

@leszko leszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR itself looks good, but I can't connect the dots of how adding 3 new metrics fixes too high success rate metric.

From my understanding, adding 3 new metrics (what I see in these PR) will just add more data, however the old query in Grafana will display exactly the same as before. Do we plan to use a different query in Grafana now?

@mjh1
Copy link
Contributor Author

mjh1 commented Jan 12, 2023

The PR itself looks good, but I can't connect the dots of how adding 3 new metrics fixes too high success rate metric.

From my understanding, adding 3 new metrics (what I see in these PR) will just add more data, however the old query in Grafana will display exactly the same as before. Do we plan to use a different query in Grafana now?

We continue using the same grafana query because these new calls to SegmentTranscodeFailed() mean that these failure scenarios get taken into account for the success metric. This is because SegmentTranscodeFailed() results in a call to countSegmentTranscoded() with the failed argument true here: https://github.com/livepeer/go-livepeer/blob/a5606ca7/monitor/census.go#L1369
This is the function that records the data used for the success rate metric, this same function is called in the successful case with the failed argument false.

Yondon had a good explanation on the ticket too if it helps: #2674

Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leszko
Copy link
Contributor

leszko commented Jan 13, 2023

The PR itself looks good, but I can't connect the dots of how adding 3 new metrics fixes too high success rate metric.
From my understanding, adding 3 new metrics (what I see in these PR) will just add more data, however the old query in Grafana will display exactly the same as before. Do we plan to use a different query in Grafana now?

We continue using the same grafana query because these new calls to SegmentTranscodeFailed() mean that these failure scenarios get taken into account for the success metric. This is because SegmentTranscodeFailed() results in a call to countSegmentTranscoded() with the failed argument true here: https://github.com/livepeer/go-livepeer/blob/a5606ca7/monitor/census.go#L1369 This is the function that records the data used for the success rate metric, this same function is called in the successful case with the failed argument false.

Yondon had a good explanation on the ticket too if it helps: #2674

Ahh, I see. Thanks for the explanation!

@mjh1 mjh1 merged commit 5e96382 into master Jan 13, 2023
@mjh1 mjh1 deleted the mh/success-metric branch January 13, 2023 10:05
@thomshutt thomshutt mentioned this pull request Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

P1 - Success rate metric only calculated based on NoOrchestrator transcode errors
3 participants