Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P1 - Success rate metric only calculated based on NoOrchestrator transcode errors #2674

Open
yondonfu opened this issue Nov 30, 2022 · 6 comments · Fixed by #2684
Open

P1 - Success rate metric only calculated based on NoOrchestrator transcode errors #2674

yondonfu opened this issue Nov 30, 2022 · 6 comments · Fixed by #2684

Comments

@yondonfu
Copy link
Member

yondonfu commented Nov 30, 2022

Describe the bug
A clear and concise description of what the bug is.

The success rate Grafana graph here shows success rate consistently >= 100% even though we know that there have been transcode failures.

From reviewing the metrics code, it appears that the success rate metric is updated whenever there is a call to census.sendSuccess() in monitor/census.go.

I see this method being called in at least two places:

  1. In SegmentFullyTranscoded() which is called here after B finishes downloading all results from O
  2. In segmentTranscodeFailed() which is called in SegmentTranscodeFailed() which is called whenever a transcode error is encountered

For 2, there is a concept of a "permanent" vs. "non-permanent" transcode error indicated via the permanent bool passed to SegmentTranscodeFailed(). We can see non-permanent errors being recorded here. The only place where there is a permanent error recorded is here for NoOrchestrator transcode errors. This seems problematic because only permanent errors will trigger a call to census.sendSuccess() here when recording a transcode error. As a result, I don't think we are properly updating the success rate metric in at least two places:

  • When we hit the max # of transcode attempts which prompts B to give up on a segment here
  • When we hit a non-retryable transcode error which prompts B to give up on a segment here
  • When the caller (i.e. HTTP push client) gives up on the transcode resulting in a context cancellation here

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

We could trigger the aforementioned errors that are not being factored in right now to see if the success rate is not affected. A solution should demonstrate that these errors cause the success rate to drop.

Expected behavior
A clear and concise description of what you expected to happen.

I expect the success rate metric to properly factor in all transcode errors that result in no renditions for a segment that is passed in.

Generally, I see at least these categories of transcode errors that should cause success rate to drop:

  • If B hits the max # of transcode attempts - B should give up on the segment b/c it tried enough times already
  • If B hits a non-retryable error - B should give up on the segment b/c it knows that this segment likely just cannot be transcoded
  • If B knows that the caller (i.e. HTTP push client) is no longer waiting for a result - B should give up on the segment b/c it knows no one cares about the results anymore because the transcode was too slow

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@thomshutt
Copy link
Contributor

@mjh1 promise this is the last one, but it's quite a good one for you to take since it involves understanding some of the Broadcaster logs and how we record metrics

@Quintendos Quintendos changed the title Success rate metric only calculated based on NoOrchestrator transcode errors P1 - Success rate metric only calculated based on NoOrchestrator transcode errors Dec 7, 2022
@Quintendos
Copy link

success rate metrics Why? Improves our visibility and can compare better with Evan’s graph. Difficult to investigate right now

@mjh1
Copy link
Contributor

mjh1 commented Dec 12, 2022

@yondonfu do you think rather than making sure we emit the metric for every failure scenario we could introduce a metric for "transcode started" (if it doesn't exist) then use transcode_success/transcode_start for our success rate? That way we wouldn't have to worry about potentially missing any obscure failure scenarios?

e.g. abd1334
but then I see in some cases the SegmentTranscoded metric is fired from the different transcoder implementations so maybe we would need to fire the transcode start metric there too.

@yondonfu
Copy link
Member Author

yondonfu commented Jan 4, 2023

do you think rather than making sure we emit the metric for every failure scenario we could introduce a metric for "transcode started" (if it doesn't exist) then use transcode_success/transcode_start for our success rate? That way we wouldn't have to worry about potentially missing any obscure failure scenarios?

@mjh1 Does that mean that if B starts multiple transcodes for segments concurrently, during the period of time until the transcodes complete we could see that success rate metric drop since transcode_start would be incremented, but B is still waiting for the transcodes to complete? That's the main concern that comes to mind and reason for updating the success metric only when a transcode is complete (either with a success or a legit failure).

@mjh1
Copy link
Contributor

mjh1 commented Jan 6, 2023

ah yep you're totally right, will forget that idea

@mjh1
Copy link
Contributor

mjh1 commented Jan 13, 2023

Reopen until change deployed

@mjh1 mjh1 reopened this Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants