Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import tasks getting stuck mid execution #19

Open
victorges opened this issue Apr 29, 2022 · 0 comments
Open

Import tasks getting stuck mid execution #19

victorges opened this issue Apr 29, 2022 · 0 comments

Comments

@victorges
Copy link
Member

We have some weird cases of tasks getting stuck mid execution. These tasks just keep getting
re-executed, since the RabbitMQ will eventually timeout (after 30m) waiting on an ACK and just
terminates the connection with the client (nacking all messages it had in-flight). The task is then
re-executed as if nothing happened [1].

[1] This is also another bug that we should address. Should just fail if the task had already tried
running before and just disappeared, which we can already tell from the metadata in the API. This
is not the root cause though so we still need to investigate and fix the stuck tasks.

No logs that indicate what is wrong, but I have a light suspicion on either:

  • the "progress reporting" logic
  • the "piping" logic in the import task which sends a stream both to ffprobe and to the storage
  • the S3 upload client

On the first tasks I found this error, they were actually importing large stream recordings which take
12+ minutes to download on a good connection, due to the on-demand MP4 generation bottleneck.
It was already weird since we have a hard timeout of 10 minutes so the task runner
should have just failed the task, instead of gone silent.

Right now I just found an even weirder case though. It was from a regular "import" task, which is not
importing a recording but actually just another asset as a test that the user was making. This is the
task:

{
    "id": "51ea2a1e-618e-452d-a024-7c5a0ace266f",
    "type": "import",
    "params": {
        "import": {
            "url": "https://livepeercdn.com/asset/REDACTED/video"
        }
    },
    "status": {
        "phase": "running",
        "progress": 0.649,
        "updatedAt": 1651269956139
    },
    "userId": "REDACTED",
    "createdAt": 1650886712179,
    "outputAssetId": "4582de3b-ead3-4ffe-8b6d-b130f61290a1"
}

The asset has around 5GB and takes less than a minute to download from a good connection, so there's
no clear reason of why the task-runner is getting stuck.

victorges added a commit that referenced this issue Apr 29, 2022
We have some tasks that are being re-executed
over and over again since they get stuck in the
task-runner logic. We should fix the root cause
of those, but to avoid the problem from getting
worse we should also avoid re-running these tasks
over and over again.

This fixes that by not even starting tasks that we
find out had already been started (phase=running).

This is related to #19
victorges added a commit that referenced this issue May 3, 2022
We have some tasks that are being re-executed
over and over again since they get stuck in the
task-runner logic. We should fix the root cause
of those, but to avoid the problem from getting
worse we should also avoid re-running these tasks
over and over again.

This fixes that by not even starting tasks that we
find out had already been started (phase=running).

This is related to #19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant