Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Janitor marks runs lost before their time #2079

Open
taylordowns2000 opened this issue May 13, 2024 · 2 comments
Open

Janitor marks runs lost before their time #2079

taylordowns2000 opened this issue May 13, 2024 · 2 comments
Assignees
Labels
bug Newly identified bug

Comments

@taylordowns2000
Copy link
Member

The Janitor uses this query to determine which runs are lost and marks them as so:

  @doc """
  Return all runs that have been claimed by a worker before the earliest
  acceptable start time (determined by the longest acceptable run time) but are
  still incomplete. This indicates that we may have lost contact with the worker
  that was responsible for executing the run.
  """
  @spec lost(DateTime.t()) :: Ecto.Queryable.t()
  def lost(%DateTime{} = now) do
    max_run_duration_seconds =
      Application.get_env(:lightning, :max_run_duration_seconds)

    grace_period = Lightning.Config.grace_period()

    oldest_valid_claim =
      now
      |> DateTime.add(-max_run_duration_seconds, :second)
      |> DateTime.add(-grace_period, :second)

    final_states = Run.final_states()

    from(att in Run,
      where: is_nil(att.finished_at),
      where: att.state not in ^final_states,
      where: att.claimed_at < ^oldest_valid_claim
    )
  end

Now that we have dynamic max_run_duration being passed to the worker and we allow some runs to run longer than others, we need to refactor this query to account for the variable time limits.

Possible solutions:

  1. Is the allowed run duration for a run stored on the run itself as metadata? if so, we could re-write this query to check the run "against itself" --> i.e., where runs.metadata['max_duration'] + runs.started_at > time.now()
  2. ..?
@taylordowns2000 taylordowns2000 added the bug Newly identified bug label May 13, 2024
@taylordowns2000
Copy link
Member Author

@stuartc , interesting one here. We do not store the options on the runs table. Instead they're generated on the fly by the worker when they claim a run. This has a couple of interesting impacts:

  1. Nice: If I have 100 runs in the queue and they're failing because of timeouts, I can change the limits (new ENV if it's my deployment, upgrade plan if on a hosted deployment, etc.) and then they'll start to suceed.
  2. Naughty: If I mark disable_console_log: true as one of the options for my workflow (note this still hasn't been ported from v1, but will be comings soon) and then execute a run but it's stuck in the queue, someone else might come along and enable console.log statements. Even though it was disabled when I created the run, the worker would come along and discover (at claim time) that they're totally allowed to use console.log.
  3. A little frustrating, but neither inherently good nor bad: If there are a bunch of unfinished runs and I want to see which are actually lost, there's no way to check the runs table to see which are actually lost and which simply have extended durations compared to the instance default. @elias-ba points out that we could rewrite this Runs.Query.lost/1 to Runs.Query.lost/2 and query per project or at least per set of projects on the same plan (i.e., set of projects with identical run timeout limits) and that's probably the best near-term fix.

Thoughts on this?

@taylordowns2000
Copy link
Member Author

moving to in review to get early feedback from @elias-ba or @stuartc (both have worked on this recently) before continuing. guys, please see #2085

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Newly identified bug
Projects
Status: In progress
Development

No branches or pull requests

1 participant