Janitor marks runs lost before their time #2079

taylordowns2000 · 2024-05-13T11:17:21Z

The Janitor uses this query to determine which runs are lost and marks them as so:

  @doc """
  Return all runs that have been claimed by a worker before the earliest
  acceptable start time (determined by the longest acceptable run time) but are
  still incomplete. This indicates that we may have lost contact with the worker
  that was responsible for executing the run.
  """
  @spec lost(DateTime.t()) :: Ecto.Queryable.t()
  def lost(%DateTime{} = now) do
    max_run_duration_seconds =
      Application.get_env(:lightning, :max_run_duration_seconds)

    grace_period = Lightning.Config.grace_period()

    oldest_valid_claim =
      now
      |> DateTime.add(-max_run_duration_seconds, :second)
      |> DateTime.add(-grace_period, :second)

    final_states = Run.final_states()

    from(att in Run,
      where: is_nil(att.finished_at),
      where: att.state not in ^final_states,
      where: att.claimed_at < ^oldest_valid_claim
    )
  end

Now that we have dynamic max_run_duration being passed to the worker and we allow some runs to run longer than others, we need to refactor this query to account for the variable time limits.

Possible solutions:

Is the allowed run duration for a run stored on the run itself as metadata? if so, we could re-write this query to check the run "against itself" --> i.e., where runs.metadata['max_duration'] + runs.started_at > time.now()
..?

The text was updated successfully, but these errors were encountered:

taylordowns2000 · 2024-05-13T15:17:20Z

@stuartc , interesting one here. We do not store the options on the runs table. Instead they're generated on the fly by the worker when they claim a run. This has a couple of interesting impacts:

Nice: If I have 100 runs in the queue and they're failing because of timeouts, I can change the limits (new ENV if it's my deployment, upgrade plan if on a hosted deployment, etc.) and then they'll start to suceed.
Naughty: If I mark disable_console_log: true as one of the options for my workflow (note this still hasn't been ported from v1, but will be comings soon) and then execute a run but it's stuck in the queue, someone else might come along and enable console.log statements. Even though it was disabled when I created the run, the worker would come along and discover (at claim time) that they're totally allowed to use console.log.
A little frustrating, but neither inherently good nor bad: If there are a bunch of unfinished runs and I want to see which are actually lost, there's no way to check the runs table to see which are actually lost and which simply have extended durations compared to the instance default. @elias-ba points out that we could rewrite this Runs.Query.lost/1 to Runs.Query.lost/2 and query per project or at least per set of projects on the same plan (i.e., set of projects with identical run timeout limits) and that's probably the best near-term fix.

Thoughts on this?

taylordowns2000 · 2024-05-14T14:29:57Z

moving to in review to get early feedback from @elias-ba or @stuartc (both have worked on this recently) before continuing. guys, please see #2085

taylordowns2000 added the bug Newly identified bug label May 13, 2024

taylordowns2000 mentioned this issue May 14, 2024

Add options column to runs table #2082

Open

taylordowns2000 self-assigned this May 14, 2024

taylordowns2000 mentioned this issue May 14, 2024

Persist run options & set at creation (not claim) #2085

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Janitor marks runs lost before their time #2079

Janitor marks runs lost before their time #2079

taylordowns2000 commented May 13, 2024

taylordowns2000 commented May 13, 2024

taylordowns2000 commented May 14, 2024

Janitor marks runs lost before their time #2079

Janitor marks runs lost before their time #2079

Comments

taylordowns2000 commented May 13, 2024

taylordowns2000 commented May 13, 2024

taylordowns2000 commented May 14, 2024