Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea / feature request - deferring execution of batch jobs while still writing them to the database #1319

Open
pgvsalamander opened this issue Apr 9, 2024 · 3 comments

Comments

@pgvsalamander
Copy link
Contributor

I think this would be useful for queueing huge volumes of jobs, when combined with concurrency limits this would allow the process queueing the jobs to fail and restart without duplicate jobs being queued. The same mechanism would also allow multiple processes to add jobs to the same batch while avoiding duplicates (for some hypothetical where you have multiple datasets containing duplicates). Our specific use case is for queueing the generation/delivery of large email blasts, where we don't want to send duplicate emails to the same user if the job-spawner process dies for some reason.

For implementation, I think this would be accomplished by adding a boolean 'deferred' column to either the jobs or batches tables. Adding it to the jobs table would require a ton of record updates when marking a batch to run (something like UPDATE good_jobs SET deferred = false WHERE batch_id = xyz), but avoids complicating the jobs-to-run query with a join or subquery. Adding it to the batches table instead would make the updates much easier, but would require the aforementioned join/subquery. I think the adding the column to the jobs table is likely the better solution here.

Is this something you'd be interested in adding to the project, and/or do you have any thoughts/recommendations/requests? This seems straightforward enough so I'll likely take a crack at implementing this myself.

This is potentially related to #919, the pause state could be implemented by flipping deferred back to true for all pending jobs

@bensheldon
Copy link
Owner

Thanks for opening this issue. I think I like the general idea of it.

Here's an idea that I think could accomplish this without hugely changing things: what if "pausing" a job was accomplished by setting the scheduled_at column to NULL?

It was waaaaay back at GoodJob v0.7.0 when GoodJob started always assigning a scheduled_at regardless of whether a job was intended to run immediately or at a future scheduled time.

So the implication would be that a job that is not "scheduled" is not intended to run. I think the only wrinkle would be asking: when the job is intended to be undeferred/unpaused we'd want to do something like:

job.scheduled_at = serialized_params["scheduled_at"] && serialized_params["scheduled_at"] > Time.current ? serialized_params["scheduled_at"] : Time.current

...which is a little gnarly but probably ok.

What do you think?

@pgvsalamander
Copy link
Contributor Author

I'd translate that logic into a SQL update we could run within the database (unless we need to load all the records in rails for the notifiers to work?), but other than that it sounds great.

@bensheldon
Copy link
Owner

Thinking about this some more, you can add jobs to a batch, and load the batch elsewhere, and the final callbacks won't be triggered until the batch is enqueued. Eg

batch = GoodJob::Batch.new
batch.add { MyJob.perform_later }

same_batch = GoodJob::Batch.find(batch.id)
same_batch.add { OtherJob.perform_later }

same_batch.enqueue # <= now finish callback is enabled

That might not totally work for your needs, but I didn't think before that "enqueued" is a batch status that's slightly similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

2 participants