Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Persistence] If persistence write fails contents of batch to be persisted will be lost #115

Open
MendelMonteiro opened this issue Apr 17, 2023 · 1 comment
Assignees
Labels

Comments

@MendelMonteiro
Copy link
Collaborator

In the following code when an exception is thrown the batch will still be cleared and the acks/msgs that were present will never be persisted.

I think we have two options, either leave the messages in the batch so that they are processed after the next delay or re-enqueue the entries in the batch so that they will be re-processed.

Any thoughts @ocoanet , @ltrzesniewski ?

        private void PersistAndClearBatch(List<MatcherEntry> batch)
        {
            try
            {
                PersistBatch(batch);
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Unexpected error happened");
                _bus.Publish(new CustomProcessingFailed(GetType().FullName!, ex.ToString(), SystemDateTime.UtcNow));
            }
            finally
            {
                batch.Clear();
            }
        }
@MendelMonteiro MendelMonteiro self-assigned this Apr 23, 2023
@ocoanet
Copy link
Member

ocoanet commented Apr 26, 2023

Well, I suspect that most of the persistence errors are transient problems related to temporary server issues (high load, GC, etc.). Also the persistence CQL statements are all idempotent because they explicitly set the timestamp to a fixed value.

So retrying failed entries seems to be both safe and rational. However, we need to make sure that invalid entries will not be retried forever.

I do not have a strong opinion on the retry strategy.

  • Retrying right away instead of requeueing might be simpler. It would also make it easier to add a limit on the retry count.
  • Requeueing could be a better solution if we wanted to support long duration server unavailability (retrying regularly with a small delay between retries).

I think both strategies would be an improvement over the current behavior. I would probably go for the first one as a first step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants