Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject submissions with error if they cannot be written to MongoDB #837

Open
jnm opened this issue Jul 20, 2022 · 0 comments · Fixed by #840
Open

Reject submissions with error if they cannot be written to MongoDB #837

jnm opened this issue Jul 20, 2022 · 0 comments · Fixed by #840
Assignees
Labels

Comments

@jnm
Copy link
Member

jnm commented Jul 20, 2022

MongoDB (and Pymongo) now enable retryable writes by default. From https://github.com/mongodb/specifications/blob/master/source/retryable-writes/retryable-writes.rst#why-are-write-operations-only-retried-once:

The spec concerns itself with retrying write operations that encounter a retryable error (i.e. no response due to network error or a response indicating that the node is no longer a primary). A retryable error may be classified as either a transient error (e.g. dropped connection, replica set failover) or persistent outage. In the case of a transient error, the driver will mark the server as "unknown" per the SDAM spec. A subsequent retry attempt will allow the driver to rediscover the primary within the designated server selection timeout period (30 seconds by default). If server selection times out during this retry attempt, we can reasonably assume that there is a persistent outage. In the case of a persistent outage, multiple retry attempts are fruitless and would waste time. See How To Write Resilient MongoDB Applications for additional discussion on this strategy.

Given that, let's not worry about adding an additional level of retries within synchronous request processing. Instead, we should return a failure response if a Pymongo reports that a submission has not been written successfully to MongoDB.

I believe that the philosophy behind the old design, which returns success once the submission is written to PostgreSQL—regardless of what happens with MongoDB—was to maximize data safety in the face of hostile field environments. A submission that stores successfully in Postgres but not in Mongo might never reappear to the server: the phone holding it could be run over by a truck before there's an opportunity to upload again. However, the risk of data being silently hidden from essentially all exports and views until someone runs ./manage.py sync_mongo --remongo likely outweighs the risk of the original submission being destroyed before it can be uploaded again after a transient MongoDB error.

Old description

This is necessary to support writing when replica sets are used (see #830).

AutoReconnect is:

Raised when a connection to the database is lost and an attempt to auto-reconnect will be made.

In order to auto-reconnect you must handle this exception, recognizing that the operation which caused it has not necessarily succeeded. Future operations will attempt to open a new connection to the database (and will continue to raise this exception until the first successful connection is made).

(NotPrimaryError is a subclass of AutoReconnect: https://pymongo.readthedocs.io/en/stable/api/pymongo/errors.html#pymongo.errors.NotPrimaryError.)

It's apparently the responsibility of our application code to retry queries whenever Pymongo raises these errors. Currently, we don't do that:

  1. Pymongo loses its connection or finds itself connected to a server that doesn't accept writes;
  2. A submission comes in, and KoBoCAT tries to write via Pymongo;
  3. Pymongo raises exception;
  4. KoBoCAT gives up on Mongo but returns success because the submission has been stored in PostgreSQL.

We need to change (4) so that KoBoCAT retries the Pymongo operation. I don't know what the best practice is, though. How many times should we retry? Should we delay between retries, and how much can we afford to delay given we are responding a synchronous HTTP request? Should we simply fail the request, rolling back the Postgres transaction and returning a failure code to the client?

@jnm jnm added the backend label Jul 20, 2022
@jnm jnm changed the title Tolerate AutoReconnect and subclasses from Pymongo Reject submissions with error if they cannot be written to MongoDB Jul 20, 2022
@bufke bufke self-assigned this Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants