You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The spec concerns itself with retrying write operations that encounter a retryable error (i.e. no response due to network error or a response indicating that the node is no longer a primary). A retryable error may be classified as either a transient error (e.g. dropped connection, replica set failover) or persistent outage. In the case of a transient error, the driver will mark the server as "unknown" per the SDAM spec. A subsequent retry attempt will allow the driver to rediscover the primary within the designated server selection timeout period (30 seconds by default). If server selection times out during this retry attempt, we can reasonably assume that there is a persistent outage. In the case of a persistent outage, multiple retry attempts are fruitless and would waste time. See How To Write Resilient MongoDB Applications for additional discussion on this strategy.
Given that, let's not worry about adding an additional level of retries within synchronous request processing. Instead, we should return a failure response if a Pymongo reports that a submission has not been written successfully to MongoDB.
I believe that the philosophy behind the old design, which returns success once the submission is written to PostgreSQL—regardless of what happens with MongoDB—was to maximize data safety in the face of hostile field environments. A submission that stores successfully in Postgres but not in Mongo might never reappear to the server: the phone holding it could be run over by a truck before there's an opportunity to upload again. However, the risk of data being silently hidden from essentially all exports and views until someone runs ./manage.py sync_mongo --remongo likely outweighs the risk of the original submission being destroyed before it can be uploaded again after a transient MongoDB error.
Old description
This is necessary to support writing when replica sets are used (see #830).
Raised when a connection to the database is lost and an attempt to auto-reconnect will be made.
In order to auto-reconnect you must handle this exception, recognizing that the operation which caused it has not necessarily succeeded. Future operations will attempt to open a new connection to the database (and will continue to raise this exception until the first successful connection is made).
It's apparently the responsibility of our application code to retry queries whenever Pymongo raises these errors. Currently, we don't do that:
Pymongo loses its connection or finds itself connected to a server that doesn't accept writes;
A submission comes in, and KoBoCAT tries to write via Pymongo;
Pymongo raises exception;
KoBoCAT gives up on Mongo but returns success because the submission has been stored in PostgreSQL.
We need to change (4) so that KoBoCAT retries the Pymongo operation. I don't know what the best practice is, though. How many times should we retry? Should we delay between retries, and how much can we afford to delay given we are responding a synchronous HTTP request? Should we simply fail the request, rolling back the Postgres transaction and returning a failure code to the client?
The text was updated successfully, but these errors were encountered:
jnm
changed the title
Tolerate AutoReconnect and subclasses from Pymongo
Reject submissions with error if they cannot be written to MongoDB
Jul 20, 2022
MongoDB (and Pymongo) now enable retryable writes by default. From https://github.com/mongodb/specifications/blob/master/source/retryable-writes/retryable-writes.rst#why-are-write-operations-only-retried-once:
Given that, let's not worry about adding an additional level of retries within synchronous request processing. Instead, we should return a failure response if a Pymongo reports that a submission has not been written successfully to MongoDB.
I believe that the philosophy behind the old design, which returns success once the submission is written to PostgreSQL—regardless of what happens with MongoDB—was to maximize data safety in the face of hostile field environments. A submission that stores successfully in Postgres but not in Mongo might never reappear to the server: the phone holding it could be run over by a truck before there's an opportunity to upload again. However, the risk of data being silently hidden from essentially all exports and views until someone runs
./manage.py sync_mongo --remongo
likely outweighs the risk of the original submission being destroyed before it can be uploaded again after a transient MongoDB error.Old description
This is necessary to support writing when replica sets are used (see #830).
AutoReconnect
is:(
NotPrimaryError
is a subclass ofAutoReconnect
: https://pymongo.readthedocs.io/en/stable/api/pymongo/errors.html#pymongo.errors.NotPrimaryError.)It's apparently the responsibility of our application code to retry queries whenever Pymongo raises these errors. Currently, we don't do that:
We need to change (4) so that KoBoCAT retries the Pymongo operation. I don't know what the best practice is, though. How many times should we retry? Should we delay between retries, and how much can we afford to delay given we are responding a synchronous HTTP request? Should we simply fail the request, rolling back the Postgres transaction and returning a failure code to the client?
The text was updated successfully, but these errors were encountered: