Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure robust database connectivity #694

Open
BenjaminPelletier opened this issue Feb 1, 2022 · 1 comment
Open

Ensure robust database connectivity #694

BenjaminPelletier opened this issue Feb 1, 2022 · 1 comment
Labels
dss Relating to one of the DSS implementations feature Issue would improve software P2 Normal priority

Comments

@BenjaminPelletier
Copy link
Member

Historically, there may have been a problem in the DSS where a DSS instance would fail at least the first operation, due to database connectivity problems, after a long period of idleness and/or had its CRDB server restart after core-service (then grpc-backend) had already started. The exact circumstances, triggers, etc of this possible problem are not well understood, and may have been due to factors unrelated to the DSS design. Around the time of this problem, a feature was added to core-service (then grpc-backend) that would interact with (ping) the database every minute to attempt to avoid the issue -- if there was a problem with the ping, the core-service instance would panic and kill itself so that the Kubernetes orchestrator would then restart it (turn it off and back on again) to reestablish the database connection. More recently, we switched to using connection statistics rather than pinging the database (#679) for the periodic check, but it turned out (#691) that connections could go to zero without actually indicating a bad database connection. So, we changed the panic to a warning message (#692), which means that the DSS no longer restarts itself when there is a bad database connection.

In the long term, this resilience should be built into the database client. If an initial attempt to interact with the database fails due to a now-invalid connection, the database client should transparently attempt to repair the connection and then retry the operation before returning to the caller. Importantly, we should also consider how system maintenance is to be performed. If CRDB nodes can be offline for short periods of time (when, e.g., upgrading to new versions) while their corresponding core-service instances are still online, then core-service should be able to fail over to other CRDB nodes when attempting to fulfill requests.

@BenjaminPelletier BenjaminPelletier added P2 Normal priority feature Issue would improve software labels Feb 1, 2022
@BenjaminPelletier
Copy link
Member Author

#752 illustrates the usage of haproxy to recover from a loss of a CRDB node

@BenjaminPelletier BenjaminPelletier added the dss Relating to one of the DSS implementations label Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dss Relating to one of the DSS implementations feature Issue would improve software P2 Normal priority
Projects
None yet
Development

No branches or pull requests

1 participant