Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed signature support to handle Delta-persistent objects #271

Merged
merged 15 commits into from Apr 27, 2024

Conversation

etremel
Copy link
Contributor

@etremel etremel commented Mar 13, 2024

My original implementation of signed persistent logs did not properly handle Delta-supporting Persistent fields in a few edge cases. Specifically, if a single Derecho Replicated object contains both a non-signed Persistent field and a signed Persistent field that has DeltaSupport enabled, it's possible for PersistentRegistry::sign() to produce the warning "Logic error: Version X was returned by getMinimumVersionAfter(), but it did not exist in any field" and then a subsequent handle_verify_request to produce the warning "Verification of version X failed!" because it attempts to verify an empty signature.

This happens when makeVersion is called on the Replicated object when only the non-signed Persistent field was updated, which means the non-signed field creates a log entry but the Delta-supporting signed field just advances its current version without creating a log entry (since there is no Delta data). Then in PersistentRegistry::sign(), getMinimumVersionAfter() will return the new version that is in the non-signed log as cur_nonempty_version, but updateSignature won't sign any data for this version because it doesn't exist in the signed field's log. Making things even worse, getMinimumLatestVersion() can't be used to properly detect the "current version" in this situation, because the getLatestVersion() method only returns the latest version that is in a log entry, not the current version recorded in the meta-header file (which is what's updated when a Delta-supporting object advances its current version).

I fixed this with several changes:

  • Added a new getNextSignedVersion() method in PersistentRegistry that returns the next version that exists in some signed field, skipping all the non-signed fields
  • Used getNextSignedVersion() instead of getNextVersionAfter() to increment the loop in PersistentRegistry::sign()
  • Added a new signed_num column to the DerechoSST to record the latest version that has been signed by a replica, since this may not be equal to the latest version persisted by a replica (if that latest version exists only in non-signed fields)
  • Changed PersistenceManager's handle_verify_request to use the new signed_num field when validating signatures from other replicas, and to not attempt to validate a signature that doesn't (yet) exist in the local replica's log
  • Changed handle_verify_request to verify a signature from another replica by simply doing a byte-level equality check against the local replica's signature on the same version, instead of doing a public-key verification on the signature. This is not quite related to fixing the bug but will improve performance. It's not necessary to do a public-key verification if the local replica is using the same private key to sign the same updates and thus should have exactly the same signature on the log entry.
  • Separated Replicated's persist() method into two separate persist() and sign() methods, and moved the while loop in the former persist() method up to PersistenceManager's handle_sign_request method. This is necessary to allow PersistenceManager to learn which version was signed, separately from which version was persisted, and PersistenceManager was the only caller of the Replicated::persist() method so nothing else should be impacted by this API change.

The "logic error in PersistentRegistry" warning gets printed in Cascade
whenever an RPC update changes only a non-signed field, and not the
signed field. This seems to indicate that the sign() method is being
called on a version that does not exist in the signed field's log, and
is trying to add a signature to this nonexistent version. I'm trying to
reproduce this bug in Derecho's signed_log_test so I can fix it here.
It turns out the error only shows up when the signed persistent field is
also a DeltaSupport field, meaning it's possible for it to create a
version with no data (because the delta is empty). Regular non-Delta
signed fields always create a version with their current state even when
a non-signed field originates the makeVersion call, so mixing regular
signed and non-signed fields doesn't create a log with gaps.
As identified by the new signed_log_test, Delta-supporting signed
fields create a problem for the sign() method because they can end up
saving no data to the log when makeVersion is called. In particular, if
makeVersion is called when only non-signed fields were updated, the
non-signed fields will get a log entry at that version, but the
Delta-suporting signed field will just advance its current version
without creating a log entry (since there is no Delta data). Then when
PersistentRegistry::sign() is called, getMinimumVersionAfter() will
return the new version that is in the non-signed log as
cur_nonempty_version, but updateSignature won't sign any data for this
version because it doesn't exist in the signed field's log. This results
in the "logic error" message because no data was signed for
cur_nonempty_version.

The most direct solution is to simply write a new version of
getMinimumVersionAfter() that's just for sign(), called
getNextSignedVersion(), which only calls getNextVersionOf() on fields
that have signatures enabled. The getNextVersionOf() method already
correctly skips ahead to the next version with nonempty log data for
signed fields, and if no non-signed fields get to return their versions,
the overall minimum must be a version that actually has data in the log
of a signed field.

Fixing the same problem in handle_verify_request also involved fixing
a longstanding performance flaw. Instead of doing a public-key
verification on the signature posted to the SST by another replica, we
should just compare that signature to the signature in the local
replica's log on the same version - they should be bytewise identical
since both replicas have the same signing key. With this change made, we
should also ensure that we only do the equality check if both replicas
actually have a signature for the requested version; a version that is
in the persistent log but not in any signed fields doesn't need to be
verified.

Finally, this commit removes noexcept(true) from the PersistLog
constructor because the from_pem_private method (called to determine the
signature length) can throw an exception if the PEM file is not found.
Now that the signed_num can be different from the persisted_num
(smaller, if the current persisted_num has no signed data), it no longer
makes sense to use the single return value from Replicated::persist as
the new version that is "signed and persisted" - a single persist
request may result in a different persisted_num and signed_num.
PersistenceManager was the only part of the code calling the
Replicated::persist method, so I refactored it into separate "persist"
and "sign" methods that PersistenceManager calls in a loop, replacing
the sign-and-persist loop that used to be within Replicated::persist.
This allows PersistenceManger to learn both the latest persisted version
and the latest signed version, and update persisted_num and signed_num
separately.

While doing this, I re-arranged replicated.hpp and added some comments
to more clearly document that some of the "public" methods are in fact
only used by internal Derecho components and should not be relied on as
a stable API.

Also updated signed_log_test to correctly detect the end of the test,
now that the last persisted version might not equal the last signed
version - if the last update to MixedFieldObject is to the unsigned
field, the version resulting from that update will never be signed.
I noticed that a few internal files are still including other Derecho
headers with the "system" (angle-brackets) syntax instead of the "local"
(quotes) syntax. Library files should always include other library files
with the "local" syntax. Putting this in a separate commit since it's
unrelated to my other work on sign() and persist().
Following a discussion with Weijia where we confirmed that the
signatures system was the only part of Derecho that used
getMinimumLatestVersion, I decided to change this method to return the
current "in-memory" version of the Persistent fields as reported by the
currMetaHeader version, rather than the latest version of the Persistent
fields that corresponds to a log entry. I renamed the method to
getCurrentVersion to reflect this new meaning, and added a
getCurrentVersion method to the PersistLog API that would get the
meta-header version. The original getLatestVersion still exists and has
the same meaning, since the latest version in a log entry is still a
piece of information someone might want to know.
Since the sign() method, like the persist() method, can end up signing a
later version than the one requested (due to batching), it's possible
for a later persistence request to call sign() only to find that an
earlier persistence request already signed the latest version - in other
words, there's nothing for PersistentRegistry::sign() to do because the
latest version is already signed. In this case,
PersistentRegistry::sign() would return lastSignedVersion without
placing any data into the signature_buffer, because the main while loop
would do nothing. This would cause PersistenceManager to get an empty
buffer as the "latest signature" and then place that empty buffer in the
SST's signature field.

I added a special case to PersistentRegistry::sign() to make it return
the cached latest signature if the method is a no-op. I also changed
handle_persist_request to check if signed_version has actually advanced
before updating the SST, to avoid redundantly re-copying the latest
version's signature to the SST if it was already placed there by a
previous persistence request.

Also changed handle_verify_request to check the local node's version of
signed_num in the SST to determine if there's a local signature to check
against when verifying another node's signature. This is more reliable
than trying to retrieve the local signature and testing for failure,
because get_signature can "succeed" but return a signature of all zeroes
if the desired log entry has been created but not yet signed.
Now that the signatures are verifying correctly, I'll stop dumping the
entire hex value of each signature into the log. Even at the trace level
that's a bit much for anything other than finding this particular bug.
This seems to work and produces no warnings, which means we no longer
need the "version" parameter from PersistenceManager as even a "hint" to
the Replicated::sign() method - it can just use getCurrentVersion().
The "version" argument to the various persist() methods actually didn't
do anything because FilePersistLog's persist() ignored the argument and
always persisted the latest version. After discussion with Weijia, I
rewrote FilePersistLog::persist() to actually use the argument to limit
the range of data it persists, but also to make the argument optional
and revert to the old behavior if it is not provided. This means callers
that want the batching behavior don't have to guess at the correct
"latest" version to provide as an argument (they can just provide
std::nullopt), and callers that want only a certain version persisted
will not end up persisting a later version than they wanted.

Most of the "code changes" in FilePersistLog::persist() come from me
manually inlining the NEXT_LOG_ENTRY, NEXT_LOG_ENTRY_PERS, and NEXT_DATA
macros so that I could figure out what they were doing. The substantial
change is to replace all references to m_currMetaHeader in those macros
with shadow_header, which is a modified copy of m_currMetaHeader. If the
latest_version argument is provided, shadow_header is modified to
"point" to an earlier log entry, namely the one corresponding to
latest_version. If latest_version is not provided, shadow_header equals
m_currMetaHeader and the function should behave exactly as before.
Either way, shadow_header becomes the next m_persMetaHeader after the
data is flushed, so the m_persMetaHeader accurately reflects which log
entries were persisted.
Interleaving persistence and verification requests in a single queue
processed by a single thread makes it harder for the thread to skip
ahead and discard obsolete persistence requests when the persisted
number advances due to batching. There's also no good reason for these
requests to be handled by the same thread, since they are independent of
one another (and verification has very little to do with perisistence).
The handle_verify_request method should also return early if the
requested version has already been verified, like
handle_persist_request.
When I refactored persist() to have an optional version parameter
(instead of a required one that was ignored), I changed the instances
where it was called with "the latest version" to call it with nullopt
instead, but I missed this one.
Technically these includes are redundant since the bottom of the header
file includes the _impl file, but this makes VSCode's parser happy and
we do the same thing in group_impl.hpp.
@songweijia songweijia merged commit 4a4c41b into master Apr 27, 2024
1 check passed
@etremel etremel deleted the signed_log_bug_fix branch May 2, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants