-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: Banning peers based on ResponseCheckTx #9675
base: main
Are you sure you want to change the base?
Conversation
* ABCI++ RFC This commit adds an RFC for ABCI++, which is a collection of three new phases of communication between the consensus engine and the application. Co-authored-by: Sunny Aggarwal <sunnya97@protonmail.ch> * Fix bugs pointed out by @liamsi * Update rfc/004-abci++.md Co-authored-by: Federico Kunze <31522760+fedekunze@users.noreply.github.com> * Fix markdown lints * Update rfc/004-abci++.md Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com> * Update rfc/004-abci++.md Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com> * Update rfc/004-abci++.md Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com> * Add information about the rename in the context section * Bold RFC * Add example for self-authenticating vote data * More exposition of the term IPC * Update pros / negatives * Fix sentence fragment * Add desc for no-ops Co-authored-by: Sunny Aggarwal <sunnya97@protonmail.ch> Co-authored-by: Federico Kunze <31522760+fedekunze@users.noreply.github.com> Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com> Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com>
* add rpc spec and support outline * add json * add more routes remove unneeded ones * add rest of rpc endpoints * add jsonrpc calls * add more jsonrpc calls * fix blockchain * cleanup unused links and add links to repos * Update spec/rpc/README.md Co-authored-by: Callum Waters <cmwaters19@gmail.com> * add missing param from consensus param * Update spec/rpc/README.md Co-authored-by: Callum Waters <cmwaters19@gmail.com> * Update spec/rpc/README.md Co-authored-by: Callum Waters <cmwaters19@gmail.com> * fix cast and add doc to readme Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Marko Baricevic <markobaricevic@Fergalicious.local>
* Avoid quantifier alternation cycle The problematic quantifier alternation cycle arose because the definition of accountability_violation was unfolded. This commit also restructures the induction proof for clarity. * add count_lines.sh * fix typo and add forgotten complete=fo in comment Co-authored-by: Giuliano <giuliano@eic-61-11.galois.com>
Bumps [actions/stale](https://github.com/actions/stale) from 3 to 3.0.18. - [Release notes](https://github.com/actions/stale/releases) - [Commits](actions/stale@v3...v3.0.18) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/stale](https://github.com/actions/stale) from 3.0.18 to 3.0.19. - [Release notes](https://github.com/actions/stale/releases) - [Commits](actions/stale@v3.0.18...v3.0.19) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* add a changelog to track changes * Update CHANGELOG.md Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Callum Waters <cmwaters19@gmail.com>
* clarify timestamps * changelog entry * Update spec/rpc/README.md Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Callum Waters <cmwaters19@gmail.com>
* rpc: add chunked genesis endpoint * fix lint * feedback * add info about error * fix lint Co-authored-by: marbar3778 <marbar3778@yahoo.com>
* add parameters to byzantine send action * make net not trusted it's not necessary since for proofs Ivy will assume that the environment does not break action preconditions * use require instead of assume it seems that assume is not checked when other isolates call! * add comment * add comment * run with random seed * make domain model extractable to C++ * substitute require for assume assumes in an action are not checked when the action is called! I.e. they place no requirement on the caller; we're just assuming that the caller is going to do the right thing. This wasn't very important here but it leade to a minor inconsistency slipping through. * make the net isolate not trusted there was no need for it * add tendermint_test.ivy contains a simple test scenario that show that the specification is no vacuuous * update comment * add comments * throw if trying to parse nset value in the repl * add comment * minor refactoring
Bumps [gaurav-nelson/github-action-markdown-link-check](https://github.com/gaurav-nelson/github-action-markdown-link-check) from 1.0.12 to 1.0.13. - [Release notes](https://github.com/gaurav-nelson/github-action-markdown-link-check/releases) - [Commits](gaurav-nelson/github-action-markdown-link-check@1.0.12...1.0.13) --- updated-dependencies: - dependency-name: gaurav-nelson/github-action-markdown-link-check dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/stale](https://github.com/actions/stale) from 3.0.19 to 4. - [Release notes](https://github.com/actions/stale/releases) - [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md) - [Commits](actions/stale@v3.0.19...v4) --- updated-dependencies: - dependency-name: actions/stale dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
mem.peerFailureMap[peerID] = { time.Now(), numFailures + 1} | ||
} | ||
``` | ||
If transactions with `ResponseCheckTx.code > 1` are not deleted from the cache and we want to ban them on the first offence, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, transactions with ResponseCheckTx.code > 1
should never be removed from the cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is something we need to agree on. If we do not expect to have many nodes sending us such transactions, we might save space by just removing it and not storing the peer info, because this peer is immediately banned and this should be a one-off occurence. That is also why the condition to remove the tx from the cache from your previous comments removes the transaction if this is the reason for failure.
If we do think this is better stored and never removed, then the condition from your comment should be as you suggested and not as it is currently written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it'd be best to discuss this synchronously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to summarize our in person discussion which will also lead to a clarification in the doc. The debate we are having is whether we want the never-valid transactions in the cache or not.
- Yes because this means we will never run checkTx on them again and we know it will never be valid, so we may want it to stay there forever.
- No because they are very rare, the peer is banned so they take up space in the cache when they don't occur that often
|
||
As this logic loops through the transactions in any case, we can leverage it to check whether we can ban peers. | ||
However, this approach has several downsides: | ||
- It is not timely enough. Recheck is executed after a block is committed, leaving room for a bad peer to send |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, my thought on this part is that I can't to see a valid example of an App that locally passed CheckTx
on a TX, and then fails at reCheckTx
time with "> 1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we would not expect the transaction to fail it here after it passed it the first time. What I meant here is that we check whether the transaction was marked as invalid with a particular code when it was added, and if yes we ban all the peers that sent it. So we only leverage the fact that we anyways loop through all the transactions and do not apply the normal recheck logic (which applies to only valid transactions).
In my opinion this solution is a bad one, but I decided to list it here in case people feel differently.
|
||
### Impact on ABCI | ||
|
||
- Introduction of new response codes for CheckTx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention this is a non-breaking interface change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, I believe that in a certain sense this is breaking. We are giving meaning to something that applications may have already given meaning too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I saw your comment above. I hadn't thought about that. I think you are right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the great write-up! I would say that, in my mind, this is effectively bringing a mechanism of the PEX
reactor into the mempool. That reactor has logic to mark a peer as bad if it sends malformed network addresses. In that context, the bytes of the message may contain a valid proto but the values in that proto message are invalid. In that case, they do not represent a legal network address. I believe that this mechanism should just mimic the logic of the PEX
reactor but with the needed input of the application.
Since the contents of the Tx are meaningless from the standpoint of Tendermint, we have no idea if they are well formed and/or contain clearly bogus data so we must ask the application. If the application responds that the data is malformed or never valid for some other reason, we should ban the peer. This mimics what we do in the PEX
reactor: when we see garbage data we ban the peer.
I think that the case of banning a peer if it frequently sends transactions to our node that could have been valid at some point, either in the past or in the future, is much much more subtle and, without strong justification, we should likely avoid solving entirely. If we need something we could consider a rate-limit for such peers to prevent our resources from being completely tied up by peers that are not actively malicious but that are also being unhelpful.
of them might be obsolete at the time of the check; state changes upon block execution etc.). | ||
However, Tendermint users observed that | ||
there are transactions that can never have been or will never be valid. They thus propose | ||
to introduce a special response code for `CheckTx` to indicate this behaviour, and ban the peers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They thus propose
to introduce a special response code forCheckTx
to indicate this behaviour, and ban the peers...
I would also note that this may be breaking for old applications. We have never reserved response codes as meaningful to Tendermint, so giving semantic meaning to some response code, such as '44 = ban the peer' may break older applications that had used 44 for some internal reason. It seems unlikely that this is the case but it is a concern worth taking into consideration.
2. Peer is disconnected from and banned. | ||
3. Conditions for the peer to be banned. | ||
2. If `CheckTx` signals a peer should be banned, how do we know which peer(s) to ban? | ||
3. Are there possible attack scenarios by allowing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since our p2p layer implements a cryptographic handshake procedure, we can be certain of the identity of the sending peer. This means a malicious peer could not force us to ban a different peer.
For the sake of simplicity, in the remainder of the text we will distinguish the failures due to a) as failures | ||
signaled with `ResponseCheckTx.code = 1` and the failures described in b), failures with `ResponseCheckTx.code > 1`. | ||
|
||
For a), a peer sends transactions that **repeatedly** fail CheckTx with `ResponseCheckTx.code = 1`, and is banned or disconnected from to avoid this. In this case we need to define what repeatedly means. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to me that that's what's being suggested by the different response codes, either =1 or >1 @sergio-mena
A peer should not be banned the first time this happens, but rather if this happens a predefined number | ||
of times (`numFailures`) within a time interval (`lastFailure`). This time interval should be reset every `failureResetInterval`ms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm not seeing it in the Background
section, is this behavior that we definitely need? I think that banning peers that send us transactions that absolutely could never have been valid is mostly what's being asked for. A mechanism for banning peers that often send currently invalid - but not never-valid -transactions likely cannot be defined within Tendermint so I'm wondering how much effort we should allocate to doing it?
**Note** Do we have mechanisms in place to handle cases when `broadcastTxCommit` submits | ||
failing transactions (can this be a form of attack)? | ||
|
||
### 3. Attack scenarios |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the difficulty here is that this would be an application-produced issue that would first be noticed by operators but that operators would be unable to reasonably address.
Is there a possibility of (accidentally) introducing a bug in CheckTx, which would cause (most of) the P2P to loose connectivity? How would that look like?
It seems to me that, of course this is the case. If the app always returns an erroneous 'never-valid' response code, then the p2p network will completely disconnect.
|
||
### 3. Attack scenarios | ||
|
||
While an attack by simply banning peers on faling `CheckTx` is hard to imagine, as the incentive for doing so is not clear, there are considerations with regards to the current mempool gossip implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, supposing some overlay network, as a node I can only influence my direct connections. I can force a peer to disconnect from me forever if, say, I wanted to lower the number of ways it has of getting messages to the rest of the network. However, I could have already done that by just disconnecting or by dropping its messages.
|
||
### Impact on ABCI | ||
|
||
- Introduction of new response codes for CheckTx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, I believe that in a certain sense this is breaking. We are giving meaning to something that applications may have already given meaning too.
of them might be obsolete at the time of the check; state changes upon block execution etc.). | ||
However, Tendermint users observed that | ||
there are transactions that can never have been or will never be valid. They thus propose | ||
to introduce a special response code for `CheckTx` to indicate this behaviour, and ban the peers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur with what you're saying @williambanfield but there are definitely non-breaking options such as simply adding a boolean field like txCouldNeverHaveBeenValid
to ResponseCheckTx
1. What does banning a peer mean: | ||
1. A peer can be simply disconnected from. | ||
2. Peer is disconnected from and banned. | ||
3. Conditions for the peer to be banned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to discuss banning by node ID and IP vs banning by IP only?
We thus differentiate two scenarios - a) where `CheckTx` fails due to reasons already | ||
known and b) where `CheckTx` deems a transactions could never have been valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way of phrasing this is stateless checks vs stateful checks. Because check tx can be called by validators across different heights i.e. different states, stateful checks are non deterministic and thus should never result in the ban of a peer, however stateless checks should be deterministic for every honest node irrespective of the height and thus should lead to a ban.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, CheckTx
should be stateful , in the sense that it compares the current transaction against the latest state. The application is the one that defines whether the failure is something that happened due to a difference in states (height etc.) or is the failure a reason for banning. THese two distinctions cover the case where the first scenario happens very often, while the other is immediate reason for banning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, CheckTx should be stateful
Not necessarily. The SDK only has one stateful check in the antehandler (that I know of) which is to check the nonce numbers everything else about it is stateless (It is essentially one large ValidateBasic
function which mostly ensures that the bytes match a recognised protobuf message).
It doesn't make sense to spend too much time checking against state because you don't know whether that's the state that the transactions will eventually execute against so you will need to do all those stateful checks again.
CheckTx should be seen as an initial gatekeeper for transactions but not the final gatekeeper. The final gatekeeper is PrepareProposal
and ProcessProposal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the source for my comment btw: https://github.com/tendermint/tendermint/blob/main/spec/abci/abci%2B%2B_methods.md#checktx . We might need to rephrase this if you feel that's needed.
For now, we will disconnect and ban the peer regardless of the exact reason a transaction | ||
is considered to never be valid. | ||
|
||
#### **Banning in case of ResponseCheckTx.code = 1** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're kind of jumping ahead to the implementation here, but I would be for introducing a new field in ResponseCheckTx
that defaults to not banning the peer rather than modifying the semantics of the existing Code
field.
Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
Currently, a peer can send the same valid (or invalid) transaction [multiple times](https://github.com/tendermint/tendermint/blob/ff0f98892f24aac11e46aeff2b6d2c0ad816701a/mempool/v0/clist_mempool.go#L247). Peers do not | ||
gossip transactions to peers that have sent them that same transaction. But there is no check on whether | ||
a node has alrady sent the same transaction to this peer before. There is also no check whether the transaction | ||
that is being gossiped was valid or not (assumint that invalid transactions could become valid). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is being gossiped was valid or not (assumint that invalid transactions could become valid). | |
that is being gossiped was valid or not (assuming that invalid transactions could become valid). |
|
||
**Signaling the ban to p2p** | ||
|
||
As it is the mempool **reactor** that has access to the p2p layer, not the actual mempool implementation, the peer banning function will most likely have to send the peer ID to a channel to inform the reactor of the fact that this peer should be banned. This would require adding a channel for communication into the `CListMempool` on construction, and adding a routine in the mempool ractor that waits on a response via this channel. However, the actual implementation of this is yet to be defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is the mempool **reactor** that has access to the p2p layer, not the actual mempool implementation, the peer banning function will most likely have to send the peer ID to a channel to inform the reactor of the fact that this peer should be banned. This would require adding a channel for communication into the `CListMempool` on construction, and adding a routine in the mempool ractor that waits on a response via this channel. However, the actual implementation of this is yet to be defined. | |
As it is the mempool **reactor** that has access to the p2p layer, not the actual mempool implementation, the peer banning function will most likely have to send the peer ID to a channel to inform the reactor of the fact that this peer should be banned. This would require adding a channel for communication into the `CListMempool` on construction, and adding a routine in the mempool reactor that waits on a response via this channel. However, the actual implementation of this is yet to be defined. |
|
||
- Introduction of new response codes for CheckTx | ||
- Altering the specification to reflect this change | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also consider using bogus transactions (those which could never be valid) as evidence of misbehavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate how self-contained the doc is. Without much prior knowledge I can grok it!
[avoid overloading the p2p/mempool system](https://github.com/cosmos/ibc-go/issues/853#issuecomment-1032211020) | ||
(ToDo check with Adi what is this about, what are valid reasons this can happen and when should we disconnect) | ||
(relevant code [here](https://github.com/cosmos/ibc-go/blob/a0e59b8e7a2e1305b7b168962e20516ca8c98fad/modules/core/ante/ante.go#L23)) | ||
- IBC relayers: Nodes allow transactions with a wrong `minGas` transaction and gossip them, and other nodes keep rejecting them. (Problem seen with relayers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, this is the exact config name and link to it:
Co-authored-by: Lasaro Camargos <lasaro@informal.systems>
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very supportive for banning for txs that could never be valid
I think banning being defined as blocking the IP is fine to me personally, you should be unable to spoof other IP's due to the secret connection handshake
I do not share cited concerns around backwards compatability. This feels like something a chain who attempts the upgrade on testnet will catch
Definitely support this and echo backwards comparability concern from valardragon. I don't think this is a large concern. Any additions to checktx message are backwards compatible and can be introduced in a minor release |
This RFC captures the changes required to implement the proposals in #7918 , #2185, #6523. The gist of the document is that the mempool should be able to trigger baning of peers if they either send us transactions that fail too often, or transactions that fail due to a particular code that indicates that the application has determined a transaction to never be valid, in any state.
It closes #9546 and if accepted will be followed by implementing the proposed changes.
📖 Rendered