Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: Banning peers based on ResponseCheckTx #9675

Open
wants to merge 408 commits into
base: main
Choose a base branch
from

Conversation

jmalicevic
Copy link
Contributor

This RFC captures the changes required to implement the proposals in #7918 , #2185, #6523. The gist of the document is that the mempool should be able to trigger baning of peers if they either send us transactions that fail too often, or transactions that fail due to a particular code that indicates that the application has determined a transaction to never be valid, in any state.

It closes #9546 and if accepted will be followed by implementing the proposed changes.

📖 Rendered

ValarDragon and others added 30 commits April 9, 2021 10:33
* ABCI++ RFC

This commit adds an RFC for ABCI++, which is a collection of three new phases of communication between the consensus engine and the application.

Co-authored-by: Sunny Aggarwal <sunnya97@protonmail.ch>

* Fix bugs pointed out by @liamsi

* Update rfc/004-abci++.md

Co-authored-by: Federico Kunze <31522760+fedekunze@users.noreply.github.com>

* Fix markdown lints

* Update rfc/004-abci++.md

Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com>

* Update rfc/004-abci++.md

Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com>

* Update rfc/004-abci++.md

Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com>

* Add information about the rename in the context section

* Bold RFC

* Add example for self-authenticating vote data

* More exposition of the term IPC

* Update pros / negatives

* Fix sentence fragment

* Add desc for no-ops

Co-authored-by: Sunny Aggarwal <sunnya97@protonmail.ch>
Co-authored-by: Federico Kunze <31522760+fedekunze@users.noreply.github.com>
Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com>
Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com>
* add rpc spec and support outline

* add json

* add more routes remove unneeded ones

* add rest of rpc endpoints

* add jsonrpc calls

* add more jsonrpc calls

* fix blockchain

* cleanup unused links and add links to repos

* Update spec/rpc/README.md

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

* add missing param from consensus param

* Update spec/rpc/README.md

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

* Update spec/rpc/README.md

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

* fix cast and add doc to readme

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
Co-authored-by: Marko Baricevic <markobaricevic@Fergalicious.local>
* Avoid quantifier alternation cycle

The problematic quantifier alternation cycle arose because the
definition of accountability_violation was unfolded.

This commit also restructures the induction proof for clarity.

* add count_lines.sh

* fix typo and add forgotten complete=fo in comment

Co-authored-by: Giuliano <giuliano@eic-61-11.galois.com>
Bumps [actions/stale](https://github.com/actions/stale) from 3 to 3.0.18.
- [Release notes](https://github.com/actions/stale/releases)
- [Commits](actions/stale@v3...v3.0.18)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/stale](https://github.com/actions/stale) from 3.0.18 to 3.0.19.
- [Release notes](https://github.com/actions/stale/releases)
- [Commits](actions/stale@v3.0.18...v3.0.19)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* add a changelog to track changes

* Update CHANGELOG.md

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
* clarify timestamps

* changelog entry

* Update spec/rpc/README.md

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
* rpc: add chunked genesis endpoint

* fix lint

* feedback

* add info about error

* fix lint

Co-authored-by: marbar3778 <marbar3778@yahoo.com>
* add parameters to byzantine send action

* make net not trusted

it's not necessary since for proofs Ivy will assume that the environment
does not break action preconditions

* use require instead of assume

it seems that assume is not checked when other isolates call!

* add comment

* add comment

* run with random seed

* make domain model extractable to C++

* substitute require for assume

assumes in an action are not checked when the action is called! I.e.
they place no requirement on the caller; we're just assuming that the
caller is going to do the right thing. This wasn't very important here
but it leade to a minor inconsistency slipping through.

* make the net isolate not trusted

there was no need for it

* add tendermint_test.ivy

contains a simple test scenario that show that the specification is no
vacuuous

* update comment

* add comments

* throw if trying to parse nset value in the repl

* add comment

* minor refactoring
Bumps [gaurav-nelson/github-action-markdown-link-check](https://github.com/gaurav-nelson/github-action-markdown-link-check) from 1.0.12 to 1.0.13.
- [Release notes](https://github.com/gaurav-nelson/github-action-markdown-link-check/releases)
- [Commits](gaurav-nelson/github-action-markdown-link-check@1.0.12...1.0.13)

---
updated-dependencies:
- dependency-name: gaurav-nelson/github-action-markdown-link-check
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/stale](https://github.com/actions/stale) from 3.0.19 to 4.
- [Release notes](https://github.com/actions/stale/releases)
- [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md)
- [Commits](actions/stale@v3.0.19...v4)

---
updated-dependencies:
- dependency-name: actions/stale
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
mem.peerFailureMap[peerID] = { time.Now(), numFailures + 1}
}
```
If transactions with `ResponseCheckTx.code > 1` are not deleted from the cache and we want to ban them on the first offence,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, transactions with ResponseCheckTx.code > 1 should never be removed from the cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is something we need to agree on. If we do not expect to have many nodes sending us such transactions, we might save space by just removing it and not storing the peer info, because this peer is immediately banned and this should be a one-off occurence. That is also why the condition to remove the tx from the cache from your previous comments removes the transaction if this is the reason for failure.

If we do think this is better stored and never removed, then the condition from your comment should be as you suggested and not as it is currently written.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it'd be best to discuss this synchronously

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to summarize our in person discussion which will also lead to a clarification in the doc. The debate we are having is whether we want the never-valid transactions in the cache or not.

  • Yes because this means we will never run checkTx on them again and we know it will never be valid, so we may want it to stay there forever.
  • No because they are very rare, the peer is banned so they take up space in the cache when they don't occur that often


As this logic loops through the transactions in any case, we can leverage it to check whether we can ban peers.
However, this approach has several downsides:
- It is not timely enough. Recheck is executed after a block is committed, leaving room for a bad peer to send
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, my thought on this part is that I can't to see a valid example of an App that locally passed CheckTx on a TX, and then fails at reCheckTx time with "> 1"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we would not expect the transaction to fail it here after it passed it the first time. What I meant here is that we check whether the transaction was marked as invalid with a particular code when it was added, and if yes we ban all the peers that sent it. So we only leverage the fact that we anyways loop through all the transactions and do not apply the normal recheck logic (which applies to only valid transactions).

In my opinion this solution is a bad one, but I decided to list it here in case people feel differently.


### Impact on ABCI

- Introduction of new response codes for CheckTx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention this is a non-breaking interface change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, I believe that in a certain sense this is breaking. We are giving meaning to something that applications may have already given meaning too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw your comment above. I hadn't thought about that. I think you are right

Copy link
Contributor

@williambanfield williambanfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the great write-up! I would say that, in my mind, this is effectively bringing a mechanism of the PEX reactor into the mempool. That reactor has logic to mark a peer as bad if it sends malformed network addresses. In that context, the bytes of the message may contain a valid proto but the values in that proto message are invalid. In that case, they do not represent a legal network address. I believe that this mechanism should just mimic the logic of the PEX reactor but with the needed input of the application.

Since the contents of the Tx are meaningless from the standpoint of Tendermint, we have no idea if they are well formed and/or contain clearly bogus data so we must ask the application. If the application responds that the data is malformed or never valid for some other reason, we should ban the peer. This mimics what we do in the PEX reactor: when we see garbage data we ban the peer.

I think that the case of banning a peer if it frequently sends transactions to our node that could have been valid at some point, either in the past or in the future, is much much more subtle and, without strong justification, we should likely avoid solving entirely. If we need something we could consider a rate-limit for such peers to prevent our resources from being completely tied up by peers that are not actively malicious but that are also being unhelpful.

of them might be obsolete at the time of the check; state changes upon block execution etc.).
However, Tendermint users observed that
there are transactions that can never have been or will never be valid. They thus propose
to introduce a special response code for `CheckTx` to indicate this behaviour, and ban the peers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They thus propose
to introduce a special response code for CheckTx to indicate this behaviour, and ban the peers...

I would also note that this may be breaking for old applications. We have never reserved response codes as meaningful to Tendermint, so giving semantic meaning to some response code, such as '44 = ban the peer' may break older applications that had used 44 for some internal reason. It seems unlikely that this is the case but it is a concern worth taking into consideration.

2. Peer is disconnected from and banned.
3. Conditions for the peer to be banned.
2. If `CheckTx` signals a peer should be banned, how do we know which peer(s) to ban?
3. Are there possible attack scenarios by allowing this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our p2p layer implements a cryptographic handshake procedure, we can be certain of the identity of the sending peer. This means a malicious peer could not force us to ban a different peer.

For the sake of simplicity, in the remainder of the text we will distinguish the failures due to a) as failures
signaled with `ResponseCheckTx.code = 1` and the failures described in b), failures with `ResponseCheckTx.code > 1`.

For a), a peer sends transactions that **repeatedly** fail CheckTx with `ResponseCheckTx.code = 1`, and is banned or disconnected from to avoid this. In this case we need to define what repeatedly means.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to me that that's what's being suggested by the different response codes, either =1 or >1 @sergio-mena

Comment on lines 122 to 123
A peer should not be banned the first time this happens, but rather if this happens a predefined number
of times (`numFailures`) within a time interval (`lastFailure`). This time interval should be reset every `failureResetInterval`ms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not seeing it in the Background section, is this behavior that we definitely need? I think that banning peers that send us transactions that absolutely could never have been valid is mostly what's being asked for. A mechanism for banning peers that often send currently invalid - but not never-valid -transactions likely cannot be defined within Tendermint so I'm wondering how much effort we should allocate to doing it?

**Note** Do we have mechanisms in place to handle cases when `broadcastTxCommit` submits
failing transactions (can this be a form of attack)?

### 3. Attack scenarios
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the difficulty here is that this would be an application-produced issue that would first be noticed by operators but that operators would be unable to reasonably address.

Is there a possibility of (accidentally) introducing a bug in CheckTx, which would cause (most of) the P2P to loose connectivity? How would that look like?

It seems to me that, of course this is the case. If the app always returns an erroneous 'never-valid' response code, then the p2p network will completely disconnect.


### 3. Attack scenarios

While an attack by simply banning peers on faling `CheckTx` is hard to imagine, as the incentive for doing so is not clear, there are considerations with regards to the current mempool gossip implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, supposing some overlay network, as a node I can only influence my direct connections. I can force a peer to disconnect from me forever if, say, I wanted to lower the number of ways it has of getting messages to the rest of the network. However, I could have already done that by just disconnecting or by dropping its messages.


### Impact on ABCI

- Introduction of new response codes for CheckTx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, I believe that in a certain sense this is breaking. We are giving meaning to something that applications may have already given meaning too.

of them might be obsolete at the time of the check; state changes upon block execution etc.).
However, Tendermint users observed that
there are transactions that can never have been or will never be valid. They thus propose
to introduce a special response code for `CheckTx` to indicate this behaviour, and ban the peers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with what you're saying @williambanfield but there are definitely non-breaking options such as simply adding a boolean field like txCouldNeverHaveBeenValid to ResponseCheckTx

docs/rfc/rfc-026-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved
Comment on lines +84 to +87
1. What does banning a peer mean:
1. A peer can be simply disconnected from.
2. Peer is disconnected from and banned.
3. Conditions for the peer to be banned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to discuss banning by node ID and IP vs banning by IP only?

Comment on lines 97 to 98
We thus differentiate two scenarios - a) where `CheckTx` fails due to reasons already
known and b) where `CheckTx` deems a transactions could never have been valid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way of phrasing this is stateless checks vs stateful checks. Because check tx can be called by validators across different heights i.e. different states, stateful checks are non deterministic and thus should never result in the ban of a peer, however stateless checks should be deterministic for every honest node irrespective of the height and thus should lead to a ban.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, CheckTx should be stateful , in the sense that it compares the current transaction against the latest state. The application is the one that defines whether the failure is something that happened due to a difference in states (height etc.) or is the failure a reason for banning. THese two distinctions cover the case where the first scenario happens very often, while the other is immediate reason for banning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, CheckTx should be stateful

Not necessarily. The SDK only has one stateful check in the antehandler (that I know of) which is to check the nonce numbers everything else about it is stateless (It is essentially one large ValidateBasic function which mostly ensures that the bytes match a recognised protobuf message).

It doesn't make sense to spend too much time checking against state because you don't know whether that's the state that the transactions will eventually execute against so you will need to do all those stateful checks again.

CheckTx should be seen as an initial gatekeeper for transactions but not the final gatekeeper. The final gatekeeper is PrepareProposal and ProcessProposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the source for my comment btw: https://github.com/tendermint/tendermint/blob/main/spec/abci/abci%2B%2B_methods.md#checktx . We might need to rephrase this if you feel that's needed.

docs/rfc/rfc-026-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved
For now, we will disconnect and ban the peer regardless of the exact reason a transaction
is considered to never be valid.

#### **Banning in case of ResponseCheckTx.code = 1**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're kind of jumping ahead to the implementation here, but I would be for introducing a new field in ResponseCheckTx that defaults to not banning the peer rather than modifying the semantics of the existing Code field.

jmalicevic and others added 3 commits November 11, 2022 10:46
Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
Co-authored-by: Lasaro Camargos <lasaro@informal.systems>
Currently, a peer can send the same valid (or invalid) transaction [multiple times](https://github.com/tendermint/tendermint/blob/ff0f98892f24aac11e46aeff2b6d2c0ad816701a/mempool/v0/clist_mempool.go#L247). Peers do not
gossip transactions to peers that have sent them that same transaction. But there is no check on whether
a node has alrady sent the same transaction to this peer before. There is also no check whether the transaction
that is being gossiped was valid or not (assumint that invalid transactions could become valid).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
that is being gossiped was valid or not (assumint that invalid transactions could become valid).
that is being gossiped was valid or not (assuming that invalid transactions could become valid).


**Signaling the ban to p2p**

As it is the mempool **reactor** that has access to the p2p layer, not the actual mempool implementation, the peer banning function will most likely have to send the peer ID to a channel to inform the reactor of the fact that this peer should be banned. This would require adding a channel for communication into the `CListMempool` on construction, and adding a routine in the mempool ractor that waits on a response via this channel. However, the actual implementation of this is yet to be defined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As it is the mempool **reactor** that has access to the p2p layer, not the actual mempool implementation, the peer banning function will most likely have to send the peer ID to a channel to inform the reactor of the fact that this peer should be banned. This would require adding a channel for communication into the `CListMempool` on construction, and adding a routine in the mempool ractor that waits on a response via this channel. However, the actual implementation of this is yet to be defined.
As it is the mempool **reactor** that has access to the p2p layer, not the actual mempool implementation, the peer banning function will most likely have to send the peer ID to a channel to inform the reactor of the fact that this peer should be banned. This would require adding a channel for communication into the `CListMempool` on construction, and adding a routine in the mempool reactor that waits on a response via this channel. However, the actual implementation of this is yet to be defined.


- Introduction of new response codes for CheckTx
- Altering the specification to reflect this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also consider using bogus transactions (those which could never be valid) as evidence of misbehavior?

Copy link
Contributor

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate how self-contained the doc is. Without much prior knowledge I can grok it!

docs/rfc/rfc-026-p2p-bad-peers-checktx.md Outdated Show resolved Hide resolved
[avoid overloading the p2p/mempool system](https://github.com/cosmos/ibc-go/issues/853#issuecomment-1032211020)
(ToDo check with Adi what is this about, what are valid reasons this can happen and when should we disconnect)
(relevant code [here](https://github.com/cosmos/ibc-go/blob/a0e59b8e7a2e1305b7b168962e20516ca8c98fad/modules/core/ante/ante.go#L23))
- IBC relayers: Nodes allow transactions with a wrong `minGas` transaction and gossip them, and other nodes keep rejecting them. (Problem seen with relayers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: Lasaro Camargos <lasaro@informal.systems>
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale for use by stalebot label Nov 25, 2022
@jmalicevic jmalicevic added S:wip Work in progress (prevents stalebot from automatically closing) and removed stale for use by stalebot labels Nov 28, 2022
Copy link
Contributor

@ValarDragon ValarDragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very supportive for banning for txs that could never be valid

I think banning being defined as blocking the IP is fine to me personally, you should be unable to spoof other IP's due to the secret connection handshake

I do not share cited concerns around backwards compatability. This feels like something a chain who attempts the upgrade on testnet will catch

@tac0turtle
Copy link
Contributor

Definitely support this and echo backwards comparability concern from valardragon. I don't think this is a large concern. Any additions to checktx message are backwards compatible and can be introduced in a minor release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:docs Component: Documentation C:mempool Component: Mempool S:wip Work in progress (prevents stalebot from automatically closing)
Projects
Status: Ready for Review
Development

Successfully merging this pull request may close these issues.

mempool/p2p: Research implications of peer disconnect based on ResponseCheckTx