Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pebble compaction causes intermittent, but significant performance impacts #29575

Open
riposteX opened this issue Apr 18, 2024 · 4 comments
Open
Labels

Comments

@riposteX
Copy link

System information

Geth version: v1.13.14
CL client & version: teku@24.1.3
OS & Version: Linux

Expected behaviour

Geth receives/processes blocks in a timely manner.

Actual behaviour

I run a number of geth/teku nodes and have recently noticed an infrequent (order of daily) pattern occurring across all of them where geth receives a burst of 6ish blocks around the same time, the oldest one, of course, being 72s stale.

Of course it could just be a temporary network issue, but I keep seeing this same number of blocks across multiple machines in multiple locations.

Could also be a teku issue, but seems a bit unlikely given the logs below.

Steps to reproduce the behaviour

I've added some custom logging in forkchoiceUpdated():

// Block is not canonical, set head.

unixNow := uint64(time.Now().Unix())
blockHeader := block.Header()
if blockHeader.Time+8 < unixNow {
	fmt.Println("FCU received late block:", blockHeader.Number, unixNow-blockHeader.Time)
}

With this code in place, yesterday I got the output:

FCU received late block: 19680123 63
FCU received late block: 19680124 52
FCU received late block: 19680125 40
FCU received late block: 19680126 28
FCU received late block: 19680127 16

and corresponding Teku logs:

01:01:51.130 INFO  - ESC[37mSlot Event  *** Slot: 8882707, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 64ESC[0m
01:01:56.514 WARN  - ESC[33mExecution Client request timed out. Make sure the Execution Client is online and can respond to requests.ESC[0m
01:01:56.516 WARN  - ESC[33mLate Block Import *** Block: 719951b2cd1546b28744883f37736b44aac68a8db826a050eba30b0862a6ca17 (8882707) proposer 425691 arrival 1500ms, gossip_validation +7ms, pre-state_retrieved +4ms, processed +146ms, data_availability_checked +0ms, execution_payload_result_received +7857ms, begin_importing +0ms, completed +2ms
ESC[0m
01:02:03.105 INFO  - ESC[37mSlot Event  *** Slot: 8882708, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 63ESC[0m
01:02:15.065 INFO  - ESC[37mSlot Event  *** Slot: 8882709, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 63ESC[0m
01:02:15.094 INFO  - ESC[32mExecution Client is responding to requests again after a previous failureESC[0m
01:02:18.523 WARN  - ESC[33mExecution Client request timed out. Make sure the Execution Client is online and can respond to requests.ESC[0m
01:02:27.058 INFO  - ESC[37mSlot Event  *** Slot: 8882710, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 70ESC[0m
01:02:39.051 INFO  - ESC[37mSlot Event  *** Slot: 8882711, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 70ESC[0m
01:02:50.552 INFO  - ESC[32mExecution Client is responding to requests again after a previous failureESC[0m
01:02:51.110 INFO  - ESC[37mSlot Event  *** Slot: 8882712, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 69ESC[0m
01:03:03.347 INFO  - ESC[37mSlot Event  *** Slot: 8882713, Block: e9a5299cdef23abf523c27235560777a176a614341703654aec7b3eab504f99e, Justified: 277583, Finalized: 277582, Peers: 69ESC[0m

I'm interpreting this as something is infrequently hanging geth for ~1 min.

Some other recent incidents occurred at block 19678501 and 19675309. The signature is always a 6ish block pileup on geth and a late block message on Teku. These late block messages were a bit different from the above though:

12:51:51.989 WARN  - ESC[33mLate Block Import *** Block: b9b41da780df72f2b0d81e7e54820cfad87cf25107ba63f
de3e147606bdb76d6 (8877857) proposer 241202 arrival 972ms, gossip_validation +6ms, pre-state_retrieved +
10ms, processed +223ms, execution_payload_result_received +0ms, begin_importing +3778ms, completed +0ms

and

23:34:51.509 WARN  - ESC[33mLate Block Import *** Block: 23b5b12c8680fe980ee2f8bf8455b43b837181b9988fcf7
4d5bd73130b457057 (8881072) proposer 37541 arrival 488ms, gossip_validation +7ms, pre-state_retrieved +1
2ms, processed +163ms, execution_payload_result_received +0ms, begin_importing +3837ms, completed +2ms

I was running an older (and less verbose) version of Teku at the time, Lucas Saldanha from the Teku Discord told me that both of these blocks were late because of blob data unavailability.

@karalabe
Copy link
Member

Could you share some logs from Geth when this happens?

@riposteX
Copy link
Author

I usually run with --verbosity 1 so no past logs. Will try to capture another one.

@riposteX
Copy link
Author

riposteX commented May 7, 2024

Seems to have magically stopped on its own, haven't seen the characteristic 5-6 block pattern for a while now. Will keep an eye out and reopen if it returns.

@riposteX riposteX closed this as completed May 7, 2024
@riposteX
Copy link
Author

Did some more digging, seems the issue is caused by pebble's database compaction.

Not sure how/why it gets triggered, but the result is ~1 min of slow block processing.

Some recent examples where I've timed InsertBlockWithoutSetHead in newPayload

NP received heavy block: 19858977 2.95236375s
NP received heavy block: 19858978 4.224653413s
NP received heavy block: 19858979 22.798706499s
NP received heavy block: 19858980 4.120076448s

NP received heavy block: 19859242 2.621948628s
NP received heavy block: 19859243 2.995023483s
NP received heavy block: 19859244 3.207645697s
NP received heavy block: 19859245 2.334093554s

NP received heavy block: 19860885 2.695685369s
NP received heavy block: 19860886 5.435836699s
NP received heavy block: 19860887 3.502931665s
NP received heavy block: 19860888 2.930462485s

Found the culprit by profiling during the "bad" times, and confirmed that compaction is indeed being triggered by adding logging here:

func (d *Database) onCompactionBegin(info pebble.CompactionInfo) {

Saw some earlier optimizations around compaction (#20130), not sure if there's anything more that can be done to smooth these out as well.

@riposteX riposteX reopened this May 13, 2024
@riposteX riposteX changed the title Infrequent block pileup Pebble compaction causes intermittent, but significant performance impacts May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants