Pebble compaction causes intermittent, but significant performance impacts #29575

riposteX · 2024-04-18T14:40:12Z

System information

Geth version: v1.13.14
CL client & version: teku@24.1.3
OS & Version: Linux

Expected behaviour

Geth receives/processes blocks in a timely manner.

Actual behaviour

I run a number of geth/teku nodes and have recently noticed an infrequent (order of daily) pattern occurring across all of them where geth receives a burst of 6ish blocks around the same time, the oldest one, of course, being 72s stale.

Of course it could just be a temporary network issue, but I keep seeing this same number of blocks across multiple machines in multiple locations.

Could also be a teku issue, but seems a bit unlikely given the logs below.

Steps to reproduce the behaviour

I've added some custom logging in forkchoiceUpdated():

go-ethereum/eth/catalyst/api.go

Line 313 in 823719b

// Block is not canonical, set head.

unixNow := uint64(time.Now().Unix())
blockHeader := block.Header()
if blockHeader.Time+8 < unixNow {
	fmt.Println("FCU received late block:", blockHeader.Number, unixNow-blockHeader.Time)
}

With this code in place, yesterday I got the output:

FCU received late block: 19680123 63
FCU received late block: 19680124 52
FCU received late block: 19680125 40
FCU received late block: 19680126 28
FCU received late block: 19680127 16

and corresponding Teku logs:

01:01:51.130 INFO  - ESC[37mSlot Event  *** Slot: 8882707, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 64ESC[0m
01:01:56.514 WARN  - ESC[33mExecution Client request timed out. Make sure the Execution Client is online and can respond to requests.ESC[0m
01:01:56.516 WARN  - ESC[33mLate Block Import *** Block: 719951b2cd1546b28744883f37736b44aac68a8db826a050eba30b0862a6ca17 (8882707) proposer 425691 arrival 1500ms, gossip_validation +7ms, pre-state_retrieved +4ms, processed +146ms, data_availability_checked +0ms, execution_payload_result_received +7857ms, begin_importing +0ms, completed +2ms
ESC[0m
01:02:03.105 INFO  - ESC[37mSlot Event  *** Slot: 8882708, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 63ESC[0m
01:02:15.065 INFO  - ESC[37mSlot Event  *** Slot: 8882709, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 63ESC[0m
01:02:15.094 INFO  - ESC[32mExecution Client is responding to requests again after a previous failureESC[0m
01:02:18.523 WARN  - ESC[33mExecution Client request timed out. Make sure the Execution Client is online and can respond to requests.ESC[0m
01:02:27.058 INFO  - ESC[37mSlot Event  *** Slot: 8882710, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 70ESC[0m
01:02:39.051 INFO  - ESC[37mSlot Event  *** Slot: 8882711, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 70ESC[0m
01:02:50.552 INFO  - ESC[32mExecution Client is responding to requests again after a previous failureESC[0m
01:02:51.110 INFO  - ESC[37mSlot Event  *** Slot: 8882712, Block:                                                        ... empty, Justified: 277583, Finalized: 277582, Peers: 69ESC[0m
01:03:03.347 INFO  - ESC[37mSlot Event  *** Slot: 8882713, Block: e9a5299cdef23abf523c27235560777a176a614341703654aec7b3eab504f99e, Justified: 277583, Finalized: 277582, Peers: 69ESC[0m

I'm interpreting this as something is infrequently hanging geth for ~1 min.

Some other recent incidents occurred at block 19678501 and 19675309. The signature is always a 6ish block pileup on geth and a late block message on Teku. These late block messages were a bit different from the above though:

12:51:51.989 WARN  - ESC[33mLate Block Import *** Block: b9b41da780df72f2b0d81e7e54820cfad87cf25107ba63f
de3e147606bdb76d6 (8877857) proposer 241202 arrival 972ms, gossip_validation +6ms, pre-state_retrieved +
10ms, processed +223ms, execution_payload_result_received +0ms, begin_importing +3778ms, completed +0ms

and

23:34:51.509 WARN  - ESC[33mLate Block Import *** Block: 23b5b12c8680fe980ee2f8bf8455b43b837181b9988fcf7
4d5bd73130b457057 (8881072) proposer 37541 arrival 488ms, gossip_validation +7ms, pre-state_retrieved +1
2ms, processed +163ms, execution_payload_result_received +0ms, begin_importing +3837ms, completed +2ms

I was running an older (and less verbose) version of Teku at the time, Lucas Saldanha from the Teku Discord told me that both of these blocks were late because of blob data unavailability.

The text was updated successfully, but these errors were encountered:

karalabe · 2024-04-19T10:06:38Z

Could you share some logs from Geth when this happens?

riposteX · 2024-04-19T10:48:51Z

I usually run with --verbosity 1 so no past logs. Will try to capture another one.

riposteX · 2024-05-07T17:33:04Z

Seems to have magically stopped on its own, haven't seen the characteristic 5-6 block pattern for a while now. Will keep an eye out and reopen if it returns.

riposteX · 2024-05-13T13:09:07Z

Did some more digging, seems the issue is caused by pebble's database compaction.

Not sure how/why it gets triggered, but the result is ~1 min of slow block processing.

Some recent examples where I've timed InsertBlockWithoutSetHead in newPayload

NP received heavy block: 19858977 2.95236375s
NP received heavy block: 19858978 4.224653413s
NP received heavy block: 19858979 22.798706499s
NP received heavy block: 19858980 4.120076448s

NP received heavy block: 19859242 2.621948628s
NP received heavy block: 19859243 2.995023483s
NP received heavy block: 19859244 3.207645697s
NP received heavy block: 19859245 2.334093554s

NP received heavy block: 19860885 2.695685369s
NP received heavy block: 19860886 5.435836699s
NP received heavy block: 19860887 3.502931665s
NP received heavy block: 19860888 2.930462485s

Found the culprit by profiling during the "bad" times, and confirmed that compaction is indeed being triggered by adding logging here:

go-ethereum/ethdb/pebble/pebble.go

Line 97 in 44a50c9

func (d *Database) onCompactionBegin(info pebble.CompactionInfo) {

Saw some earlier optimizations around compaction (#20130), not sure if there's anything more that can be done to smooth these out as well.

riposteX added the type:bug label Apr 18, 2024

riposteX closed this as completed May 7, 2024

riposteX reopened this May 13, 2024

riposteX changed the title ~~Infrequent block pileup~~ Pebble compaction causes intermittent, but significant performance impacts May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pebble compaction causes intermittent, but significant performance impacts #29575

Pebble compaction causes intermittent, but significant performance impacts #29575

riposteX commented Apr 18, 2024

karalabe commented Apr 19, 2024

riposteX commented Apr 19, 2024

riposteX commented May 7, 2024

riposteX commented May 13, 2024

Pebble compaction causes intermittent, but significant performance impacts #29575

Pebble compaction causes intermittent, but significant performance impacts #29575

Comments

riposteX commented Apr 18, 2024

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

karalabe commented Apr 19, 2024

riposteX commented Apr 19, 2024

riposteX commented May 7, 2024

riposteX commented May 13, 2024