Currently range sync download log errors just say `error: rpc_error` which isn't helpful. The actual error is suppressed unless logged somewhere else.
Log the actual error that caused the batch download to fail as part of the log that states that the batch download failed.
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR https://github.com/sigp/lighthouse/pull/6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds.
I also added a few more useful metrics. The total set and intended usage is:
- `sync_peers_per_column_subnet`: Track health of overall subnets in your node
- `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot
- ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~
- ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
Currently we have very poor coverage of range sync with unit tests. With the event driven test framework we could cover much more ground and be confident when modifying the code.
Add two basic cases:
- Happy path, complete a finalized sync for 2 epochs
- Post-PeerDAS case where we start without enough custody peers and later we find enough
⚠️ If you have ideas for more test cases, please let me know! I'll write them
- Re-opened PR from https://github.com/sigp/lighthouse/pull/6869
Writing and running tests I noted that the sync RPC requests are very verbose now.
`DataColumnsByRootRequestId { id: 123, requester: Custody(CustodyId { requester: CustodyRequester(SingleLookupReqId { req_id: 121, lookup_id: 101 }) }) }`
Since this Id is logged rather often I believe there's value in
1. Making them more succinct for log verbosity
2. Make them a string that's easy to copy and work with elastic
Write custom `Display` implementations to render Ids in a more DX format
_ DataColumnsByRootRequestId with a block lookup_
```
123/Custody/121/Lookup/101
```
_DataColumnsByRangeRequestId_
```
123/122/RangeSync/0/5492900659401505034
```
- This one will be shorter after https://github.com/sigp/lighthouse/pull/6868
Also made the logs format and text consistent across all methods
- PR https://github.com/sigp/lighthouse/pull/6497 made obsolete some consistency checks inside the batch
I forgot to remove the consumers of those errors
Remove un-used batch sync error condition, which was a nested `Result<_, Result<_, E>>`
Part of
- https://github.com/sigp/lighthouse/issues/6258
To address PeerDAS sync issues we need to make individual by_range requests within a batch retriable. We should adopt the same pattern for lookup sync where each request (block/blobs/columns) is tracked individually within a "meta" request that group them all and handles retry logic.
- Building on https://github.com/sigp/lighthouse/pull/6398
second step is to add individual request accumulators for `blocks_by_range`, `blobs_by_range`, and `data_columns_by_range`. This will allow each request to progress independently and be retried separately.
Most of the logic is just piping, excuse the large diff. This PR does not change the logic of how requests are handled or retried. This will be done in a future PR changing the logic of `RangeBlockComponentsRequest`.
### Before
- Sync manager receives block with `SyncRequestId::RangeBlockAndBlobs`
- Insert block into `SyncNetworkContext::range_block_components_requests`
- (If received stream terminators of all requests)
- Return `Vec<RpcBlock>`, and insert into `range_sync`
### Now
- Sync manager receives block with `SyncRequestId::RangeBlockAndBlobs`
- Insert block into `SyncNetworkContext:: blocks_by_range_requests`
- (If received stream terminator of this request)
- Return `Vec<SignedBlock>`, and insert into `SyncNetworkContext::components_by_range_requests `
- (If received a result for all requests)
- Return `Vec<RpcBlock>`, and insert into `range_sync`
Resolve a `TODO(das)` to use KZG batch verification in `put_rpc_custody_columns`
Uses `verify_kzg_for_data_column_list_with_scoring` in all paths that send more than one column. To use batch verification and have attributability of which peer is sending a bad column.
Needs to move `verify_kzg_for_data_column_list_with_scoring` into the type's module to convert to the KZG verified type.
Addresses #6706
This PR activates PeerDAS at the Fulu fork epoch instead of `EIP_7594_FORK_EPOCH`. This means we no longer support testing PeerDAS with Deneb / Electrs, as it's now part of a hard fork.
Currently, we set the `chain_id` of range sync chains to `u64(hash(target_root, target_slot))`, which results in a long integer.
```
Jan 27 00:43:27.246 DEBG Batch downloaded, chain: 4223372036854775807, awaiting_batches: 0, batch_state: [p,E,E,E,E], blocks: 0, epoch: 0, service: range_sync
```
Instead, we can use `network_context.next_id()` as we do for all other sync items and get a unique sequential (not too big) integer as id.
```
Jan 27 00:43:27.246 DEBG Batch downloaded, chain: 4, awaiting_batches: 0, batch_state: [p,E,E,E,E], blocks: 0, epoch: 0, service: range_sync
```
Also, if a specific chain for the same target is retried later, it won't get the same ID so we can more clearly differentiate the logs associated with each attempt.
When working on unrelated changes I noted:
- An unnecessary closure left by a commit of some guy named @dapplion that can be removed
- match statements that can be simplified with the new let else syntax
- instead of mapping a result to ignore the Ok value, return
N/A
In https://github.com/sigp/lighthouse/pull/6329 we changed `max_blobs_per_block` from a preset to a config value.
We weren't using the right value based on fork in that PR. This is a follow up PR to use the fork dependent values.
In the proces, I also updated other places where we weren't using fork dependent values from the ChainSpec.
Note to reviewer: easier to go through by commit
Complements
- https://github.com/sigp/lighthouse/pull/6321
by detecting if the proposer signature is valid or not during RPC block processing. In lookup sync, if the invalid signature signature is the proposer signature, it's not deterministic on the block root. So we should only penalize the sending peer and retry. Otherwise, if it's on the body we should drop the lookup and penalize all peers that claim to have imported the block
* First pass
* Add restrictions to RuntimeVariableList api
* Use empty_uninitialized and fix warnings
* Fix some todos
* Merge branch 'unstable' into max-blobs-preset
* Fix take impl on RuntimeFixedList
* cleanup
* Fix test compilations
* Fix some more tests
* Fix test from unstable
* Merge branch 'unstable' into max-blobs-preset
* Merge remote-tracking branch 'origin/unstable' into max-blobs-preset
* Remove footgun function
* Minor simplifications
* Move from preset to config
* Fix typo
* Revert "Remove footgun function"
This reverts commit de01f923c7.
* Try fixing tests
* Thread through ChainSpec
* Fix release tests
* Move RuntimeFixedVector into module and rename
* Add test
* Remove empty RuntimeVarList awefullness
* Fix tests
* Simplify BlobSidecarListFromRoot
* Merge remote-tracking branch 'origin/unstable' into max-blobs-preset
* Bump quota to account for new target (6)
* Remove clone
* Fix issue from review
* Try to remove ugliness
* Merge branch 'unstable' into max-blobs-preset
* Fix max value
* Fix doctest
* Fix formatting
* Fix max check
* Delete hardcoded max_blobs_per_block in RPC limits
* Merge remote-tracking branch 'origin/unstable' into max-blobs-preset
* Write range sync tests in external event-driven form
* Fix remaining test
* Drop unused generics
* Merge branch 'unstable' into range-sync-tests
* Add reference to test author
* Use async await
* Fix failing test. Not sure how it was passing before without an EL.
* Better assert message in lookup sampling test
* Export status
* Merge remote-tracking branch 'sigp/unstable' into lookup-sampling-test-assert
* Drop unused
* Use slice
* Add test for ActiveSamplingRequest
* Fix the column_indexes field from the requested ones to the responded ones
* Fix clippy errors
* Move tests to tests.rs
* Fix unused import
* Fix clippy error
* Merge branch 'unstable' into fork/add-test-for-active-sampling-request
# Conflicts:
# beacon_node/network/Cargo.toml
# beacon_node/network/src/sync/sampling.rs
* Merge branch 'unstable' into fork/add-test-for-active-sampling-request
* add fork_name_enabled fn to Forkname impl
* refactor codebase to use new fork_enabled fn
* fmt
* Merge branch 'unstable' of https://github.com/sigp/lighthouse into fork-ord-impl
* small code cleanup
* resolve merge conflicts
* fix beacon chain test
* merge conflicts
* fix ef test issue
* resolve merge conflicts
* add id to rpc requests
* rename rpc request and response types for more accurate meaning
* remove unrequired build_request function
* remove unirequired Request wrapper types and unify Outbound and Inbound Request
* add RequestId to NetworkMessage::SendResponse
,NetworkMessage::SendErrorResponse to be passed to Rpc::send_response
* start splitting gossip verification
* WIP
* Gossip verify separate (#7)
* save
* save
* make ProvenancedBlock concrete
* delete into gossip verified block contents
* get rid of IntoBlobSidecar trait
* remove IntoGossipVerified trait
* get tests compiling
* don't check sidecar slashability in publish
* remove second publish closure
* drop blob bool. also prefer using message index over index of position in list
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Fix low-hanging tests
* Fix tests and clean up
* Clean up imports
* more cleanup
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Further refine behaviour and add tests
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Remove empty line
* Fix test (block is not fully imported just gossip verified)
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Update for unstable & use empty blob list
* Update comment
* Add test for duplicate block case
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Clarify unreachable case
* Fix another publish_block case
* Remove unreachable case in filter chain segment
* Revert unrelated blob optimisation
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Fix merge conflicts
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Fix some compilation issues. Impl is fucked though
* Support peerDAS
* Fix tests
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Fix conflict
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Address review comments
* Merge remote-tracking branch 'origin/unstable' into gossip-verify-separate
* Improve `get_custody_columns` validation, caching and error handling.
* Merge branch 'unstable' into get-custody-columns-error-handing
* Fix failing test and add more test.
* Fix failing test and add more test.
* Merge branch 'unstable' into get-custody-columns-error-handing
# Conflicts:
# beacon_node/lighthouse_network/src/discovery/subnet_predicate.rs
# beacon_node/lighthouse_network/src/peer_manager/peerdb.rs
# beacon_node/lighthouse_network/src/peer_manager/peerdb/peer_info.rs
# beacon_node/lighthouse_network/src/types/globals.rs
# beacon_node/network/src/service.rs
# consensus/types/src/data_column_subnet_id.rs
* Add unit test to make sure the default specs won't panic on the `compute_custody_requirement_subnets` function.
* Add condition when calling `compute_custody_subnets_from_metadata` and update logs.
* Validate `csc` when returning from enr. Remove `csc` computation on connection since we get them on metadata anyway.
* Add `peers_per_custody_subnet_count` to track peer csc and supernodes.
* Disconnect peers with invalid metadata and find other peers instead.
* Fix sampling tests.
* Merge branch 'unstable' into get-custody-columns-error-handing
* Merge branch 'unstable' into get-custody-columns-error-handing