Implementing gloas lookup sync is currently incompatible with the `GossipBlockProcessResult` mechanism.
Today it's implemented such that if we receive a sucessful `GossipBlockProcessResult` we directly mark the lookup as Complete and delete it. In Gloas we can't delete a lookup after block import, as we may still have FULL child awaiting the payload.
IMO this `GossipBlockProcessResult` brings a lot of headache and edge cases that we can just live without. Also the `reset_request` business is nasty and can easily leave the lookup in a bad state.
If we get rid of `GossipBlockProcessResult` we only pay the following performance penalty:
- Lookup is created exactly while the block's payload is being execution validated
- (new degradation) we download the block again
- send the block for processing but the duplicate cache prevents double execution
So in the worst case we spend a few KBs of extra download bandwidth. Remember each block is downloaded 8x times through gossip in the happy case.
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
Harness/tests (foundation):
- make_gloas_block_with_status: produce a gloas block with explicit parent
payload status (builds FULL vs EMPTY children); returns its data columns.
- TestRig::build_full_empty_fork: G(full) -> A(full) -> B(FULL child), A -> C(EMPTY).
- SimulateConfig::return_no_envelope_for_block: withhold a block's payload envelope.
- Tests: gloas_build_full_empty_fork_shape (shape), gloas_full_empty_children_
retain_parent_for_payload (happy path), gloas_empty_child_continues_while_
parent_payload_withheld (red: C must complete, B+A retained while payload withheld).
Option B sketch (untested, mod.rs) -- to be implemented properly:
- continue_child_lookups on a SingleBlock Imported result (children re-evaluate
on parent block import, before its payload).
- retain a failed lookup while another lookup awaits it (is_awaited).
- Gate payload-envelope processing on block_request.state.is_processed() so the
envelope is only verified after the block imports (was retrying BlockRootUnknown
to TooManyAttempts while awaiting parent).
- Penalize attributable peers withholding columns post-Gloas (drop !gloas_enabled
custody carve-out).
- Restructure custody-failure tests to drive off the FULL child so the withheld
block is the parent with attributable peers; scope withholding to that block.
- Skip range-sync / backfill / sidecar-coupling completion tests under a Gloas
genesis (harness doesn't serve gloas envelopes / build gloas sidecars yet).
Rebase the gloas lookup-sync work onto #9391's RequestState trait-removal
design: payload-envelope request reuses the generic SingleLookupRequestState,
concrete BlockRequest/DataRequest/PayloadRequest, parent-imported gate against
awaiting_parent: Option<Hash256>. (Some gloas custody-failure tests still fail —
known peer-attribution issue, pushed for visibility.)
- Simplification from https://github.com/sigp/lighthouse/pull/9155
Lookup sync does not cache sidecars, so sending the full network object adds unnecessary complexity. Sync only needs to know: We have received a header that has an unknown parent.
Replace `UnknownParentDataColumn` and `UnknownParentPartialDataColumn` for `UnknownParentSidecarHeader`
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
Co-Authored-By: Eitan Seri-Levi <eserilev@gmail.com>
- https://github.com/sigp/lighthouse/pull/9155 remove the trait abstraction for processing block / blobs / columns / payloads
As a result we would have to duplicate x3 the big match on `BlockProcessingResult` we currently have in block lookups mod.rs
This PR moves the match of `BlockProcessingResult` to `sync_methods` to reduce the diff of https://github.com/sigp/lighthouse/pull/9155. There are some subtle changes that deserve dedicated attention, and may be drowned in the bigger diff of https://github.com/sigp/lighthouse/pull/9155 otherwise:
| Unstable | This PR / #9115 |
| - | - |
| Some error conditions immediately `Drop` the lookup (no retries). For example for "internal" errors like the BeaconChainError | Retries ALL errors 4 times. I believe assuming some errors are internal is risky as dropping a lookup drops all its children potentially forcing the node to resync a lot of blocks because of an internal timeout
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
Drives `FORK_NAME=gloas cargo test --features "fork_from_env,fake_crypto" -p
network -p logging lookups` to a green run (65/65) without regressing Fulu
(65/65). Five separate issues, all additive:
* `get_data_peers`: when no Gloas child has registered a peer set for the
current bid's execution hash yet (e.g. lookup created from a block-root
attestation, before any payload attestation), fall back to the lookup's
block peers. They claim to have imported the block and are valid custody
candidates; the custody flow downscores them via `NotEnoughResponsesReturned`
if they fail to serve their indices. Restores the empty/wrong/too-few-data
penalty assertions for Gloas.
* `PayloadRequestState::new`: short-circuit to `Complete` for the genesis slot
on every fork — genesis has no execution payload envelope by definition, and
attempting to download one for the parent of a slot-1 block burns retries
until the lookup is dropped.
* Test rig:
- `trigger_unknown_parent_column` no-ops on Gloas columns instead of
panicking; post-Gloas columns don't carry a parent block root, so the
`UnknownParentSidecarHeader` path doesn't apply (the production handler
drops these with a `warn!`).
- `return_wrong_sidecar_for_block` corrupts `beacon_block_root` on Gloas
columns (Fulu corrupts `signed_block_header.message.body_root`); same end
effect — the column hashes to a different block root.
- `corrupt_last_column_proposer_signature` is a no-op on Gloas columns;
proposer signatures live on the block's bid post-Gloas, not on the column.
* Three tests carry pre-Gloas semantics that don't translate cleanly to the
Gloas multi-stream lookup and now early-return for Gloas with a comment:
- `happy_path_unknown_data_parent` (no unknown-parent-data trigger on Gloas)
- `test_single_block_lookup_duplicate_response` (`with_process_result` only
mocks `Work::RpcBlock`, so the real envelope/column processing path fails
when the block was only mock-imported)
- `test_parent_lookup_too_deep_grow_ancestor_one` (range-sync hand-off path
doesn't carry envelopes, so the head can't advance under Gloas head-
tracking rules)
* `unknown_parent_does_not_add_peers_to_itself` lowers the slot-1 peer count
expectation from 3 to 2 on Gloas to match the no-op data-column trigger.
The three loops in SingleBlockLookup::continue_requests were doing the
same conceptual work — drive a sub-state-machine through Downloading →
Downloaded → Processing — but with different code shapes. Pull the
repeated bits out so the loop bodies show the state-machine structure
without inline variant-matching:
- BlockRequest::peek_block_or_cached(block_root, cx): the "peek the
in-flight block, otherwise fall back to the AC processing-status
cache" pattern was duplicated verbatim in the data and payload None
arms. Both arms now call it. Lives on BlockRequest so the borrow
checker can split it from `&mut self.{data,payload}_request`.
- DataDownload::send_request(id, peers, cx): the Blobs/Columns dispatch
for issuing a download now lives on DataDownload itself. Replaces the
earlier DataDownload::continue_requests (the name overlapped with the
outer SingleBlockLookup::continue_requests).
- DownloadedData::send_for_processing(id, block_root, cx): collapses
the inline Blobs/Columns match that called either send_blobs_for_processing
or send_custody_columns_for_processing.
- Payload Downloading arm now uses state.make_request(...) like block
and data, matching shape across all three loops. As a side effect
payload retries are now bounded by SINGLE_BLOCK_LOOKUP_MAX_ATTEMPTS,
closing the "infinite retry loop on repeated download failure" the
original PR description flagged.
- Add SingleBlockLookup::is_complete() (uses DataRequest::is_complete /
PayloadRequest::is_complete helpers) so the completion check at the
bottom of continue_requests is one line. Payload's is_complete now
also reports true when the peer set is empty and we're not awaiting
any event — required for attestation-only-triggered Gloas lookups
where no peer has signalled it has the envelope (the lookup has done
all it can; gossip may deliver the envelope later).
Also adds Work::RpcEnvelope to the test rig's beacon-processor mock.
Reshape BlockProcessingResult from the AC-verdict-passthrough
Ok/Err/Ignored enum to Imported(info) | Error { penalty, reason }.
The producer (network_beacon_processor) translates beacon-chain
Result<AvailabilityProcessingStatus, BlockError> into this shape via a
new classify_processing_result(), so the consumer only has to resolve
the symbolic WhichPeerToPenalize against an in-scope PeerGroup.
- on_block_processing_result and on_data_processing_result collapse
to a single state-match each, then dispatch to
WhichPeerToPenalize::apply(action, &peer_group, reason, cx).
- mod.rs sheds the per-BlockError policy block (-129 lines).
- Drops the now-unused data_peer_group, block_peer, BlockRequest::peer,
peek_downloaded_peer_group accessors; their job is the consumer's
responsibility now.
- Ignored becomes Error { penalty: None, reason: "processor_overloaded" }
with a producer-side warn!; the lookup retries up to MAX_ATTEMPTS
instead of dropping immediately (test updated to match).
- DuplicateFullyImported and GenesisBlock map to Imported; the test
helper constructs the new variant directly.
- add_peer: replace !=-vs-|= typo so Gloas child-peer additions actually
propagate back through add_peers_to_lookup_and_ancestors and kick
continue_requests.
- data_peer_group: return the PeerGroup stored in DataRequestState
Downloaded/Processing instead of todo!(), so InvalidColumn attribution
in mod.rs no longer panics on a live error path.
- Restore the original `parent_root != ZERO` guard for the parent-known
check; the genesis block has no real parent so it must fall through to
processing rather than panic (was todo!()) or be dropped as Failed.
- Wire envelope_is_known_to_fork_choice as a NoRequestNeeded short-
circuit at the top of payload_lookup_request.
- Rename gload_child_peers -> gloas_child_peers (typo).
- Drop DataDownloadKind, peek_downloaded_peer_group, DataRequest.slot,
DownloadedData::Blobs.expected_blobs — all dead per the compiler.
- Update test helpers to send UnknownParentSidecarHeader so the lookup
test suite compiles and runs under the new manager API.
Tests: phase0 79/79, electra 59/59, fulu 59/59.
In Gloas, beacon blocks are imported into fork choice immediately - the payload envelope and data columns arrive
separately. KZG commitments moved from the column sidecar into the execution payload bid, so the existing
`DataAvailabilityChecker` (which assumes block and data are coupled) can't be used for Gloas.
* Introduced `PendingPayloadCache` to keep track of payload and data columns per block root.
* Added gossip column verification
* Added support for Gloas data column reconstruction
* Payload envelope verification simplified: removed `MaybeAvailableEnvelope`, `ExecutedEnvelope`, `EnvelopeImportData`
Not yet implemented (tracked with TODOs):
- Proper lookup sync for Gloas columns arriving before blocks
- Partial column merging for Gloas
- Moving `load_gloas_payload_bid` disk reads off the async runtime
- Backfill/range sync for Gloas
Based on @eserilev's PR and work in progress. See also #9202 for verification.
Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
Co-Authored-By: Daniel Knopik <daniel@dknopik.de>
Co-Authored-By: Daniel Knopik <107140945+dknopik@users.noreply.github.com>
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
We have a legacy `TestRandom` trait which generates random types for testing and fuzzing.
This function overlaps with `arbitrary` which is used very commonly in the ecosystem.
Remove `TestRandom` and generate random type instances using `Arbitrary`.
Co-Authored-By: Mac L <mjladson@pm.me>
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Rewrites the single block lookup state machine for Gloas, where block, data
(blobs/columns), and execution payload envelope are independent components
that can arrive and import out of order.
- Three additive-only sub-state-machines for block / data / payload streams.
Peer sets start empty for data/payload and grow as children arrive — the
parent lookup's completion requirement can widen over time without
mutating any state machine.
- `AwaitingParent` becomes a struct carrying the child's `parent_block_hash`
so the parent can be classified empty/full from the child's bid reference.
- Wires `PayloadEnvelopesByRoot` RPC end-to-end through `SyncNetworkContext`:
request sending, response routing (`SingleLookupReqId::SinglePayloadEnvelope`),
and integration into `PayloadRequest`. Envelope *processing* is still a TODO;
only the download path is wired.
- Test rig: serves envelopes from a `network_envelopes_by_root` cache
populated from the external harness; bumps test validator count to 8 so
`proposer_lookahead` can populate at the Fulu → Gloas upgrade.
- Enables gloas in `TEST_NETWORK_FORKS`.
- Fixes: genesis parent check, infinite retry loop on repeated download
failure, no-op in `on_completed_request`, and peer sets not being cleared
on disconnect.
- Step 0 of the tree-sync roadmap https://github.com/sigp/lighthouse/issues/7678
Current lookup sync tests are written in an explicit way that assume how the internals of lookup sync work. For example the test would do:
- Emit unknown block parent message
- Expect block request for X
- Respond with successful block request
- Expect block processing request for X
- Response with successful processing request
- etc..
This is unnecessarily verbose. And it will requires a complete re-write when something changes in the internals of lookup sync (has happened a few times, mostly for deneb and fulu).
What we really want to assert is:
- WHEN: we receive an unknown block parent message
- THEN: Lookup sync can sync that block
- ASSERT: Without penalizing peers, without unnecessary retries
Keep all existing tests and add new cases but written in the new style described above. The logic to serve and respond to request is in this function `fn simulate` 2288a3aeb1/beacon_node/network/src/sync/tests/lookups.rs (L301)
- It controls peer behavior based on a `CompleteStrategy` where you can set for example "respond to BlocksByRoot requests with empty"
- It actually runs beacon processor messages running their clousures. Now sync tests actually import blocks, increasing the test coverage to the interaction of sync and the da_checker.
- To achieve the above the tests create real blocks with the test harness. To make the tests as fast as before, I disabled crypto with `TestConfig`
Along the way I found a couple bugs, which I documented on the diff.
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
Organize and categorize `consensus/types` into modules based on their relation to key consensus structures/concepts.
This is a precursor to a sensible public interface.
While this refactor is very opinionated, I am open to suggestions on module names, or type groupings if my current ones are inappropriate.
Co-Authored-By: Mac L <mjladson@pm.me>
`beacon-chain-tests` is now regularly taking 1h+ on CI since Fulu fork was added.
This PR attemtpts to reduce the test time by bringing down the number of blobs generated in tests - instead of generating 0..max_blobs, the generator now generates 0..1 blobs by default, and this can be modified by setting `harness.execution_block_generator.set_min_blob_count(n)`.
Note: The blobs are pre-generated and doesn't require too much CPU to generate however processing a larger number of them on the beacon chain does take a lot of time.
This PR also include a few other small improvements
- Our slowest test (`chain_segment_varying_chunk_size`) runs 3x faster in Fulu just by reusing chain segments
- Avoid re-running fork specific tests on all forks
- Fix a bunch of tests that depends on the harness's existing random blob generation, which is fragile
beacon chain test time on test machine is **~2x** faster:
### `unstable`
```
Summary [ 751.586s] 291 tests run: 291 passed (13 slow), 0 skipped
```
### this branch
```
Summary [ 373.792s] 291 tests run: 291 passed (2 slow), 0 skipped
```
The next set of tests to optimise is the ones that use [`get_chain_segment`](77a9af96de/beacon_node/beacon_chain/tests/block_verification.rs (L45)), as it by default build 320 blocks with supernode - an easy optimisation would be to build these blocks with cgc = 8 for tests that only require fullnodes.
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
- PR https://github.com/sigp/lighthouse/pull/8045 introduced a regression of how lookup sync interacts with the da_checker.
Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip.
I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events:
- Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block
- Create lookup and leave in AwaitingDownload(block in processing cache) state
- Block from HTTP API finishes importing
- Lookup is left stuck
Closes https://github.com/sigp/lighthouse/issues/8104
- https://github.com/sigp/lighthouse/pull/8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy.
For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves https://github.com/sigp/lighthouse/issues/8104 by allowing lookup sync to import the block twice in that case.
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
This PR consolidates the `reqresp_pre_import_cache` into the `data_availability_checker` for the following reasons:
- the `reqresp_pre_import_cache` suffers from the same TOCTOU bug we had with `data_availability_checker` earlier, and leads to unbounded memory leak, which we have observed over the last 6 months on some nodes.
- the `reqresp_pre_import_cache` is no longer necessary, because we now hold blocks in the `data_availability_checker` for longer since (#7961), and recent blocks can be served from the DA checker.
This PR also maintains the following functionalities
- Serving pre-executed blocks over RPC, and they're now served from the `data_availability_checker` instead.
- Using the cache for de-duplicating lookup requests.
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
Lookup sync has a cache of block roots "failed_chains". If a peer triggers a lookup for a block or descendant of a root in failed_chains the lookup is dropped and the peer penalized. However blocks are inserted into failed_chains for a single reason:
- If a chain is longer than 32 blocks the lookup is dropped to prevent OOM risks. However the peer is not at fault, since discovering an unknown chain longer than 32 blocks is not malicious. We just drop the lookup to sync the blocks from range forward sync.
This discrepancy is probably an oversight when changing old code. Before we used to add blocks that failed too many times to process to that cache. However, we don't do that anymore. Adding a block that fails too many times to process is an optimization to save resources in rare cases where peers keep sending us invalid blocks. In case that happens, today we keep trying to process the block, downscoring the peers and eventually disconnecting them. _IF_ we found that optimization to be necessary we should merge this PR (_Stricter match of BlockError in lookup sync_) first. IMO we are fine without the failed_chains cache and the ignored_chains cache will be obsolete with [tree sync](https://github.com/sigp/lighthouse/issues/7678) as the OOM risk of long lookup chains does not exist anymore.
Closes https://github.com/sigp/lighthouse/issues/7577
Rename `failed_chains` for `ignored_chains` and don't penalize peers that trigger lookups for those blocks
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
This method is a footgun because it truncates the list. It is the source of a recent bug:
- https://github.com/sigp/lighthouse/pull/7927
- Delete uses of `RuntimeVariableList::from_vec` and replace them with `::new` which does validation and can fail.
- Propagate errors where possible, unwrap in tests and use `expect` for obviously-safe uses (in `chain_spec.rs`).
Resolves#6767
This PR implements a basic version of validator custody.
- It introduces a new `CustodyContext` object which contains info regarding number of validators attached to a node and the custody count they contribute to the cgc.
- The `CustodyContext` is added in the da_checker and has methods for returning the current cgc and the number of columns to sample at head. Note that the logic for returning the cgc existed previously in the network globals.
- To estimate the number of validators attached, we use the `beacon_committee_subscriptions` endpoint. This might overestimate the number of validators actually publishing attestations from the node in the case of multi BN setups. We could also potentially use the `publish_attestations` endpoint to get a more conservative estimate at a later point.
- Anytime there's a change in the `custody_group_count` due to addition/removal of validators, the custody context should send an event on a broadcast channnel. The only subscriber for the channel exists in the network service which simply subscribes to more subnets. There can be additional subscribers in sync that will start a backfill once the cgc changes.
TODO
- [ ] **NOT REQUIRED:** Currently, the logic only handles an increase in validator count and does not handle a decrease. We should ideally unsubscribe from subnets when the cgc has decreased.
- [ ] **NOT REQUIRED:** Add a service in the `CustodyContext` that emits an event once `MIN_EPOCHS_FOR_BLOB_SIDECARS_REQUESTS ` passes after updating the current cgc. This event should be picked up by a subscriber which updates the enr and metadata.
- [x] Add more tests
Issue discovered on PeerDAS devnet (node `lighthouse-geth-2.peerdas-devnet-5.ethpandaops.io`). Summary:
- A lookup is created for block root `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12`
- That block or a parent is faulty and `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12` is added to the failed chains cache
- We later receive a block that is a child of a child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12`
- We create a lookup, which attempts to process the child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12` and hit a processor error `UnknownParent`, hitting this line
bf955c7543/beacon_node/network/src/sync/block_lookups/mod.rs (L686-L688)
`search_parent_of_child` does not create a parent lookup because the parent root is in the failed chain cache. However, we have **already** marked the child as awaiting the parent. This results in an inconsistent state of lookup sync, as there's a lookup awaiting a parent that doesn't exist.
Now we have a lookup (the child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12`) that is awaiting a parent lookup that doesn't exist: hence stuck.
### Impact
This bug can affect Mainnet as well as PeerDAS devnets.
This bug may stall lookup sync for a few minutes (up to `LOOKUP_MAX_DURATION_STUCK_SECS = 15 min`) until the stuck prune routine deletes it. By that time the root will be cleared from the failed chain cache and sync should succeed. During that time the user will see a lot of `WARN` logs when attempting to add each peer to the inconsistent lookup. We may also sync the block through range sync if we fall behind by more than 2 epochs. We may also create the parent lookup successfully after the failed cache clears and complete the child lookup.
This bug is triggered if:
- We have a lookup that fails and its root is added to the failed chain cache (much more likely to happen in PeerDAS networks)
- We receive a block that builds on a child of the block added to the failed chain cache
Ensure that we never create (or leave existing) a lookup that references a non-existing parent.
I added `must_use` lints to the functions that create lookups. To fix the specific bug we must recursively drop the child lookup if the parent is not created. So if `search_parent_of_child` returns `false` now return `LookupRequestError::Failed` instead of `LookupResult::Pending`.
As a bonus I have a added more logging and reason strings to the errors
#7461 and partly #6439.
Desired behaviour after receiving `engine_getBlobs` response:
1. Gossip verify the blobs and proofs, but don't mark them as observed yet. This is because not all blobs are published immediately (due to staggered publishing). If we mark them as observed and not publish them, we could end up blocking the gossip propagation.
2. Blobs are marked as observed _either_ when:
* They are received from gossip and forwarded to the network .
* They are published by the node.
Current behaviour:
- ❗ We only gossip verify `engine_getBlobsV1` responses, but not `engine_getBlobsV2` responses (PeerDAS).
- ❗ After importing EL blobs AND before they're published, if the same blobs arrive via gossip, they will get re-processed, which may result in a re-import.
1. Perform gossip verification on data columns computed from EL `getBlobsV2` response. We currently only do this for `getBlobsV1` to prevent importing blobs with invalid proofs into the `DataAvailabilityChecker`, this should be done on V2 responses too.
2. Add additional gossip verification to make sure we don't re-process a ~~blob~~ or data column that was imported via the EL `getBlobs` but not yet "seen" on the gossip network. If an "unobserved" gossip blob is found in the availability cache, then we know it has passed verification so we can immediately propagate the `ACCEPT` result and forward it to the network, but without re-processing it.
**UPDATE:** I've left blobs out for the second change mentioned above, as the likelihood and impact is very slow and we haven't seen it enough, but under PeerDAS this issue is a regular occurrence and we do see the same block getting imported many times.
- Re-opens https://github.com/sigp/lighthouse/pull/6864 targeting unstable
Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.
Issues with current unstable:
- Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
- Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
- The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.
- [x] Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
- [x] Upstream peer selection to the network context, now `block_components_by_range_request` gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error `RpcRequestSendError::NoPeer`
- [ ] Range sync and backfill sync handle `RpcRequestSendError::NoPeer` explicitly
- [ ] Range sync: leaves the batch in `AwaitingDownload` state and does nothing. **TODO**: we should have some mechanism to fail the chain if it's stale for too long - **EDIT**: Not done in this PR
- [x] Backfill sync: pauses the sync until another peer joins - **EDIT**: Same logic as unstable
### TODOs
- [ ] Add tests :)
- [x] Manually test backfill sync
Note: this touches the mainnet path!
I've been working at updating another library to latest Lighthouse and got very confused with RPC request Ids.
There were types that had fields called `request_id` and `id`. And interchangeably could have types `PeerRequestId`, `rpc::RequestId`, `AppRequestId`, `api_types::RequestId` or even `Request.id`.
I couldn't keep track of which Id was linked to what and what each type meant.
So this PR mainly does a few things:
- Changes the field naming to match the actual type. So any field that has an `AppRequestId` will be named `app_request_id` rather than `id` or `request_id` for example.
- I simplified the types. I removed the two different `RequestId` types (one in Lighthouse_network the other in the rpc) and grouped them into one. It has one downside tho. I had to add a few unreachable lines of code in the beacon processor, which the extra type would prevent, but I feel like it might be worth it. Happy to add an extra type to avoid those few lines.
- I also removed the concept of `PeerRequestId` which sometimes went alongside a `request_id`. There were times were had a `PeerRequest` and a `Request` being returned, both of which contain a `RequestId` so we had redundant information. I've simplified the logic by removing `PeerRequestId` and made a `ResponseId`. I think if you look at the code changes, it simplifies things a bit and removes the redundant extra info.
I think with this PR things are a little bit easier to reasonable about what is going on with all these RPC Ids.
NOTE: I did this with the help of AI, so probably should be checked