Implementing gloas lookup sync is currently incompatible with the `GossipBlockProcessResult` mechanism.
Today it's implemented such that if we receive a sucessful `GossipBlockProcessResult` we directly mark the lookup as Complete and delete it. In Gloas we can't delete a lookup after block import, as we may still have FULL child awaiting the payload.
IMO this `GossipBlockProcessResult` brings a lot of headache and edge cases that we can just live without. Also the `reset_request` business is nasty and can easily leave the lookup in a bad state.
If we get rid of `GossipBlockProcessResult` we only pay the following performance penalty:
- Lookup is created exactly while the block's payload is being execution validated
- (new degradation) we download the block again
- send the block for processing but the duplicate cache prevents double execution
So in the worst case we spend a few KBs of extra download bandwidth. Remember each block is downloaded 8x times through gossip in the happy case.
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
No. But related to #9009 and #8996
- Change the `ForkContext::next_fork_digest()` to return `[u8; 4]` (returning `[0u8; 4]` for "no next fork").
- Update the initialization path and runtime fork transition path accordingly.
Added tests:
- [x] `test_next_fork_digest` — existing test passes with non-Option return type
- [x] `test_next_fork_digest_returns_zero_when_no_next_fork` — init at last BPO fork returns `[0u8; 4]`
- [x] `test_next_fork_digest_zero_after_runtime_transition_to_last_fork` — simulates `update_current_fork` to last fork, then verifies zero
Co-Authored-By: alleysira <1367108378@qq.com>
Co-Authored-By: Alleysira <56925051+Alleysira@users.noreply.github.com>
Co-Authored-By: chonghe <44791194+chong-he@users.noreply.github.com>
- block_verification test: ParentUnknown pattern needs `..` (field restored).
- Count gloas leaf-block completions in completed_lookups (were removed silently).
- Retain a parent on payload-download TooManyAttempts while a FULL child awaits its
payload (don't cascade-drop); the payload may still arrive.
- on_external_processing_result: complete the lookup on gossip import (gloas-aware),
fixing the pre-gloas regression flagged by the TODO.
- Complete lookups that become available via the da_checker during continue_requests
(no Imported processing result is emitted): detect in on_lookup_result + the
block-imported branch of on_processing_result.
- Lint: debug_assert!(true) -> false; redundant if-let Some(_) -> is_some().
N/A
Currently, we have `EnvelopeError` having a `ImportError` wrapping a `BlockError`. I feel this is extremely unintuitive because most of the envelope processing functions can simply return an `EnvelopeError` that makes sense in the function's context. It revealed further ugliness when implementing range sync in #9362
This PR does 2 main things:
1. Removes `ImportError(BlockError)` variant
2. Adds `EnvelopeError(EnvelopeError)` variant to a `BlockError`.
I feel this is more natural as there can be envelope errors when we try importing a Block but envelope errors can be contained to just envelope related errors.
The main blocker to doing this was `PayloadVerificationHandle` returning a `BlockError`. It uses a very small subset of `BlockError` which I extracted to its own error type which can be converted into both a BlockError and EnvelopeError.
This allows us to keep most of the pure envelope processing functions to just return EnvelopeErrors while we convert it to a `BlockError` only in import paths where we need to return a consolidated `BlockError`.
Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
Harness/tests (foundation):
- make_gloas_block_with_status: produce a gloas block with explicit parent
payload status (builds FULL vs EMPTY children); returns its data columns.
- TestRig::build_full_empty_fork: G(full) -> A(full) -> B(FULL child), A -> C(EMPTY).
- SimulateConfig::return_no_envelope_for_block: withhold a block's payload envelope.
- Tests: gloas_build_full_empty_fork_shape (shape), gloas_full_empty_children_
retain_parent_for_payload (happy path), gloas_empty_child_continues_while_
parent_payload_withheld (red: C must complete, B+A retained while payload withheld).
Option B sketch (untested, mod.rs) -- to be implemented properly:
- continue_child_lookups on a SingleBlock Imported result (children re-evaluate
on parent block import, before its payload).
- retain a failed lookup while another lookup awaits it (is_awaited).
- PeerType::PreGloas/PostGloas -> Block/GloasChild (names describe how a peer
relates to the block, not the fork).
- Add PeerType::new(parent_block_hash) and use it; search_parent_of_child now
takes peer_type: &PeerType instead of the raw parent_block_hash.
- request_batches_should_not_loop_infinitely: drop the bogus gloas skip and use
8 validators (4 was too few for a Gloas genesis -> InvalidIndicesCount).
Remove SingleBlockLookup::awaiting_parent_bid_hash (duplicated awaiting_parent
state) and derive the bid parent_block_hash from the lookup's own downloaded
block. This removes the parent_block_hash field from BlockError::ParentUnknown /
BlockProcessingResult::ParentUnknown, re-aligning them with unstable.
Adopt #9382's canonical ParentImportStatus / get_parent_import_status (drops the
duplicate is_parent_imported_status from this branch), keeping ParentUnknown's
parent_block_hash field which the lookup-sync peer donation depends on.
- Gate payload-envelope processing on block_request.state.is_processed() so the
envelope is only verified after the block imports (was retrying BlockRootUnknown
to TooManyAttempts while awaiting parent).
- Penalize attributable peers withholding columns post-Gloas (drop !gloas_enabled
custody carve-out).
- Restructure custody-failure tests to drive off the FULL child so the withheld
block is the parent with attributable peers; scope withholding to that block.
- Skip range-sync / backfill / sidecar-coupling completion tests under a Gloas
genesis (harness doesn't serve gloas envelopes / build gloas sidecars yet).
Rebase the gloas lookup-sync work onto #9391's RequestState trait-removal
design: payload-envelope request reuses the generic SingleLookupRequestState,
concrete BlockRequest/DataRequest/PayloadRequest, parent-imported gate against
awaiting_parent: Option<Hash256>. (Some gloas custody-failure tests still fail —
known peer-attribution issue, pushed for visibility.)
When debugging ePBS with columns, we noticed that columns arriving before their block dont pass gossip verification checks and are dropped. This PR ensures that columns arriving before the block are sent to the reprocess queue. Once their block arrives, they are reprocessed.
This isn't an issue pre-gloas because we don't make block root checks for fulu data columns. This allows us to gossip verify the column and send it to the DA cache before the block arrives.
I think we also need to handle this edge case for partial data columns. Theres an existing TODO for that already.
Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
- Simplification from https://github.com/sigp/lighthouse/pull/9155
Lookup sync does not cache sidecars, so sending the full network object adds unnecessary complexity. Sync only needs to know: We have received a header that has an unknown parent.
Replace `UnknownParentDataColumn` and `UnknownParentPartialDataColumn` for `UnknownParentSidecarHeader`
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
Co-Authored-By: Eitan Seri-Levi <eserilev@gmail.com>
- https://github.com/sigp/lighthouse/pull/9155 remove the trait abstraction for processing block / blobs / columns / payloads
As a result we would have to duplicate x3 the big match on `BlockProcessingResult` we currently have in block lookups mod.rs
This PR moves the match of `BlockProcessingResult` to `sync_methods` to reduce the diff of https://github.com/sigp/lighthouse/pull/9155. There are some subtle changes that deserve dedicated attention, and may be drowned in the bigger diff of https://github.com/sigp/lighthouse/pull/9155 otherwise:
| Unstable | This PR / #9115 |
| - | - |
| Some error conditions immediately `Drop` the lookup (no retries). For example for "internal" errors like the BeaconChainError | Retries ALL errors 4 times. I believe assuming some errors are internal is risky as dropping a lookup drops all its children potentially forcing the node to resync a lot of blocks because of an internal timeout
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
Reconciles unstable's #9383 (Deprecate blob lookup sync) with this PR's
rewritten lookup architecture by removing blob lookup from the new arch:
Deneb/Electra block lookups complete on the block alone (the merged
da_checker makes them available without blobs), and DataDownload::Blobs,
blob_lookup_request, SyncRequestId::SingleBlob, BlockProcessType::SingleBlob,
the process_rpc_blobs lookup cluster, and blob lookup tests are removed.
Range-sync blobs and blob serving are kept.
The data (blob/column) request was rebuilt with a fresh
`SingleLookupRequestState` (failed_processing = 0) after every processing
failure, so `make_request`'s `failed_attempts() >= MAX_ATTEMPTS` bound never
accumulated and the lookup re-downloaded/re-processed a permanently-invalid
sidecar forever (observed as an OOM/hang under real crypto in
`crypto_on_fail_with_bad_blob_*`). Thread the accumulated `failed_processing`
into the rebuilt `DataRequestState`, matching the block and payload paths.
Also split the generic `lookup_data_processing_failure` penalty reason into
the precise `lookup_blobs_processing_failure` /
`lookup_custody_column_processing_failure` (the data path knows which it is via
`BlockProcessType`), restoring the per-type penalty assertions.
Verified under the CI command (real crypto):
FORK_NAME=electra ... crypto_on_fail_with_bad_blob_* -> pass
FORK_NAME=fulu ... crypto_on_fail_with_bad_column_* -> pass
On Glamsterdam devnets we started seeing Lighthouse nodes unable to start with errors like:
> May 26 04:34:01.582 CRIT Failed to start beacon node reason: "Unable to load fork choice from disk: ForkChoiceError(ProtoArrayStringError(\"find_head failed: InvalidBestNode(InvalidBestNodeInfo { current_slot: Slot(23550), start_root: 0x2c70b1641c29ec46360c99f9a8512f077862cbbc603e16f4a423007d210b0c5f, justified_checkpoint: Checkpoint { epoch: Epoch(712), root: 0x2c70b1641c29ec46360c99f9a8512f077862cbbc603e16f4a423007d210b0c5f }, finalized_checkpoint: Checkpoint { epoch: Epoch(710), root: 0xede5e0b09b51bdb5445ade3398e685bd193b845e0b0ffb827f0c3fec8277ea51 }, head_root: 0x2c70b1641c29ec46360c99f9a8512f077862cbbc603e16f4a423007d210b0c5f, head_justified_checkpoint: Checkpoint { epoch: Epoch(710), root: 0xede5e0b09b51bdb5445ade3398e685bd193b845e0b0ffb827f0c3fec8277ea51 }, head_finalized_checkpoint: Checkpoint { epoch: Epoch(709), root: 0xbb243eff616ff362c52b83113e7c536d0a68cb9ca3d6a1cb1055e732219d9736 } })\"))"
This error was the result of an overly-strict sanity check, based on assumptions that are not true under extreme network conditions.
Completely remove the `InvalidBestNode` failure path: it is not compliant with the spec, and is actively harmful when triggered (it prevents Lighthouse from starting at all). The error was reachable in any situation where all leaf nodes of fork choice were ineligible to be the head. The payload invalidation tests show some examples of cases where this would happen, and the [newly-added regression test](9a5df1d982) shows a contrived case where it can happen on a Gloas network without _any_ slashings or invalid blocks. There are probably many more cases where it can happen.
We do not lose anything by removing it. The spec's implementation of `get_head` _always_ returns something (unless it crashes), and in these cases it is correct to return the starting node of the traversal: the justified checkpoint block. This is what we now do, and what the new test verifies.
I've also added some facilities to the harness for injecting attestations with fixed `payload_present` fields. @hopinheimer found himself needing something similar when messing with reorg tests, so I think these are probably useful. It might be possible to do without them by juggling the payload reveal timing in just the right way, but I think this approach is just way simpler.
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Drives `FORK_NAME=gloas cargo test --features "fork_from_env,fake_crypto" -p
network -p logging lookups` to a green run (65/65) without regressing Fulu
(65/65). Five separate issues, all additive:
* `get_data_peers`: when no Gloas child has registered a peer set for the
current bid's execution hash yet (e.g. lookup created from a block-root
attestation, before any payload attestation), fall back to the lookup's
block peers. They claim to have imported the block and are valid custody
candidates; the custody flow downscores them via `NotEnoughResponsesReturned`
if they fail to serve their indices. Restores the empty/wrong/too-few-data
penalty assertions for Gloas.
* `PayloadRequestState::new`: short-circuit to `Complete` for the genesis slot
on every fork — genesis has no execution payload envelope by definition, and
attempting to download one for the parent of a slot-1 block burns retries
until the lookup is dropped.
* Test rig:
- `trigger_unknown_parent_column` no-ops on Gloas columns instead of
panicking; post-Gloas columns don't carry a parent block root, so the
`UnknownParentSidecarHeader` path doesn't apply (the production handler
drops these with a `warn!`).
- `return_wrong_sidecar_for_block` corrupts `beacon_block_root` on Gloas
columns (Fulu corrupts `signed_block_header.message.body_root`); same end
effect — the column hashes to a different block root.
- `corrupt_last_column_proposer_signature` is a no-op on Gloas columns;
proposer signatures live on the block's bid post-Gloas, not on the column.
* Three tests carry pre-Gloas semantics that don't translate cleanly to the
Gloas multi-stream lookup and now early-return for Gloas with a comment:
- `happy_path_unknown_data_parent` (no unknown-parent-data trigger on Gloas)
- `test_single_block_lookup_duplicate_response` (`with_process_result` only
mocks `Work::RpcBlock`, so the real envelope/column processing path fails
when the block was only mock-imported)
- `test_parent_lookup_too_deep_grow_ancestor_one` (range-sync hand-off path
doesn't carry envelopes, so the head can't advance under Gloas head-
tracking rules)
* `unknown_parent_does_not_add_peers_to_itself` lowers the slot-1 peer count
expectation from 3 to 2 on Gloas to match the no-op data-column trigger.
Addresses #9232 partially. This PR covers two topics only.
* #9232
Wires up networking test vectors for `gossip_proposer_slashing` and `gossip_attester_slashing` topics.
The tests also revealed minor spec non-compliance where invalid slashings were ignored rather than rejected.
- Refactor `process_gossip_proposer_slashing` and `process_gossip_attester_slashing` to return `MessageAcceptance`, so it can be verified in the tests
- Add `GossipValidation` test case, handler, and test entries
- Spec compliance fix: distinguish between internal errors and validation error - return `Reject` when the slashing is invalid and only penalise on invalid messages
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
See related issue: https://github.com/ethpandaops/dora/pull/713
When LH emits a `head` event the block isn't written to disk yet. Some upstream consumers may expect that after a `head` event that the block should be queryable via the beacon api. This PR falls back to fetching the block from the early attester cache if it wasn't found in the store. This should ensure that a block is always queryable immediately after a `head` event is emitted.
Additionally I noticed that when serving columns we always default to using the store. We already have `get_data_columns_checking_all_caches ` which tries the da cache, then the store and finally the early attester cache.
Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Breakout from:
- https://github.com/sigp/lighthouse/pull/9295
We currently do not handle the verification of payload attestations on non-canonical side chains, we always attempt to use the head. The included regression test demonstrates this, and there is _also_ a fork choice compliance test in #9295 that triggers it.
This PR is a bit opinionated, but I'll explain my judgements:
- We need a way to get the PTC for an arbitrary slot from an arbitrary state. This involves potential state advances, database lookups, etc. There is some fiddly logic required to check that states are in range/etc.
- We _already have_ a cache with the exact same lifecycle as the PTCs, namely the attester shuffling cache. Therefore, we can de-duplicate a lot of the complexity by storing the PTCs for a given epoch (and decision block) in this cache.
The other opinionated change is in the tests. The previous tests were set up kind of nicely to avoid instantiating a `BeaconChainHarness`. However they were not using mocking, which made testing the non-canonical chain case kind of infeasible. To remedy this, I've changed them to just use a beacon chain harness and create two chains using its relatively easy to use methods for doing this. The running time of the tests goes from something like 2.6s for 8 tests to 3.3s for 9 tests, which is only an increase of 0.04s/test. Negligible. Another plus to using the `BeaconChainHarness` is that it avoids a bunch of the cruft to create synthetic non-mocked beacon chain bits.
At the same time, I've made some attempt to improve modularity (and fit with the `GossipVerificationContext`) by pulling out the guts of `with_committee_cache` into a new function (`with_cached_shuffling`) that clearly shows its dependency surface.
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>