Commit Graph

20 Commits

Author SHA1 Message Date
Lion - dapplion
d457ceeaaf Don't create child lookup if parent is faulty (#7118)
Issue discovered on PeerDAS devnet (node `lighthouse-geth-2.peerdas-devnet-5.ethpandaops.io`). Summary:

- A lookup is created for block root `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12`
- That block or a parent is faulty and `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12` is added to the failed chains cache
- We later receive a block that is a child of a child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12`
- We create a lookup, which attempts to process the child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12` and hit a processor error `UnknownParent`, hitting this line

bf955c7543/beacon_node/network/src/sync/block_lookups/mod.rs (L686-L688)

`search_parent_of_child` does not create a parent lookup because the parent root is in the failed chain cache. However, we have **already** marked the child as awaiting the parent. This results in an inconsistent state of lookup sync, as there's a lookup awaiting a parent that doesn't exist.

Now we have a lookup (the child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12`) that is awaiting a parent lookup that doesn't exist: hence stuck.

### Impact

This bug can affect Mainnet as well as PeerDAS devnets.

This bug may stall lookup sync for a few minutes (up to `LOOKUP_MAX_DURATION_STUCK_SECS = 15 min`) until the stuck prune routine deletes it. By that time the root will be cleared from the failed chain cache and sync should succeed. During that time the user will see a lot of `WARN` logs when attempting to add each peer to the inconsistent lookup. We may also sync the block through range sync if we fall behind by more than 2 epochs. We may also create the parent lookup successfully after the failed cache clears and complete the child lookup.

This bug is triggered if:
- We have a lookup that fails and its root is added to the failed chain cache (much more likely to happen in PeerDAS networks)
- We receive a block that builds on a child of the block added to the failed chain cache


  Ensure that we never create (or leave existing) a lookup that references a non-existing parent.

I added `must_use` lints to the functions that create lookups. To fix the specific bug we must recursively drop the child lookup if the parent is not created. So if `search_parent_of_child` returns `false` now return `LookupRequestError::Failed` instead of `LookupResult::Pending`.

As a bonus I have a added more logging and reason strings to the errors
2025-06-05 08:53:43 +00:00
Jimmy Chen
e6ef644db4 Verify getBlobsV2 response and avoid reprocessing imported data columns (#7493)
#7461 and partly #6439.

Desired behaviour after receiving `engine_getBlobs` response:

1. Gossip verify the blobs and proofs, but don't mark them as observed yet. This is because not all blobs are published immediately (due to staggered publishing). If we mark them as observed and not publish them, we could end up blocking the gossip propagation.
2. Blobs are marked as observed _either_ when:
* They are received from gossip and forwarded to the network .
* They are published by the node.

Current behaviour:
-  We only gossip verify `engine_getBlobsV1` responses, but not `engine_getBlobsV2` responses (PeerDAS).
-  After importing EL blobs AND before they're published, if the same blobs arrive via gossip, they will get re-processed, which may result in a re-import.


  1. Perform gossip verification on data columns computed from EL `getBlobsV2` response. We currently only do this for `getBlobsV1` to prevent importing blobs with invalid proofs into the `DataAvailabilityChecker`, this should be done on V2 responses too.
2. Add additional gossip verification to make sure we don't re-process a ~~blob~~ or data column that was imported via the EL `getBlobs` but not yet "seen" on the gossip network. If an "unobserved" gossip blob is found in the availability cache, then we know it has passed verification so we can immediately propagate the `ACCEPT` result and forward it to the network, but without re-processing it.

**UPDATE:** I've left blobs out for the second change mentioned above, as the likelihood and impact is very slow and we haven't seen it enough, but under PeerDAS this issue is a regular occurrence and we do see the same block getting imported many times.
2025-05-26 19:55:58 +00:00
Akihito Nakano
537fc5bde8 Revive network-test logs files in CI (#7459)
https://github.com/sigp/lighthouse/issues/7187


  This PR adds a writer that implements `tracing_subscriber::fmt::MakeWriter`, which writes logs to separate files for each test.
2025-05-22 02:51:22 +00:00
SunnysidedJ
593390162f peerdas-devnet-7: update DataColumnSidecarsByRoot request to use DataColumnsByRootIdentifier (#7399)
Update DataColumnSidecarsByRoot request to use DataColumnsByRootIdentifier #7377


  As described in https://github.com/ethereum/consensus-specs/pull/4284
2025-05-12 00:20:55 +00:00
Lion - dapplion
beb0ce68bd Make range sync peer loadbalancing PeerDAS-friendly (#6922)
- Re-opens https://github.com/sigp/lighthouse/pull/6864 targeting unstable

Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.

Issues with current unstable:

- Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
- Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
- The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.


  - [x] Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
- [x] Upstream peer selection to the network context, now `block_components_by_range_request` gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error `RpcRequestSendError::NoPeer`
- [ ] Range sync and backfill sync handle `RpcRequestSendError::NoPeer` explicitly
- [ ] Range sync: leaves the batch in `AwaitingDownload` state and does nothing. **TODO**: we should have some mechanism to fail the chain if it's stale for too long - **EDIT**: Not done in this PR
- [x] Backfill sync: pauses the sync until another peer joins - **EDIT**: Same logic as unstable

### TODOs

- [ ] Add tests :)
- [x] Manually test backfill sync

Note: this touches the mainnet path!
2025-05-07 02:03:07 +00:00
Mac L
0e6da0fcaf Merge branch 'release-v7.0.0' into v7-backmerge 2025-04-04 13:32:58 +11:00
Mac L
82d1674455 Rust 1.86.0 lints (#7254)
Implement lints for the new Rust compiler version 1.86.0.
2025-04-04 02:30:22 +00:00
Age Manning
d6cd049a45 RPC RequestId Cleanup (#7238)
I've been working at updating another library to latest Lighthouse and got very confused with RPC request Ids.

There were types that had fields called `request_id` and `id`. And interchangeably could have types `PeerRequestId`, `rpc::RequestId`, `AppRequestId`, `api_types::RequestId` or even `Request.id`.

I couldn't keep track of which Id was linked to what and what each type meant.

So this PR mainly does a few things:
- Changes the field naming to match the actual type. So any field that has an  `AppRequestId` will be named `app_request_id` rather than `id` or `request_id` for example.
- I simplified the types. I removed the two different `RequestId` types (one in Lighthouse_network the other in the rpc) and grouped them into one. It has one downside tho. I had to add a few unreachable lines of code in the beacon processor, which the extra type would prevent, but I feel like it might be worth it. Happy to add an extra type to avoid those few lines.
- I also removed the concept of `PeerRequestId` which sometimes went alongside a `request_id`. There were times were had a `PeerRequest` and a `Request` being returned, both of which contain a `RequestId` so we had redundant information. I've simplified the logic by removing `PeerRequestId` and made a `ResponseId`. I think if you look at the code changes, it simplifies things a bit and removes the redundant extra info.

I think with this PR things are a little bit easier to reasonable about what is going on with all these RPC Ids.

NOTE: I did this with the help of AI, so probably should be checked
2025-04-03 10:10:15 +00:00
Lion - dapplion
6f31d44343 Remove CGC from data_availability checker (#7033)
- Part of https://github.com/sigp/lighthouse/issues/6767

Validator custody makes the CGC and set of sampling columns dynamic. Right now this information is stored twice:
- in the data availability checker
- in the network globals

If that state becomes dynamic we must make sure it is in sync updating it twice, or guarding it behind a mutex. However, I noted that we don't really have to keep the CGC inside the data availability checker. All consumers can actually read it from the network globals, and we can update `make_available` to read the expected count of data columns from the block.
2025-03-26 05:19:51 +00:00
ThreeHrSleep
d60c24ef1c Integrate tracing (#6339)
Tracing Integration
- [reference](5bbf1859e9/projects/project-ideas.md (L297))


  - [x] replace slog & log with tracing throughout the codebase
- [x] implement custom crit log
- [x] make relevant changes in the formatter
- [x] replace sloggers
- [x] re-write SSE logging components

cc: @macladson @eserilev
2025-03-12 22:31:05 +00:00
Lion - dapplion
3992d6ba74 Fix misc PeerDAS todos (#6862)
Address misc PeerDAS TODOs that are not too big for a dedicated PR


  I'll justify each TODO on an inlined comment
2025-02-11 06:07:13 +00:00
Lion - dapplion
d5a03c9d86 Add more range sync tests (#6872)
Currently we have very poor coverage of range sync with unit tests. With the event driven test framework we could cover much more ground and be confident when modifying the code.


  Add two basic cases:
- Happy path, complete a finalized sync for 2 epochs
- Post-PeerDAS case where we start without enough custody peers and later we find enough

⚠️  If you have ideas for more test cases, please let me know! I'll write them
2025-02-10 07:55:22 +00:00
Jimmy Chen
70194dfc6a Implement PeerDAS Fulu fork activation (#6795)
Addresses #6706


  This PR activates PeerDAS at the Fulu fork epoch instead of `EIP_7594_FORK_EPOCH`. This means we no longer support testing PeerDAS with Deneb / Electrs, as it's now part of a hard fork.
2025-01-30 07:01:34 +00:00
Pawan Dhananjay
4a07c08c4f Fork aware max values in rpc (#6847)
N/A


  In https://github.com/sigp/lighthouse/pull/6329 we changed `max_blobs_per_block` from a preset to a config value.
We weren't using the right value based on fork in that PR. This is a follow up PR to use the fork dependent values.

In the proces, I also updated other places where we weren't using fork dependent values from the ChainSpec.

Note to reviewer: easier to go through by commit
2025-01-29 19:42:13 +00:00
Lion - dapplion
c6ebaba892 Detect invalid proposer signature on RPC block processing (#6519)
Complements
- https://github.com/sigp/lighthouse/pull/6321

by detecting if the proposer signature is valid or not during RPC block processing. In lookup sync, if the invalid signature signature is the proposer signature, it's not deterministic on the block root. So we should only penalize the sending peer and retry. Otherwise, if it's on the body we should drop the lookup and penalize all peers that claim to have imported the block
2025-01-28 19:01:26 +00:00
Jimmy Chen
e98209d118 Implement PeerDAS subnet decoupling (aka custody groups) (#6736)
* Implement PeerDAS subnet decoupling (aka custody groups).

* Merge branch 'unstable' into decouple-subnets

* Refactor feature testing for spec tests (#6737)

Squashed commit of the following:

commit 898d05ee17
Merge: ffbd25e2b 7e0cddef3
Author: Jimmy Chen <jchen.tc@gmail.com>
Date:   Tue Dec 24 14:41:19 2024 +1100

    Merge branch 'unstable' into refactor-ef-tests-features

commit ffbd25e2be
Author: Jimmy Chen <jchen.tc@gmail.com>
Date:   Tue Dec 24 14:40:38 2024 +1100

    Fix `SszStatic` tests for PeerDAS: exclude eip7594 test vectors when testing Electra types.

commit aa593cf35c
Author: Jimmy Chen <jchen.tc@gmail.com>
Date:   Fri Dec 20 12:08:54 2024 +1100

    Refactor spec testing for features and simplify usage.

* Fix build.

* Add input validation and improve arithmetic handling when calculating custody groups.

* Address review comments re code style consistency.

* Merge branch 'unstable' into decouple-subnets

# Conflicts:
#	beacon_node/beacon_chain/src/kzg_utils.rs
#	beacon_node/beacon_chain/src/observed_data_sidecars.rs
#	beacon_node/lighthouse_network/src/discovery/subnet_predicate.rs
#	common/eth2_network_config/built_in_network_configs/chiado/config.yaml
#	common/eth2_network_config/built_in_network_configs/gnosis/config.yaml
#	common/eth2_network_config/built_in_network_configs/holesky/config.yaml
#	common/eth2_network_config/built_in_network_configs/mainnet/config.yaml
#	common/eth2_network_config/built_in_network_configs/sepolia/config.yaml
#	consensus/types/src/chain_spec.rs

* Update consensus/types/src/chain_spec.rs

Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com>

* Merge remote-tracking branch 'origin/unstable' into decouple-subnets

* Update error handling.

* Address review comment.

* Merge remote-tracking branch 'origin/unstable' into decouple-subnets

# Conflicts:
#	consensus/types/src/chain_spec.rs

* Update PeerDAS spec tests to `1.5.0-beta.0` and fix failing unit tests.

* Merge remote-tracking branch 'origin/unstable' into decouple-subnets

# Conflicts:
#	beacon_node/lighthouse_network/src/peer_manager/mod.rs
2025-01-15 07:40:26 +00:00
Pawan Dhananjay
05727290fb Make max_blobs_per_block a config parameter (#6329)
* First pass

* Add restrictions to RuntimeVariableList api

* Use empty_uninitialized and fix warnings

* Fix some todos

* Merge branch 'unstable' into max-blobs-preset

* Fix take impl on RuntimeFixedList

* cleanup

* Fix test compilations

* Fix some more tests

* Fix test from unstable

* Merge branch 'unstable' into max-blobs-preset

* Merge remote-tracking branch 'origin/unstable' into max-blobs-preset

* Remove footgun function

* Minor simplifications

* Move from preset to config

* Fix typo

* Revert "Remove footgun function"

This reverts commit de01f923c7.

* Try fixing tests

* Thread through ChainSpec

* Fix release tests

* Move RuntimeFixedVector into module and rename

* Add test

* Remove empty RuntimeVarList awefullness

* Fix tests

* Simplify BlobSidecarListFromRoot

* Merge remote-tracking branch 'origin/unstable' into max-blobs-preset

* Bump quota to account for new target (6)

* Remove clone

* Fix issue from review

* Try to remove ugliness

* Merge branch 'unstable' into max-blobs-preset

* Fix max value

* Fix doctest

* Fix formatting

* Fix max check

* Delete hardcoded max_blobs_per_block in RPC limits

* Merge remote-tracking branch 'origin/unstable' into max-blobs-preset
2025-01-10 06:34:58 +00:00
Mac L
ecdf2d891f Add Fulu boilerplate (#6695)
* Add Fulu boilerplate

* Add more boilerplate

* Change fulu_time to osaka_time

* Merge branch 'unstable' into fulu-boilerplate

* Fix tests

* Merge branch 'unstable' into fulu-boilerplate

* More test fixes

* Apply suggestions

* Remove `get_payload` boilerplate

* Add lightclient fulu types and fix beacon-chain-tests

* Disable Fulu in ef-tests

* Reduce boilerplate for future forks

* Small fixes

* One more fix

* Apply suggestions

* Merge branch 'unstable' into fulu-boilerplate

* Fix lints
2025-01-10 05:25:23 +00:00
Lion - dapplion
1c5be34def Write range sync tests in external event-driven form (#6618)
* Write range sync tests in external event-driven form

* Fix remaining test

* Drop unused generics

* Merge branch 'unstable' into range-sync-tests

* Add reference to test author

* Use async await

* Fix failing test. Not sure how it was passing before without an EL.
2024-12-16 05:44:10 +00:00
Lion - dapplion
8188e036a0 Generalize sync block lookup tests (#6498)
* Generalize sync block lookup tests
2024-10-25 05:50:31 +00:00