This method is a footgun because it truncates the list. It is the source of a recent bug:
- https://github.com/sigp/lighthouse/pull/7927
- Delete uses of `RuntimeVariableList::from_vec` and replace them with `::new` which does validation and can fail.
- Propagate errors where possible, unwrap in tests and use `expect` for obviously-safe uses (in `chain_spec.rs`).
Adds the required boilerplate code for the Gloas (Glamsterdam) hard fork. This allows PRs testing Gloas-candidate features to test fork transition.
This also includes de-duplication of post-Bellatrix readiness notifiers from #6797 (credit to @dapplion)
#7815
- removes all existing spans, so some span fields that appear in logs like `service_name` may be lost.
- instruments a few key code paths in the beacon node, starting from **root spans** named below:
* Gossip block and blobs
* `process_gossip_data_column_sidecar`
* `process_gossip_blob`
* `process_gossip_block`
* Rpc block and blobs
* `process_rpc_block`
* `process_rpc_blobs`
* `process_rpc_custody_columns`
* Rpc blocks (range and backfill)
* `process_chain_segment`
* `PendingComponents` lifecycle
* `pending_components`
To test locally:
* Run Grafana and Tempo with https://github.com/sigp/lighthouse-metrics/pull/57
* Run Lighthouse BN with `--telemetry-collector-url http://localhost:4317`
Some captured traces can be found here: https://hackmd.io/@jimmygchen/r1sLOxPPeg
Removing the old spans seem to have reduced the memory usage quite a lot - i think we were using them on long running tasks and too excessively:
<img width="910" height="495" alt="image" src="https://github.com/user-attachments/assets/5208bbe4-53b2-4ead-bc71-0b782c788669" />
Closes#7467.
This PR primarily addresses [the P2P changes](https://github.com/ethereum/EIPs/pull/9840) in [fusaka-devnet-2](https://fusaka-devnet-2.ethpandaops.io/). Specifically:
* [the new `nfd` parameter added to the `ENR`](https://github.com/ethereum/EIPs/pull/9840)
* [the modified `compute_fork_digest()` changes for every BPO fork](https://github.com/ethereum/EIPs/pull/9840)
90% of this PR was absolutely hacked together as fast as possible during the Berlinterop as fast as I could while running between Glamsterdam debates. Luckily, it seems to work. But I was unable to be as careful in avoiding bugs as I usually am. I've cleaned up the things *I remember* wanting to come back and have a closer look at. But still working on this.
Progress:
* [x] get it working on `fusaka-devnet-2`
* [ ] [*optional* disconnect from peers with incorrect `nfd` at the fork boundary](https://github.com/ethereum/consensus-specs/pull/4407) - Can be addressed in a future PR if necessary
* [x] first pass clean-up
* [x] fix up all the broken tests
* [x] final self-review
* [x] more thorough review from people more familiar with affected code
Lighthouse is currently loggign a lot errors in the `RPC` behaviour whenever a response is received for a request_id that no longer exists in active_inbound_requests. This is likely due to a data race or timing issue (e.g., the peer disconnecting before the response is handled).
This PR addresses that by removing the error logging from the RPC layer. Instead, RPC::send_response now simply returns an Err, shifting the responsibility to the main service. The main service can then determine whether the peer is still connected and only log an error if the peer remains connected.
Thanks @ackintosh for helping debug!
This bug was first found and partially fixed by @VolodymyrBg in #7317 - this PR applies the same fix everywhere else.
The old logic updated the waker when it already matched the context, and did nothing when it was stale:
```rust
if waker.will_wake(cx.waker()) {
self.waker = Some(cx.waker().clone());
}
```
This is the wrong way around. We only want to update the waker if it doesn't match the current context:
```rust
if !waker.will_wake(cx.waker()) {
self.waker = Some(cx.waker().clone());
}
```
I don't think we've ever noticed any issues, but it’s a subtle bug that could lead to missed wakeups.
https://github.com/sigp/lighthouse/issues/6875
- Enabled the linter in rate-limiter and fixed errors.
- Changed the type of `Quota::max_tokens` from `u64` to `NonZeroU64` because `max_tokens` cannot be zero.
- Added a test to ensure that a large value for `tokens`, which causes an overflow, is handled properly.
closes https://github.com/sigp/lighthouse/issues/5785
The diagram below shows the differences in how the receiver (responder) behaves before and after this PR. The following sentences will detail the changes.
```mermaid
flowchart TD
subgraph "*** After ***"
Start2([START]) --> AA[Receive request]
AA --> COND1{Is there already an active request <br> with the same protocol?}
COND1 --> |Yes| CC[Send error response]
CC --> End2([END])
%% COND1 --> |No| COND2{Request is too large?}
%% COND2 --> |Yes| CC
COND1 --> |No| DD[Process request]
DD --> EE{Rate limit reached?}
EE --> |Yes| FF[Wait until tokens are regenerated]
FF --> EE
EE --> |No| GG[Send response]
GG --> End2
end
subgraph "*** Before ***"
Start([START]) --> A[Receive request]
A --> B{Rate limit reached <br> or <br> request is too large?}
B -->|Yes| C[Send error response]
C --> End([END])
B -->|No| E[Process request]
E --> F[Send response]
F --> End
end
```
### `Is there already an active request with the same protocol?`
This check is not performed in `Before`. This is taken from the PR in the consensus-spec, which proposes updates regarding rate limiting and response timeout.
https://github.com/ethereum/consensus-specs/pull/3767/files
> The requester MUST NOT make more than two concurrent requests with the same ID.
The PR mentions the requester side. In this PR, I introduced the `ActiveRequestsLimiter` for the `responder` side to restrict more than two requests from running simultaneously on the same protocol per peer. If the limiter disallows a request, the responder sends a rate-limited error and penalizes the requester.
### `Rate limit reached?` and `Wait until tokens are regenerated`
UPDATE: I moved the limiter logic to the behaviour side. https://github.com/sigp/lighthouse/pull/5923#issuecomment-2379535927
~~The rate limiter is shared between the behaviour and the handler. (`Arc<Mutex<RateLimiter>>>`) The handler checks the rate limit and queues the response if the limit is reached. The behaviour handles pruning.~~
~~I considered not sharing the rate limiter between the behaviour and the handler, and performing all of these either within the behaviour or handler. However, I decided against this for the following reasons:~~
- ~~Regarding performing everything within the behaviour: The behaviour is unable to recognize the response protocol when `RPC::send_response()` is called, especially when the response is `RPCCodedResponse::Error`. Therefore, the behaviour can't rate limit responses based on the response protocol.~~
- ~~Regarding performing everything within the handler: When multiple connections are established with a peer, there could be multiple handlers interacting with that peer. Thus, we cannot enforce rate limiting per peer solely within the handler. (Any ideas? 🤔 )~~
Resolves#6811
Rename `GOSSIP_MAX_SIZE` to `MAX_PAYLOAD_SIZE` and remove `MAX_CHUNK_SIZE` in accordance with the spec.
The spec also "clarifies" the message size limits at different levels. The rpc limits are equivalent to what we had before imo.
The gossip limits have additional checks.
I have gotten rid of the `is_bellatrix_enabled` checks that used a lower limit (1mb) pre-merge. Since all networks we run start from the merge, I don't think this will break any setups.
I've been working at updating another library to latest Lighthouse and got very confused with RPC request Ids.
There were types that had fields called `request_id` and `id`. And interchangeably could have types `PeerRequestId`, `rpc::RequestId`, `AppRequestId`, `api_types::RequestId` or even `Request.id`.
I couldn't keep track of which Id was linked to what and what each type meant.
So this PR mainly does a few things:
- Changes the field naming to match the actual type. So any field that has an `AppRequestId` will be named `app_request_id` rather than `id` or `request_id` for example.
- I simplified the types. I removed the two different `RequestId` types (one in Lighthouse_network the other in the rpc) and grouped them into one. It has one downside tho. I had to add a few unreachable lines of code in the beacon processor, which the extra type would prevent, but I feel like it might be worth it. Happy to add an extra type to avoid those few lines.
- I also removed the concept of `PeerRequestId` which sometimes went alongside a `request_id`. There were times were had a `PeerRequest` and a `Request` being returned, both of which contain a `RequestId` so we had redundant information. I've simplified the logic by removing `PeerRequestId` and made a `ResponseId`. I think if you look at the code changes, it simplifies things a bit and removes the redundant extra info.
I think with this PR things are a little bit easier to reasonable about what is going on with all these RPC Ids.
NOTE: I did this with the help of AI, so probably should be checked
NA
Bumps the `ethereum_ssz` version, along with other crates that share the dep.
Primarily, this give us bitfields which can store 128 bytes on the stack before allocating, rather than 32 bytes (https://github.com/sigp/ethereum_ssz/pull/38). The validator count has increase massively since we set it at 32 bytes, so aggregation bitfields (et al) now require a heap allocation. This new value of 128 should get us to ~2m active validators.
We don't need to store `BehaviourAction` for `ready_requests` and therefore avoid having an `unreachable!` on #6625.
Therefore this PR should be merged before it
Part of
- https://github.com/sigp/lighthouse/issues/6258
To address PeerDAS sync issues we need to make individual by_range requests within a batch retriable. We should adopt the same pattern for lookup sync where each request (block/blobs/columns) is tracked individually within a "meta" request that group them all and handles retry logic.
- Building on https://github.com/sigp/lighthouse/pull/6398
second step is to add individual request accumulators for `blocks_by_range`, `blobs_by_range`, and `data_columns_by_range`. This will allow each request to progress independently and be retried separately.
Most of the logic is just piping, excuse the large diff. This PR does not change the logic of how requests are handled or retried. This will be done in a future PR changing the logic of `RangeBlockComponentsRequest`.
### Before
- Sync manager receives block with `SyncRequestId::RangeBlockAndBlobs`
- Insert block into `SyncNetworkContext::range_block_components_requests`
- (If received stream terminators of all requests)
- Return `Vec<RpcBlock>`, and insert into `range_sync`
### Now
- Sync manager receives block with `SyncRequestId::RangeBlockAndBlobs`
- Insert block into `SyncNetworkContext:: blocks_by_range_requests`
- (If received stream terminator of this request)
- Return `Vec<SignedBlock>`, and insert into `SyncNetworkContext::components_by_range_requests `
- (If received a result for all requests)
- Return `Vec<RpcBlock>`, and insert into `range_sync`
Addresses #6706
This PR activates PeerDAS at the Fulu fork epoch instead of `EIP_7594_FORK_EPOCH`. This means we no longer support testing PeerDAS with Deneb / Electrs, as it's now part of a hard fork.
N/A
In https://github.com/sigp/lighthouse/pull/6329 we changed `max_blobs_per_block` from a preset to a config value.
We weren't using the right value based on fork in that PR. This is a follow up PR to use the fork dependent values.
In the proces, I also updated other places where we weren't using fork dependent values from the ChainSpec.
Note to reviewer: easier to go through by commit
* First pass
* Add restrictions to RuntimeVariableList api
* Use empty_uninitialized and fix warnings
* Fix some todos
* Merge branch 'unstable' into max-blobs-preset
* Fix take impl on RuntimeFixedList
* cleanup
* Fix test compilations
* Fix some more tests
* Fix test from unstable
* Merge branch 'unstable' into max-blobs-preset
* Merge remote-tracking branch 'origin/unstable' into max-blobs-preset
* Remove footgun function
* Minor simplifications
* Move from preset to config
* Fix typo
* Revert "Remove footgun function"
This reverts commit de01f923c7.
* Try fixing tests
* Thread through ChainSpec
* Fix release tests
* Move RuntimeFixedVector into module and rename
* Add test
* Remove empty RuntimeVarList awefullness
* Fix tests
* Simplify BlobSidecarListFromRoot
* Merge remote-tracking branch 'origin/unstable' into max-blobs-preset
* Bump quota to account for new target (6)
* Remove clone
* Fix issue from review
* Try to remove ugliness
* Merge branch 'unstable' into max-blobs-preset
* Fix max value
* Fix doctest
* Fix formatting
* Fix max check
* Delete hardcoded max_blobs_per_block in RPC limits
* Merge remote-tracking branch 'origin/unstable' into max-blobs-preset
* Compute blob rpc limits in static block
* Fix min size
* Use MainnetEthSpec in rpc tests
* Revert MainnetEthSpec; add another constant for blob size minimal
* add fork_name_enabled fn to Forkname impl
* refactor codebase to use new fork_enabled fn
* fmt
* Merge branch 'unstable' of https://github.com/sigp/lighthouse into fork-ord-impl
* small code cleanup
* resolve merge conflicts
* fix beacon chain test
* merge conflicts
* fix ef test issue
* resolve merge conflicts
* add id to rpc requests
* rename rpc request and response types for more accurate meaning
* remove unrequired build_request function
* remove unirequired Request wrapper types and unify Outbound and Inbound Request
* add RequestId to NetworkMessage::SendResponse
,NetworkMessage::SendErrorResponse to be passed to Rpc::send_response
* fix Rpc Ping sequence number
* bubble up Outbound Err's and Responses even if the peer disconnected
* send pings via Rpc from main network
* add comment to connected check
* Merge branch 'unstable' into fix-ping-seq-number
* update libp2p to version 0.54.0
* address review
* Merge branch 'unstable' of github.com:sigp/lighthouse into update-libp2p
* Merge branch 'update-libp2p' of github.com:sigp/lighthouse into update-libp2p