Commit Graph

7008 Commits

Author SHA1 Message Date
Jimmy Chen
c13fb2fb46 Instrument publish_block code path (#7945)
Instrument `publish_block` code path and log dropped data columns when publishing.

Example spans (running the devnet from my laptop, so the numbers aren't great)

<img width="734" height="296" alt="image" src="https://github.com/user-attachments/assets/20620bf7-2b38-4392-aa75-9ba96d3a7f0d" />

<img width="718" height="625" alt="image" src="https://github.com/user-attachments/assets/61e1ff1c-65b5-4ad4-981a-d0fadc9829e1" />
2025-08-28 03:31:29 +00:00
Jimmy Chen
746da7ffd5 Fix doppelganger protection script (#7959)
Previously `kurtosis service inspect` gives us output like this - flags in separate lines

```
CMD:
lighthouse
beacon_node
--debug-level=debug
--datadir=/data/lighthouse/beacon-data
--listen-address=0.0.0.0
--port=9000
--http
--http-address=0.0.0.0
--http-port=4000
--disable-packet-filter
--execution-endpoints=http://172.16.0.8:8551
--jwt-secrets=/jwt/jwtsecret
--suggested-fee-recipient=0x8943545177806ED17B9F23F0a21ee5948eCaa776
--disable-enr-auto-update
--enr-address=172.16.0.11
```

In the latest version this has been updated to a single line

```
CMD:
exec lighthouse beacon_node --debug-level=debug --datadir=/data/lighthouse/beacon-data --listen-address=0.0.0.0 --port=9000 --http --http-address=0.0.0.0 --http-port=4000 --disable-packet-filter --execution-endpoints=http://172.16.0.12:8551 --jwt-secrets=/jwt/jwtsecret --suggested-fee-recipient=0x8943545177806ED17B9F23F0a21ee5948eCaa776 --disable-enr-auto-update --enr-address=172.16.0.18 --enr-tcp-port=9000 --enr-udp-port=9000 --enr-quic-port=9001 --quic-port=9001 --metrics --metrics-address=0.0.0.0 --metrics-allow-origin=* --metrics-port=5054 --enable-private-discovery --testnet-dir=/network-configs --boot-nodes=enr:-N24QPYP7bj0aqoM2dXsP5hnosW27U6PTYJt1kYFhNkwIvlFQhGJ1om7f4zcHhVJwvUL7wCsVbDJbP_l-TF8X3q4pVEDh2F0dG5ldHOIAAAwAAAAAACGY2xpZW500YpMaWdodGhvdXNlhTcuMS4whGV0aDKQqFs_bWAAADj__________4JpZIJ2NIJpcISsEAAPhHF1aWOCIymJc2VjcDI1NmsxoQK_z4HQylgsOal74Jek9D_EhY0vcDX5AcLHnPD7iOeEdYhzeW5jbmV0cwCDdGNwgiMog3VkcIIjKA --target-peers=3
```

and it broke our script. This PR update the extraction logic.
2025-08-28 02:48:43 +00:00
Michael Sproul
d235f2c697 Delete RuntimeVariableList::from_vec (#7930)
This method is a footgun because it truncates the list. It is the source of a recent bug:

- https://github.com/sigp/lighthouse/pull/7927


  - Delete uses of `RuntimeVariableList::from_vec` and replace them with `::new` which does validation and can fail.
- Propagate errors where possible, unwrap in tests and use `expect` for obviously-safe uses (in `chain_spec.rs`).
2025-08-27 06:52:14 +00:00
Pawan Dhananjay
ccf03e1c88 Fix data columns by range returning all columns (#7942)
N/A


  In https://github.com/sigp/lighthouse/pull/7897 , we seem to have modified data columns by range to return all the columns we have for the requested epoch disregarding what columns the peer requested.
2025-08-27 05:00:34 +00:00
Barnabas Busa
2b33fe6620 Update to spec v1.6.0-alpha.5 (#7910)
- https://github.com/ethereum/consensus-specs/pull/4508
2025-08-27 03:59:21 +00:00
Jimmy Chen
8901c7417d Notify lookup after gossip data column processing resulted in an import (#7940)
When gossip data column processing completes and results in a block import, sync is currently not notified of the successful import. This is inconsistent with how blob processing and block processing both notify sync.

This fix ensures lookup sync receives block import notifications when blocks become available through gossip data column.
2025-08-27 01:32:17 +00:00
Jimmy Chen
3e78034de6 Add BEACON_PROCESSOR_WORKERS_ACTIVE_GAUGE_BY_TYPE metric (#7935)
Similar to `BEACON_PROCESSOR_WORKERS_ACTIVE_TOTAL` but this metric also records the work type.

This is useful in identifying the task when a worker is stuck due to a deadlock or something else, and usually difficult to debug in production / release mode.
2025-08-26 06:46:14 +00:00
Jimmy Chen
78b4cca46b Run sync tests on CI by default. (#7929)
This PR enables some sync tests by default on CI - this will help catch breakages on sync happy paths, e.g. this would have caught the bug #7926 if lighthouse is not able to sync OR serve sync requests.

The enabled tests are genesis-sync with 120s / 300s offline time, and should cover both serving / consuming by root and by range requests.

I'm leaving the checkpoint sync tests optional, as it has external dependencies on checkpoint server (which may cause CI instability) and may cause extra loads on them.
2025-08-26 02:49:50 +00:00
Mac L
e438691683 Add Gloas boilerplate (#7728)
Adds the required boilerplate code for the Gloas (Glamsterdam) hard fork. This allows PRs testing Gloas-candidate features to test fork transition.

This also includes de-duplication of post-Bellatrix readiness notifiers from #6797 (credit to @dapplion)
2025-08-26 02:49:48 +00:00
Jimmy Chen
daf1c7c3af Fix RPC blocks not getting fully KZG verified (#7927)
Fix RPC blocks not getting fully KZG verified due to incorrect list truncation.
2025-08-25 16:46:16 +00:00
Jimmy Chen
747d9118ff Fix DataColumnsByRoot request limit validation bug (#7928)
Fixes #7926

This was a bug I introduced in #7890 and @pawanjay176 noticed it on some running nodes, and added a rpc test to confirm it.

The culprit is this line, where I failed to fill the vec to it's max size, so it doesn't calculate the max size properly, resulting in all `DataColumnByRoot` requests exceeding the max size during validation:
d24a6d2a45/consensus/types/src/chain_spec.rs (L1984)

The PR fixes this and includes new regression tests for this fix.
2025-08-25 04:13:36 +00:00
Mac L
c41d1181d2 Use Fork variants instead of version for JsonPayload types (#7909)
With Fulu, we increment the engine API version for `get_payload` but we do not also increment `new_payload`.
In Lighthouse, we have a tight coupling between these version numbers and the Fork variants.
For example, both `get_payload_v3` and `new_payload_v3` correspond to Deneb, `v4` to Electra, etc.

However this coupling breaks with Fulu where only `get_payload_v5` is related to Fulu and `new_payload_v4` now also corresponds to Fulu (new_payload_v5 does not exist). While we can work around this, it creates a confusing situation where the versions and Fork variants are out of sync.

See the following code snippet where we are using the v4 endpoint, and yet passing a `V5` payload variant: 522bd9e9c6/beacon_node/execution_layer/src/engine_api/http.rs (L849-L870)


  1. Remove `new_payload_v5` as it is unused in Fulu.
2. Rename the `JsonExecutionPayload` and `JsonGetExecutionPayloadResponse` types to use Fork variants instead of version variants. This clarifies the relationship between them.
2025-08-22 09:22:41 +00:00
João Oliveira
884f30094a use DEFAULT_TARGET_PEERS for target peers everywhere (#7916)
Was going to leave this as a comment on #7877 but when noticed it had already been merged.
we have `DEFAULT_TARGET_PEERS` which was set to 50 and only used on the `Default` impl for `peer_manager`'s `Config`, which then get's overridden by this `lighthouse_network::Config`s default
This PR unifies everything on `DEFAULT_TARGET_PEERS`
2025-08-22 00:24:24 +00:00
Jimmy Chen
d24a6d2a45 Prioritise StatusV2 over StatusV1 RPC protocol (#7912)
Prioritise `StatusV2` over `StatusV1` RPC protocol.

A bug discovered during devnet-4 testing and extracted from the sync fixes PR #7876.
2025-08-21 23:02:18 +00:00
João Oliveira
cee30d8ca5 Update lighthouse to the latest upstream libp2p and gossipsub (#7828) 2025-08-21 07:57:46 +00:00
Age Manning
c9ffdf7f71 Re-assess Lighthouse's peer count for Fusaka (#7877) 2025-08-21 06:12:53 +00:00
Jimmy Chen
f19d4f6af1 Implement tracing spans for data columm RPC requests and responses (#7831)
#7830
2025-08-20 23:35:51 +00:00
Jimmy Chen
2d223575d6 Avoid unnecessary database lookups in data column RPC requests (#7897)
This PR is an optimisation to avoid unnecessary database lookups when peer requests data columns that the node doesn't custody (advertised via `cgc`).

e.g. an extreme but realistic example - a full node only store 4 custody columns by default, but it may receive a range request of 32 slots with all 128 columns, and this would result in 4096 database lookups but the node is only able to get 128 (4 * 32) of them.


  - Filter data column RPC requests (`DataColumnsByRoot`, `DataColumnsByRange`) to only lookup columns the node custodies
- Prevents unnecessary database queries that would always fail for non-custody columns
2025-08-20 05:08:53 +00:00
Jimmy Chen
f6859b1137 Add tempo to local testnet config and update fulu kurtosis config files (#7898)
This PR adds tempo to kurtosis config and will collect lighthouse traces on kurtosis local testnet. The traces can be viewed / queried from Grafana.

Also updated fulu kurtosis configs to use latest geth image.
2025-08-20 02:30:11 +00:00
Jimmy Chen
b4704eab4a Fulu update to spec v1.6.0-alpha.4 (#7890)
Fulu update to spec [v1.6.0-alpha.4](https://github.com/ethereum/consensus-specs/releases/tag/v1.6.0-alpha.4).
- Make `number_of_columns` a preset
- Optimise `get_custody_groups` to avoid computing if cgc = 128
- Add support for additional typenum values in type_dispatch macro
2025-08-20 02:05:04 +00:00
Jimmy Chen
95882bfa66 Add --telemetry-service-name CLI flag for OpenTelemetry service name override (#7903)
Allows users to customize the OpenTelemetry service name instead of using the hardcoded default `lighthouse`. Defaults to 'lighthouse-bn' for beacon node, 'lighthouse-vc' for validator client, or 'lighthouse' for other subcommands.

This is useful when analysing traces from multiple nodes, see Grafana screenshot below with service name overrides in Kurtosis (`ethereum-package` PR: https://github.com/ethpandaops/ethereum-package/pull/1160):

<img width="1148" height="627" alt="image" src="https://github.com/user-attachments/assets/7e875639-10f7-4756-837f-2006fa4b12e0" />
2025-08-20 01:16:34 +00:00
Jimmy Chen
34dd1b27ae Revise data column rpc limits and queue sizes (#7887)
Revise data column rpc limits and queue sizes. Also removed some outdated TODOs for Fulu / das.
2025-08-19 03:48:08 +00:00
Daniel Knopik
1fd7ead010 Do not filter validators by status if filter is an empty list (#7884)
69d2feb12a/apis/beacon/states/validators.yaml (L128-L130) says we need to not filter if the filter is an empty list.


  Add a check for `statuses.is_empty()`.
2025-08-18 07:46:37 +00:00
Michael Sproul
836c39efaa Shrink persisted fork choice data (#7805)
Closes:

- https://github.com/sigp/lighthouse/issues/7760


  - [x] Remove `balances_cache` from `PersistedForkChoiceStore` (~65 MB saving on mainnet)
- [x] Remove `justified_balances` from `PersistedForkChoiceStore` (~16 MB saving on mainnet)
- [x] Remove `balances` from `ProtoArray`/`SszContainer`.
- [x] Implement zstd compression for votes
- [x] Fix bug in justified state usage
- [x] Bump schema version to V28 and implement migration.
2025-08-18 06:03:28 +00:00
Michael Sproul
08234b2823 Add rustfmt config with edition 2024 (#7888)
Since we updated to edition 2024 my Vim plugin for rustfmt is formatting code incorrectly, with 2018 settings:

889b9a7515/autoload/rustfmt.vim (L74-L75)

Arguably this plugin is a bit junk, but I think it's fairly harmless to add this config.


  Add `rustfmt.toml`. This is a generic config file for `rustfmt` which is probably useful for `rustfmt` integration with other editors too.

We may want to add other config to `rustfmt.toml` over time as well, I think this was discussed recently.
2025-08-18 04:32:58 +00:00
Jimmy Chen
aa8cba3741 Upgrade rust-eth-kzg to 0.8.0 (#7870)
#7864

The main breaking change in v0.8.0 is the `TrustedSetup` initialisation - it now requires a json string via `PeerDASTrustedSetup::from_json`.
2025-08-18 02:52:39 +00:00
Age Manning
9200042910 Transition network key to hex format (#7665)
#7181


  Instead of storing the network key as binary data we store it as hex, allowing users to modify it via the file.

We can read old-binary forms, however we will migrate binary to hex as it will be the new standard.
2025-08-15 07:12:19 +00:00
Michael Sproul
317dc0f56c Fix malloc_utils features (sysmalloc) (#7770)
Follow-up to:

- https://github.com/sigp/lighthouse/pull/7764

The `heaptrack` feature added in my previous PR was ineffective, because the jemalloc feature was turned on by the Linux target-specific dependency.

This PR tweaks the features such that:

- The jemalloc feature is just used to control whether jemalloc is compiled in. It is enabled on Linux by the target-specific dependency (see `lighthouse/Cargo.toml`), and completely disabled on Windows.
- If the `sysmalloc` feature is set on Linux then it overrides jemalloc when selecting an allocator, _even if_ the jemalloc feature is enabled (and the jemalloc dep was compiled).
2025-08-15 03:46:38 +00:00
Eitan Seri-Levi
90fa7c216e Fix ssz formatting for /light_client/updates beacon API endpoint (#7806)
#7759


  We were incorrectly encoding the full response from `/light_client/updates` instead of only encoding the light client update
2025-08-15 03:17:29 +00:00
Michael Sproul
42f6d7b02d Yeet env_logger into the sun (#7872)
- Remove explicit `env_logger` usage from `state_processing` tests and `lcli`.
- Set up tracing correctly for `lcli` (I've checked that we can see logs after this change).
- I didn't do anything to set up logging for the `state_processing` tests, as these are rarely run manually (they never fail). We could add `test_logger` in there on an as-needed basis.
2025-08-15 03:17:26 +00:00
antondlr
5ebb44e222 Try using sccache instead of disabling (#7873)
We temporarily can't build sccache on windows runners, but it's still available on linux.
this smol change lets us use it when available, instead of disabling across the board.


  The Windows runners now have a conditional check to disable (unset the `rustc-wrapper` env var) sccache in their entrypoint, just like the Linux ones have.
Also the workflows no longer fail when `sccache --show-stats` fails.
2025-08-14 00:10:13 +00:00
Age Manning
ee1b0ae2ff Allow for sync state where batch is unknown (#7391) 2025-08-13 06:00:49 +00:00
chonghe
522bd9e9c6 Update Rust Edition to 2024 (#7766)
* #7749

Thanks @dknopik and @michaelsproul for your help!
2025-08-13 03:04:31 +00:00
Michael Sproul
bd6b8b6a65 Disable sccache to fix Windows builds (#7867)
Quick fix to unblock Windows CI. I have a hammer and I'm using it.
2025-08-13 01:51:19 +00:00
antondlr
6604fd10b4 Deprecate macOS-x86 binaries (#7862)
Rust is demoting x86 for macOS: https://blog.rust-lang.org/2025/08/07/Rust-1.89.0/
This makes it unfeasible to maintain such a build going forward.


  Stop publishing `x86_64-apple-darwin` binaries.
2025-08-12 07:23:31 +00:00
Jimmy Chen
4ef4bdc38b Initial Claude.md draft (#7848)
Add Initial `Claude.md` draft. Feel free to comment and make suggestions.
2025-08-12 05:16:23 +00:00
Mac L
152f2bb2e4 Re-export context_deserialize_derive inside context_deserialize (#7852)
Re-export `context_deserialize_derive` inside of `context_deserialize` so they are both available from the same interface, which matches how popular crates (like `serde`) handle this.

This also nests both crates inside a new `context_deserialize` directory which will make it easier to eventually spin out into a different repo (if/when) we decide to do that (plus I prefer it aesthetically).
2025-08-12 05:16:19 +00:00
Michael Sproul
918121e313 Fix bugs in rebasing of states prior to finalization (#7849)
Attempt to fix this error reported by `beaconcha.in` on their Hoodi archive nodes:

> {"code":500,"message":"UNHANDLED_ERROR: DBError(CacheBuildError(BeaconState(MilhouseError(OutOfBoundsIterFrom { index: 1199549, len: 1060000 }))))","stacktraces":[]}


  There are only a handful of places where we call `iter_from`.

This one is safe by construction (the check immediately prior ensures `self.pubkeys.len()` is not out of bounds):

cfb1f73310/beacon_node/beacon_chain/src/validator_pubkey_cache.rs (L84-L90)

This one should also be safe, and the indexes used here would not be as large as the ones in the reported error:

cfb1f73310/consensus/state_processing/src/per_epoch_processing/single_pass.rs (L365-L368)

Which leaves one remaining usage which must be the culprit:

cfb1f73310/consensus/types/src/beacon_state.rs (L2109-L2113)

This indexing relies on the invariant that `self.pubkey_cache().len() <= self.validators.len()`. We mostly maintain that invariant, except for in `rebase_caches_on` (fixed in this PR).

The other bug, is that we were calling `rebase_on_finalized` for all "hot" states, which post-v7.1.0 includes states prior to the split which are required by the hdiff grid. This is how we end up calling something like `genesis_state.rebase_on(&split_state)`, which then corrupts the pubkey cache of the genesis state using the newer pubkey cache from the split state.
2025-08-12 02:19:24 +00:00
Pawan Dhananjay
80ba0b169b Backfill peer attribution (#7762)
Partly addresses https://github.com/sigp/lighthouse/issues/7744


  Implement similar peer sync attribution like in #7733 for backfill sync.
2025-08-12 02:11:56 +00:00
Eitan Seri-Levi
122f16776f Add metrics to track beacon processor queue times (#7808)
This PR adds a created_timestamp to the beacon processor send channel. When work items are sent through that channel `try_send` will forward the work event along with the current timestamp to the beacon processor. When the work event is completed the `Drop` impl for `SendOnDrop` will track the time it took from work event creation to its completion. Previously we only had data on how long a work event took to process, but didn't have data on how long it sat in the queue + how long it took to process.
2025-08-12 01:06:42 +00:00
Pawan Dhananjay
4262ad3e01 Add a flag to disable getBlobs (#7853)
N/A


  Add a flag to disable get blobs. I configured the flag to disable it regardless of version because its most likely something we use for testing anyway.
2025-08-11 23:17:00 +00:00
Jimmy Chen
40c2fd5ff4 Instrument tracing spans for block processing and import (#7816)
#7815

- removes all existing spans, so some span fields that appear in logs like `service_name` may be lost.
- instruments a few key code paths in the beacon node, starting from **root spans** named below:

* Gossip block and blobs
* `process_gossip_data_column_sidecar`
* `process_gossip_blob`
* `process_gossip_block`
* Rpc block and blobs
* `process_rpc_block`
* `process_rpc_blobs`
* `process_rpc_custody_columns`
* Rpc blocks (range and backfill)
* `process_chain_segment`
* `PendingComponents` lifecycle
* `pending_components`

To test locally:
* Run Grafana and Tempo with https://github.com/sigp/lighthouse-metrics/pull/57
* Run Lighthouse BN with `--telemetry-collector-url http://localhost:4317`

Some captured traces can be found here: https://hackmd.io/@jimmygchen/r1sLOxPPeg

Removing the old spans seem to have reduced the memory usage quite a lot - i think we were using them on long running tasks and too excessively:
<img width="910" height="495" alt="image" src="https://github.com/user-attachments/assets/5208bbe4-53b2-4ead-bc71-0b782c788669" />
2025-08-08 05:32:22 +00:00
Jimmy Chen
6dfab22267 Fix Rust 1.89 compiler warnings in slasher tests. (#7844)
As described in title, failing test here

https://github.com/sigp/lighthouse/actions/runs/16818997885/job/47646515894
2025-08-08 04:41:08 +00:00
Daniel Ramirez-Chiquillo
cafb3644e2 Fix Makefile line continuation syntax in test-release target (#7834)
#7833


  Fix a typo on the `Makefile` that was causing `make test` to run `http_api` tests when they should have been ignored.
2025-08-07 08:32:52 +00:00
Jimmy Chen
3a02bdd94a Adjust DA checker cache size (#7825)
The current `OVERFLOW_LRU_CAPACITY` of `1024` seems a bit excessive now we rarely store more than 1 `PendingComponents` (under normal networking components). Additionally given the blob count increases, the max size of `PendingComponents` has also increased and is expected to increase further.

This PR brings the max capacity of the cache down to `64`, which should be more than enough headroom but also give us  better protection from the network.
2025-08-07 05:11:38 +00:00
Jimmy Chen
8bc6693dac Fix wrong columns getting processed on a CGC change (#7792)
This PR fixes a bug where wrong columns could get processed immediately after a CGC increase.

Scenario:
- The node's CGC increased due to additional validators attached to it (lets say from 10 to 11)
- The new CGC is advertised and new subnets are subscribed immediately, however the change won't be effective in the data availability check until the next epoch (See [this](ab0e8870b4/beacon_node/beacon_chain/src/validator_custody.rs (L93-L99))). Data availability checker still only require 10 columns for the current epoch.
- During this time, data columns for the additional custody column (lets say column 11) may arrive via gossip as we're already subscribed to the topic, and it may be incorrectly used to satisfy the existing data availability requirement (10 columns), and result in this additional column (instead of a required one) getting persisted, resulting in database inconsistency.
2025-08-07 00:45:04 +00:00
Daniel Ramirez-Chiquillo
9c972201bc Fix: RPC test failures (#7734)
Fixes #7735


  Use `tracing::subscriber::set_default` to ensure that each test/thread has its own subscirber.
2025-08-06 14:59:41 +00:00
Eric Tu
c06ac81c67 Shuffling for 32 bit platforms (#7725)
- In shuffling, a the raw_pivot (u64) is cast to a usize which will break on 32 bit systems. Now it is modulo'ed with the list_size first then cast to a usize.
- ruint doesn't implement shifting with u64's on 32-bit arch. Since `prefix_bits` is u8 and NODE_ID_BITS = 256, we use them as u32's instead.

See: https://docs.rs/ruint/latest/src/ruint/bits.rs.html#711
2025-08-06 02:37:07 +00:00
Michael Sproul
0dcce40ccb Fix Clippy for Rust 1.90 beta (#7826)
Fix Clippy for recently released Rust 1.90 beta. There may be more changes required when Rust 1.89 stable is released in a few days, but possibly not 🤞
2025-08-05 13:52:26 +00:00
Jimmy Chen
adf6ad70f0 Update fetch blobs metrics buckets (#7823)
While looking at metrics I noticed that `beacon_blobs_from_el_expected` and `beacon_blobs_from_el_received_total` have different buckets, this PR adds more buckets to both (to prepare for Fusaka) and make them both consistent.
2025-08-01 18:27:53 +00:00