lighthouse

mirror of https://github.com/sigp/lighthouse.git synced 2026-04-17 21:08:32 +00:00

Author	SHA1	Message	Date
Jimmy Chen	9d2f55a399	Fix data column reconstruction error (#7998 ) Addresses #7991	2025-09-04 20:17:52 +00:00
Jimmy Chen	677de70025	Fix incorrect prune test logic (#7999 ) I just noticed that one of the tests i added in #7915 is incorrect, after it was running flaky for a bit. This PR fixes the scenario and ensure the outcome will always be the same.	2025-09-04 19:53:38 +00:00
Pawan Dhananjay	84ec209eba	Allow AwaitingDownload to be a valid in-between state (#7984 ) N/A Extracts (3) from https://github.com/sigp/lighthouse/pull/7946. Prior to peerdas, a batch should never have been in `AwaitingDownload` state because we immediataly try to move from `AwaitingDownload` to `Downloading` state by sending batches. This was always possible as long as we had peers in the `SyncingChain` in the pre-peerdas world. However, this is no longer the case as a batch can be stuck waiting in `AwaitingDownload` state if we have no peers to request the columns from. This PR makes `AwaitingDownload` to be an allowable in between state. If a batch is found to be in this state, then we attempt to send the batch instead of erroring like before. Note to reviewer: We need to make sure that this doesn't lead to a bunch of batches stuck in `AwaitingDownload` state if the chain can be progressed. Backfill already retries all batches in AwaitingDownload state so we just need to make `AwaitingDownload` a valid state during processing and validation. This PR explicitly adds the same logic for forward sync to download batches stuck in `AwaitingDownload`. Apart from that, we also force download of the `processing_target` when sync stops progressing. This is required in cases where `self.batches` has > `BATCH_BUFFER_SIZE` batches that are waiting to get processed but the `processing_batch` has repeatedly failed at download/processing stage. This leads to sync getting stuck and never recovering.	2025-09-04 07:39:16 +00:00
Jimmy Chen	c2a92f1a8c	Maintain peers across all data column subnets (#7915 ) Closes: - #7865 - #7855 Changes extracted from earlier PR #7876 This PR fixes two main things with a few other improvements mentioned below: - Prevent Lighthouse from repeatedly sending `DataColumnByRoot` requests to an unsynced peer, causing lookup sync to get stuck - Allows Lighthouse to send discovery requests if there isn't enough synced peers in the required sampling subnets - this fixes the stuck sync scenario where there isn't enough usable peers in sampling subnet but no discovery is attempted. - Make peer discovery queries if custody subnet peer count drops below the minimum threshold - Update peer pruning logic to prioritise uniform distribution across all data column subnets and avoid pruning sampling peers if the count is below the target threshold (2) - Check sync status when making discovery requests, to make sure we don't ignore requests if there isn't enough synced peers in the required sampling subnets - Optimise some of the `PeerDB` functions checking custody peers - Only send lookup requests to peers that are synced or advanced	2025-09-04 05:36:20 +00:00
Michael Sproul	76adedff27	Simplify length methods on BeaconBlockBody (#7989 ) Just the low-hanging fruit from: - https://github.com/sigp/lighthouse/pull/7988	2025-09-04 00:08:29 +00:00
Jimmy Chen	10e72df331	Add `tls-roots` feature to `opentelemetry_otlp` to support exporting traces over https (#7987 )	2025-09-03 08:05:09 +00:00
chonghe	a93cafee08	Implement `selections` Beacon API endpoints to support DVT middleware (#7016 ) * #6610 - [x] Add `beacon_committee_selections` endpoint - [x] Test beacon committee aggregator and confirmed working - [x] Add `sync_committee_selections` endpoint - [x] Test sync committee aggregator and confirmed working	2025-09-03 03:50:41 +00:00
Akihito Nakano	7b5be8b1e7	Remove ttfb_timeout and resp_timeout (#7925 ) `TTFB_TIMEOUT` was deprecated in https://github.com/ethereum/consensus-specs/pull/3767. Remove `ttfb_timeout` from `InboundUpgrade` and other related structs. (Update) Also removed `resp_timeout` and also removed the `NetworkParams` struct since its fields are no longer used. https://github.com/sigp/lighthouse/pull/7925#issuecomment-3226886352	2025-09-03 02:00:15 +00:00
Pawan Dhananjay	a9db8523a2	Update tracing (#7981 ) Update tracing subscriber for cargo audit failure https://rustsec.org/advisories/RUSTSEC-2025-0055	2025-09-03 02:00:12 +00:00
Jimmy Chen	eef02afc93	Fix data availability checker race condition causing partial data columns to be served over RPC (#7961 ) Partially resolves #6439, an simpler alternative to #7931. Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker: 1. Block becomes available via gossip 2. RPC columns arrive and pass fork choice check (block hasn't been imported) 3. Block import completes (removing block from DA checker) 4. RPC data columns finish verification and get imported into DA checker This causes two issues: 1. Partial data serving: Already imported components get re-inserted, potentially causing LH to serve incomplete data 2. State cache misses: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood ### Proposed Changes 1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above. 2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition 3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition Trade-offs: This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing). The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).	2025-09-02 07:18:23 +00:00
Jimmy Chen	979ed2557c	Remove `expect` usage in `kzg_utils` (#7957 ) Remove `expect` usage in `kzg_utils` to handle the case where EL sends us invalid proof size instead of crashing.	2025-09-01 09:21:26 +00:00
kevaundray	9cc3c0553b	chore: small refactor of `epoch` method (#7902 ) Stylistic; mostly using early returns to avoid the nested logic Which issue # does this PR address? Please list or describe the changes introduced by this PR.	2025-09-01 09:21:23 +00:00
Eitan Seri-Levi	c7492f1c27	Update to `1.6.0 alpha.6` spec (#7967 ) Upgrade `rust_eth_kzg` library to `0.9` to support the new cell index sorting tests in `recover_cells_and_kzg_proofs` https://github.com/ethereum/consensus-specs/releases https://github.com/crate-crypto/rust-eth-kzg/compare/v0.8.1...v0.9.0	2025-09-01 08:56:25 +00:00
Sam Wilson	477c534cd7	Remove dependency on target_info. (#7964 ) Remove dependency on target_info, use standard library instead.	2025-09-01 06:03:55 +00:00
Paul Etscheit	66edda2690	Impl ForkVersionDecode for beacon state (#7954 )	2025-09-01 02:22:40 +00:00
Jimmy Chen	438fb65d45	Avoid serving validator endpoints while the node is far behind syncing head (#7962 ) A performance issue was discovered when devnet-3 was under non-finality - some of the lighthouse nodes are "stuck" with syncing because of handling proposer duties HTTP requests. These validator requests are higher priority than Status processing, and if they are taking a long time to process, the node won't be able to progress. What's worse is - under long period of non finality, the proposer duties calculation function tries to do state advance for a large number of [slots](`d545ddcbc7/beacon_node/beacon_chain/src/beacon_proposer_cache.rs (L183)`) here, causing the node to spend all its CPU time on a task that doesn't really help, e.g. the computed duties aren't useful if the node is 20000 slots behind. To solve this issue, we use the `not_while_syncing` filter to prevent serving these requests, until the node is synced. This should allow the node to focus on sync under non-finality situations.	2025-08-29 03:01:40 +00:00
Jimmy Chen	a134d43446	Use `rayon` to speed up batch KZG verification (#7921 ) Addresses #7866. Use Rayon to speed up batch KZG verification during range / backfill sync. While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified. I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.	2025-08-29 00:59:40 +00:00
Pawan Dhananjay	b6792d85d2	Reduce backfill batch buffer size (#7958 ) N/A Currently, backfill is allowed to create upto 20 pending batches which is unnecessarily high imo. Forward sync also allows a max of 5 batches to be buffered at a time. This PR reduces the batch size to match with forward sync. Having high number of batches is a little annoying with peerdas because we try to create and send 20 requests (even though we are processing them in a rate limited manner). Requests with peerdas is a lot more heavy as we distribute requests across multiple peers leading to lot of requests that may keep getting retried. This could take resources away from processing at head.	2025-08-28 03:31:31 +00:00
Jimmy Chen	c13fb2fb46	Instrument `publish_block` code path (#7945 ) Instrument `publish_block` code path and log dropped data columns when publishing. Example spans (running the devnet from my laptop, so the numbers aren't great) <img width="734" height="296" alt="image" src="https://github.com/user-attachments/assets/20620bf7-2b38-4392-aa75-9ba96d3a7f0d" /> <img width="718" height="625" alt="image" src="https://github.com/user-attachments/assets/61e1ff1c-65b5-4ad4-981a-d0fadc9829e1" />	2025-08-28 03:31:29 +00:00
Jimmy Chen	746da7ffd5	Fix doppelganger protection script (#7959 ) Previously `kurtosis service inspect` gives us output like this - flags in separate lines ``` CMD: lighthouse beacon_node --debug-level=debug --datadir=/data/lighthouse/beacon-data --listen-address=0.0.0.0 --port=9000 --http --http-address=0.0.0.0 --http-port=4000 --disable-packet-filter --execution-endpoints=http://172.16.0.8:8551 --jwt-secrets=/jwt/jwtsecret --suggested-fee-recipient=0x8943545177806ED17B9F23F0a21ee5948eCaa776 --disable-enr-auto-update --enr-address=172.16.0.11 ``` In the latest version this has been updated to a single line ``` CMD: exec lighthouse beacon_node --debug-level=debug --datadir=/data/lighthouse/beacon-data --listen-address=0.0.0.0 --port=9000 --http --http-address=0.0.0.0 --http-port=4000 --disable-packet-filter --execution-endpoints=http://172.16.0.12:8551 --jwt-secrets=/jwt/jwtsecret --suggested-fee-recipient=0x8943545177806ED17B9F23F0a21ee5948eCaa776 --disable-enr-auto-update --enr-address=172.16.0.18 --enr-tcp-port=9000 --enr-udp-port=9000 --enr-quic-port=9001 --quic-port=9001 --metrics --metrics-address=0.0.0.0 --metrics-allow-origin=* --metrics-port=5054 --enable-private-discovery --testnet-dir=/network-configs --boot-nodes=enr:-N24QPYP7bj0aqoM2dXsP5hnosW27U6PTYJt1kYFhNkwIvlFQhGJ1om7f4zcHhVJwvUL7wCsVbDJbP_l-TF8X3q4pVEDh2F0dG5ldHOIAAAwAAAAAACGY2xpZW500YpMaWdodGhvdXNlhTcuMS4whGV0aDKQqFs_bWAAADj__________4JpZIJ2NIJpcISsEAAPhHF1aWOCIymJc2VjcDI1NmsxoQK_z4HQylgsOal74Jek9D_EhY0vcDX5AcLHnPD7iOeEdYhzeW5jbmV0cwCDdGNwgiMog3VkcIIjKA --target-peers=3 ``` and it broke our script. This PR update the extraction logic.	2025-08-28 02:48:43 +00:00
Michael Sproul	d235f2c697	Delete `RuntimeVariableList::from_vec` (#7930 ) This method is a footgun because it truncates the list. It is the source of a recent bug: - https://github.com/sigp/lighthouse/pull/7927 - Delete uses of `RuntimeVariableList::from_vec` and replace them with `::new` which does validation and can fail. - Propagate errors where possible, unwrap in tests and use `expect` for obviously-safe uses (in `chain_spec.rs`).	2025-08-27 06:52:14 +00:00
Pawan Dhananjay	ccf03e1c88	Fix data columns by range returning all columns (#7942 ) N/A In https://github.com/sigp/lighthouse/pull/7897 , we seem to have modified data columns by range to return all the columns we have for the requested epoch disregarding what columns the peer requested.	2025-08-27 05:00:34 +00:00
Barnabas Busa	2b33fe6620	Update to spec v1.6.0-alpha.5 (#7910 ) - https://github.com/ethereum/consensus-specs/pull/4508	2025-08-27 03:59:21 +00:00
Jimmy Chen	8901c7417d	Notify lookup after gossip data column processing resulted in an import (#7940 ) When gossip data column processing completes and results in a block import, sync is currently not notified of the successful import. This is inconsistent with how blob processing and block processing both notify sync. This fix ensures lookup sync receives block import notifications when blocks become available through gossip data column.	2025-08-27 01:32:17 +00:00
Jimmy Chen	3e78034de6	Add `BEACON_PROCESSOR_WORKERS_ACTIVE_GAUGE_BY_TYPE` metric (#7935 ) Similar to `BEACON_PROCESSOR_WORKERS_ACTIVE_TOTAL` but this metric also records the work type. This is useful in identifying the task when a worker is stuck due to a deadlock or something else, and usually difficult to debug in production / release mode.	2025-08-26 06:46:14 +00:00
Jimmy Chen	78b4cca46b	Run sync tests on CI by default. (#7929 ) This PR enables some sync tests by default on CI - this will help catch breakages on sync happy paths, e.g. this would have caught the bug #7926 if lighthouse is not able to sync OR serve sync requests. The enabled tests are genesis-sync with 120s / 300s offline time, and should cover both serving / consuming by root and by range requests. I'm leaving the checkpoint sync tests optional, as it has external dependencies on checkpoint server (which may cause CI instability) and may cause extra loads on them.	2025-08-26 02:49:50 +00:00
Mac L	e438691683	Add Gloas boilerplate (#7728 ) Adds the required boilerplate code for the Gloas (Glamsterdam) hard fork. This allows PRs testing Gloas-candidate features to test fork transition. This also includes de-duplication of post-Bellatrix readiness notifiers from #6797 (credit to @dapplion)	2025-08-26 02:49:48 +00:00
Jimmy Chen	daf1c7c3af	Fix RPC blocks not getting fully KZG verified (#7927 ) Fix RPC blocks not getting fully KZG verified due to incorrect list truncation.	2025-08-25 16:46:16 +00:00
Jimmy Chen	747d9118ff	Fix `DataColumnsByRoot` request limit validation bug (#7928 ) Fixes #7926 This was a bug I introduced in #7890 and @pawanjay176 noticed it on some running nodes, and added a rpc test to confirm it. The culprit is this line, where I failed to fill the vec to it's max size, so it doesn't calculate the max size properly, resulting in all `DataColumnByRoot` requests exceeding the max size during validation: `d24a6d2a45/consensus/types/src/chain_spec.rs (L1984)` The PR fixes this and includes new regression tests for this fix.	2025-08-25 04:13:36 +00:00
Mac L	c41d1181d2	Use Fork variants instead of version for JsonPayload types (#7909 ) With Fulu, we increment the engine API version for `get_payload` but we do not also increment `new_payload`. In Lighthouse, we have a tight coupling between these version numbers and the Fork variants. For example, both `get_payload_v3` and `new_payload_v3` correspond to Deneb, `v4` to Electra, etc. However this coupling breaks with Fulu where only `get_payload_v5` is related to Fulu and `new_payload_v4` now also corresponds to Fulu (new_payload_v5 does not exist). While we can work around this, it creates a confusing situation where the versions and Fork variants are out of sync. See the following code snippet where we are using the v4 endpoint, and yet passing a `V5` payload variant: `522bd9e9c6/beacon_node/execution_layer/src/engine_api/http.rs (L849-L870)` 1. Remove `new_payload_v5` as it is unused in Fulu. 2. Rename the `JsonExecutionPayload` and `JsonGetExecutionPayloadResponse` types to use Fork variants instead of version variants. This clarifies the relationship between them.	2025-08-22 09:22:41 +00:00
João Oliveira	884f30094a	use DEFAULT_TARGET_PEERS for target peers everywhere (#7916 ) Was going to leave this as a comment on #7877 but when noticed it had already been merged. we have `DEFAULT_TARGET_PEERS` which was set to 50 and only used on the `Default` impl for `peer_manager`'s `Config`, which then get's overridden by this `lighthouse_network::Config`s default This PR unifies everything on `DEFAULT_TARGET_PEERS`	2025-08-22 00:24:24 +00:00
Jimmy Chen	d24a6d2a45	Prioritise `StatusV2` over `StatusV1` RPC protocol (#7912 ) Prioritise `StatusV2` over `StatusV1` RPC protocol. A bug discovered during devnet-4 testing and extracted from the sync fixes PR #7876.	2025-08-21 23:02:18 +00:00
João Oliveira	cee30d8ca5	Update lighthouse to the latest upstream libp2p and gossipsub (#7828 )	2025-08-21 07:57:46 +00:00
Age Manning	c9ffdf7f71	Re-assess Lighthouse's peer count for Fusaka (#7877 )	2025-08-21 06:12:53 +00:00
Jimmy Chen	f19d4f6af1	Implement tracing spans for data columm RPC requests and responses (#7831 ) #7830	2025-08-20 23:35:51 +00:00
Jimmy Chen	2d223575d6	Avoid unnecessary database lookups in data column RPC requests (#7897 ) This PR is an optimisation to avoid unnecessary database lookups when peer requests data columns that the node doesn't custody (advertised via `cgc`). e.g. an extreme but realistic example - a full node only store 4 custody columns by default, but it may receive a range request of 32 slots with all 128 columns, and this would result in 4096 database lookups but the node is only able to get 128 (4 * 32) of them. - Filter data column RPC requests (`DataColumnsByRoot`, `DataColumnsByRange`) to only lookup columns the node custodies - Prevents unnecessary database queries that would always fail for non-custody columns	2025-08-20 05:08:53 +00:00
Jimmy Chen	f6859b1137	Add tempo to local testnet config and update fulu kurtosis config files (#7898 ) This PR adds tempo to kurtosis config and will collect lighthouse traces on kurtosis local testnet. The traces can be viewed / queried from Grafana. Also updated fulu kurtosis configs to use latest geth image.	2025-08-20 02:30:11 +00:00
Jimmy Chen	b4704eab4a	Fulu update to spec v1.6.0-alpha.4 (#7890 ) Fulu update to spec [v1.6.0-alpha.4](https://github.com/ethereum/consensus-specs/releases/tag/v1.6.0-alpha.4). - Make `number_of_columns` a preset - Optimise `get_custody_groups` to avoid computing if cgc = 128 - Add support for additional typenum values in type_dispatch macro	2025-08-20 02:05:04 +00:00
Jimmy Chen	95882bfa66	Add `--telemetry-service-name` CLI flag for OpenTelemetry service name override (#7903 ) Allows users to customize the OpenTelemetry service name instead of using the hardcoded default `lighthouse`. Defaults to 'lighthouse-bn' for beacon node, 'lighthouse-vc' for validator client, or 'lighthouse' for other subcommands. This is useful when analysing traces from multiple nodes, see Grafana screenshot below with service name overrides in Kurtosis (`ethereum-package` PR: https://github.com/ethpandaops/ethereum-package/pull/1160): <img width="1148" height="627" alt="image" src="https://github.com/user-attachments/assets/7e875639-10f7-4756-837f-2006fa4b12e0" />	2025-08-20 01:16:34 +00:00
Jimmy Chen	34dd1b27ae	Revise data column rpc limits and queue sizes (#7887 ) Revise data column rpc limits and queue sizes. Also removed some outdated TODOs for Fulu / das.	2025-08-19 03:48:08 +00:00
Daniel Knopik	1fd7ead010	Do not filter validators by status if filter is an empty list (#7884 ) `69d2feb12a/apis/beacon/states/validators.yaml (L128-L130)` says we need to not filter if the filter is an empty list. Add a check for `statuses.is_empty()`.	2025-08-18 07:46:37 +00:00
Michael Sproul	836c39efaa	Shrink persisted fork choice data (#7805 ) Closes: - https://github.com/sigp/lighthouse/issues/7760 - [x] Remove `balances_cache` from `PersistedForkChoiceStore` (~65 MB saving on mainnet) - [x] Remove `justified_balances` from `PersistedForkChoiceStore` (~16 MB saving on mainnet) - [x] Remove `balances` from `ProtoArray`/`SszContainer`. - [x] Implement zstd compression for votes - [x] Fix bug in justified state usage - [x] Bump schema version to V28 and implement migration.	2025-08-18 06:03:28 +00:00
Michael Sproul	08234b2823	Add rustfmt config with edition 2024 (#7888 ) Since we updated to edition 2024 my Vim plugin for rustfmt is formatting code incorrectly, with 2018 settings: `889b9a7515/autoload/rustfmt.vim (L74-L75)` Arguably this plugin is a bit junk, but I think it's fairly harmless to add this config. Add `rustfmt.toml`. This is a generic config file for `rustfmt` which is probably useful for `rustfmt` integration with other editors too. We may want to add other config to `rustfmt.toml` over time as well, I think this was discussed recently.	2025-08-18 04:32:58 +00:00
Jimmy Chen	aa8cba3741	Upgrade rust-eth-kzg to `0.8.0` (#7870 ) #7864 The main breaking change in v0.8.0 is the `TrustedSetup` initialisation - it now requires a json string via `PeerDASTrustedSetup::from_json`.	2025-08-18 02:52:39 +00:00
Age Manning	9200042910	Transition network key to hex format (#7665 ) #7181 Instead of storing the network key as binary data we store it as hex, allowing users to modify it via the file. We can read old-binary forms, however we will migrate binary to hex as it will be the new standard.	2025-08-15 07:12:19 +00:00
Michael Sproul	317dc0f56c	Fix malloc_utils features (sysmalloc) (#7770 ) Follow-up to: - https://github.com/sigp/lighthouse/pull/7764 The `heaptrack` feature added in my previous PR was ineffective, because the jemalloc feature was turned on by the Linux target-specific dependency. This PR tweaks the features such that: - The jemalloc feature is just used to control whether jemalloc is compiled in. It is enabled on Linux by the target-specific dependency (see `lighthouse/Cargo.toml`), and completely disabled on Windows. - If the `sysmalloc` feature is set on Linux then it overrides jemalloc when selecting an allocator, _even if_ the jemalloc feature is enabled (and the jemalloc dep was compiled).	2025-08-15 03:46:38 +00:00
Eitan Seri-Levi	90fa7c216e	Fix ssz formatting for `/light_client/updates` beacon API endpoint (#7806 ) #7759 We were incorrectly encoding the full response from `/light_client/updates` instead of only encoding the light client update	2025-08-15 03:17:29 +00:00
Michael Sproul	42f6d7b02d	Yeet env_logger into the sun (#7872 ) - Remove explicit `env_logger` usage from `state_processing` tests and `lcli`. - Set up tracing correctly for `lcli` (I've checked that we can see logs after this change). - I didn't do anything to set up logging for the `state_processing` tests, as these are rarely run manually (they never fail). We could add `test_logger` in there on an as-needed basis.	2025-08-15 03:17:26 +00:00
antondlr	5ebb44e222	Try using sccache instead of disabling (#7873 ) We temporarily can't build sccache on windows runners, but it's still available on linux. this smol change lets us use it when available, instead of disabling across the board. The Windows runners now have a conditional check to disable (unset the `rustc-wrapper` env var) sccache in their entrypoint, just like the Linux ones have. Also the workflows no longer fail when `sccache --show-stats` fails.	2025-08-14 00:10:13 +00:00
Age Manning	ee1b0ae2ff	Allow for sync state where batch is unknown (#7391 )	2025-08-13 06:00:49 +00:00

1 2 3 4 5 ...

7026 Commits