N/A
1. In the batch retry logic, we were failing to set the batch state to `AwaitingDownload` before attempting a retry. This PR sets it to `AwaitingDownload` before the retry and sets it back to `Downloading` if the retry suceeded in sending out a request
2. Remove all peer scoring logic from retrying and rely on just de priorotizing the failed peer. I finally concede the point to @dapplion 😄
3. Changes `block_components_by_range_request` to accept `block_peers` and `column_peers`. This is to ensure that we use the full synced peerset for requesting columns in order to avoid splitting the column peers among multiple head chains. During forward sync, we want the block peers to be the peers from the syncing chain and column peers to be all synced peers from the peerdb.
Also, fixes a typo and calls `attempt_send_awaiting_download_batches` from more places
Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
I was looking into some long `PendingComponents` span and noticed the block event wasn't added to the span, so it wasn't possible to see when the block was added from the trace view, this PR fixes this.
<img width="637" height="430" alt="image" src="https://github.com/user-attachments/assets/65040b1c-11e7-43ac-951b-bdfb34b665fb" />
Additionally I've noticed a lot of noises and confusion in sync logs due to the initial`peer_id` being included as part of the syncing chain span, causing all logs under the span to have that `peer_id`, which may not be accurate for some sync logs, I've removed `peer_id` from the `SyncingChain` span, and also cleaned up a bunch of spans to use `%` (display) for slots and epochs to make logs easier to read.
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
This PR fixes a bug where wrong columns could get processed immediately after a CGC increase.
Scenario:
- The node's CGC increased due to additional validators attached to it (lets say from 10 to 11)
- The new CGC is advertised and new subnets are subscribed immediately, however the change won't be effective in the data availability check until the next epoch (See [this](ab0e8870b4/beacon_node/beacon_chain/src/validator_custody.rs (L93-L99))). Data availability checker still only require 10 columns for the current epoch.
- During this time, data columns for the additional custody column (lets say column 11) may arrive via gossip as we're already subscribed to the topic, and it may be incorrectly used to satisfy the existing data availability requirement (10 columns), and result in this additional column (instead of a required one) getting persisted, resulting in database inconsistency.
Which issue # does this PR address?
Closes#7604
Improvements to range sync including:
1. Contain column requests only to peers that are part of the SyncingChain
2. Attribute the fault to the correct peer and downscore them if they don't return the data columns for the request
3. Improve sync performance by retrying only the failed columns from other peers instead of failing the entire batch
4. Uses the earliest_available_slot to make requests to peers that claim to have the epoch. Note: if no earliest_available_slot info is available, fallback to using previous logic i.e. assume peer has everything backfilled upto WS checkpoint/da boundary
Tested this on fusaka-devnet-2 with a full node and supernode and the recovering logic seems to works well.
Also tested this a little on mainnet.
Need to do more testing and possibly add some unit tests.
- Re-opens https://github.com/sigp/lighthouse/pull/6864 targeting unstable
Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.
Issues with current unstable:
- Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
- Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
- The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.
- [x] Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
- [x] Upstream peer selection to the network context, now `block_components_by_range_request` gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error `RpcRequestSendError::NoPeer`
- [ ] Range sync and backfill sync handle `RpcRequestSendError::NoPeer` explicitly
- [ ] Range sync: leaves the batch in `AwaitingDownload` state and does nothing. **TODO**: we should have some mechanism to fail the chain if it's stale for too long - **EDIT**: Not done in this PR
- [x] Backfill sync: pauses the sync until another peer joins - **EDIT**: Same logic as unstable
### TODOs
- [ ] Add tests :)
- [x] Manually test backfill sync
Note: this touches the mainnet path!
- PR https://github.com/sigp/lighthouse/pull/6497 made obsolete some consistency checks inside the batch
I forgot to remove the consumers of those errors
Remove un-used batch sync error condition, which was a nested `Result<_, Result<_, E>>`
Part of
- https://github.com/sigp/lighthouse/issues/6258
To address PeerDAS sync issues we need to make individual by_range requests within a batch retriable. We should adopt the same pattern for lookup sync where each request (block/blobs/columns) is tracked individually within a "meta" request that group them all and handles retry logic.
- Building on https://github.com/sigp/lighthouse/pull/6398
second step is to add individual request accumulators for `blocks_by_range`, `blobs_by_range`, and `data_columns_by_range`. This will allow each request to progress independently and be retried separately.
Most of the logic is just piping, excuse the large diff. This PR does not change the logic of how requests are handled or retried. This will be done in a future PR changing the logic of `RangeBlockComponentsRequest`.
### Before
- Sync manager receives block with `SyncRequestId::RangeBlockAndBlobs`
- Insert block into `SyncNetworkContext::range_block_components_requests`
- (If received stream terminators of all requests)
- Return `Vec<RpcBlock>`, and insert into `range_sync`
### Now
- Sync manager receives block with `SyncRequestId::RangeBlockAndBlobs`
- Insert block into `SyncNetworkContext:: blocks_by_range_requests`
- (If received stream terminator of this request)
- Return `Vec<SignedBlock>`, and insert into `SyncNetworkContext::components_by_range_requests `
- (If received a result for all requests)
- Return `Vec<RpcBlock>`, and insert into `range_sync`
* 1D PeerDAS prototype: Data format and Distribution (#5050)
* Build and publish column sidecars. Add stubs for gossip.
* Add blob column subnets
* Add `BlobColumnSubnetId` and initial compute subnet logic.
* Subscribe to blob column subnets.
* Introduce `BLOB_COLUMN_SUBNET_COUNT` based on DAS configuration parameter changes.
* Fix column sidecar type to use `VariableList` for data.
* Fix lint errors.
* Update types and naming to latest consensus-spec #3574.
* Fix test and some cleanups.
* Merge branch 'unstable' into das
* Merge branch 'unstable' into das
* Merge branch 'unstable' into das
# Conflicts:
# consensus/types/src/chain_spec.rs
* Add `DataColumnSidecarsByRoot ` req/resp protocol (#5196)
* Add stub for `DataColumnsByRoot`
* Add basic implementation of serving RPC data column from DA checker.
* Store data columns in early attester cache and blobs db.
* Apply suggestions from code review
Co-authored-by: Eitan Seri-Levi <eserilev@gmail.com>
Co-authored-by: Jacob Kaufmann <jacobkaufmann18@gmail.com>
* Fix build.
* Store `DataColumnInfo` in database and various cleanups.
* Update `DataColumnSidecar` ssz max size and remove panic code.
---------
Co-authored-by: Eitan Seri-Levi <eserilev@gmail.com>
Co-authored-by: Jacob Kaufmann <jacobkaufmann18@gmail.com>
* feat: add DAS KZG in data col construction (#5210)
* feat: add DAS KZG in data col construction
* refactor data col sidecar construction
* refactor: add data cols to GossipVerifiedBlockContents
* Disable windows tests for `das` branch. (c-kzg doesn't build on windows)
* Formatting and lint changes only.
* refactor: remove iters in construction of data cols
* Update vec capacity and error handling.
* Add `data_column_sidecar_computation_seconds` metric.
---------
Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
* Merge branch 'unstable' into das
# Conflicts:
# .github/workflows/test-suite.yml
# beacon_node/lighthouse_network/src/types/topics.rs
* fix: update data col subnet count from 64 to 32 (#5413)
* feat: add peerdas custody field to ENR (#5409)
* feat: add peerdas custody field to ENR
* add hash prefix step in subnet computation
* refactor test and fix possible u64 overflow
* default to min custody value if not present in ENR
* Merge branch 'unstable' into das
* Merge branch 'unstable' into das-unstable-merge-0415
# Conflicts:
# Cargo.lock
# beacon_node/beacon_chain/src/data_availability_checker.rs
# beacon_node/beacon_chain/src/data_availability_checker/availability_view.rs
# beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
# beacon_node/beacon_chain/src/data_availability_checker/processing_cache.rs
# beacon_node/lighthouse_network/src/rpc/methods.rs
# beacon_node/network/src/network_beacon_processor/mod.rs
# beacon_node/network/src/sync/block_lookups/tests.rs
# crypto/kzg/Cargo.toml
* Merge remote-tracking branch 'sigp/unstable' into das
* Merge remote-tracking branch 'sigp/unstable' into das
* Fix merge conflicts.
* Send custody data column to `DataAvailabilityChecker` for determining block importability (#5570)
* Only import custody data columns after publishing a block.
* Add `subscribe-all-data-column-subnets` and pass custody column count to `availability_cache`.
* Add custody requirement checks to `availability_cache`.
* Fix config not being passed to DAChecker and add more logging.
* Introduce `peer_das_epoch` and make blobs and columns mutually exclusive.
* Add DA filter for PeerDAS.
* Fix data availability check and use test_logger in tests.
* Fix subscribe to all data column subnets not working correctly.
* Fix tests.
* Only publish column sidecars if PeerDAS is activated. Add `PEER_DAS_EPOCH` chain spec serialization.
* Remove unused data column index in `OverflowKey`.
* Fix column sidecars incorrectly produced when there are no blobs.
* Re-instate index to `OverflowKey::DataColumn` and downgrade noisy debug log to `trace`.
* DAS sampling on sync (#5616)
* Data availability sampling on sync
* Address @jimmygchen review
* Trigger sampling
* Address some review comments and only send `SamplingBlock` sync message after PEER_DAS_EPOCH.
---------
Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
* Merge branch 'unstable' into das
# Conflicts:
# Cargo.lock
# Cargo.toml
# beacon_node/beacon_chain/src/block_verification.rs
# beacon_node/http_api/src/publish_blocks.rs
# beacon_node/lighthouse_network/src/rpc/codec/ssz_snappy.rs
# beacon_node/lighthouse_network/src/rpc/protocol.rs
# beacon_node/lighthouse_network/src/types/pubsub.rs
# beacon_node/network/src/sync/block_lookups/single_block_lookup.rs
# beacon_node/store/src/hot_cold_store.rs
# consensus/types/src/beacon_state.rs
# consensus/types/src/chain_spec.rs
# consensus/types/src/eth_spec.rs
* Merge branch 'unstable' into das
* Re-process early sampling requests (#5569)
* Re-process early sampling requests
# Conflicts:
# beacon_node/beacon_processor/src/work_reprocessing_queue.rs
# beacon_node/lighthouse_network/src/rpc/methods.rs
# beacon_node/network/src/network_beacon_processor/rpc_methods.rs
* Update beacon_node/beacon_processor/src/work_reprocessing_queue.rs
Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
* Add missing var
* Beta compiler fixes and small typo fixes.
* Remove duplicate method.
---------
Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
* Merge remote-tracking branch 'sigp/unstable' into das
* Fix merge conflict.
* Add data columns by root to currently supported protocol list (#5678)
* Add data columns by root to currently supported protocol list.
* Add missing data column by roots handling.
* Merge branch 'unstable' into das
# Conflicts:
# Cargo.lock
# Cargo.toml
# beacon_node/network/src/sync/block_lookups/tests.rs
# beacon_node/network/src/sync/manager.rs
* Fix simulator tests on `das` branch (#5731)
* Bump genesis delay in sim tests as KZG setup takes longer for DAS.
* Fix incorrect YAML spacing.
* DataColumnByRange boilerplate (#5353)
* add boilerplate
* fmt
* PeerDAS custody lookup sync (#5684)
* Implement custody sync
* Lint
* Fix tests
* Fix rebase issue
* Add data column kzg verification and update `c-kzg`. (#5701)
* Add data column kzg verification and update `c-kzg`.
* Fix incorrect `Cell` size.
* Add kzg verification on rpc blocks.
* Add kzg verification on rpc data columns.
* Rename `PEER_DAS_EPOCH` to `EIP7594_FORK_EPOCH` for client interop. (#5750)
* Fetch custody columns in range sync (#5747)
* Fetch custody columns in range sync
* Clean up todos
* Remove `BlobSidecar` construction and publish after PeerDAS activated (#5759)
* Avoid building and publishing blob sidecars after PeerDAS.
* Ignore gossip blobs with a slot greater than peer das activation epoch.
* Only attempt to verify blob count and import blobs before PeerDAS.
* #5684 review comments (#5748)
* #5684 review comments.
* Doc and message update only.
* Fix incorrect condition when constructing `RpcBlock` with `DataColumn`s
* Make sampling tests deterministic (#5775)
* PeerDAS spec tests (#5772)
* Add get_custody_columns spec tests.
* Add kzg merkle proof spec tests.
* Add SSZ spec tests.
* Add remaining KZG tests
* Load KZG only once per process, exclude electra tests and add missing SSZ tests.
* Fix lint and missing changes.
* Ignore macOS generated file.
* Merge remote branch 'sigp/unstable' into das
* Merge remote tracking branch 'origin/unstable' into das
* Implement unconditional reconstruction for supernodes (#5781)
* Implement unconditional reconstruction for supernodes
* Move code into KzgVerifiedCustodyDataColumn
* Remove expect
* Add test
* Thanks justin
* Add withhold attack mode for interop (#5788)
* Add withhold attack mode
* Update readme
* Drop added readmes
* Undo styling changes
* Add column gossip verification and handle unknown parent block (#5783)
* Add column gossip verification and handle missing parent for columns.
* Review PR
* Fix rebase issue
* more lint issues :)
---------
Co-authored-by: dapplion <35266934+dapplion@users.noreply.github.com>
* Trigger sampling on sync events (#5776)
* Trigger sampling on sync events
* Update beacon_chain.rs
* Fix tests
* Fix tests
* PeerDAS parameter changes for devnet-0 (#5779)
* Update PeerDAS parameters to latest values.
* Lint fix
* Fix lint.
* Update hardcoded subnet count to 64 (#5791)
* Fix incorrect columns per subnet and config cleanup (#5792)
* Tidy up PeerDAS preset and config values.
* Fix broken config
* Fix DAS branch CI (#5793)
* Fix invalid syntax.
* Update cli doc. Ignore get_custody_columns test temporarily.
* Fix failing test and add verify inclusion test.
* Undo accidentally removed code.
* Only attempt reconstruct columns once. (#5794)
* Re-enable precompute table for peerdas kzg (#5795)
* Merge branch 'unstable' into das
* Update subscription filter. (#5797)
* Remove penalty for duplicate columns (expected due to reconstruction) (#5798)
* Revert DAS config for interop testing. Optimise get_custody_columns function. (#5799)
* Don't perform reconstruction for proposer node as it already has all the columns. (#5806)
* Multithread compute_cells_and_proofs (#5805)
* Multi-thread reconstruct data columns
* Multi-thread path for block production
* Merge branch 'unstable' into das
# Conflicts:
# .github/workflows/test-suite.yml
# beacon_node/network/src/sync/block_lookups/mod.rs
# beacon_node/network/src/sync/block_lookups/single_block_lookup.rs
# beacon_node/network/src/sync/network_context.rs
* Fix CI errors.
* Move PeerDAS type-level config to configurable `ChainSpec` (#5828)
* Move PeerDAS type level config to `ChainSpec`.
* Fix tests
* Misc custody lookup improvements (#5821)
* Improve custody requests
* Type DataColumnsByRootRequestId
* Prioritize peers and load balance
* Update tests
* Address PR review
* Merge branch 'unstable' into das
* Rename deploy_block in network config (`das` branch) (#5852)
* Rename deploy_block.txt to deposit_contract_block.txt
* fmt
---------
Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
* Merge branch 'unstable' into das
* Fix CI and merge issues.
* Merge branch 'unstable' into das
# Conflicts:
# beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
# lcli/src/main.rs
* Store data columns individually in store and caches (#5890)
* Store data columns individually in store and caches
* Implement data column pruning
* Merge branch 'unstable' into das
# Conflicts:
# Cargo.lock
* Update reconstruction benches to newer criterion version. (#5949)
* Merge branch 'unstable' into das
# Conflicts:
# .github/workflows/test-suite.yml
* chore: add `recover_cells_and_compute_proofs` method (#5938)
* chore: add recover_cells_and_compute_proofs method
* Introduce type alias `CellsAndKzgProofs` to address type complexity.
---------
Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
* Update `csc` format in ENR and spec tests for devnet-1 (#5966)
* Update `csc` format in ENR.
* Add spec tests for `recover_cells_and_kzg_proofs`.
* Add tests for ENR.
* Fix failing tests.
* Add protection against invalid csc value in ENR.
* Fix lint
* Fix csc encoding and decoding (#5997)
* Fix data column rpc request not being sent due to incorrect limits set. (#6000)
* Fix incorrect inbound request count causing rate limiting. (#6025)
* Merge branch 'stable' into das
# Conflicts:
# beacon_node/network/src/sync/block_lookups/tests.rs
# beacon_node/network/src/sync/block_sidecar_coupling.rs
# beacon_node/network/src/sync/manager.rs
# beacon_node/network/src/sync/network_context.rs
# beacon_node/network/src/sync/network_context/requests.rs
* Merge remote-tracking branch 'unstable' into das
* Add kurtosis config for DAS testing (#5968)
* Add kurtosis config for DAS testing.
* Fix invalid yaml file
* Update network parameter files.
* chore: add rust PeerdasKZG crypto library for peerdas functionality and rollback c-kzg dependency to 4844 version (#5941)
* chore: add recover_cells_and_compute_proofs method
* chore: add rust peerdas crypto library
* chore: integrate peerdaskzg rust library into kzg crate
* chore(multi):
- update `ssz_cell_to_crypto_cell`
- update conversion from the crypto cell type to a Vec<u8>. Since the Rust library defines them as references to an array, the conversion is simply `to_vec`
* chore(multi):
- update rest of code to handle the new crypto `Cell` type
- update test case code to no longer use the Box type
* chore: cleanup of superfluous conversions
* chore: revert c-kzg dependency back to v1
* chore: move dependency into correct order
* chore: update rust dependency
- This version includes a new method `PeerDasContext::with_num_threads`
* chore: remove Default initialization of PeerDasContext and explicitly set the parameters in `new_from_trusted_setup`
* chore: cleanup exports
* chore: commit updated cargo.lock
* Update Cargo.toml
Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
* chore: rename dependency
* chore: update peerdas lib
- sets the blst version to 0.3 so that it matches whatever lighthouse is using. Although 0.3.12 is latest, lighthouse is pinned to 0.3.3
* chore: fix clippy lifetime
- Rust doesn't allow you to elide the lifetime on type aliases
* chore: cargo clippy fix
* chore: cargo fmt
* chore: update lib to add redundant checks (these will be removed in consensus-specs PR 3819)
* chore: update dependency to ignore proofs
* chore: update peerdas lib to latest
* update lib
* chore: remove empty proof parameter
---------
Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
* Update PeerDAS interop testnet config (#6069)
* Update interop testnet config.
* Fix typo and remove target peers
* Avoid retrying same sampling peer that previously failed. (#6084)
* Various fixes to custody range sync (#6004)
* Only start requesting batches when there are good peers across all custody columns to avoid spaming block requests.
* Add custody peer check before mutating `BatchInfo` to avoid inconsistent state.
* Add check to cover a case where batch is not processed while waiting for custody peers to become available.
* Fix lint and logic error
* Fix `good_peers_on_subnet` always returning false for `DataColumnSubnet`.
* Add test for `get_custody_peers_for_column`
* Revert epoch parameter refactor.
* Fall back to default custody requiremnt if peer ENR is not present.
* Add metrics and update code comment.
* Add more debug logs.
* Use subscribed peers on subnet before MetaDataV3 is implemented. Remove peer_id matching when injecting error because multiple peers are used for range requests. Use randomized custodial peer to avoid repeatedly sending requests to failing peers. Batch by range request where possible.
* Remove unused code and update docs.
* Add comment
* chore: update peerdas-kzg library (#6118)
* chore: update peerDAS lib
* chore: update library
* chore: update library to version that include "init context" benchmarks and optional validation checks
* chore: (can remove) -- Add benchmarks for init context
* Prevent continuous searchers for low-peer networks (#6162)
* Merge branch 'unstable' into das
* Fix merge conflicts
* Add cli flag to enable sampling and disable by default. (#6209)
* chore: Use reference to an array representing a blob instead of an owned KzgBlob (#6179)
* add KzgBlobRef type
* modify code to use KzgBlobRef
* clippy
* Remove Deneb blob related changes to maintain compatibility with `c-kzg-4844`.
---------
Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
* Store computed custody subnets in PeerDB and fix custody lookup test (#6218)
* Fix failing custody lookup tests.
* Store custody subnets in PeerDB, fix custody lookup test and refactor some methods.
* Merge branch 'unstable' into das
# Conflicts:
# beacon_node/beacon_chain/src/beacon_chain.rs
# beacon_node/beacon_chain/src/block_verification_types.rs
# beacon_node/beacon_chain/src/builder.rs
# beacon_node/beacon_chain/src/data_availability_checker.rs
# beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
# beacon_node/beacon_chain/src/data_column_verification.rs
# beacon_node/beacon_chain/src/early_attester_cache.rs
# beacon_node/beacon_chain/src/historical_blocks.rs
# beacon_node/beacon_chain/tests/store_tests.rs
# beacon_node/lighthouse_network/src/discovery/enr.rs
# beacon_node/network/src/service.rs
# beacon_node/src/cli.rs
# beacon_node/store/src/hot_cold_store.rs
# beacon_node/store/src/lib.rs
# lcli/src/generate_bootnode_enr.rs
* Fix CI failures after merge.
* Batch sampling requests by peer (#6256)
* Batch sampling requests by peer
* Fix clippy errors
* Fix tests
* Add column_index to error message for ease of tracing
* Remove outdated comment
* Fix range sync never evaluating request as finished, causing it to get stuck. (#6276)
* Merge branch 'unstable' into das-0821-merge
# Conflicts:
# Cargo.lock
# Cargo.toml
# beacon_node/beacon_chain/src/beacon_chain.rs
# beacon_node/beacon_chain/src/data_availability_checker.rs
# beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
# beacon_node/beacon_chain/src/data_column_verification.rs
# beacon_node/beacon_chain/src/kzg_utils.rs
# beacon_node/beacon_chain/src/metrics.rs
# beacon_node/beacon_processor/src/lib.rs
# beacon_node/lighthouse_network/src/rpc/codec/ssz_snappy.rs
# beacon_node/lighthouse_network/src/rpc/config.rs
# beacon_node/lighthouse_network/src/rpc/methods.rs
# beacon_node/lighthouse_network/src/rpc/outbound.rs
# beacon_node/lighthouse_network/src/rpc/rate_limiter.rs
# beacon_node/lighthouse_network/src/service/api_types.rs
# beacon_node/lighthouse_network/src/types/globals.rs
# beacon_node/network/src/network_beacon_processor/mod.rs
# beacon_node/network/src/network_beacon_processor/rpc_methods.rs
# beacon_node/network/src/network_beacon_processor/sync_methods.rs
# beacon_node/network/src/sync/block_lookups/common.rs
# beacon_node/network/src/sync/block_lookups/mod.rs
# beacon_node/network/src/sync/block_lookups/single_block_lookup.rs
# beacon_node/network/src/sync/block_lookups/tests.rs
# beacon_node/network/src/sync/manager.rs
# beacon_node/network/src/sync/network_context.rs
# consensus/types/src/data_column_sidecar.rs
# crypto/kzg/Cargo.toml
# crypto/kzg/benches/benchmark.rs
# crypto/kzg/src/lib.rs
* Fix custody tests and load PeerDAS KZG instead.
* Fix ef tests and bench compilation.
* Fix failing sampling test.
* Merge pull request #6287 from jimmygchen/das-0821-merge
Merge `unstable` into `das` 20240821
* Remove get_block_import_status
* Merge branch 'unstable' into das
* Re-enable Windows release tests.
* Address some review comments.
* Address more review comments and cleanups.
* Comment out peer DAS KZG EF tests for now
* Address more review comments and fix build.
* Merge branch 'das' of github.com:sigp/lighthouse into das
* Unignore Electra tests
* Fix metric name
* Address some of Pawan's review comments
* Merge remote-tracking branch 'origin/unstable' into das
* Update PeerDAS network parameters for peerdas-devnet-2 (#6290)
* update subnet count & custody req
* das network params
* update ef tests
---------
Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
* rename 4844 to deneb
* rename 4844 to deneb
* move excess data gas field
* get EF tests working
* fix ef tests lint
* fix the blob identifier ef test
* fix accessed files ef test script
* get beacon chain tests passing
* add a rt is_blob_batch
* use the mixed type everywhere
* glue
* more glue
* minor fixes
* fix range tests
* filling in the gaps
* moore filling in the gaps
## Issue Addressed
Solves #3390
So after checking some logs @pawanjay176 got, we conclude that this happened because we blacklisted a chain after trying it "too much". Now here, in all occurrences it seems that "too much" means we got too many download failures. This happened very slowly, exactly because the batch is allowed to stay alive for very long times after not counting penalties when the ee is offline. The error here then was not that the batch failed because of offline ee errors, but that we blacklisted a chain because of download errors, which we can't pin on the chain but on the peer. This PR fixes that.
## Proposed Changes
Adds a missing piece of logic so that if a chain fails for errors that can't be attributed to an objectively bad behavior from the peer, it is not blacklisted. The issue at hand occurred when new peers arrived claiming a head that had wrongfully blacklisted, even if the original peers participating in the chain were not penalized.
Another notable change is that we need to consider a batch invalid if it processed correctly but its next non empty batch fails processing. Now since a batch can fail processing in non empty ways, there is no need to mark as invalid previous batches.
Improves some logging as well.
## Additional Info
We should do this regardless of pausing sync on ee offline/unsynced state. This is because I think it's almost impossible to ensure a processing result will reach in a predictable order with a synced notification from the ee. Doing this handles what I think are inevitable data races when we actually pause sync
This also fixes a return that reports which batch failed and caused us some confusion checking the logs
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](de2b2801c8/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java (L171-L182)) it uses [`getStateRootFromBlockRoot`](de2b2801c8/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java (L336-L341)) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
## Issue Addressed
Deprecates the step parameter in the blocks by range request
## Proposed Changes
- Modifies the BlocksByRangeRequest type to remove the step parameter and everywhere we took it into account before
- Adds a new type to still handle coding and decoding of requests that use the parameter
## Additional Info
I went with a deprecation over the type itself so that requests received outside `lighthouse_network` don't even need to deal with this parameter. After the deprecation period just removing the Old blocks by range request should be straightforward
## Issue Addressed
currently we count a failed attempt for a syncing chain even if the peer is not at fault. This makes us do more work if the chain fails, and heavily penalize peers, when we can simply retry. Inspired by a proposal I made to #3094
## Proposed Changes
If a batch fails but the peer is not at fault, do not count the attempt
Also removes some annoying logs
## Additional Info
We still get a counter on ignored attempts.. just in case
## Proposed Changes
Initially the idea was to remove hashing of blocks in backfill sync. After considering it more, we conclude that we need to do it in both (forward and backfill) anyway. But since we forgot why we were doing it in the first place, this PR documents this logic.
Future us should find it useful
Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>
## Proposed Changes
Allocate less memory in sync by hashing the `SignedBeaconBlock`s in a batch directly, rather than going via SSZ bytes.
Credit to @paulhauner for finding this source of temporary allocations.
## Description
The `eth2_libp2p` crate was originally named and designed to incorporate a simple libp2p integration into lighthouse. Since its origins the crates purpose has expanded dramatically. It now houses a lot more sophistication that is specific to lighthouse and no longer just a libp2p integration.
As of this writing it currently houses the following high-level lighthouse-specific logic:
- Lighthouse's implementation of the eth2 RPC protocol and specific encodings/decodings
- Integration and handling of ENRs with respect to libp2p and eth2
- Lighthouse's discovery logic, its integration with discv5 and logic about searching and handling peers.
- Lighthouse's peer manager - This is a large module handling various aspects of Lighthouse's network, such as peer scoring, handling pings and metadata, connection maintenance and recording, etc.
- Lighthouse's peer database - This is a collection of information stored for each individual peer which is specific to lighthouse. We store connection state, sync state, last seen ips and scores etc. The data stored for each peer is designed for various elements of the lighthouse code base such as syncing and the http api.
- Gossipsub scoring - This stores a collection of gossipsub 1.1 scoring mechanisms that are continuously analyssed and updated based on the ethereum 2 networks and how Lighthouse performs on these networks.
- Lighthouse specific types for managing gossipsub topics, sync status and ENR fields
- Lighthouse's network HTTP API metrics - A collection of metrics for lighthouse network monitoring
- Lighthouse's custom configuration of all networking protocols, RPC, gossipsub, discovery, identify and libp2p.
Therefore it makes sense to rename the crate to be more akin to its current purposes, simply that it manages the majority of Lighthouse's network stack. This PR renames this crate to `lighthouse_network`
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
Closes#1891Closes#1784
## Proposed Changes
Implement checkpoint sync for Lighthouse, enabling it to start from a weak subjectivity checkpoint.
## Additional Info
- [x] Return unavailable status for out-of-range blocks requested by peers (#2561)
- [x] Implement sync daemon for fetching historical blocks (#2561)
- [x] Verify chain hashes (either in `historical_blocks.rs` or the calling module)
- [x] Consistency check for initial block + state
- [x] Fetch the initial state and block from a beacon node HTTP endpoint
- [x] Don't crash fetching beacon states by slot from the API
- [x] Background service for state reconstruction, triggered by CLI flag or API call.
Considered out of scope for this PR:
- Drop the requirement to provide the `--checkpoint-block` (this would require some pretty heavy refactoring of block verification)
Co-authored-by: Diva M <divma@protonmail.com>
## Issue Addressed
N/A
## Proposed Changes
- Removing a bunch of unnecessary references
- Updated `Error::VariantError` to `Error::Variant`
- There were additional enum variant lints that I ignored, because I thought our variant names were fine
- removed `MonitoredValidator`'s `pubkey` field, because I couldn't find it used anywhere. It looks like we just use the string version of the pubkey (the `id` field) if there is no index
## Additional Info
Co-authored-by: realbigsean <seananderson33@gmail.com>
This is a little bit of a tip-of-the-iceberg PR. It houses a lot of code changes in the libp2p dependency.
This needs a bit of thorough testing before merging.
The primary code changes are:
- General libp2p dependency update
- Gossipsub refactor to shift compression into gossipsub providing performance improvements and improved API for handling compression
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
Following slog's documentation, this should help a bit with string allocations. I left it run for two days and mem usage is lower. This is of course anecdotal, but shouldn't harm anyway
## Proposed Changes
remove `String` creation in logs when possible
## Issue Addressed
Sync edge case when we get an empty optimistic batch that passes validation and is inside the download buffer. Eventually the chain would reach the batch and treat it as an ugly state.
## Proposed Changes
- Handle the edge case advancing the chain's target + code clarification
- Some largey changes for readability + ergonomics since rust has try ops
- Better handling of bad batch and chain states
## Issue Addressed
#1614 and a couple of sync-stalling problems, the most important is a cyclic dependency between the sync manager and the peer manager
## Issue Addressed
Downgrade inconsistent chain segment states from `panic` to `crit`. I don't love this solution but since range can always bounce back from any of those, we don't panic.
Co-authored-by: Age Manning <Age@AgeManning.com>
## Issue Addressed
chain state inconsistencies
## Proposed Changes
- a batch can be fake-failed by Range if it needs to move a peer to another chain. The peer will still send blocks/ errors / produce timeouts for those requests, so check when we get a response from the RPC that the request id matches, instead of only the peer, since a re-request can be directed to the same peer.
- if an optimistic batch succeeds, store the attempt to avoid trying it again when quickly switching chains. Also, use it only if ahead of our current target, instead of the segment's start epoch
## Issue Addressed
In principle.. closes#1551 but in general are improvements for performance, maintainability and readability. The logic for the optimistic sync in actually simple
## Proposed Changes
There are miscellaneous things here:
- Remove unnecessary `BatchProcessResult::Partial` to simplify the batch validation logic
- Make batches a state machine. This is done to ensure batch state transitions respect our logic (this was previously done by moving batches between `Vec`s) and to ease the cognitive load of the `SyncingChain` struct
- Move most batch-related logic to the batch
- Remove `PendingBatches` in favor of a map of peers to their batches. This is to avoid duplicating peers inside the chain (peer_pool and pending_batches)
- Add `must_use` decoration to the `ProcessingResult` so that chains that request to be removed are handled accordingly. This also means that chains are now removed in more places than before to account for unhandled cases
- Store batches in a sorted map (`BTreeMap`) access is not O(1) but since the number of _active_ batches is bounded this should be fast, and saves performing hashing ops. Batches are indexed by the epoch they start. Sorted, to easily handle chain advancements (range logic)
- Produce the chain Id from the identifying fields: target root and target slot. This, to guarantee there can't be duplicated chains and be able to consistently search chains by either Id or checkpoint
- Fix chain_id not being present in all chain loggers
- Handle mega-edge case where the processor's work queue is full and the batch can't be sent. In this case the chain would lose the blocks, remain in a "syncing" state and waiting for a result that won't arrive, effectively stalling sync.
- When a batch imports blocks or the chain starts syncing with a local finalized epoch greater that the chain's start epoch, the chain is advanced instead of reset. This is to avoid losing download progress and validate batches faster. This also means that the old `start_epoch` now means "current first unvalidated batch", so it represents more accurately the progress of the chain.
- Batch status peers from the same chain to reduce Arc access.
- Handle a couple of cases where the retry counters for a batch were not updated/checked are now handled via the batch state machine. Basically now if we forget to do it, we will know.
- Do not send back the blocks from the processor to the batch. Instead register the attempt before sending the blocks (does not count as failed)
- When re-requesting a batch, try to avoid not only the last failed peer, but all previous failed peers.
- Optimize requesting batches ahead in the buffer by shuffling idle peers just once (this is just addressing a couple of old TODOs in the code)
- In chain_collection, store chains by their id in a map
- Include a mapping from request_ids to (chain, batch) that requested the batch to avoid the double O(n) search on block responses
- Other stuff:
- impl `slog::KV` for batches
- impl `slog::KV` for syncing chains
- PSA: when logging, we can use `%thing` if `thing` implements `Display`. Same for `?` and `Debug`
### Optimistic syncing:
Try first the batch that contains the current head, if the batch imports any block, advance the chain. If not, if this optimistic batch is inside the current processing window leave it there for future use, if not drop it. The tolerance for this block is the same for downloading, but just once for processing
Co-authored-by: Age Manning <Age@AgeManning.com>
The changes are somewhat simple but should solve two issues:
- When quickly changing between chains once and a second time back again, batchIds would collide and cause havoc.
- If we got an out of range response from a peer, sync would remain in syncing but without advancing
Changes:
- remove the batch id. Identify each batch (inside a chain) by its starting epoch. Target epochs for downloading and processing now advance by EPOCHS_PER_BATCH
- for the same reason, move the "to_be_downloaded_id" to be an epoch
- remove a sneaky line that dropped an out of range batch without downloading it
- bonus: put the chain_id in the log given to the chain. This is why explicitly logging the chain_id is removed
## Issue Addressed
Recurring sync loop and invalid batch downloading
## Proposed Changes
Shifts the batches to include the first slot of each epoch. This ensures the finalized is always downloaded once a chain has completed syncing.
Also add in logic to prevent re-dialing disconnected peers. Non-performant peers get disconnected during sync, this prevents re-connection to these during sync.
## Additional Info
N/A
* Update `milagro_bls` to new release (#1183)
* Update milagro_bls to new release
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* Tidy up fake cryptos
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* move SecretHash to bls and put plaintext back
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* Update v0.12.0 to v0.12.1
* Use ssz types for Request and error types
* Fix errors
* Constrain BlocksByRangeRequest count to MAX_REQUEST_BLOCKS
* Fix issues after rebasing
* Address review comments
Co-authored-by: Kirk Baird <baird.k@outlook.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Co-authored-by: Age Manning <Age@AgeManning.com>