## Issue Addressed
currently we count a failed attempt for a syncing chain even if the peer is not at fault. This makes us do more work if the chain fails, and heavily penalize peers, when we can simply retry. Inspired by a proposal I made to #3094
## Proposed Changes
If a batch fails but the peer is not at fault, do not count the attempt
Also removes some annoying logs
## Additional Info
We still get a counter on ignored attempts.. just in case
## Proposed Changes
Initially the idea was to remove hashing of blocks in backfill sync. After considering it more, we conclude that we need to do it in both (forward and backfill) anyway. But since we forgot why we were doing it in the first place, this PR documents this logic.
Future us should find it useful
Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>
## Proposed Changes
Allocate less memory in sync by hashing the `SignedBeaconBlock`s in a batch directly, rather than going via SSZ bytes.
Credit to @paulhauner for finding this source of temporary allocations.
## Description
The `eth2_libp2p` crate was originally named and designed to incorporate a simple libp2p integration into lighthouse. Since its origins the crates purpose has expanded dramatically. It now houses a lot more sophistication that is specific to lighthouse and no longer just a libp2p integration.
As of this writing it currently houses the following high-level lighthouse-specific logic:
- Lighthouse's implementation of the eth2 RPC protocol and specific encodings/decodings
- Integration and handling of ENRs with respect to libp2p and eth2
- Lighthouse's discovery logic, its integration with discv5 and logic about searching and handling peers.
- Lighthouse's peer manager - This is a large module handling various aspects of Lighthouse's network, such as peer scoring, handling pings and metadata, connection maintenance and recording, etc.
- Lighthouse's peer database - This is a collection of information stored for each individual peer which is specific to lighthouse. We store connection state, sync state, last seen ips and scores etc. The data stored for each peer is designed for various elements of the lighthouse code base such as syncing and the http api.
- Gossipsub scoring - This stores a collection of gossipsub 1.1 scoring mechanisms that are continuously analyssed and updated based on the ethereum 2 networks and how Lighthouse performs on these networks.
- Lighthouse specific types for managing gossipsub topics, sync status and ENR fields
- Lighthouse's network HTTP API metrics - A collection of metrics for lighthouse network monitoring
- Lighthouse's custom configuration of all networking protocols, RPC, gossipsub, discovery, identify and libp2p.
Therefore it makes sense to rename the crate to be more akin to its current purposes, simply that it manages the majority of Lighthouse's network stack. This PR renames this crate to `lighthouse_network`
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
Closes#1891Closes#1784
## Proposed Changes
Implement checkpoint sync for Lighthouse, enabling it to start from a weak subjectivity checkpoint.
## Additional Info
- [x] Return unavailable status for out-of-range blocks requested by peers (#2561)
- [x] Implement sync daemon for fetching historical blocks (#2561)
- [x] Verify chain hashes (either in `historical_blocks.rs` or the calling module)
- [x] Consistency check for initial block + state
- [x] Fetch the initial state and block from a beacon node HTTP endpoint
- [x] Don't crash fetching beacon states by slot from the API
- [x] Background service for state reconstruction, triggered by CLI flag or API call.
Considered out of scope for this PR:
- Drop the requirement to provide the `--checkpoint-block` (this would require some pretty heavy refactoring of block verification)
Co-authored-by: Diva M <divma@protonmail.com>
## Issue Addressed
N/A
## Proposed Changes
- Removing a bunch of unnecessary references
- Updated `Error::VariantError` to `Error::Variant`
- There were additional enum variant lints that I ignored, because I thought our variant names were fine
- removed `MonitoredValidator`'s `pubkey` field, because I couldn't find it used anywhere. It looks like we just use the string version of the pubkey (the `id` field) if there is no index
## Additional Info
Co-authored-by: realbigsean <seananderson33@gmail.com>
This is a little bit of a tip-of-the-iceberg PR. It houses a lot of code changes in the libp2p dependency.
This needs a bit of thorough testing before merging.
The primary code changes are:
- General libp2p dependency update
- Gossipsub refactor to shift compression into gossipsub providing performance improvements and improved API for handling compression
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
Following slog's documentation, this should help a bit with string allocations. I left it run for two days and mem usage is lower. This is of course anecdotal, but shouldn't harm anyway
## Proposed Changes
remove `String` creation in logs when possible
## Issue Addressed
Sync edge case when we get an empty optimistic batch that passes validation and is inside the download buffer. Eventually the chain would reach the batch and treat it as an ugly state.
## Proposed Changes
- Handle the edge case advancing the chain's target + code clarification
- Some largey changes for readability + ergonomics since rust has try ops
- Better handling of bad batch and chain states
## Issue Addressed
#1614 and a couple of sync-stalling problems, the most important is a cyclic dependency between the sync manager and the peer manager
## Issue Addressed
Downgrade inconsistent chain segment states from `panic` to `crit`. I don't love this solution but since range can always bounce back from any of those, we don't panic.
Co-authored-by: Age Manning <Age@AgeManning.com>
## Issue Addressed
chain state inconsistencies
## Proposed Changes
- a batch can be fake-failed by Range if it needs to move a peer to another chain. The peer will still send blocks/ errors / produce timeouts for those requests, so check when we get a response from the RPC that the request id matches, instead of only the peer, since a re-request can be directed to the same peer.
- if an optimistic batch succeeds, store the attempt to avoid trying it again when quickly switching chains. Also, use it only if ahead of our current target, instead of the segment's start epoch
## Issue Addressed
In principle.. closes#1551 but in general are improvements for performance, maintainability and readability. The logic for the optimistic sync in actually simple
## Proposed Changes
There are miscellaneous things here:
- Remove unnecessary `BatchProcessResult::Partial` to simplify the batch validation logic
- Make batches a state machine. This is done to ensure batch state transitions respect our logic (this was previously done by moving batches between `Vec`s) and to ease the cognitive load of the `SyncingChain` struct
- Move most batch-related logic to the batch
- Remove `PendingBatches` in favor of a map of peers to their batches. This is to avoid duplicating peers inside the chain (peer_pool and pending_batches)
- Add `must_use` decoration to the `ProcessingResult` so that chains that request to be removed are handled accordingly. This also means that chains are now removed in more places than before to account for unhandled cases
- Store batches in a sorted map (`BTreeMap`) access is not O(1) but since the number of _active_ batches is bounded this should be fast, and saves performing hashing ops. Batches are indexed by the epoch they start. Sorted, to easily handle chain advancements (range logic)
- Produce the chain Id from the identifying fields: target root and target slot. This, to guarantee there can't be duplicated chains and be able to consistently search chains by either Id or checkpoint
- Fix chain_id not being present in all chain loggers
- Handle mega-edge case where the processor's work queue is full and the batch can't be sent. In this case the chain would lose the blocks, remain in a "syncing" state and waiting for a result that won't arrive, effectively stalling sync.
- When a batch imports blocks or the chain starts syncing with a local finalized epoch greater that the chain's start epoch, the chain is advanced instead of reset. This is to avoid losing download progress and validate batches faster. This also means that the old `start_epoch` now means "current first unvalidated batch", so it represents more accurately the progress of the chain.
- Batch status peers from the same chain to reduce Arc access.
- Handle a couple of cases where the retry counters for a batch were not updated/checked are now handled via the batch state machine. Basically now if we forget to do it, we will know.
- Do not send back the blocks from the processor to the batch. Instead register the attempt before sending the blocks (does not count as failed)
- When re-requesting a batch, try to avoid not only the last failed peer, but all previous failed peers.
- Optimize requesting batches ahead in the buffer by shuffling idle peers just once (this is just addressing a couple of old TODOs in the code)
- In chain_collection, store chains by their id in a map
- Include a mapping from request_ids to (chain, batch) that requested the batch to avoid the double O(n) search on block responses
- Other stuff:
- impl `slog::KV` for batches
- impl `slog::KV` for syncing chains
- PSA: when logging, we can use `%thing` if `thing` implements `Display`. Same for `?` and `Debug`
### Optimistic syncing:
Try first the batch that contains the current head, if the batch imports any block, advance the chain. If not, if this optimistic batch is inside the current processing window leave it there for future use, if not drop it. The tolerance for this block is the same for downloading, but just once for processing
Co-authored-by: Age Manning <Age@AgeManning.com>
The changes are somewhat simple but should solve two issues:
- When quickly changing between chains once and a second time back again, batchIds would collide and cause havoc.
- If we got an out of range response from a peer, sync would remain in syncing but without advancing
Changes:
- remove the batch id. Identify each batch (inside a chain) by its starting epoch. Target epochs for downloading and processing now advance by EPOCHS_PER_BATCH
- for the same reason, move the "to_be_downloaded_id" to be an epoch
- remove a sneaky line that dropped an out of range batch without downloading it
- bonus: put the chain_id in the log given to the chain. This is why explicitly logging the chain_id is removed
## Issue Addressed
Recurring sync loop and invalid batch downloading
## Proposed Changes
Shifts the batches to include the first slot of each epoch. This ensures the finalized is always downloaded once a chain has completed syncing.
Also add in logic to prevent re-dialing disconnected peers. Non-performant peers get disconnected during sync, this prevents re-connection to these during sync.
## Additional Info
N/A
* Update `milagro_bls` to new release (#1183)
* Update milagro_bls to new release
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* Tidy up fake cryptos
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* move SecretHash to bls and put plaintext back
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* Update v0.12.0 to v0.12.1
* Use ssz types for Request and error types
* Fix errors
* Constrain BlocksByRangeRequest count to MAX_REQUEST_BLOCKS
* Fix issues after rebasing
* Address review comments
Co-authored-by: Kirk Baird <baird.k@outlook.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Co-authored-by: Age Manning <Age@AgeManning.com>
* wip: mwake the request id optional
* make the request_id optional
* cleanup
* address clippy lints inside rpc
* WIP: Separate sent RPC events from received ones
* WIP: Separate sent RPC events from received ones
* cleanup
* Separate request ids from substream ids
* Make RPC's message handling independent of RequestIds
* Change behaviour RPC events to be more outside-crate friendly
* Propage changes across the network + router + processor
* Propage changes across the network + router + processor
* fmt
* "tiny" refactor
* more tiny refactors
* fmt eth2-libp2p
* wip: propagating changes
* wip: propagating changes
* cleaning up
* more cleanup
* fmt
* tests HOT fix
Co-authored-by: Age Manning <Age@AgeManning.com>
* Start updating types
* WIP
* Signature hacking
* Existing EF tests passing with fake_crypto
* Updates
* Delete outdated API spec
* The refactor continues
* It compiles
* WIP test fixes
* All release tests passing bar genesis state parsing
* Update and test YamlConfig
* Update to spec v0.10 compatible BLS
* Updates to BLS EF tests
* Add EF test for AggregateVerify
And delete unused hash2curve tests for uncompressed points
* Update EF tests to v0.10.1
* Use optional block root correctly in block proc
* Use genesis fork in deposit domain. All tests pass
* Cargo fmt
* Fast aggregate verify test
* Update REST API docs
* Cargo fmt
* Fix unused import
* Bump spec tags to v0.10.1
* Add `seconds_per_eth1_block` to chainspec
* Update to timestamp based eth1 voting scheme
* Return None from `get_votes_to_consider` if block cache is empty
* Handle overflows in `is_candidate_block`
* Revert to failing tests
* Fix eth1 data sets test
* Choose default vote according to spec
* Fix collect_valid_votes tests
* Fix `get_votes_to_consider` to choose all eligible blocks
* Uncomment winning_vote tests
* Add comments; remove unused code
* Reduce seconds_per_eth1_block for simulation
* Addressed review comments
* Add test for default vote case
* Fix logs
* Remove unused functions
* Meter default eth1 votes
* Fix comments
* Address review comments; remove unused dependency
* Disable/delete two outdated tests
* Bump eth1 default vote warn to error
* Delete outdated eth1 test
Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>