## Issue Addressed
N/A
## Proposed Changes
- Removing a bunch of unnecessary references
- Updated `Error::VariantError` to `Error::Variant`
- There were additional enum variant lints that I ignored, because I thought our variant names were fine
- removed `MonitoredValidator`'s `pubkey` field, because I couldn't find it used anywhere. It looks like we just use the string version of the pubkey (the `id` field) if there is no index
## Additional Info
Co-authored-by: realbigsean <seananderson33@gmail.com>
* Adjust beacon node timeouts for validator client HTTP requests (#2352)
Resolves#2313
Provide `BeaconNodeHttpClient` with a dedicated `Timeouts` struct.
This will allow granular adjustment of the timeout duration for different calls made from the VC to the BN. These can either be a constant value, or as a ratio of the slot duration.
Improve timeout performance by using these adjusted timeout duration's only whenever a fallback endpoint is available.
Add a CLI flag called `use-long-timeouts` to revert to the old behavior.
Additionally set the default `BeaconNodeHttpClient` timeouts to the be the slot duration of the network, rather than a constant 12 seconds. This will allow it to adjust to different network specifications.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* Use read_recursive locks in database (#2417)
Closes#2245
Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`.
* Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector.
* The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us.
* I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp
* [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation.
* Return more detail when invalid data is found in the DB during startup (#2445)
- Resolves#2444
Adds some more detail to the error message returned when the `BeaconChainBuilder` is unable to access or decode block/state objects during startup.
NA
* Use hardware acceleration for SHA256 (#2426)
Modify the SHA256 implementation in `eth2_hashing` so that it switches between `ring` and `sha2` to take advantage of [x86_64 SHA extensions](https://en.wikipedia.org/wiki/Intel_SHA_extensions). The extensions are available on modern Intel and AMD CPUs, and seem to provide a considerable speed-up: on my Ryzen 5950X it dropped state tree hashing times by about 30% from 35ms to 25ms (on Prater).
The extensions became available in the `sha2` crate [last year](https://www.reddit.com/r/rust/comments/hf2vcx/ann_rustcryptos_sha1_and_sha2_now_support/), and are not available in Ring, which uses a [pure Rust implementation of sha2](https://github.com/briansmith/ring/blob/main/src/digest/sha2.rs). Ring is faster on CPUs that lack the extensions so I've implemented a runtime switch to use `sha2` only when the extensions are available. The runtime switching seems to impose a miniscule penalty (see the benchmarks linked below).
* Start a release checklist (#2270)
NA
Add a checklist to the release draft created by CI. I know @michaelsproul was also working on this and I suspect @realbigsean also might have useful input.
NA
* Serious banning
* fmt
Co-authored-by: Mac L <mjladson@pm.me>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
`make lint` failing on rust 1.53.0.
## Proposed Changes
1.53.0 updates
## Additional Info
I haven't figure out why yet, we were now hitting the recursion limit in a few crates. So I had to add `#![recursion_limit = "256"]` in a few places
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
#2325
## Proposed Changes
This pull request changes the behavior of the Peer Manager by including a minimum outbound-only peers requirement. The peer manager will continue querying for peers if this outbound-only target number hasn't been met. Additionally, when peers are being removed, an outbound-only peer will not be disconnected if doing so brings us below the minimum.
## Additional Info
Unit test for heartbeat function tests that disconnection behavior is correct. Continual querying for peers if outbound-only hasn't been met is not directly tested, but indirectly through unit testing of the helper function that counts the number of outbound-only peers.
EDIT: Am concerned about the behavior of ```update_peer_scores```. If we have connected to a peer with a score below the disconnection threshold (-20), then its connection status will remain connected, while its score state will change to disconnected.
```rust
let previous_state = info.score_state();
// Update scores
info.score_update();
Self::handle_score_transitions(
previous_state,
peer_id,
info,
&mut to_ban_peers,
&mut to_unban_peers,
&mut self.events,
&self.log,
);
```
```previous_state``` will be set to Disconnected, and then because ```handle_score_transitions``` only changes connection status for a peer if the state changed, the peer remains connected. Then in the heartbeat code, because we only disconnect healthy peers if we have too many peers, these peers don't get disconnected. I'm not sure realistically how often this scenario would occur, but it might be better to adjust the logic to account for scenarios where the score state implies a connection status different from the current connection status.
Co-authored-by: Kevin Lu <kevlu93@gmail.com>
The ordering of adding new peers to the peerdb and deciding when to dial them was not considered in a previous update.
This adds the condition that if a peer is not in the peer-db then it is an acceptable peer to dial.
This makes #2374 obsolete.
## Issue Addressed
#2107
## Proposed Change
The peer manager will mark peers as disconnected in the discv5 DHT when they disconnect or dial fails
## Additional Info
Rationale for this particular change is explained in my comment on #2107
## Issue Addressed
NA
## Proposed Changes
Rust 1.50 has landed 🎉
The shiny new `clippy` peers down upon us mere mortals with disgust. Brutish peasants wrapping our `usize`s in superfluous `Option`s... tsk tsk.
I've performed the goat sacrifice and corrected our evil ways in this PR. Tonight we shall pray that Github Actions bestows the almighty green tick upon us.
## Additional Info
NA
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
resolves#2129resolves#2099
addresses some of #1712
unblocks #2076
unblocks #2153
## Proposed Changes
- Updates all the dependencies mentioned in #2129, except for web3. They haven't merged their tokio 1.0 update because they are waiting on some dependencies of their own. Since we only use web3 in tests, I think updating it in a separate issue is fine. If they are able to merge soon though, I can update in this PR.
- Updates `tokio_util` to 0.6.2 and `bytes` to 1.0.1.
- We haven't made a discv5 release since merging tokio 1.0 updates so I'm using a commit rather than release atm. **Edit:** I think we should merge an update of `tokio_util` to 0.6.2 into discv5 before this release because it has panic fixes in `DelayQueue` --> PR in discv5: https://github.com/sigp/discv5/pull/58
## Additional Info
tokio 1.0 changes that required some changes in lighthouse:
- `interval.next().await.is_some()` -> `interval.tick().await`
- `sleep` future is now `!Unpin` -> https://github.com/tokio-rs/tokio/issues/3028
- `try_recv` has been temporarily removed from `mpsc` -> https://github.com/tokio-rs/tokio/issues/3350
- stream features have moved to `tokio-stream` and `broadcast::Receiver::into_stream()` has been temporarily removed -> `https://github.com/tokio-rs/tokio/issues/2870
- I've copied over the `BroadcastStream` wrapper from this PR, but can update to use `tokio-stream` once it's merged https://github.com/tokio-rs/tokio/pull/3384
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
NA
## Proposed Changes
As per #2100, uses derives from the sturm library to implement AsRef<str> and AsStaticRef to easily get str values from enums without creating new Strings. Furthermore unifies all attestation error counter into one IntCounterVec vector.
These works are originally by @blacktemplar, I've just created this PR so I can resolve some merge conflicts.
## Additional Info
NA
Co-authored-by: blacktemplar <blacktemplar@a1.net>
## Issue Addressed
`test_dht_persistence` failing
## Proposed Changes
Bind `NetworkService::start` to an underscore prefixed variable rather than `_`. `_` was causing it to be dropped immediately
This was failing 5/100 times before this update, but I haven't been able to get it to fail after updating it
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
Fixes#2146
## Proposed Changes
Change ping timeout errors to return `LowToleranceErrors` so that we disconnect faster on internet failures/changes.
This is a little bit of a tip-of-the-iceberg PR. It houses a lot of code changes in the libp2p dependency.
This needs a bit of thorough testing before merging.
The primary code changes are:
- General libp2p dependency update
- Gossipsub refactor to shift compression into gossipsub providing performance improvements and improved API for handling compression
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
N/A
## Proposed Changes
Fixes multiple issues related to discovering of subnet peers.
1. Subnet discovery retries after yielding no results
2. Metadata updates if peer send older metadata
3. peerdb stores the peer subscriptions from gossipsub
## Issue Addressed
Related to #1891, The error is not in the spec yet (see ethereum/eth2.0-specs#2131)
## Proposed Changes
Implement the proposed error, banning peers that send it
## Additional Info
NA
## Issue Addressed
#2046
## Proposed Changes
The log was originally intended to verify the correct logic and ordering of events when scoring peers. The queued tasks can be structured in such a way that peers can be banned after they are disconnected. Therefore the error log is now downgraded to debug log.
## Issue Addressed
NA
## Proposed Changes
This was mostly done to find the reason why LH was dropping peers from Nimbus. It proved to be useful so I think it's worth it. But there is also some functional stuff here
- Add metrics for rpc errors per client, error type and direction
- Add metrics for downscoring events per source type, client and penalty type
- Add metrics for gossip validation results per client for non-accepted messages
- Make the RPC handler return errors and requests/responses in the order we see them
- Allow a small burst for the Ping rate limit, from 1 every 5 seconds to 2 every 10 seconds
- Send rate limiting errors with a particular code and use that same code to identify them. I picked something different to 128 since that is most likely what other clients are using for their own errors
- Remove some unused code in the `PeerAction` and the rpc handler
- Remove the unused variant `RateLimited`. tTis was never produced directly, since the only way to get the request's protocol is via de handler. The handler upon receiving from LH a response with an error (rate limited in this case) emits this event with the missing info (It was always like this, just pointing out that we do downscore rate limiting errors regardless of the change)
Metrics for Nimbus looked like this:
Downscoring events: `increase(libp2p_peer_actions_per_client{client="Nimbus"}[5m])`

RPC Errors: `increase(libp2p_rpc_errors_per_client{client="Nimbus"}[5m])`

Unaccepted gossip message: `increase(gossipsub_unaccepted_messages_per_client{client="Nimbus"}[5m])`

## Issue Addressed
Following slog's documentation, this should help a bit with string allocations. I left it run for two days and mem usage is lower. This is of course anecdotal, but shouldn't harm anyway
## Proposed Changes
remove `String` creation in logs when possible
## Issue Addressed
NA
## Proposed Changes
This was causing:
```
Nov 28 21:56:08.154 ERRO slog-async: logger dropped messages due to channel overflow, count: 44, service: libp2p
```
## Additional Info
NA
## Description
This PR updates Lighthouse to tokio 0.3. It includes a number of dependency updates and some structural changes as to how we create and spawn tasks.
This also brings with it a number of various improvements:
- Discv5 update
- Libp2p update
- Fix for recompilation issues
- Improved UPnP port mapping handling
- Futures dependency update
- Log downgrade to traces for rejecting peers when we've reached our max
Co-authored-by: blacktemplar <blacktemplar@a1.net>
## Issue Addressed
- Add metrics to keep track of peer counts by sync type
- Add metric to keep track of the number of syncing chains in range
## Proposed Changes
Plugin to the network metrics update interval and update too the counts for peers wrt to their sync status with us
## Additional Info
For the peer counts
- By the way it is implemented the numbers won't always match to the total peer count in the `libp2p` metric.
- Updating the gauge with every change is messy because it requires to be updated on connection (in the `eth2_libp2p` crate, while metrics are defined in the `network` crate) on Goodbye sent (for an `IrrelevantPeer`) either in the `beacon_processor` or the `peer_manager`, and on disconnection. Since this is not a critical metric I think counting once every second is enough. If you think more accuracy is needed we can do it too, but it would be harder to maintain)
ATM those look like this

## Issue Addressed
NA
## Proposed Changes
Removes most of the temporary string initializations in network metrics and replaces them by directly using `&str`. This further improves on PR https://github.com/sigp/lighthouse/pull/1895.
For the subnet id handling the current approach uses a build script to create a static map. This has the disadvantage that the build script hardcodes the number of subnets. If we want to use more than 64 subnets we need to adjust this in the build script.
## Additional Info
We still have some string initializations for the enum `PeerKind`. To also replace that by `&str` I created a PR in the libp2p dependency: https://github.com/sigp/rust-libp2p/pull/91. Either we wait with merging until this dependency PR is merged (and all conflicts with the newest libp2p version are resolved) or we just merge as is and I will create another PR when the dependency is ready.
This PR adds a number of improvements:
- Downgrade a warning log when we ignore blocks for gossipsub processing
- Revert a a correction to improve logging of peer score changes
- Shift syncing DB reads off the core-executor allowing parallel processing of large sync messages
- Correct the timeout logic of RPC chunk sends, giving more time before timing out RPC outbound messages.
## Issue Addressed
- RPC Errors were being logged twice: first in the peer manager and then again in the router, so leave just the peer manager's one
- The "reduce peer count" warn message gets thrown to the user for every missed chunk, so instead print it when the request times out and also do not include there info that is not relevant to the user
- The processor didn't have the service tag so add it
- Impl `KV` for status message
- Do not downscore peers if we are the ones that timed out
Other small improvements
## Issue Addressed
#1856
## Proposed Changes
- For clarity, the router's processor now only decides if a peer is compatible and it disconnects it or sends it to sync accordingly. No logic here regarding how useful is the peer.
- Update peer_sync_info's rules
- Add an `IrrelevantPeer` sync status to account for incompatible peers (maybe this should be "IncompatiblePeer" now that I think about it?) this state is update upon receiving an internal goodbye in the peer manager
- Misc code cleanups
- Reduce the need to create `StatusMessage`s (and thus, `Arc` accesses )
- Add missing calls to update the global sync state
The overall effect should be:
- More peers recognized as Behind, and less as Unknown
- Peers identified as incompatible
## Issue Addressed
- Asymmetric pings - Currently with symmetric ping intervals, lighthouse nodes race each other to ping often ending in simultaneous ping connections. This shifts the ping interval to be asymmetric based on inbound/outbound connections
- Correct inbound/outbound peer-db registering - It appears we were accounting inbound as outbound and vice versa in the peerdb, this has been corrected
- Improved logging
There is likely more to come - I'll leave this open as we investigate further testnets
## Issue Addressed
NA
## Proposed Changes
Correctly handles peer state transitions on gossipsub changes + refactors handling of peer state transitions into one function used for lighthouse score changes and gossipsub score changes.
Co-authored-by: Age Manning <Age@AgeManning.com>
## Issue Addressed
#1606
## Proposed Changes
Uses dynamic gossipsub scoring parameters depending on the number of active validators as specified in https://gist.github.com/blacktemplar/5c1862cb3f0e32a1a7fb0b25e79e6e2c.
## Additional Info
Although the parameters got tested on Medalla, extensive testing using simulations on larger networks is still to be done and we expect that we need to change the parameters, although this might only affect constants within the dynamic parameter framework.
## Issue Addressed
NA
## Proposed Changes
Lets the networking behavior ignore messages of peers that are not connected. Furthermore, old peers are not removed from the peerdb based on score anymore but based on the disconnection time.
## Issue Addressed
Fixes head syncing
## Proposed Changes
- Get back to statusing peers after removing chain segments and making the peer manager deal with status according to the Sync status, preventing an old known deadlock
- Also a bug where a chain would get removed if the optimistic batch succeeds being empty
## Additional Info
Tested on Medalla and looking good
## Issue Addressed
NA
## Proposed Changes
Currently a banned peer will remain banned indefinitely as long as update is called on the score struct regularly. This fixes this bug and the score decay starts after `BANNED_BEFORE_DECAY` seconds after banning.
## Issue Addressed
Further testing revealed another edge case where we attempt to unban a peer that can be in a disconnected start. Although this causes no real issue, it does log an error to the user.
This PR adds a check to prevent this edge case and prevents the error being logged to the user.
This PR prevents a user-facing error.
It prevents optimistically unbanning a peer and instead checks the state of the peer before requesting the peers state to be unbanned.
## Issue Addressed
Potentially resolves#1647 and sync stalls.
## Proposed Changes
The handling of the state of banned peers was inadequate for the complex peerdb data structure. We store a limited number of disconnected and banned peers in the db. We were not tracking intermediate "disconnecting" states and the in some circumstances we were updating the peer state without informing the peerdb. This lead to a number of inconsistencies in the peer state.
Further, the peer manager could ban a peer changing a peer's state from being connected to banned. In this circumstance, if the peer then disconnected, we didn't inform the application layer, which lead to applications like sync not being informed of a peers disconnection. This could lead to sync stalling and having to require a lighthouse restart.
Improved handling for peer states and interactions with the peerdb is made in this PR.
## Issue Addressed
`node` endpoints in #1434
## Proposed Changes
Implement these:
```
/eth/v1/node/health
/eth/v1/node/peers/{peer_id}
/eth/v1/node/peers
```
- Add an `Option<Enr>` to `PeerInfo`
- Finish implementation of `/eth/v1/node/identity`
## Additional Info
- should update the `peers` endpoints when #1764 is resolved
Co-authored-by: realbigsean <seananderson33@gmail.com>
check for advanced peers and the state of the chain wrt the clock slot to decide if a chain is or not synced /transitioning to a head sync. Also a fix that prevented getting the right state while syncing heads
## Issue Addressed
#1614 and a couple of sync-stalling problems, the most important is a cyclic dependency between the sync manager and the peer manager
## Issue Addressed
NA
## Proposed Changes
Adds a direction field to `PeerConnectionStatus` that can be accessed by calling `is_outgoing` which will return `true` iff the peer is connected and the first connection was an outgoing one.