## Issue Addressed
Upgrade libp2p to v0.52
## Proposed Changes
- **Workflows**: remove installation of `protoc`
- **Book**: remove installation of `protoc`
- **`Dockerfile`s and `cross`**: remove custom base `Dockerfile` for cross since it's no longer needed. Remove `protoc` from remaining `Dockerfiles`s
- **Upgrade `discv5` to `v0.3.1`:** we have some cool stuff in there: no longer needs `protoc` and faster ip updates on cold start
- **Upgrade `prometheus` to `0.21.0`**, now it no longer needs encoding checks
- **things that look like refactors:** bunch of api types were renamed and need to be accessed in a different (clearer) way
- **Lighthouse network**
- connection limits is now a behaviour
- banned peers no longer exist on the swarm level, but at the behaviour level
- `connection_event_buffer_size` now is handled per connection with a buffer size of 4
- `mplex` is deprecated and was removed
- rpc handler now logs the peer to which it belongs
## Additional Info
Tried to keep as much behaviour unchanged as possible. However, there is a great deal of improvements we can do _after_ this upgrade:
- Smart connection limits: Connection limits have been checked only based on numbers, we can now use information about the incoming peer to decide if we want it
- More powerful peer management: Dial attempts from other behaviours can be rejected early
- Incoming connections can be rejected early
- Banning can be returned exclusively to the peer management: We should not get connections to banned peers anymore making use of this
- TCP Nat updates: We might be able to take advantage of confirmed external addresses to check out tcp ports/ips
Co-authored-by: Age Manning <Age@AgeManning.com>
Co-authored-by: Akihito Nakano <sora.akatsuki@gmail.com>
## Issue Addressed
#4150
## Proposed Changes
Maintain trusted peers in the pruning logic. ~~In principle the changes here are not necessary as a trusted peer has a max score (100) and all other peers can have at most 0 (because we don't implement positive scores). This means that we should never prune trusted peers unless we have more trusted peers than the target peer count.~~
This change shifts this logic to explicitly never prune trusted peers which I expect is the intuitive behaviour.
~~I suspect the issue in #4150 arises when a trusted peer disconnects from us for one reason or another and then we remove that peer from our peerdb as it becomes stale. When it re-connects at some large time later, it is no longer a trusted peer.~~
Currently we do disconnect trusted peers, and this PR corrects this to maintain trusted peers in the pruning logic.
As suggested in #4150 we maintain trusted peers in the db and thus we remember them even if they disconnect from us.
It is possible that when we go to ban a peer, there is already an unbanned message in the queue. It could lead to the case that we ban and immediately unban a peer leaving us in a state where a should-be banned peer is unbanned.
If this banned peer connects to us in this faulty state, we currently do not attempt to re-ban it. This PR does correct this also, so if we do see this error, it will now self-correct (although we shouldn't see the error in the first place).
I have also incremented the severity of not supporting protocols as I see peers ultimately get banned in a few steps and it seems to make sense to just ban them outright, rather than have them linger.
There is a race condition which occurs when multiple discovery queries return at almost the exact same time and they independently contain a useful peer we would like to connect to.
The condition can occur that we can add the same peer to the dial queue, before we get a chance to process the queue.
This ends up displaying an error to the user:
```
ERRO Dialing an already dialing peer
```
Although this error is harmless it's not ideal.
There are two solutions to resolving this:
1. As we decide to dial the peer, we change the state in the peer-db to dialing (before we add it to the queue) which would prevent other requests from adding to the queue.
2. We prevent duplicates in the dial queue
This PR has opted for 2. because 1. will complicate the code in that we are changing states in non-intuitive places. Although this technically adds a very slight performance cost, its probably a cleaner solution as we can keep the state-changing logic in one place.
## Issue Addressed
Cleans up all the remnants of 4844 in capella. This makes sure when 4844 is reviewed there is nothing we are missing because it got included here
## Proposed Changes
drop a bomb on every 4844 thing
## Additional Info
Merge process I did (locally) is as follows:
- squash merge to produce one commit
- in new branch off unstable with the squashed commit create a `git revert HEAD` commit
- merge that new branch onto 4844 with `--strategy ours`
- compare local 4844 to remote 4844 and make sure the diff is empty
- enjoy
Co-authored-by: Paul Hauner <paul@paulhauner.com>
On heavily crowded networks, we are seeing many attempted connections to our node every second.
Often these connections come from peers that have just been disconnected. This can be for a number of reasons including:
- We have deemed them to be not as useful as other peers
- They have performed poorly
- They have dropped the connection with us
- The connection was spontaneously lost
- They were randomly removed because we have too many peers
In all of these cases, if we have reached or exceeded our target peer limit, there is no desire to accept new connections immediately after the disconnect from these peers. In fact, it often costs us resources to handle the established connections and defeats some of the logic of dropping them in the first place.
This PR adds a timeout, that prevents recently disconnected peers from reconnecting to us.
Technically we implement a ban at the swarm layer to prevent immediate re connections for at least 10 minutes. I decided to keep this light, and use a time-based LRUCache which only gets updated during the peer manager heartbeat to prevent added stress of polling a delay map for what could be a large number of peers.
This cache is bounded in time. An extra space bound could be added should people consider this a risk.
Co-authored-by: Diva M <divma@protonmail.com>
I've needed to do this work in order to do some episub testing.
This version of libp2p has not yet been released, so this is left as a draft for when we wish to update.
Co-authored-by: Diva M <divma@protonmail.com>
## Issue Addressed
Partially addresses #3651
## Proposed Changes
Adds server-side support for light_client_bootstrap_v1 topic
## Additional Info
This PR, creates each time a bootstrap without using cache, I do not know how necessary a cache is in this case as this topic is not supposed to be called frequently and IMHO we can just prevent abuse by using the limiter, but let me know what you think or if there is any caveat to this, or if it is necessary only for the sake of good practice.
Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
## Issue Addressed
We currently subscribe to attestation subnets as soon as the subscription arrives (one epoch in advance), this makes it so that subscriptions for future slots are scheduled instead of done immediately.
## Proposed Changes
- Schedule subscriptions to subnets for future slots.
- Finish removing hashmap_delay, in favor of [delay_map](https://github.com/AgeManning/delay_map). This was the only remaining service to do this.
- Subscriptions for past slots are rejected, before we would subscribe for one slot.
- Add a new test for subscriptions that are not consecutive.
## Additional Info
This is also an effort in making the code easier to understand
## Issue Addressed
Resolves#3351
## Proposed Changes
Returns a `ResourceUnavailable` rpc error if we are unable to serve full payloads to blocks by root and range requests because the execution layer is not synced.
## Additional Info
This PR also changes the penalties such that a `ResourceUnavailable` error is only penalized if it is an outgoing request. If we are syncing and aren't getting full block responses, then we don't have use for the peer. However, this might not be true for the incoming request case. We let the peer decide in this case if we are still useful or if we should be banned.
cc @divagant-martian please let me know if i'm missing something here.
## Issue Addressed
Which issue # does this PR address?
## Proposed Changes
Please list or describe the changes introduced by this PR.
## Additional Info
Please provide any additional information. For example, future considerations
or information useful for reviewers.
## Issue Addressed
We still ping peers that are considered in a disconnecting state
## Proposed Changes
Do not ping peers once we decide they are disconnecting
Upgrade logs about ignored rpc messages
## Additional Info
--
## Issue Addressed
N/A
## Proposed Changes
Fix the upper bound for blocks by root responses to be equal to the max merge block size instead of altair.
Further make the rpc response limits fork aware.
## Proposed Changes
Add a `lighthouse db` command with three initial subcommands:
- `lighthouse db version`: print the database schema version.
- `lighthouse db migrate --to N`: manually upgrade (or downgrade!) the database to a different version.
- `lighthouse db inspect --column C`: log the key and size in bytes of every value in a given `DBColumn`.
This PR lays the groundwork for other changes, namely:
- Mark's fast-deposit sync (https://github.com/sigp/lighthouse/pull/2915), for which I think we should implement a database downgrade (from v9 to v8).
- My `tree-states` work, which already implements a downgrade (v10 to v8).
- Standalone purge commands like `lighthouse db purge-dht` per https://github.com/sigp/lighthouse/issues/2824.
## Additional Info
I updated the `strum` crate to 0.24.0, which necessitated some changes in the network code to remove calls to deprecated methods.
Thanks to @winksaville for the motivation, and implementation work that I used as a source of inspiration (https://github.com/sigp/lighthouse/pull/2685).
## Issue Addressed
I noticed in some logs some excess and unecessary discovery queries. What was happening was we were pruning our peers down to our outbound target and having some disconnect. When we are below this threshold we try to find more peers (even if we are at our peer limit). The request becomes futile because we have no more peer slots.
This PR corrects this issue and advances the pruning mechanism to favour subnet peers.
An overview the new logic added is:
- We prune peers down to a target outbound peer count which is higher than the minimum outbound peer count.
- We only search for more peers if there is room to do so, and we are below the minimum outbound peer count not the target. So this gives us some buffer for peers to disconnect. The buffer is currently 10%
The modified pruning logic is documented in the code but for reference it should do the following:
- Prune peers with bad scores first
- If we need to prune more peers, then prune peers that are subscribed to a long-lived subnet
- If we still need to prune peers, the prune peers that we have a higher density of on any given subnet which should drive for uniform peers across all subnets.
This will need a bit of testing as it modifies some significant peer management behaviours in lighthouse.
There is a pretty significant tradeoff between bandwidth and speed of gossipsub messages.
We can reduce our bandwidth usage considerably at the cost of minimally delaying gossipsub messages. The impact of delaying messages has not been analyzed thoroughly yet, however this PR in conjunction with some gossipsub updates show considerable bandwidth reduction.
This PR allows the user to set a CLI value (`network-load`) which is an integer in the range of 1 of 5 depending on their bandwidth appetite. 1 represents the least bandwidth but slowest message recieving and 5 represents the most bandwidth and fastest received message time.
For low-bandwidth users it is likely to be more efficient to use a lower value. The default is set to 3, which currently represents a reduced bandwidth usage compared to previous version of this PR. The previous lighthouse versions are equivalent to setting the `network-load` CLI to 4.
This PR is awaiting a few gossipsub updates before we can get it into lighthouse.
## Issue Addressed
#2841
## Proposed Changes
Not counting dialing peers while deciding if we have reached the target peers in case of outbound peers.
## Additional Info
Checked this running in nodes and bandwidth looks normal, peer count looks normal too
## Issue Addressed
N/A
## Proposed Changes
1. Don't disconnect peer from dht on connection limit errors
2. Bump up `PRIORITY_PEER_EXCESS` to allow for dialing upto 60 peers by default.
Co-authored-by: Diva M <divma@protonmail.com>
## Issue Addressed
Part of a bigger effort to make the network globals read only. This moves all writes to the `PeerDB` to the `eth2_libp2p` crate. Limiting writes to the peer manager is a slightly more complicated issue for a next PR, to keep things reviewable.
## Proposed Changes
- Make the peers field in the globals a private field.
- Allow mutable access to the peers field to `eth2_libp2p` for now.
- Add a new network message to update the sync state.
Co-authored-by: Age Manning <Age@AgeManning.com>
RPC Responses are for some reason not removing their timeout when they are completing.
As an example:
```
Nov 09 01:18:20.256 DEBG Received BlocksByRange Request step: 1, start_slot: 728465, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:20.263 DEBG Received BlocksByRange Request step: 1, start_slot: 728593, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:20.483 DEBG BlocksByRange Response sent returned: 63, requested: 64, current_slot: 2466389, start_slot: 728465, msg: Failed to return all requested blocks, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:20.500 DEBG BlocksByRange Response sent returned: 64, requested: 64, current_slot: 2466389, start_slot: 728593, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:21.068 DEBG Received BlocksByRange Request step: 1, start_slot: 728529, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:21.272 DEBG BlocksByRange Response sent returned: 63, requested: 64, current_slot: 2466389, start_slot: 728529, msg: Failed to return all requested blocks, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:23.434 DEBG Received BlocksByRange Request step: 1, start_slot: 728657, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:23.665 DEBG BlocksByRange Response sent returned: 64, requested: 64, current_slot: 2466390, start_slot: 728657, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:25.851 DEBG Received BlocksByRange Request step: 1, start_slot: 728337, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:25.851 DEBG Received BlocksByRange Request step: 1, start_slot: 728401, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:26.094 DEBG BlocksByRange Response sent returned: 62, requested: 64, current_slot: 2466390, start_slot: 728401, msg: Failed to return all requested blocks, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:26.100 DEBG BlocksByRange Response sent returned: 63, requested: 64, current_slot: 2466390, start_slot: 728337, msg: Failed to return all requested blocks, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:31.070 DEBG RPC Error direction: Incoming, score: 0, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw, client: Prysm: version: a80b1c252a9b4773493b41999769bf3134ac373f, os_version: unknown, err: Stream Timeout, protocol: beacon_blocks_by_range, service: libp2p
Nov 09 01:18:31.070 WARN Timed out to a peer's request. Likely insufficient resources, reduce peer count, service: libp2p
Nov 09 01:18:31.085 DEBG RPC Error direction: Incoming, score: 0, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw, client: Prysm: version: a80b1c252a9b4773493b41999769bf3134ac373f, os_version: unknown, err: Stream Timeout, protocol: beacon_blocks_by_range, service: libp2p
Nov 09 01:18:31.085 WARN Timed out to a peer's request. Likely insufficient resources, reduce peer count, service: libp2p
Nov 09 01:18:31.459 DEBG RPC Error direction: Incoming, score: 0, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw, client: Prysm: version: a80b1c252a9b4773493b41999769bf3134ac373f, os_version: unknown, err: Stream Timeout, protocol: beacon_blocks_by_range, service: libp2p
Nov 09 01:18:31.459 WARN Timed out to a peer's request. Likely insufficient resources, reduce peer count, service: libp2p
Nov 09 01:18:34.129 DEBG RPC Error direction: Incoming, score: 0, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw, client: Prysm: version: a80b1c252a9b4773493b41999769bf3134ac373f, os_version: unknown, err: Stream Timeout, protocol: beacon_blocks_by_range, service: libp2p
Nov 09 01:18:34.130 WARN Timed out to a peer's request. Likely insufficient resources, reduce peer count, service: libp2p
Nov 09 01:18:35.686 DEBG Peer Manager disconnecting peer reason: Too many peers, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw, service: libp2p
```
This PR is to investigate and correct the issue.
~~My current thoughts are that for some reason we are not closing the streams correctly, or fast enough, or the executor is not registering the closes and waking up.~~ - Pretty sure this is not the case, see message below for a more accurate reason.
~~I've currently added a timeout to stream closures in an attempt to force streams to close and the future to always complete.~~ I removed this
This simply moves some functions that were "swarm notifications" to a network behaviour implementation.
Notes
------
- We could disconnect from the peer manager but we would lose the rpc shutdown message
- We still notify from the swarm since this is the most reliable way to get some events. Ugly but best for now
- Events need to be pushed with "add event" to wake the waker
Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>