This was found in the wild on a node that was reconstructing states and had a ~350 epoch gap
between finalization and the split slot. When the finalization migration ran it deleted some of
the hot state summaries that were still being iterated by `load_hot_state`, causing state storage
to fail and the database to become corrupt (due to the atomicity bug from `unstable`).
This commit additonally adds states created from diffs to the cache. This should help avoid
iterating back so far but is strictly a performance improvement and *not* a correctness fix.
Squashed commit of the following:
commit 974b3359f8
Merge: ac205b7ba480309fb9
Author: Michael Sproul <michael@sigmaprime.io>
Date: Wed Jan 18 10:01:26 2023 +1100
Merge remote-tracking branch 'origin/unstable' into jemalloc
commit 480309fb96
Author: aliask <aliask@gmail.com>
Date: Tue Jan 17 05:13:49 2023 +0000
Fix some dead links in markdown files (#3885)
## Issue Addressed
No issue has been raised for these broken links.
## Proposed Changes
Update links with the new URLs for the same document.
## Additional Info
~The link for the [Lighthouse Development Updates](https://eepurl.com/dh9Lvb/) mailing list is also broken, but I can't find the correct link.~
Co-authored-by: Paul Hauner <paul@paulhauner.com>
commit b4d9fc03ee
Author: GeemoCandama <geemo@tutanota.com>
Date: Tue Jan 17 05:13:48 2023 +0000
add logging for starting request and receiving block (#3858)
## Issue Addressed
#3853
## Proposed Changes
Added `INFO` level logs for requesting and receiving the unsigned block.
## Additional Info
Logging for successfully publishing the signed block is already there. And seemingly there is a log for when "We realize we are going to produce a block" in the `start_update_service`: `info!(log, "Block production service started");
`. Is there anywhere else you'd like to see logging around this event?
Co-authored-by: GeemoCandama <104614073+GeemoCandama@users.noreply.github.com>
commit 9a970ce3a2
Author: David Theodore <prodigalsonsolutions@gmail.com>
Date: Tue Jan 17 05:13:47 2023 +0000
add better err reporting UnableToOpenVotingKeystore (#3781)
## Issue Addressed
#3780
## Proposed Changes
Add error reporting that notifies the node operator that the `voting_keystore_path` in their `validator_definitions.yml` file may be incorrect.
## Additional Info
There is more info in issue #3780
Co-authored-by: Paul Hauner <paul@paulhauner.com>
commit ac205b7bab
Merge: 93457d85bbf533c8e4
Author: Michael Sproul <michael@sigmaprime.io>
Date: Fri Nov 25 16:32:33 2022 +1100
Merge remote-tracking branch 'origin/unstable' into jemalloc
commit 93457d85b7
Author: Michael Sproul <michael@sigmaprime.io>
Date: Wed Nov 9 11:53:59 2022 +1100
Fix cargo-udeps
commit 6c42aef1b5
Author: Michael Sproul <micsproul@gmail.com>
Date: Tue Nov 8 19:12:19 2022 +1100
Fixups
commit f14b87bb88
Author: Michael Sproul <michael@sigmaprime.io>
Date: Tue Nov 8 16:28:16 2022 +1100
Update docs
commit 5005dc3b65
Author: Michael Sproul <michael@sigmaprime.io>
Date: Tue Nov 8 16:22:42 2022 +1100
Fix lcli
commit a082ba5904
Author: Michael Sproul <michael@sigmaprime.io>
Date: Tue Nov 8 16:17:10 2022 +1100
Remove check-consensus
commit 81441e9cea
Author: Michael Sproul <micsproul@gmail.com>
Date: Tue Nov 8 15:28:11 2022 +1100
Disable jemalloc on Windows
commit 41eac5d0c1
Author: Michael Sproul <micsproul@gmail.com>
Date: Tue Nov 8 13:46:17 2022 +1100
Compatibility with macOS
commit 69ecba7876
Author: Michael Sproul <michael@sigmaprime.io>
Date: Mon Nov 7 18:48:31 2022 +1100
Add jemalloc support
## Proposed Changes
Decouple the stdout and logfile formats by adding the `--logfile-format` CLI flag.
This behaves identically to the existing `--log-format` flag, but instead will only affect the logs written to the logfile.
The `--log-format` flag will no longer have any effect on the contents of the logfile.
## Additional Info
This avoids being a breaking change by causing `logfile-format` to default to the value of `--log-format` if it is not provided.
This means that users who were previously relying on being able to use a JSON formatted logfile will be able to continue to use `--log-format JSON`.
Users who want to use JSON on stdout and default logs in the logfile, will need to pass the following flags: `--log-format JSON --logfile-format DEFAULT`
## Issue Addressed
Issue #3112
## Proposed Changes
Add `Filter::recover` to the GET chain to handle rejections specifically as 404 NOT FOUND
## Additional Info
Making a request to `http://localhost:5052/not_real` now returns the following:
```
{
"code": 404,
"message": "NOT_FOUND",
"stacktraces": []
}
```
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Proposed Changes
Update all dependencies to new semver-compatible releases with `cargo update`. Importantly this patches a Tokio vuln: https://rustsec.org/advisories/RUSTSEC-2023-0001. I don't think we were affected by the vuln because it only applies to named pipes on Windows, but it's still good hygiene to patch.
## Issue Addressed
NA
## Proposed Changes
Myself and others (#3678) have observed that when running with lots of validators (e.g., 1000s) the cardinality is too much for Prometheus. I've seen Prometheus instances just grind to a halt when we turn the validator monitor on for our testnet validators (we have 10,000s of Goerli validators). Additionally, the debug log volume can get very high with one log per validator, per attestation.
To address this, the `bn --validator-monitor-individual-tracking-threshold <INTEGER>` flag has been added to *disable* per-validator (i.e., non-aggregated) metrics/logging once the validator monitor exceeds the threshold of validators. The default value is `64`, which is a finger-to-the-wind value. I don't actually know the value at which Prometheus starts to become overwhelmed, but I've seen it work with ~64 validators and I've seen it *not* work with 1000s of validators. A default of `64` seems like it will result in a breaking change to users who are running millions of dollars worth of validators whilst resulting in a no-op for low-validator-count users. I'm open to changing this number, though.
Additionally, this PR starts collecting aggregated Prometheus metrics (e.g., total count of head hits across all validators), so that high-validator-count validators still have some interesting metrics. We already had logging for aggregated values, so nothing has been added there.
I've opted to make this a breaking change since it can be rather damaging to your Prometheus instance to accidentally enable the validator monitor with large numbers of validators. I've crashed a Prometheus instance myself and had a report from another user who's done the same thing.
## Additional Info
NA
## Breaking Changes Note
A new label has been added to the validator monitor Prometheus metrics: `total`. This label tracks the aggregated metrics of all validators in the validator monitor (as opposed to each validator being tracking individually using its pubkey as the label).
Additionally, a new flag has been added to the Beacon Node: `--validator-monitor-individual-tracking-threshold`. The default value is `64`, which means that when the validator monitor is tracking more than 64 validators then it will stop tracking per-validator metrics and only track the `all_validators` metric. It will also stop logging per-validator logs and only emit aggregated logs (the exception being that exit and slashing logs are always emitted).
These changes were introduced in #3728 to address issues with untenable Prometheus cardinality and log volume when using the validator monitor with high validator counts (e.g., 1000s of validators). Users with less than 65 validators will see no change in behavior (apart from the added `all_validators` metric). Users with more than 65 validators who wish to maintain the previous behavior can set something like `--validator-monitor-individual-tracking-threshold 999999`.
## Issue Addressed
Recent discussions with other client devs about optimistic sync have revealed a conceptual issue with the optimisation implemented in #3738. In designing that feature I failed to consider that the execution node checks the `blockHash` of the execution payload before responding with `SYNCING`, and that omitting this check entirely results in a degradation of the full node's validation. A node omitting the `blockHash` checks could be tricked by a supermajority of validators into following an invalid chain, something which is ordinarily impossible.
## Proposed Changes
I've added verification of the `payload.block_hash` in Lighthouse. In case of failure we log a warning and fall back to verifying the payload with the execution client.
I've used our existing dependency on `ethers_core` for RLP support, and a new dependency on Parity's `triehash` crate for the Merkle patricia trie. Although the `triehash` crate is currently unmaintained it seems like our best option at the moment (it is also used by Reth, and requires vastly less boilerplate than Parity's generic `trie-root` library).
Block hash verification is pretty quick, about 500us per block on my machine (mainnet).
The optimistic finalized sync feature can be disabled using `--disable-optimistic-finalized-sync` which forces full verification with the EL.
## Additional Info
This PR also introduces a new dependency on our [`metastruct`](https://github.com/sigp/metastruct) library, which was perfectly suited to the RLP serialization method. There will likely be changes as `metastruct` grows, but I think this is a good way to start dogfooding it.
I took inspiration from some Parity and Reth code while writing this, and have preserved the relevant license headers on the files containing code that was copied and modified.
I've needed to do this work in order to do some episub testing.
This version of libp2p has not yet been released, so this is left as a draft for when we wish to update.
Co-authored-by: Diva M <divma@protonmail.com>
Our custom RPC implementation is lagging from the libp2p v50 version.
We are going to need to change a bunch of function names and would be nice to have consistent ordering of function names inside the handlers.
This is a precursor to the libp2p upgrade to minimize merge conflicts in function ordering.
The notion of "phases" doesn't exist anymore in the Ethereum roadmap. Also fix dead link to roadmap.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
While testing withdrawals with @ethDreamer we noticed lighthouse is sending empty batches when an error occurs. As LH peer receiving this, we would consider this a low tolerance action because the peer is claiming the batch is right and is empty.
## Proposed Changes
If any kind of error occurs, send a error response instead
## Additional Info
Right now we don't handle such thing as a partial batch with an error. If an error is received, the whole batch is discarded. Because of this it makes little sense to send partial batches that end with an error, so it's better to do the proposed solution instead of sending empty batches.
## Proposed Changes
Update the Gnosis chain bootnodes. The current list of Gnosis bootnodes were abandoned at some point before the Gnosis merge and are now failing to bootstrap peers. There's a workaround list of bootnodes here: https://docs.gnosischain.com/updates/20221208-temporary-bootnodes
The list from this PR represents the long-term bootnodes run by the Gnosis team. We will also try to set up SigP bootnodes for Gnosis chain at some point.
## Proposed Changes
With proposer boosting implemented (#2822) we have an opportunity to re-org out late blocks.
This PR adds three flags to the BN to control this behaviour:
* `--disable-proposer-reorgs`: turn aggressive re-orging off (it's on by default).
* `--proposer-reorg-threshold N`: attempt to orphan blocks with less than N% of the committee vote. If this parameter isn't set then N defaults to 20% when the feature is enabled.
* `--proposer-reorg-epochs-since-finalization N`: only attempt to re-org late blocks when the number of epochs since finalization is less than or equal to N. The default is 2 epochs, meaning re-orgs will only be attempted when the chain is finalizing optimally.
For safety Lighthouse will only attempt a re-org under very specific conditions:
1. The block being proposed is 1 slot after the canonical head, and the canonical head is 1 slot after its parent. i.e. at slot `n + 1` rather than building on the block from slot `n` we build on the block from slot `n - 1`.
2. The current canonical head received less than N% of the committee vote. N should be set depending on the proposer boost fraction itself, the fraction of the network that is believed to be applying it, and the size of the largest entity that could be hoarding votes.
3. The current canonical head arrived after the attestation deadline from our perspective. This condition was only added to support suppression of forkchoiceUpdated messages, but makes intuitive sense.
4. The block is being proposed in the first 2 seconds of the slot. This gives it time to propagate and receive the proposer boost.
## Additional Info
For the initial idea and background, see: https://github.com/ethereum/consensus-specs/pull/2353#issuecomment-950238004
There is also a specification for this feature here: https://github.com/ethereum/consensus-specs/pull/3034
Co-authored-by: Michael Sproul <micsproul@gmail.com>
Co-authored-by: pawan <pawandhananjay@gmail.com>
## Issue Addressed
NA
## Proposed Changes
This is a *potentially* contentious change, but I find it annoying that the validator monitor logs `WARN` and `ERRO` for imperfect attestations. Perfect attestation performance is unachievable (don't believe those photo-shopped beauty magazines!) since missed and poorly-packed blocks by other validators will reduce your performance.
When the validator monitor is on with 10s or more validators, I find the logs are washed out with ERROs that are not worth investigating. I suspect that users who really want to know if validators are missing attestations can do so by matching the content of the log, rather than the log level.
I'm open to feedback about this, especially from anyone who is relying on the current log levels.
## Additional Info
NA
## Breaking Changes Notes
The validator monitor will no longer emit `WARN` and `ERRO` logs for sub-optimal attestation performance. The logs will now be emitted at `INFO` level. This change was introduced to avoid cluttering the `WARN` and `ERRO` logs with alerts that are frequently triggered by the actions of other network participants (e.g., a missed block) and require no action from the user.
## Issue Addressed
Implementing the light_client_gossip topics but I'm not there yet.
Which issue # does this PR address?
Partially #3651
## Proposed Changes
Add light client gossip topics.
Please list or describe the changes introduced by this PR.
I'm going to Implement light_client_finality_update and light_client_optimistic_update gossip topics. Currently I've attempted the former and I'm seeking feedback.
## Additional Info
I've only implemented the light_client_finality_update topic because I wanted to make sure I was on the correct path. Also checking that the gossiped LightClientFinalityUpdate is the same as the locally constructed one is not implemented because caching the updates will make this much easier. Could someone give me some feedback on this please?
Please provide any additional information. For example, future considerations
or information useful for reviewers.
Co-authored-by: GeemoCandama <104614073+GeemoCandama@users.noreply.github.com>
## Issue Addressed
NA
## Proposed Changes
In #3725 I introduced a `CRIT` log for unrevealed payloads, against @michaelsproul's [advice](https://github.com/sigp/lighthouse/pull/3725#discussion_r1034142113). After being woken up in the middle of the night by a block that was not revealed to the BN but *was* revealed to the network, I have capitulated. This PR implements @michaelsproul's suggestion and reduces the severity to `ERRO`.
Additionally, I have dropped a `CRIT` to an `ERRO` for when a block is published late. The block in question was indeed published late on the network, however now that we have builders that can slow down block production I don't think the error is "actionable" enough to warrant a `CRIT` for the user.
## Additional Info
NA
## Issue Addressed
#3766
## Proposed Changes
Adds an endpoint to get the graffiti that will be used for the next block proposal for each validator.
## Usage
```bash
curl -H "Authorization: Bearer api-token" http://localhost:9095/lighthouse/ui/graffiti | jq
```
```json
{
"data": {
"0x81283b7a20e1ca460ebd9bbd77005d557370cabb1f9a44f530c4c4c66230f675f8df8b4c2818851aa7d77a80ca5a4a5e": "mr f was here",
"0xa3a32b0f8b4ddb83f1a0a853d81dd725dfe577d4f4c3db8ece52ce2b026eca84815c1a7e8e92a4de3d755733bf7e4a9b": "mr v was here",
"0x872c61b4a7f8510ec809e5b023f5fdda2105d024c470ddbbeca4bc74e8280af0d178d749853e8f6a841083ac1b4db98f": null
}
}
```
## Additional Info
This will only return graffiti that the validator client knows about.
That is from these 3 sources:
1. Graffiti File
2. validator_definitions.yml
3. The `--graffiti` flag on the VC
If the graffiti is set on the BN, it will not be returned. This may warrant an additional endpoint on the BN side which can be used in the event the endpoint returns `null`.
## Proposed Changes
Adds docs for the following endpoints:
- `/lighthouse/analysis/attestation_performance`
- `/lighthouse/analysis/block_packing_efficiency`
## Issue Addressed
#3724
## Proposed Changes
Exposes certain `validator_monitor` as an endpoint on the HTTP API. Will only return metrics for validators which are actively being monitored.
### Usage
```bash
curl -X GET "http://localhost:5052/lighthouse/ui/validator_metrics" -H "accept: application/json" | jq
```
```json
{
"data": {
"validators": {
"12345": {
"attestation_hits": 10,
"attestation_misses": 0,
"attestation_hit_percentage": 100,
"attestation_head_hits": 10,
"attestation_head_misses": 0,
"attestation_head_hit_percentage": 100,
"attestation_target_hits": 5,
"attestation_target_misses": 5,
"attestation_target_hit_percentage": 50
}
}
}
}
```
## Additional Info
Based on #3756 which should be merged first.
## Proposed Changes
Now that the Gnosis merge is scheduled, all users should have upgraded beyond Lighthouse v3.0.0. Accordingly we can delete schema migrations for versions prior to v3.0.0.
## Additional Info
I also deleted the state cache stuff I added in #3714 as it turned out to be useless for the light client proofs due to the one-slot offset.
## Issue Addressed
#3724
## Proposed Changes
Adds an endpoint to quickly count the number of occurances of each status in the validator set.
## Usage
```bash
curl -X GET "http://localhost:5052/lighthouse/ui/validator_count" -H "accept: application/json" | jq
```
```json
{
"data": {
"active_ongoing":479508,
"active_exiting":0,
"active_slashed":0,
"pending_initialized":28,
"pending_queued":0,
"withdrawal_possible":933,
"withdrawal_done":0,
"exited_unslashed":0,
"exited_slashed":3
}
}
```
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/2327
## Proposed Changes
This is an extension of some ideas I implemented while working on `tree-states`:
- Cache the indexed attestations from blocks in the `ConsensusContext`. Previously we were re-computing them 3-4 times over.
- Clean up `import_block` by splitting each part into `import_block_XXX`.
- Move some stuff off hot paths, specifically:
- Relocate non-essential tasks that were running between receiving the payload verification status and priming the early attester cache. These tasks are moved after the cache priming:
- Attestation observation
- Validator monitor updates
- Slasher updates
- Updating the shuffling cache
- Fork choice attestation observation now happens at the end of block verification in parallel with payload verification (this seems to save 5-10ms).
- Payload verification now happens _before_ advancing the pre-state and writing it to disk! States were previously being written eagerly and adding ~20-30ms in front of verifying the execution payload. State catchup also sometimes takes ~500ms if we get a cache miss and need to rebuild the tree hash cache.
The remaining task that's taking substantial time (~20ms) is importing the block to fork choice. I _think_ this is because of pull-tips, and we should be able to optimise it out with a clever total active balance cache in the state (which would be computed in parallel with payload verification). I've decided to leave that for future work though. For now it can be observed via the new `beacon_block_processing_post_exec_pre_attestable_seconds` metric.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
our bootnodes as of now support only ipv4. this makes it so that they support ipv6
## Proposed Changes
- Adds code necessary to update the bootnodes to run on dual stack nodes and therefore contact and store ipv6 nodes.
- Adds some metrics about connectivity type of stored peers. It might have been nice to see some metrics over the sessions but that feels out of scope right now.
## Additional Info
- some code quality improvements sneaked in since the changes seemed small
- I think it depends on the OS, but enabling mapped addresses on an ipv6 node without dual stack support enabled could fail silently, making these nodes effectively ipv6 only. In the future I'll probably change this to use two sockets, which should fail loudly
## Issue Addressed
#3704
## Proposed Changes
Adds is_syncing_finalized: bool parameter for block verification functions. Sets the payload_verification_status to Optimistic if is_syncing_finalized is true. Uses SyncState in NetworkGlobals in BeaconProcessor to retrieve the syncing status.
## Additional Info
I could implement FinalizedSignatureVerifiedBlock if you think it would be nicer.