lighthouse

mirror of https://github.com/sigp/lighthouse.git synced 2026-04-21 06:48:27 +00:00

Author	SHA1	Message	Date
Michael Sproul	acd49d988d	Implement database temp states to reduce memory usage (#1798 ) ## Issue Addressed Closes #800 Closes #1713 ## Proposed Changes Implement the temporary state storage algorithm described in #800. Specifically: * Add `DBColumn::BeaconStateTemporary`, for storing 0-length temporary marker values. * Store intermediate states immediately as they are created, marked temporary. Delete the temporary flag if the block is processed successfully. * Add a garbage collection process to delete leftover temporary states on start-up. * Bump the database schema version to 2 so that a DB with temporary states can't accidentally be used with older versions of the software. The auto-migration is a no-op, but puts in place some infra that we can use for future migrations (e.g. #1784) ## Additional Info There are two known race conditions, one potentially causing permanent faults (hopefully rare), and the other insignificant. ### Race 1: Permanent state marked temporary EDIT: this has been fixed by the addition of a lock around the relevant critical section There are 2 threads that are trying to store 2 different blocks that share some intermediate states (e.g. they both skip some slots from the current head). Consider this sequence of events: 1. Thread 1 checks if state `s` already exists, and seeing that it doesn't, prepares an atomic commit of `(s, s_temporary_flag)`. 2. Thread 2 does the same, but also gets as far as committing the state txn, finishing the processing of its block, and _deleting_ the temporary flag. 3. Thread 1 is (finally) scheduled again, and marks `s` as temporary with its transaction. 4. a) The process is killed, or thread 1's block fails verification and the temp flag is not deleted. This is a permanent failure! Any attempt to load state `s` will fail... hope it isn't on the main chain! Alternatively (4b) happens... b) Thread 1 finishes, and re-deletes the temporary flag. In this case the failure is transient, state `s` will disappear temporarily, but will come back once thread 1 finishes running. I _hope_ that steps 1-3 only happen very rarely, and 4a even more rarely. It's hard to know This once again begs the question of why we're using LevelDB (#483), when it clearly doesn't care about atomicity! A ham-fisted fix would be to wrap the hot and cold DBs in locks, which would bring us closer to how other DBs handle read-write transactions. E.g. [LMDB only allows one R/W transaction at a time](https://docs.rs/lmdb/0.8.0/lmdb/struct.Environment.html#method.begin_rw_txn). ### Race 2: Temporary state returned from `get_state` I don't think this race really matters, but in `load_hot_state`, if another thread stores a state between when we call `load_state_temporary_flag` and when we call `load_hot_state_summary`, then we could end up returning that state even though it's only a temporary state. I can't think of any case where this would be relevant, and I suspect if it did come up, it would be safe/recoverable (having data is safer than _not_ having data). This could be fixed by using a LevelDB read snapshot, but that would require substantial changes to how we read all our values, so I don't think it's worth it right now.	2020-10-23 01:27:51 +00:00
blacktemplar	23a8f31f83	Fix clippy warnings (#1385 ) ## Issue Addressed NA ## Proposed Changes Fixes most clippy warnings and ignores the rest of them, see issue #1388.	2020-07-23 14:18:00 +00:00
Adam Szkoda	c7f47af9fb	Harden the freezing procedure against failures (#1323 ) * Enable logging in tests * Migrate states to the freezer atomically	2020-07-03 09:47:31 +10:00
Adam Szkoda	536728b975	Write new blocks and states to the database atomically (#1285 ) * Mostly atomic put_state() * Reduce number of vec allocations * Make crucial db operations atomic * Save restore points * Remove StateBatch * Merge two HotColdDB impls * Further reduce allocations * Review feedback * Silence clippy warning	2020-07-01 12:45:57 +10:00
Adam Szkoda	9db0c28051	Make key value storage abstractions more accurate (#1267 ) * Layer do_atomically() abstractions properly * Reduce allocs and DRY get_key_for_col() * Parameterize HotColdDB with hot and cold item stores * -impl Store for MemoryStore * Replace Store uses with HotColdDB * Ditch Store trait * cargo fmt * Style fix * Readd missing dep that broke the build	2020-06-16 11:34:04 +10:00
Adam Szkoda	91cb14ac41	Clean up database abstractions (#1200 ) * Remove redundant method * Pull out a method out of a struct * More precise db access abstractions * Move fake trait method out of it * cargo fmt * Fix compilation error after refactoring * Move another fake method out the Store trait * Get rid of superfluous method * Fix refactoring bug * Rename: SimpleStoreItem -> StoreItem * Get rid of the confusing DiskStore type alias * Get rid of SimpleDiskStore type alias * Correction: A method took both self and a ref to Self	2020-06-01 08:13:49 +10:00
Adam Szkoda	59ead67f76	Race condition fix + Reliability improvements around forks pruning (#1132 ) * Improve error handling in block iteration * Introduce atomic DB operations * Fix race condition An invariant was violated: For every block hash in head_tracker, that block is accessible from the store.	2020-05-16 13:23:32 +10:00
Paul Hauner	2fb6b7c793	Add no-copy block processing cache (#863 ) * Add state cache, remove store cache * Only build the head committee cache * Fix compile error * Fix compile error from merge * Rename state_cache -> checkpoint_cache * Rename Checkpoint -> Snapshot * Tidy, add comments * Tidy up find_head function * Change some checkpoint -> snapshot * Add tests * Expose max_len * Remove dead code * Tidy * Fix bug	2020-04-06 10:53:33 +10:00
Michael Sproul	e0b9fa599f	Add LRU cache to database (#837 ) * Add LRU caches to store * Improvements to LRU caches * Take state by value in `Store::put_state` * Store blocks by value, configurable cache sizes * Use a StateBatch to efficiently store skip states * Fix store tests * Add CloneConfig test, remove unused metrics * Use Mutexes instead of RwLocks for LRU caches	2020-02-10 11:30:21 +11:00
Paul Hauner	10a134792b	Testnet2 (#685 ) * Apply clippy lints to beacon node * Remove unnecessary logging and correct formatting * Initial bones of load-balanced range-sync * Port bump meshsup tests * Further structure and network handling logic added * Basic structure, ignoring error handling * Correct max peers delay bug * Clean up and re-write message processor and sync manager * Restructure directory, correct type issues * Fix compiler issues * Completed first testing of new sync * Correct merge issues * Clean up warnings * Push attestation processed log down to dbg * Add state enc/dec benches * Correct math error, downgraded logs * Add example for flamegraph * Use `PublicKeyBytes` for `Validator` * Ripple PublicKeyBytes change through codebase * Add RPC error handling and improved syncing code * Add benches, optimizations to store BeaconState * Store BeaconState in StorageContainer too * Optimize StorageContainer with std::mem magic * Add libp2p stream error handling and dropping of invalid peers * Lower logs * Update lcli to parse spec at boot, remove pycli * Fix issues when starting with mainnet spec * Set default spec to mainnet * Fix lcli --spec param * Add discovery tweak * Ensure ETH1_FOLLOW_DISTANCE is in YamlConfig * Set testnet ETH1_FOLLOW_DISTANCE to 16 * Fix rest_api tests * Set testnet min validator count * Update with new testnet dir * Remove some dbg, println * Add timeout when notifier waits for libp2p lock * Add validator count CLI flag to lcli contract deploy * Extend genesis delay time * Correct libp2p service locking * Update testnet dir * Add basic block/state caching on beacon chain * Add decimals display to notifier sync speed * Try merge in change to reduce fork choice calls * Remove fork choice from process block * Minor log fix * Check successes > 0 * Adds checkpoint cache * Stop storing the tree hash cache in the db * Handles peer disconnects for sync * Fix failing beacon chain tests * Change eth2_testnet_config tests to Mainnet * Add logs downgrade discovery log * Remove dedunant beacon state write * Fix re-org warnings * Use caching get methods in fork choice * Fix mistake in prev commit * Use caching state getting in state_by_slot * Add state.cacheless_clone * Less fork choice (#679) * Try merge in change to reduce fork choice calls * Remove fork choice from process block * Minor log fix * Check successes > 0 * Fix failing beacon chain tests * Fix re-org warnings * Fix mistake in prev commit * Attempt to improve attestation processing times * Introduce HeadInfo struct * Used cache tree hash for block processing * Use cached tree hash for block production too * Range sync refactor - Introduces `ChainCollection` - Correct Disconnect node handling - Removes duplicate code * Add more logging for DB * Various bug fixes * Remove unnecessary logs * Maintain syncing state in the transition from finalied to head * Improved disconnect handling * Add `Speedo` struct * Fix bugs in speedo * Fix bug in speedo * Fix rounding bug in speedo * Move code around, reduce speedo observation count * Adds forwards block interator * Fix inf NaN * Add first draft of validator onboarding * Update docs * Add documentation link to main README * Continue docs development * Update book readme * Update docs * Allow vc to run without testnet subcommand * Small change to onboarding docs * Tidy CLI help messages * Update docs * Add check to val client see if beacon node is synced * Attempt to fix NaN bug * Fix compile bug * Add notifier service to validator client * Re-order onboarding steps * Update deposit contract address * Update testnet dir * Add note about public eth1 node * Fix installation link * Set default eth1 endpoint to sigp * Fix broken test * Try fix eth1 cache locking * Be more specific about eth1 endpoint * Increase gas limit for deposit * Fix default deposit amount * Fix re-org log	2019-12-09 23:14:13 +11:00
Michael Sproul	bd1b61a5b1	Forwards block root iterators (#672 ) * Implement forwards block root iterators * Clean up errors and docs	2019-12-06 18:52:11 +11:00
Michael Sproul	bf2eeae3f2	Implement freezer database (#508 ) * Implement freezer database for state vectors * Improve BeaconState safe accessors And fix a bug in the compact committees accessor. * Banish dodgy type bounds back to gRPC * Clean up * Switch to exclusive end points in chunked vec * Cleaning up and start of tests * Randao fix, more tests * Fix unsightly hack * Resolve test FIXMEs * Config file support * More clean-ups, migrator beginnings * Finish migrator, integrate into BeaconChain * Fixups * Fix store tests * Fix BeaconChain tests * Fix LMD GHOST tests * Address review comments, delete 'static bounds * Cargo format * Address review comments * Fix LMD ghost tests * Update to spec v0.9.0 * Update to v0.9.1 * Bump spec tags for v0.9.1 * Formatting, fix CI failures * Resolve accidental KeyPair merge conflict * Document new BeaconState functions * Fix incorrect cache drops in `advance_caches` * Update fork choice for v0.9.1 * Clean up some FIXMEs * Fix a few docs/logs * Update for new builder paradigm, spec changes * Freezer DB integration into BeaconNode * Cleaning up * This works, clean it up * Cleanups * Fix and improve store tests * Refine store test * Delete unused beacon_chain_builder.rs * Fix CLI * Store state at split slot in hot database * Make fork choice lookup fast again * Store freezer DB split slot in the database * Handle potential div by 0 in chunked_vector * Exclude committee caches from freezer DB * Remove FIXME about long-running test	2019-11-27 10:54:46 +11:00
Paul Hauner	c4ced3e0d2	Fix block processing blowup, upgrade metrics (#500 ) * Renamed fork_choice::process_attestation_from_block * Processing attestation in fork choice * Retrieving state from store and checking signature * Looser check on beacon state validity. * Cleaned up get_attestation_state * Expanded fork choice api to provide latest validator message. * Checking if the an attestation contains a latest message * Correct process_attestation error handling. * Copy paste error in comment fixed. * Tidy ancestor iterators * Getting attestation slot via helper method * Refactored attestation creation in test utils * Revert "Refactored attestation creation in test utils" This reverts commit 4d277fe4239a7194758b18fb5c00dfe0b8231306. * Integration tests for free attestation processing * Implicit conflicts resolved. * formatting * Do first pass on Grants code * Add another attestation processing test * Tidy attestation processing * Remove old code fragment * Add non-compiling half finished changes * Simplify, fix bugs, add tests for chain iters * Remove attestation processing from op pool * Fix bug with fork choice, tidy * Fix overly restrictive check in fork choice. * Ensure committee cache is build during attn proc * Ignore unknown blocks at fork choice * Various minor fixes * Make fork choice write lock in to read lock * Remove unused method * Tidy comments * Fix attestation prod. target roots change * Fix compile error in store iters * Reject any attestation prior to finalization * Begin metrics refactor * Move beacon_chain to new metrics structure. * Make metrics not panic if already defined * Use global prometheus gather at rest api * Unify common metric fns into a crate * Add heavy metering to block processing * Remove hypen from prometheus metric name * Add more beacon chain metrics * Add beacon chain persistence metric * Prune op pool on finalization * Add extra prom beacon chain metrics * Prefix BeaconChain metrics with "beacon_" * Add more store metrics * Add basic metrics to libp2p * Add metrics to HTTP server * Remove old `http_server` crate * Update metrics names to be more like standard * Fix broken beacon chain metrics, add slot clock metrics * Add lighthouse_metrics gather fn * Remove http args * Fix wrong state given to op pool prune * Make prom metric names more consistent * Add more metrics, tidy existing metrics * Fix store block read metrics * Tidy attestation metrics * Fix minor PR comments * Allow travis failures on beta (see desc) There's a non-backward compatible change in `cargo fmt`. Stable and beta do not agree. * Tidy `lighthouse_metrics` docs * Fix typo	2019-08-19 21:02:34 +10:00
Paul Hauner	fd6766c268	Tidy beacon node runtime code	2019-06-08 09:46:04 -04:00
Paul Hauner	c840b76cac	Tidy `store` crate, add comments	2019-05-21 18:49:24 +10:00
Paul Hauner	3bcf5ba706	Rename `db` crate to `store`	2019-05-21 18:20:23 +10:00

16 Commits