From 29474f98937dfaecdcb1f71ff91491839c4e9e05 Mon Sep 17 00:00:00 2001
From: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
Date: Wed, 25 May 2022 05:47:21 +0200
Subject: [PATCH] Document benchmarking CLI (#11246)

* Decrese default repeats

Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Add benchmarking READMEs

Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Update docs

Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Update docs

Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Update README

Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Review fixes

Co-authored-by: Shawn Tabrizi <shawntabrizi@gmail.com>

Co-authored-by: parity-processbot <>
Co-authored-by: Shawn Tabrizi <shawntabrizi@gmail.com>
---
 substrate/frame/benchmarking/README.md        |  14 +-
 .../utils/frame/benchmarking-cli/README.md    |  47 +++++-
 .../benchmarking-cli/src/block/README.md      | 118 +++++++++++++++
 .../benchmarking-cli/src/machine/README.md    |  71 +++++++++
 .../benchmarking-cli/src/overhead/README.md   | 136 ++++++++++++++++++
 .../benchmarking-cli/src/overhead/bench.rs    |   4 +-
 .../benchmarking-cli/src/pallet/README.md     |   3 +
 .../benchmarking-cli/src/shared/README.md     |  15 ++
 .../benchmarking-cli/src/storage/README.md    | 105 ++++++++++++++
 9 files changed, 505 insertions(+), 8 deletions(-)
 create mode 100644 substrate/utils/frame/benchmarking-cli/src/block/README.md
 create mode 100644 substrate/utils/frame/benchmarking-cli/src/machine/README.md
 create mode 100644 substrate/utils/frame/benchmarking-cli/src/overhead/README.md
 create mode 100644 substrate/utils/frame/benchmarking-cli/src/pallet/README.md
 create mode 100644 substrate/utils/frame/benchmarking-cli/src/shared/README.md
 create mode 100644 substrate/utils/frame/benchmarking-cli/src/storage/README.md

diff --git a/substrate/frame/benchmarking/README.md b/substrate/frame/benchmarking/README.md
index 38c683cb8db..f0fe05cc140 100644
--- a/substrate/frame/benchmarking/README.md
+++ b/substrate/frame/benchmarking/README.md
@@ -43,7 +43,7 @@ The benchmarking framework comes with the following tools:
 * [A set of macros](./src/lib.rs) (`benchmarks!`, `add_benchmark!`, etc...) to make it easy to
   write, test, and add runtime benchmarks.
 * [A set of linear regression analysis functions](./src/analysis.rs) for processing benchmark data.
-* [A CLI extension](../../utils/frame/benchmarking-cli/) to make it easy to execute benchmarks on your
+* [A CLI extension](../../utils/frame/benchmarking-cli/README.md) to make it easy to execute benchmarks on your
   node.
 
 The end-to-end benchmarking pipeline is disabled by default when compiling a node. If you want to
@@ -150,9 +150,13 @@ feature flag:
 
 ```bash
 cd bin/node/cli
-cargo build --release --features runtime-benchmarks
+cargo build --profile=production --features runtime-benchmarks
 ```
 
+The production profile applies various compiler optimizations.  
+These optimizations slow down the compilation process *a lot*.  
+If you are just testing things out and don't need final numbers, don't include `--profile=production`.
+
 ## Running Benchmarks
 
 Finally, once you have a node binary with benchmarks enabled, you need to execute your various
@@ -161,13 +165,13 @@ benchmarks.
 You can get a list of the available benchmarks by running:
 
 ```bash
-./target/release/substrate benchmark --chain dev --pallet "*" --extrinsic "*" --repeat 0
+./target/production/substrate benchmark pallet --chain dev --pallet "*" --extrinsic "*" --repeat 0
 ```
 
 Then you can run a benchmark like so:
 
 ```bash
-./target/release/substrate benchmark \
+./target/production/substrate benchmark pallet \
     --chain dev \                  # Configurable Chain Spec
     --execution=wasm \             # Always test with Wasm
     --wasm-execution=compiled \    # Always used `wasm-time`
@@ -200,7 +204,7 @@ used for joining all the arguments passed to the CLI.
 To get a full list of available options when running benchmarks, run:
 
 ```bash
-./target/release/substrate benchmark --help
+./target/production/substrate benchmark --help
 ```
 
 License: Apache-2.0
diff --git a/substrate/utils/frame/benchmarking-cli/README.md b/substrate/utils/frame/benchmarking-cli/README.md
index 9718db58b37..e6a48b61fd2 100644
--- a/substrate/utils/frame/benchmarking-cli/README.md
+++ b/substrate/utils/frame/benchmarking-cli/README.md
@@ -1 +1,46 @@
-License: Apache-2.0
\ No newline at end of file
+# The Benchmarking CLI
+
+This crate contains commands to benchmark various aspects of Substrate and the hardware.  
+All commands are exposed by the Substrate node but can be exposed by any Substrate client.  
+The goal is to have a comprehensive suite of benchmarks that cover all aspects of Substrate and the hardware that its running on.
+
+Invoking the root benchmark command prints a help menu:  
+```sh
+$ cargo run --profile=production -- benchmark
+
+Sub-commands concerned with benchmarking.
+
+USAGE:
+    substrate benchmark <SUBCOMMAND>
+
+OPTIONS:
+    -h, --help       Print help information
+    -V, --version    Print version information
+
+SUBCOMMANDS:
+    block       Benchmark the execution time of historic blocks
+    machine     Command to benchmark the hardware.
+    overhead    Benchmark the execution overhead per-block and per-extrinsic
+    pallet      Benchmark the extrinsic weight of FRAME Pallets
+    storage     Benchmark the storage speed of a chain snapshot
+```
+
+All examples use the `production` profile for correctness which makes the compilation *very* slow; for testing you can use `--release`.  
+For the final results the `production` profile and reference hardware should be used, otherwise the results are not comparable.
+
+The sub-commands are explained in depth here:  
+- [block] Compare the weight of a historic block to its actual resource usage
+- [machine] Gauges the speed of the hardware
+- [overhead] Creates weight files for the *Block*- and *Extrinsic*-base weights
+- [pallet] Creates weight files for a Pallet
+- [storage] Creates weight files for *Read* and *Write* storage operations
+
+License: Apache-2.0
+
+<!-- LINKS -->
+
+[pallet]: ../../../frame/benchmarking/README.md
+[machine]: src/machine/README.md
+[storage]: src/storage/README.md
+[overhead]: src/overhead/README.md
+[block]: src/block/README.md
diff --git a/substrate/utils/frame/benchmarking-cli/src/block/README.md b/substrate/utils/frame/benchmarking-cli/src/block/README.md
new file mode 100644
index 00000000000..7e99f0df9d4
--- /dev/null
+++ b/substrate/utils/frame/benchmarking-cli/src/block/README.md
@@ -0,0 +1,118 @@
+# The `benchmark block` command
+
+The whole benchmarking process in Substrate aims to predict the resource usage of an unexecuted block.  
+This command measures how accurate this prediction was by executing a block and comparing the predicted weight to its actual resource usage.  
+It can be used to measure the accuracy of the pallet benchmarking.
+
+In the following it will be explained once for Polkadot and once for Substrate.  
+
+## Polkadot # 1
+<sup>(Also works for Kusama, Westend and Rococo)</sup>
+
+
+Suppose you either have a synced Polkadot node or downloaded a snapshot from [Polkachu].  
+This example uses a pruned ParityDB snapshot from the 2022-4-19 with the last block being 9939462.  
+For pruned snapshots you need to know the number of the last block (to be improved [here]).    
+Pruned snapshots normally store the last 256 blocks, archive nodes can use any block range.  
+
+In this example we will benchmark just the last 10 blocks:  
+```sh
+cargo run --profile=production -- benchmark block --from 9939453 --to 9939462 --db paritydb
+```
+
+Output:
+```pre
+Block 9939453 with     2 tx used   4.57% of its weight (    26,458,801 of    579,047,053 ns)    
+Block 9939454 with     3 tx used   4.80% of its weight (    28,335,826 of    590,414,831 ns)    
+Block 9939455 with     2 tx used   4.76% of its weight (    27,889,567 of    586,484,595 ns)    
+Block 9939456 with     2 tx used   4.65% of its weight (    27,101,306 of    582,789,723 ns)    
+Block 9939457 with     2 tx used   4.62% of its weight (    26,908,882 of    582,789,723 ns)    
+Block 9939458 with     2 tx used   4.78% of its weight (    28,211,440 of    590,179,467 ns)    
+Block 9939459 with     4 tx used   4.78% of its weight (    27,866,077 of    583,260,451 ns)    
+Block 9939460 with     3 tx used   4.72% of its weight (    27,845,836 of    590,462,629 ns)    
+Block 9939461 with     2 tx used   4.58% of its weight (    26,685,119 of    582,789,723 ns)    
+Block 9939462 with     2 tx used   4.60% of its weight (    26,840,938 of    583,697,101 ns)    
+```
+
+### Output Interpretation
+
+<sup>(Only results from reference hardware are relevant)</sup>
+
+Each block is executed multiple times and the results are averaged.  
+The percent number is the interesting part and indicates how much weight was used as compared to how much was predicted.  
+The closer to 100% this is without exceeding 100%, the better.  
+If it exceeds 100%, the block is marked with "**OVER WEIGHT!**" to easier spot them. This is not good since then the benchmarking under-estimated the weight.  
+This would mean that an honest validator would possibly not be able to keep up with importing blocks since users did not pay for enough weight.  
+If that happens the validator could lag behind the chain and get slashed for missing deadlines.  
+It is therefore important to investigate any overweight blocks.  
+
+In this example you can see an unexpected result; only < 5% of the weight was used!  
+The measured blocks can be executed much faster than predicted.  
+This means that the benchmarking process massively over-estimated the execution time.  
+Since they are off by so much, it is an issue [polkadot#5192].  
+
+The ideal range for these results would be 85-100%.
+
+## Polkadot # 2
+
+Let's take a more interesting example where the blocks use more of their predicted weight.  
+Every day when validators pay out rewards, the blocks are nearly full.  
+Using an archive node here is the easiest.  
+
+The Polkadot blocks TODO-TODO for example contain large batch transactions for staking payout.  
+
+```sh
+cargo run --profile=production -- benchmark block --from TODO --to TODO --db paritydb
+```
+
+```pre
+TODO
+```
+
+## Substrate
+
+It is also possible to try the procedure in Substrate, although it's a bit boring.  
+
+First you need to create some blocks with either a local or dev chain.  
+This example will use the standard development spec.  
+Pick a non existing directory where the chain data will be stored, eg `/tmp/dev`.
+```sh
+cargo run --profile=production -- --dev -d /tmp/dev
+```
+You should see after some seconds that it started to produce blocks:  
+```pre
+â€¦
+âœ¨ Imported #1 (0x801dâ€¦9189)
+â€¦
+```
+You can now kill the node with `Ctrl+C`. Then measure how long it takes to execute these blocks:  
+```sh
+cargo run --profile=production -- benchmark block --from 1 --to 1 --dev -d /tmp/dev --pruning archive
+```
+This will benchmark the first block. If you killed the node at a later point, you can measure multiple blocks.
+```pre
+Block 1 with     1 tx used  72.04% of its weight (     4,945,664 of      6,864,702 ns)
+```
+
+In this example the block used ~72% of its weight.  
+The benchmarking therefore over-estimated the effort to execute the block.  
+Since this block is empty, its not very interesting.
+
+## Arguments
+
+- `--from` Number of the first block to measure (inclusive).
+- `--to` Number of the last block to measure (inclusive).
+- `--repeat` How often each block should be measured.
+- [`--db`]
+- [`--pruning`]
+
+License: Apache-2.0
+
+<!-- LINKS -->
+
+[Polkachu]: https://polkachu.com/snapshots
+[here]: https://github.com/paritytech/substrate/issues/11141
+[polkadot#5192]: https://github.com/paritytech/polkadot/issues/5192
+
+[`--db`]: ../shared/README.md#arguments
+[`--pruning`]: ../shared/README.md#arguments
diff --git a/substrate/utils/frame/benchmarking-cli/src/machine/README.md b/substrate/utils/frame/benchmarking-cli/src/machine/README.md
new file mode 100644
index 00000000000..f22a8ea54b8
--- /dev/null
+++ b/substrate/utils/frame/benchmarking-cli/src/machine/README.md
@@ -0,0 +1,71 @@
+# The `benchmark machine` command
+
+Different Substrate chains can have different hardware requirements.  
+It is therefore important to be able to quickly gauge if a piece of hardware fits a chains' requirements.  
+The `benchmark machine` command archives this by measuring key metrics and making them comparable.  
+
+Invoking the command looks like this:  
+```sh
+cargo run --profile=production -- benchmark machine --dev
+```
+
+## Output
+
+The output on reference hardware:  
+
+```pre
++----------+----------------+---------------+--------------+-------------------+
+| Category | Function       | Score         | Minimum      | Result            |
++----------+----------------+---------------+--------------+-------------------+
+| CPU      | BLAKE2-256     | 1023.00 MiB/s | 1.00 GiB/s   | âœ… Pass ( 99.4 %) |
++----------+----------------+---------------+--------------+-------------------+
+| CPU      | SR25519-Verify | 665.13 KiB/s  | 666.00 KiB/s | âœ… Pass ( 99.9 %) |
++----------+----------------+---------------+--------------+-------------------+
+| Memory   | Copy           | 14.39 GiB/s   | 14.32 GiB/s  | âœ… Pass (100.4 %) |
++----------+----------------+---------------+--------------+-------------------+
+| Disk     | Seq Write      | 457.00 MiB/s  | 450.00 MiB/s | âœ… Pass (101.6 %) |
++----------+----------------+---------------+--------------+-------------------+
+| Disk     | Rnd Write      | 190.00 MiB/s  | 200.00 MiB/s | âœ… Pass ( 95.0 %) |
++----------+----------------+---------------+--------------+-------------------+
+```
+
+The *score* is the average result of each benchmark. It always adheres to "higher is better".  
+
+The *category* indicate which part of the hardware was benchmarked:  
+- **CPU** Processor intensive task
+- **Memory** RAM intensive task
+- **Disk** Hard drive intensive task
+
+The *function* is the concrete benchmark that was run:  
+- **BLAKE2-256** The throughput of the [Blake2-256] cryptographic hashing function with 32 KiB input. The [blake2_256 function] is used in many places in Substrate. The throughput of a hash function strongly depends on the input size, therefore we settled to use a fixed input size for comparable results.
+- **SR25519 Verify** Sr25519 is an optimized version of the [Curve25519] signature scheme. Signature verification is used by Substrate when verifying extrinsics and blocks.
+- **Copy** The throughput of copying memory from one place in the RAM to another.
+- **Seq Write** The throughput of writing data to the storage location sequentially. It is important that the same disk is used that will later-on be used to store the chain data.
+- **Rnd Write** The throughput of writing data to the storage location in a random order. This is normally much slower than the sequential write.
+
+The *score* needs to reach the *minimum* in order to pass the benchmark. This can be reduced with the `--tolerance` flag.
+
+The *result* indicated if a specific benchmark was passed by the machine or not. The percent number is the relative score reached to the *minimum* that is needed. The `--tolerance` flag is taken into account for this decision. For example a benchmark that passes even with 95% since the *tolerance* was set to 10% would look like this: `âœ… Pass ( 95.0 %)`.
+
+## Interpretation
+
+Ideally all results show a `Pass` and the program exits with code 0. Currently some of the benchmarks can fail even on reference hardware; they are still being improved to make them more deterministic.  
+Make sure to run nothing else on the machine when benchmarking it.  
+You can re-run them multiple times to get more reliable results.
+
+## Arguments
+
+- `--tolerance` A percent number to reduce the *minimum* requirement. This should be used to ignore outliers of the benchmarks. The default value is 10%.
+- `--verify-duration` How long the verification benchmark should run.
+- `--disk-duration` How long the *read* and *write* benchmarks should run each.
+- `--allow-fail` Always exit the program with code 0.
+- `--chain` / `--dev` Specify the chain config to use. This will be used to compare the results with the requirements of the chain (WIP).
+- [`--base-path`]
+
+License: Apache-2.0
+
+<!-- LINKS -->
+[Blake2-256]: https://www.blake2.net/
+[blake2_256 function]: https://crates.parity.io/sp_core/hashing/fn.blake2_256.html
+[Curve25519]: https://en.wikipedia.org/wiki/Curve25519
+[`--base-path`]: ../shared/README.md#arguments
diff --git a/substrate/utils/frame/benchmarking-cli/src/overhead/README.md b/substrate/utils/frame/benchmarking-cli/src/overhead/README.md
new file mode 100644
index 00000000000..6f41e881d05
--- /dev/null
+++ b/substrate/utils/frame/benchmarking-cli/src/overhead/README.md
@@ -0,0 +1,136 @@
+# The `benchmark overhead` command
+
+Each time an extrinsic or a block is executed, a fixed weight is charged as "execution overhead".  
+This is necessary since the weight that is calculated by the pallet benchmarks does not include this overhead.  
+The exact overhead to can vary per Substrate chain and needs to be calculated per chain.  
+This command calculates the exact values of these overhead weights for any Substrate chain that supports it.
+
+## How does it work?
+
+The benchmark consists of two parts; the [`BlockExecutionWeight`] and the [`ExtrinsicBaseWeight`].  
+Both are executed sequentially when invoking the command.
+
+## BlockExecutionWeight
+
+The block execution weight is defined as the weight that it takes to execute an *empty block*.  
+It is measured by constructing an empty block and measuring its executing time.  
+The result are written to a `block_weights.rs` file which is created from a template.  
+The file will contain the concrete weight value and various statistics about the measurements. For example:  
+```rust
+/// Time to execute an empty block.
+/// Calculated by multiplying the *Average* with `1` and adding `0`.
+///
+/// Stats [NS]:
+///   Min, Max: 3_508_416, 3_680_498
+///   Average:  3_532_484
+///   Median:   3_522_111
+///   Std-Dev:  27070.23
+///
+/// Percentiles [NS]:
+///   99th: 3_631_863
+///   95th: 3_595_674
+///   75th: 3_526_435
+pub const BlockExecutionWeight: Weight = 3_532_484 * WEIGHT_PER_NANOS;
+```
+
+In this example it takes 3.5 ms to execute an empty block. That means that it always takes at least 3.5 ms to execute *any* block.  
+This constant weight is therefore added to each block to ensure that Substrate budgets enough time to execute it.
+
+## ExtrinsicBaseWeight
+
+The extrinsic base weight is defined as the weight that it takes to execute an *empty* extrinsic.  
+An *empty* extrinsic is also called a *NO-OP*. It does nothing and is the equivalent to the empty block form above.  
+The benchmark now constructs a block which is filled with only NO-OP extrinsics.
+This block is then executed many times and the weights are measured.  
+The result is divided by the number of extrinsics in that block and the results are written to `extrinsic_weights.rs`.  
+
+The relevant section in the output file looks like this:  
+```rust
+ /// Time to execute a NO-OP extrinsic, for example `System::remark`.
+/// Calculated by multiplying the *Average* with `1` and adding `0`.
+///
+/// Stats [NS]:
+///   Min, Max: 67_561, 69_855
+///   Average:  67_745
+///   Median:   67_701
+///   Std-Dev:  264.68
+///
+/// Percentiles [NS]:
+///   99th: 68_758
+///   95th: 67_843
+///   75th: 67_749
+pub const ExtrinsicBaseWeight: Weight = 67_745 * WEIGHT_PER_NANOS;
+```
+
+In this example it takes 67.7 Âµs to execute a NO-OP extrinsic. That means that it always takes at least 67.7 Âµs to execute *any* extrinsic.  
+This constant weight is therefore added to each extrinsic to ensure that Substrate budgets enough time to execute it.
+
+## Invocation
+
+The base command looks like this (for debugging you can use `--release`):
+```sh
+cargo run --profile=production -- benchmark overhead --dev
+```
+
+Output:
+```pre
+# BlockExecutionWeight
+Running 10 warmups...
+Executing block 100 times    
+Per-block execution overhead [ns]:
+Total: 353248430
+Min: 3508416, Max: 3680498
+Average: 3532484, Median: 3522111, Stddev: 27070.23
+Percentiles 99th, 95th, 75th: 3631863, 3595674, 3526435    
+Writing weights to "block_weights.rs"
+
+# Setup
+Building block, this takes some time...    
+Extrinsics per block: 12000
+
+# ExtrinsicBaseWeight
+Running 10 warmups...
+Executing block 100 times    
+Per-extrinsic execution overhead [ns]:
+Total: 6774590
+Min: 67561, Max: 69855
+Average: 67745, Median: 67701, Stddev: 264.68
+Percentiles 99th, 95th, 75th: 68758, 67843, 67749    
+Writing weights to "extrinsic_weights.rs"
+```
+
+The complete command for Polkadot looks like this:  
+```sh
+cargo run --profile=production -- benchmark overhead --chain=polkadot-dev --execution=wasm --wasm-execution=compiled --weight-path=runtime/polkadot/constants/src/weights/
+```
+
+This will overwrite the the [block_weights.rs](https://github.com/paritytech/polkadot/blob/c254e5975711a6497af256f6831e9a6c752d28f5/runtime/polkadot/constants/src/weights/block_weights.rs) and [extrinsic_weights.rs](https://github.com/paritytech/polkadot/blob/c254e5975711a6497af256f6831e9a6c752d28f5/runtime/polkadot/constants/src/weights/extrinsic_weights.rs) files in the Polkadot runtime directory. 
+You can try the same for *Rococo* and to see that the results slightly differ.  
+ðŸ‘‰ It is paramount to use `--profile=production`, `--execution=wasm` and `--wasm-execution=compiled` as the results are otherwise useless.
+
+## Output Interpretation
+
+Lower is better. The less weight the execution overhead needs, the better.  
+Since the weights of the overhead is charged per extrinsic and per block, a larger weight results in less extrinsics per block.  
+Minimizing this is important to have a large transaction throughput.
+
+## Arguments
+
+- `--chain` / `--dev` Set the chain specification. 
+- `--weight-path` Set the output directory or file to write the weights to.  
+- `--repeat` Set the repetitions of both benchmarks.
+- `--warmup` Set the rounds of warmup before measuring.
+- `--execution` Should be set to `wasm` for correct results.
+- `--wasm-execution` Should be set to `compiled` for correct results.
+- [`--mul`](../shared/README.md#arguments)
+- [`--add`](../shared/README.md#arguments)
+- [`--metric`](../shared/README.md#arguments)
+- [`--weight-path`](../shared/README.md#arguments)
+
+License: Apache-2.0
+
+<!-- LINKS -->
+[`ExtrinsicBaseWeight`]: https://github.com/paritytech/substrate/blob/580ebae17fa30082604f1c9720f6f4a1cfe95b50/frame/support/src/weights/extrinsic_weights.rs#L26
+[`BlockExecutionWeight`]: https://github.com/paritytech/substrate/blob/580ebae17fa30082604f1c9720f6f4a1cfe95b50/frame/support/src/weights/block_weights.rs#L26
+
+[System::Remark]: https://github.com/paritytech/substrate/blob/580ebae17fa30082604f1c9720f6f4a1cfe95b50/frame/system/src/lib.rs#L382
diff --git a/substrate/utils/frame/benchmarking-cli/src/overhead/bench.rs b/substrate/utils/frame/benchmarking-cli/src/overhead/bench.rs
index 68f3f6597b4..be7dac24021 100644
--- a/substrate/utils/frame/benchmarking-cli/src/overhead/bench.rs
+++ b/substrate/utils/frame/benchmarking-cli/src/overhead/bench.rs
@@ -43,11 +43,11 @@ use crate::shared::Stats;
 #[derive(Debug, Default, Serialize, Clone, PartialEq, Args)]
 pub struct BenchmarkParams {
 	/// Rounds of warmups before measuring.
-	#[clap(long, default_value = "100")]
+	#[clap(long, default_value = "10")]
 	pub warmup: u32,
 
 	/// How many times the benchmark should be repeated.
-	#[clap(long, default_value = "1000")]
+	#[clap(long, default_value = "100")]
 	pub repeat: u32,
 
 	/// Maximal number of extrinsics that should be put into a block.
diff --git a/substrate/utils/frame/benchmarking-cli/src/pallet/README.md b/substrate/utils/frame/benchmarking-cli/src/pallet/README.md
new file mode 100644
index 00000000000..72845652de6
--- /dev/null
+++ b/substrate/utils/frame/benchmarking-cli/src/pallet/README.md
@@ -0,0 +1,3 @@
+The pallet command is explained in [frame/benchmarking](../../../../../frame/benchmarking/README.md).
+
+License: Apache-2.0
diff --git a/substrate/utils/frame/benchmarking-cli/src/shared/README.md b/substrate/utils/frame/benchmarking-cli/src/shared/README.md
new file mode 100644
index 00000000000..2a3719b8549
--- /dev/null
+++ b/substrate/utils/frame/benchmarking-cli/src/shared/README.md
@@ -0,0 +1,15 @@
+# Shared code
+
+Contains code that is shared among multiple sub-commands.
+
+## Arguments
+
+- `--mul` Multiply the result with a factor. Can be used to manually adjust for future chain growth.
+- `--add` Add a value to the result. Can be used to manually offset the results.
+- `--metric` Set the metric to use for calculating the final weight from the raw data. Defaults to `average`.
+- `--weight-path` Set the file or directory to write the weight files to.
+- `--db` The database backend to use. This depends on your snapshot.
+- `--pruning` Set the pruning mode of the node. Some benchmarks require you to set this to `archive`.
+- `--base-path` The location on the disk that should be used for the benchmarks. You can try this on different disks or even on a mounted RAM-disk. It is important to use the same location that will later-on be used to store the chain data to get the correct results.
+
+License: Apache-2.0
diff --git a/substrate/utils/frame/benchmarking-cli/src/storage/README.md b/substrate/utils/frame/benchmarking-cli/src/storage/README.md
new file mode 100644
index 00000000000..820785f7ea2
--- /dev/null
+++ b/substrate/utils/frame/benchmarking-cli/src/storage/README.md
@@ -0,0 +1,105 @@
+# The `benchmark storage` command
+
+The cost of storage operations in a Substrate chain depends on the current chain state.  
+It is therefore important to regularly update these weights as the chain grows.  
+This sub-command measures the cost of storage operations for a concrete snapshot.  
+
+For the Substrate node it looks like this (for debugging you can use `--release`):  
+```sh
+cargo run --profile=production -- benchmark storage --dev --state-version=1
+```
+
+Running the command on Substrate itself is not verify meaningful, since the genesis state of the `--dev` chain spec is used.  
+
+The output for the Polkadot client with a recent chain snapshot will give you a better impression. A recent snapshot can be downloaded from [Polkachu].  
+Then run (remove the `--db=paritydb` if you have a RocksDB snapshot):
+```sh
+cargo run --profile=production -- benchmark storage --dev --state-version=0 --db=paritydb --weight-path runtime/polkadot/constants/src/weights
+```
+
+This takes a while since reads and writes all keys from the snapshot:
+```pre
+# The 'read' benchmark
+Preparing keys from block BlockId::Number(9939462)    
+Reading 1379083 keys    
+Time summary [ns]:
+Total: 19668919930
+Min: 6450, Max: 1217259
+Average: 14262, Median: 14190, Stddev: 3035.79
+Percentiles 99th, 95th, 75th: 18270, 16190, 14819
+Value size summary:
+Total: 265702275
+Min: 1, Max: 1381859
+Average: 192, Median: 80, Stddev: 3427.53
+Percentiles 99th, 95th, 75th: 3368, 383, 80    
+
+# The 'write' benchmark
+Preparing keys from block BlockId::Number(9939462)    
+Writing 1379083 keys    
+Time summary [ns]:
+Total: 98393809781
+Min: 12969, Max: 13282577
+Average: 71347, Median: 69499, Stddev: 25145.27
+Percentiles 99th, 95th, 75th: 135839, 106129, 79239
+Value size summary:
+Total: 265702275
+Min: 1, Max: 1381859
+Average: 192, Median: 80, Stddev: 3427.53
+Percentiles 99th, 95th, 75th: 3368, 383, 80
+
+Writing weights to "paritydb_weights.rs"
+```
+You will see that the [paritydb_weights.rs] files was modified and now contains new weights. 
+The exact command for Polkadot can be seen at the top of the file.  
+This uses the most recent block from your snapshot which is printed at the top.  
+The value size summary tells us that the pruned Polkadot chain state is ~253 MiB in size.  
+Reading a value on average takes (in this examples) 14.3 Âµs and writing 71.3 Âµs.  
+The interesting part in the generated weight file tells us the weight constants and some statistics about the measurements:
+```rust
+/// Time to read one storage item.
+/// Calculated by multiplying the *Average* of all values with `1.1` and adding `0`.
+///
+/// Stats [NS]:
+///   Min, Max: 4_611, 1_217_259
+///   Average:  14_262
+///   Median:   14_190
+///   Std-Dev:  3035.79
+///
+/// Percentiles [NS]:
+///   99th: 18_270
+///   95th: 16_190
+///   75th: 14_819
+read: 14_262 * constants::WEIGHT_PER_NANOS,
+
+/// Time to write one storage item.
+/// Calculated by multiplying the *Average* of all values with `1.1` and adding `0`.
+///
+/// Stats [NS]:
+///   Min, Max: 12_969, 13_282_577
+///   Average:  71_347This works under the assumption that the *average* read a
+///   Median:   69_499
+///   Std-Dev:  25145.27
+///
+/// Percentiles [NS]:
+///   99th: 135_839
+///   95th: 106_129
+///   75th: 79_239
+write: 71_347 * constants::WEIGHT_PER_NANOS,
+```
+
+## Arguments
+
+- `--db` Specify which database backend to use. This greatly influences the results.
+- `--state-version` Set the version of the state encoding that this snapshot uses. Should be set to `1` for Substrate `--dev` and `0` for Polkadot et al. Using the wrong version can corrupt the snapshot.
+- [`--mul`](../shared/README.md#arguments)
+- [`--add`](../shared/README.md#arguments)
+- [`--metric`](../shared/README.md#arguments)
+- [`--weight-path`](../shared/README.md#arguments)
+- `--json-read-path` Write the raw 'read' results to this file or directory.
+- `--json-write-path` Write the raw 'write' results to this file or directory.
+
+License: Apache-2.0
+
+<!-- LINKS -->
+[Polkachu]: https://polkachu.com/snapshots
+[paritydb_weights.rs]: https://github.com/paritytech/polkadot/blob/c254e5975711a6497af256f6831e9a6c752d28f5/runtime/polkadot/constants/src/weights/paritydb_weights.rs#L60
-- 
GitLab