Commits · cc1e6ac301ea88e3cb3253a84e4c6aa28f2d8f87 · parity / Mirrored projects / polkadot-sdk

Mar 25, 2024

[subsystem-benchmarks] Fix availability-write regression tests (#3698) · cc1e6ac3

Andrei Eres authored Mar 25, 2024

Adds availability-write regression tests.
The results for the `availability-distribution` subsystem are volatile,
so I had to reduce the precision of the test.

cc1e6ac3

Mar 13, 2024
- Add subsystems regression tests to CI (#3527) · 256546b7
  Andrei Eres authored Mar 13, 2024
```
Fixes https://github.com/paritytech/polkadot-sdk/issues/3530
```
  256546b7
Mar 11, 2024

subsystem-bench: adjust test config to Kusama (#3583) · 05381afc

Andrei Eres authored Mar 11, 2024

Fixes https://github.com/paritytech/polkadot-sdk/issues/3528

```rust
latency:
    mean_latency_ms = 30 // common sense
    std_dev = 2.0 // common sense
n_validators = 300 // max number of validators, from chain config
n_cores = 60 // 300/5
max_validators_per_core = 5 // default
min_pov_size = 5120 // max
max_pov_size = 5120 // max
peer_bandwidth = 44040192 // from the Parity's kusama validators
bandwidth = 44040192 // from the Parity's kusama validators
connectivity = 90 // we need to be connected to 90-95% of peers
```

05381afc

Mar 08, 2024

fix some typos (#3587) · ea458d0b

cuinix authored Mar 09, 2024



Signed-off-by: cuinix <[email protected]>
Co-authored-by: Bastian Köcher <[email protected]>

ea458d0b

Mar 01, 2024

subsystem-bench: add regression tests for availability read and write (#3311) · f0e589d7

Andrei Eres authored Mar 01, 2024



### What's been done
- `subsystem-bench` has been split into two parts: a cli benchmark
runner and a library.
- The cli runner is quite simple. It just allows us to run `.yaml` based
test sequences. Now it should only be used to run benchmarks during
development.
- The library is used in the cli runner and in regression tests. Some
code is changed to make the library independent of the runner.
- Added first regression tests for availability read and write that
replicate existing test sequences.

### How we run regression tests
- Regression tests are simply rust integration tests without the
harnesses.
- They should only be compiled under the `subsystem-benchmarks` feature
to prevent them from running with other tests.
- This doesn't work when running tests with `nextest` in CI, so
additional filters have been added to the `nextest` runs.
- Each benchmark run takes a different time in the beginning, so we
"warm up" the tests until their CPU usage differs by only 1%.
- After the warm-up, we run the benchmarks a few more times and compare
the average with the exception using a precision.

### What is still wrong?
- I haven't managed to set up approval voting tests. The spread of their
results is too large and can't be narrowed down in a reasonable amount
of time in the warm-up phase.
- The tests start an unconfigurable prometheus endpoint inside, which
causes errors because they use the same 9999 port. I disable it with a
flag, but I think it's better to extract the endpoint launching outside
the test, as we already do with `valgrind` and `pyroscope`. But we still
use `prometheus` inside the tests.

### Future work
* https://github.com/paritytech/polkadot-sdk/issues/3528
* https://github.com/paritytech/polkadot-sdk/issues/3529
* https://github.com/paritytech/polkadot-sdk/issues/3530
* https://github.com/paritytech/polkadot-sdk/issues/3531

---------

Co-authored-by: Alexander Samusev <[email protected]>

f0e589d7

Feb 12, 2024
- subsystem-bench: polish imports (#3262) · cbd68467
  Andrei Eres authored Feb 12, 2024
  
  cbd68467
Feb 08, 2024

subsystem-bench: run cli benchmarks only using config files (#3239) · 07f85929

Andrei Eres authored Feb 08, 2024

This PR removes the configuration of subsystem benchmarks via CLI
arguments. After this, we only keep configurations only in yaml files.
It removes unnecessary code duplication

07f85929

Feb 06, 2024

subsystem-bench: Prepare CI output (#3158) · 9e6298e7

Andrei Eres authored Feb 06, 2024



1. Benchmark results are collected in a single struct.
2. The output of the results is prettified.
3. The result struct used to save the output as a yaml and store it in
artifacts in a CI job.

```
$ cargo run -p polkadot-subsystem-bench --release -- test-sequence --path polkadot/node/subsystem-bench/examples/availability_read.yaml | tee output.txt
$ cat output.txt

polkadot/node/subsystem-bench/examples/availability_read.yaml #1

Network usage, KiB                     total   per block
Received from peers               510796.000  170265.333
Sent to peers                        221.000      73.667

CPU usage, s                           total   per block
availability-recovery                 38.671      12.890
Test environment                       0.255       0.085


polkadot/node/subsystem-bench/examples/availability_read.yaml #2

Network usage, KiB                     total   per block
Received from peers               413633.000  137877.667
Sent to peers                        353.000     117.667

CPU usage, s                           total   per block
availability-recovery                 52.630      17.543
Test environment                       0.271       0.090


polkadot/node/subsystem-bench/examples/availability_read.yaml #3

Network usage, KiB                     total   per block
Received from peers               424379.000  141459.667
Sent to peers                        703.000     234.333

CPU usage, s                           total   per block
availability-recovery                 51.128      17.043
Test environment                       0.502       0.167

```

```
$ cargo run -p polkadot-subsystem-bench --release -- --ci test-sequence --path polkadot/node/subsystem-bench/examples/availability_read.yaml | tee output.txt
$ cat output.txt
- benchmark_name: 'polkadot/node/subsystem-bench/examples/availability_read.yaml #1'
  network:
  - resource: Received from peers
    total: 509011.0
    per_block: 169670.33333333334
  - resource: Sent to peers
    total: 220.0
    per_block: 73.33333333333333
  cpu:
  - resource: availability-recovery
    total: 31.845848445
    per_block: 10.615282815
  - resource: Test environment
    total: 0.23582828799999941
    per_block: 0.07860942933333313

- benchmark_name: 'polkadot/node/subsystem-bench/examples/availability_read.yaml #2'
  network:
  - resource: Received from peers
    total: 411738.0
    per_block: 137246.0
  - resource: Sent to peers
    total: 351.0
    per_block: 117.0
  cpu:
  - resource: availability-recovery
    total: 18.93596025099999
    per_block: 6.31198675033333
  - resource: Test environment
    total: 0.2541994199999979
    per_block: 0.0847331399999993

- benchmark_name: 'polkadot/node/subsystem-bench/examples/availability_read.yaml #3'
  network:
  - resource: Received from peers
    total: 424548.0
    per_block: 141516.0
  - resource: Sent to peers
    total: 703.0
    per_block: 234.33333333333334
  cpu:
  - resource: availability-recovery
    total: 16.54178526900001
    per_block: 5.513928423000003
  - resource: Test environment
    total: 0.43960946299999537
    per_block: 0.14653648766666513
```

---------

Co-authored-by: Andrei Sandu <[email protected]>

9e6298e7

Feb 05, 2024

Introduce approval-voting/distribution benchmark (#2621) · f9f88688

Alexandru Gheorghe authored Feb 05, 2024



## Summary
Built on top of the tooling and ideas introduced in
https://github.com/paritytech/polkadot-sdk/pull/2528, this PR introduces
a synthetic benchmark for measuring and assessing the performance
characteristics of the approval-voting and approval-distribution
subsystems.

Currently this allows, us to simulate the behaviours of these systems
based on the following dimensions:
```
TestConfiguration:
# Test 1
- objective: !ApprovalsTest
    last_considered_tranche: 89
    min_coalesce: 1
    max_coalesce: 6
    enable_assignments_v2: true
    send_till_tranche: 60
    stop_when_approved: false
    coalesce_tranche_diff: 12
    workdir_prefix: "/tmp"
    num_no_shows_per_candidate: 0
    approval_distribution_expected_tof: 6.0
    approval_distribution_cpu_ms: 3.0
    approval_voting_cpu_ms: 4.30
  n_validators: 500
  n_cores: 100
  n_included_candidates: 100
  min_pov_size: 1120
  max_pov_size: 5120
  peer_bandwidth: 524288000000
  bandwidth: 524288000000
  latency:
    min_latency:
      secs: 0
      nanos: 1000000
    max_latency:
      secs: 0
      nanos: 100000000
  error: 0
  num_blocks: 10
```

## The approach
1. We build a real overseer with the real implementations for
approval-voting and approval-distribution subsystems.
2. For a given network size, for each validator we pre-computed all
potential assignments and approvals it would send, because this a
computation heavy operation this will be cached on a file on disk and be
re-used if the generation parameters don't change.
3. The messages will be sent accordingly to the configured parameters
and those are split into 3 main benchmarking scenarios.

## Benchmarking scenarios

### Best case scenario *approvals_throughput_best_case.yaml*
It send to the approval-distribution only the minimum required tranche
to gathered the needed_approvals, so that a candidate is approved.

### Behaviour in the presence of no-shows *approvals_no_shows.yaml*
It sends the tranche needed to approve a candidate when we have a
maximum of *num_no_shows_per_candidate* tranches with no-shows for each
candidate.

### Maximum throughput *approvals_throughput.yaml*
It sends all the tranches for each block and measures the used CPU and
necessary network bandwidth. by the approval-voting and
approval-distribution subsystem.

## How to run it
```
cargo run -p polkadot-subsystem-bench --release -- test-sequence --path polkadot/node/subsystem-bench/examples/approvals_throughput.yaml
```

## Evaluating performance
### Use the real subsystems metrics
If you follow the steps in
https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-grafana
for installing locally prometheus and grafana, all real metrics for the
`approval-distribution`, `approval-voting` and overseer are available.
E.g:
<img width="2149" alt="Screenshot 2023-12-05 at 11 07 46"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/cb8ae2dd-178b-4922-bfa4-dc37e572ed38">

<img width="2551" alt="Screenshot 2023-12-05 at 11 09 42"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/8b4542ba-88b9-46f9-9b70-cc345366081b">

<img width="2154" alt="Screenshot 2023-12-05 at 11 10 15"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/b8874d8d-632e-443a-9840-14ad8e90c54f">

<img width="2535" alt="Screenshot 2023-12-05 at 11 10 52"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/779a439f-fd18-4985-bb80-85d5afad78e2">

### Profile with pyroscope
1. Setup pyroscope following the steps in
https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-pyroscope,
then run any of the benchmark scenario with `--profile` as the
arguments.
2. Open the pyroscope dashboard in grafana, e.g:
<img width="2544" alt="Screenshot 2024-01-09 at 17 09 58"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/58f50c99-a910-4d20-951a-8b16639303d9">



### Useful  logs
1. Network bandwidth requirements:
```
Payload bytes received from peers: 503993 KiB total, 50399 KiB/block
Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block
```

2. Cpu usage by the approval-distribution/approval-voting subsystems.
```
approval-distribution CPU usage 84.061s
approval-distribution CPU usage per block 8.406s
approval-voting CPU usage 96.532s
approval-voting CPU usage per block 9.653s
```

3. Time passed until a given block is approved
```
 Chain selection approved  after 3500 ms hash=0x0101010101010101010101010101010101010101010101010101010101010101
Chain selection approved  after 4500 ms hash=0x0202020202020202020202020202020202020202020202020202020202020202
```

### Using benchmark to quantify improvements from
https://github.com/paritytech/polkadot-sdk/pull/1178 +
https://github.com/paritytech/polkadot-sdk/pull/1191

Using a versi-node we compare the scenarios where all new optimisations
are disabled with a scenarios where tranche0 assignments are sent in a
single message and a conservative simulation where the coalescing of
approvals gives us just 50% reduction in the number of messages we send.

Overall, what we see is a speedup of around 30-40% in the time it takes
to process the necessary messages and a 30-40% reduction in the
necessary bandwidth.

#### Best case scenario comparison(minimum required tranches sent).
Unoptimised
```
    Number of blocks: 10
    Payload bytes received from peers: 53289 KiB total, 5328 KiB/block
    Payload bytes sent to peers: 52489 KiB total, 5248 KiB/block
    approval-distribution CPU usage 6.732s
    approval-distribution CPU usage per block 0.673s
    approval-voting CPU usage 9.523s
    approval-voting CPU usage per block 0.952s
```

vs Optimisation enabled
```
   Number of blocks: 10
   Payload bytes received from peers: 32141 KiB total, 3214 KiB/block
   Payload bytes sent to peers: 37314 KiB total, 3731 KiB/block
   approval-distribution CPU usage 4.658s
   approval-distribution CPU usage per block 0.466s
   approval-voting CPU usage 6.236s
   approval-voting CPU usage per block 0.624s
```

#### Worst case all tranches sent, very unlikely happens when sharding
breaks.

Unoptimised
```
   Number of blocks: 10
   Payload bytes received from peers: 746393 KiB total, 74639 KiB/block
   Payload bytes sent to peers: 729151 KiB total, 72915 KiB/block
   approval-distribution CPU usage 118.681s
   approval-distribution CPU usage per block 11.868s
   approval-voting CPU usage 124.118s
   approval-voting CPU usage per block 12.412s
```

vs optimised
```
    Number of blocks: 10
    Payload bytes received from peers: 503993 KiB total, 50399 KiB/block
    Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block
    approval-distribution CPU usage 84.061s
    approval-distribution CPU usage per block 8.406s
    approval-voting CPU usage 96.532s
    approval-voting CPU usage per block 9.653s
```


## TODOs
[x] Polish implementation.
[x] Use what we have so far to evaluate
https://github.com/paritytech/polkadot-sdk/pull/1191 before merging.
[x] List of features and additional dimensions we want to use for
benchmarking.
[x] Run benchmark on hardware similar with versi and kusama nodes.
[ ] Add benchmark to be run in CI for catching regression in
performance.
[ ] Rebase on latest changes for network emulation.

---------

Signed-off-by: Andrei Sandu <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>

f9f88688

Jan 30, 2024

[ci] Update rust in ci image (1.75 and 2024-01-22) (#3016) · 5b7f24fc

Alexander Samusev authored Jan 31, 2024



cc https://github.com/paritytech/ci_cd/issues/926

---------

Signed-off-by: Oliver Tale-Yazdi <[email protected]>
Co-authored-by: command-bot <>
Co-authored-by: Oliver Tale-Yazdi <[email protected]>
Co-authored-by: Bastian Köcher <[email protected]>
Co-authored-by: Liam Aharon <[email protected]>
Co-authored-by: Dónal Murray <[email protected]>
Co-authored-by: Vladimir Istyufeev <[email protected]>

5b7f24fc

Jan 25, 2024

Add subsystem benchmarks for `availability-distribution` and... · 47e46d17

Andrei Sandu authored Jan 25, 2024


Add subsystem benchmarks for `availability-distribution` and `biftield-distribution` (availability write) (#2970)

Introduce a new test objective : `DataAvailabilityWrite`.

The new benchmark measures the network and cpu usage of
`availability-distribution`, `biftield-distribution` and
`availability-store` subsystems from the perspective of a validator node
during the process when candidates are made available.

Additionally I refactored the networking emulation to support bandwidth
acounting and limits of incoming and outgoing requests.

Screenshot of succesful run


<img width="1293" alt="Screenshot 2024-01-17 at 19 17 44"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/fde11280-e25b-4dc3-9dc9-d4b9752f9b7a">

---------

Signed-off-by: Andrei Sandu <[email protected]>

47e46d17

Jan 16, 2024

subsystem-bench: cache misses profiling (#2893) · ec7bfae0

Andrei Eres authored Jan 16, 2024

## Why we need it
To provide another level of understanding to why polkadot's subsystems
may perform slower than expected. Cache misses occur when processing
large amounts of data, such as during availability recovery.

## Why Cachegrind
Cachegrind has many drawbacks: it is slow, it uses its own cache
simulation, which is very basic. But unlike `perf`, which is a great
tool, Cachegrind can run in a virtual machine. This means we can easily
run it in remote installations and even use it in CI/CD to catch
possible regressions.

Why Cachegrind and not Callgrind, another part of Valgrind? It is simply
empirically proven that profiling runs faster with Cachegrind.

## First results
First results have been obtained while testing of the approach. Here is
an example.

```
$ target/testnet/subsystem-bench --n-cores 10 --cache-misses data-availability-read
$ cat cachegrind_report.txt
I refs:        64,622,081,485
I1  misses:         3,018,168
LLi misses:           437,654
I1  miss rate:           0.00%
LLi miss rate:           0.00%

D refs:        12,161,833,115  (9,868,356,364 rd   + 2,293,476,751 wr)
D1  misses:       167,940,701  (   71,060,073 rd   +    96,880,628 wr)
LLd misses:        33,550,018  (   16,685,853 rd   +    16,864,165 wr)
D1  miss rate:            1.4% (          0.7%     +           4.2%  )
LLd miss rate:            0.3% (          0.2%     +           0.7%  )

LL refs:          170,958,869  (   74,078,241 rd   +    96,880,628 wr)
LL misses:         33,987,672  (   17,123,507 rd   +    16,864,165 wr)
LL miss rate:             0.0% (          0.0%     +           0.7%  )
```

The CLI output shows that 1.4% of the L1 data cache missed, which is not
so bad, given that the last-level cache had that data most of the time
missing only 0.3%. Instruction data of the L1 has 0.00% misses of the
time. Looking at an output file with `cg_annotate` shows that most of
the misses occur during reed-solomon, which is expected.

ec7bfae0

Jan 10, 2024

add fallback request for req-response protocols (#2771) · f2a750ee

Alin Dima authored Jan 10, 2024

Previously, it was only possible to retry the same request on a
different protocol name that had the exact same binary payloads.

Introduce a way of trying a different request on a different protocol if
the first one fails with Unsupported protocol.

This helps with adding new req-response versions in polkadot while
preserving compatibility with unupgraded nodes.

The way req-response protocols were bumped previously was that they were
bundled with some other notifications protocol upgrade, like for async
backing (but that is more complicated, especially if the feature does
not require any changes to a notifications protocol). Will be needed for
implementing https://github.com/polkadot-fellows/RFCs/pull/47

TODO:
- [x]  add tests
- [x] add guidance docs in polkadot about req-response protocol
versioning

f2a750ee

Jan 05, 2024
- Fix clippy warnings (#2861) · cea7024d
  Tsvetomir Dimitrov authored Jan 05, 2024
```
Fix some issues reported by clippy
```
  cea7024d
Dec 19, 2023

subsystem benchmarks: add cpu profiling (#2734) · 526c81b1

Andrei Eres authored Dec 19, 2023



Ready-to-merge version of
https://github.com/paritytech/polkadot-sdk/pull/2601

- Added optional CPU profiling
- Updated instructions how to set up Prometheus, Pyroscope and Graphana
- Added a flamegraph dashboard
<img width="1470" alt="image"
src="https://github.com/paritytech/polkadot-sdk/assets/27277055/c8f3b33d-3c01-4ec0-ac34-72d52325b6e6">

---------

Co-authored-by: ordian <[email protected]>

526c81b1

Dec 14, 2023

Introduce subsystem benchmarking tool (#2528) · 8a6e9ef1

Andrei Sandu authored Dec 14, 2023

This tool makes it easy to run parachain consensus stress/performance
testing on your development machine or in CI.

## Motivation
The parachain consensus node implementation spans across many modules
which we call subsystems. Each subsystem is responsible for a small part
of logic of the parachain consensus pipeline, but in general the most
load and performance issues are localized in just a few core subsystems
like `availability-recovery`, `approval-voting` or
`dispute-coordinator`. In the absence of such a tool, we would run large
test nets to load/stress test these parts of the system. Setting up and
making sense of the amount of data produced by such a large test is very
expensive, hard to orchestrate and is a huge development time sink.

## PR contents
- CLI tool 
- Data Availability Read test
- reusable mockups and components needed so far
- Documentation on how to get started

### Data Availability Read test

An overseer is built with using a real `availability-recovery` susbsytem
instance while dependent subsystems like `av-store`, `network-bridge`
and `runtime-api` are mocked. The network bridge will emulate all the
network peers and their answering to requests.

The test is going to be run for a number of blocks. For each block it
will generate send a “RecoverAvailableData” request for an arbitrary
number of candidates. We wait for the subsystem to respond to all
requests before moving to the next block.
At the same time we collect the usual subsystem metrics and task CPU
metrics and show some nice progress reports while running.

### Here is how the CLI looks like:

```
[2023-11-28T13:06:27Z INFO  subsystem_bench::core::display] n_validators = 1000, n_cores = 20, pov_size = 5120 - 5120, error = 3, latency = Some(PeerLatency { min_latency: 1ms, max_latency: 100ms })
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Generating template candidate index=0 pov_size=5242880
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Created test environment.
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Pre-generating 60 candidates.
[2023-11-28T13:06:30Z INFO  subsystem-bench::core] Initializing network emulation for 1000 peers.
[2023-11-28T13:06:30Z INFO  subsystem-bench::availability] Current block 1/3
[2023-11-28T13:06:30Z INFO  substrate_prometheus_endpoint] 〽

️ Prometheus exporter started at 127.0.0.1:9999
[2023-11-28T13:06:30Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:37Z INFO  subsystem_bench::availability] Block time 6262ms
[2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Current block 2/3
[2023-11-28T13:06:37Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:43Z INFO  subsystem_bench::availability] Block time 6369ms
[2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Current block 3/3
[2023-11-28T13:06:43Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time 6194ms
[2023-11-28T13:06:49Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] All blocks processed in 18829ms
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Throughput: 102400 KiB/block
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time: 6276 ms
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] 
    
    Total received from network: 415 MiB
    Total sent to network: 724 KiB
    Total subsystem CPU usage 24.00s
    CPU usage per block 8.00s
    Total test environment CPU usage 0.15s
    CPU usage per block 0.05s
```

### Prometheus/Grafana stack in action
<img width="1246" alt="Screenshot 2023-11-28 at 15 11 10"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/eaa47422-4a5e-4a3a-aaef-14ca644c1574">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 01"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/237329d6-1710-4c27-8f67-5fb11d7f66ea">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 38"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/a07119e8-c9f1-4810-a1b3-f1b7b01cf357">

---------

Signed-off-by: Andrei Sandu <[email protected]>

8a6e9ef1