1. Feb 05, 2024
    • Alexandru Gheorghe's avatar
      Introduce approval-voting/distribution benchmark (#2621) · f9f88688
      Alexandru Gheorghe authored
      ## Summary
      Built on top of the tooling and ideas introduced in
      https://github.com/paritytech/polkadot-sdk/pull/2528, this PR introduces
      a synthetic benchmark for measuring and assessing the performance
      characteristics of the approval-voting and approval-distribution
      Currently this allows, us to simulate the behaviours of these systems
      based on the following dimensions:
      # Test 1
      - objective: !ApprovalsTest
          last_considered_tranche: 89
          min_coalesce: 1
          max_coalesce: 6
          enable_assignments_v2: true
          send_till_tranche: 60
          stop_when_approved: false
          coalesce_tranche_diff: 12
          workdir_prefix: "/tmp"
          num_no_shows_per_candidate: 0
          approval_distribution_expected_tof: 6.0
          approval_distribution_cpu_ms: 3.0
          approval_voting_cpu_ms: 4.30
        n_validators: 500
        n_cores: 100
        n_included_candidates: 100
        min_pov_size: 1120
        max_pov_size: 5120
        peer_bandwidth: 524288000000
        bandwidth: 524288000000
            secs: 0
            nanos: 1000000
            secs: 0
            nanos: 100000000
        error: 0
        num_blocks: 10
      ## The approach
      1. We build a real overseer with the real implementations for
      approval-voting and approval-distribution subsystems.
      2. For a given network size, for each validator we pre-computed all
      potential assignments and approvals it would send, because this a
      computation heavy operation this will be cached on a file on disk and be
      re-used if the generation parameters don't change.
      3. The messages will be sent accordingly to the configured parameters
      and those are split into 3 main benchmarking scenarios.
      ## Benchmarking scenarios
      ### Best case scenario *approvals_throughput_best_case.yaml*
      It send to the approval-distribution only the minimum required tranche
      to gathered the needed_approvals, so that a candidate is approved.
      ### Behaviour in the presence of no-shows *approvals_no_shows.yaml*
      It sends the tranche needed to approve a candidate when we have a
      maximum of *num_no_shows_per_candidate* tranches with no-shows for each
      ### Maximum throughput *approvals_throughput.yaml*
      It sends all the tranches for each block and measures the used CPU and
      necessary network bandwidth. by the approval-voting and
      approval-distribution subsystem.
      ## How to run it
      cargo run -p polkadot-subsystem-bench --release -- test-sequence --path polkadot/node/subsystem-bench/examples/approvals_throughput.yaml
      ## Evaluating performance
      ### Use the real subsystems metrics
      If you follow the steps in
      for installing locally prometheus and grafana, all real metrics for the
      `approval-distribution`, `approval-voting` and overseer are available.
      <img width="2149" alt="Screenshot 2023-12-05 at 11 07 46"
      <img width="2551" alt="Screenshot 2023-12-05 at 11 09 42"
      <img width="2154" alt="Screenshot 2023-12-05 at 11 10 15"
      <img width="2535" alt="Screenshot 2023-12-05 at 11 10 52"
      ### Profile with pyroscope
      1. Setup pyroscope following the steps in
      then run any of the benchmark scenario with `--profile` as the
      2. Open the pyroscope dashboard in grafana, e.g:
      <img width="2544" alt="Screenshot 2024-01-09 at 17 09 58"
      ### Useful  logs
      1. Network bandwidth requirements:
      Payload bytes received from peers: 503993 KiB total, 50399 KiB/block
      Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block
      2. Cpu usage by the approval-distribution/approval-voting subsystems.
      approval-distribution CPU usage 84.061s
      approval-distribution CPU usage per block 8.406s
      approval-voting CPU usage 96.532s
      approval-voting CPU usage per block 9.653s
      3. Time passed until a given block is approved
       Chain selection approved  after 3500 ms hash=0x0101010101010101010101010101010101010101010101010101010101010101
      Chain selection approved  after 4500 ms hash=0x0202020202020202020202020202020202020202020202020202020202020202
      ### Using benchmark to quantify improvements from
      https://github.com/paritytech/polkadot-sdk/pull/1178 +
      Using a versi-node we compare the scenarios where all new optimisations
      are disabled with a scenarios where tranche0 assignments are sent in a
      single message and a conservative simulation where the coalescing of
      approvals gives us just 50% reduction in the number of messages we send.
      Overall, what we see is a speedup of around 30-40% in the time it takes
      to process the necessary messages and a 30-40% reduction in the
      necessary bandwidth.
      #### Best case scenario comparison(minimum required tranches sent).
          Number of blocks: 10
          Payload bytes received from peers: 53289 KiB total, 5328 KiB/block
          Payload bytes sent to peers: 52489 KiB total, 5248 KiB/block
          approval-distribution CPU usage 6.732s
          approval-distribution CPU usage per block 0.673s
          approval-voting CPU usage 9.523s
          approval-voting CPU usage per block 0.952s
      vs Optimisation enabled
         Number of blocks: 10
         Payload bytes received from peers: 32141 KiB total, 3214 KiB/block
         Payload bytes sent to peers: 37314 KiB total, 3731 KiB/block
         approval-distribution CPU usage 4.658s
         approval-distribution CPU usage per block 0.466s
         approval-voting CPU usage 6.236s
         approval-voting CPU usage per block 0.624s
      #### Worst case all tranches sent, very unlikely happens when sharding
         Number of blocks: 10
         Payload bytes received from peers: 746393 KiB total, 74639 KiB/block
         Payload bytes sent to peers: 729151 KiB total, 72915 KiB/block
         approval-distribution CPU usage 118.681s
         approval-distribution CPU usage per block 11.868s
         approval-voting CPU usage 124.118s
         approval-voting CPU usage per block 12.412s
      vs optimised
          Number of blocks: 10
          Payload bytes received from peers: 503993 KiB total, 50399 KiB/block
          Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block
          approval-distribution CPU usage 84.061s
          approval-distribution CPU usage per block 8.406s
          approval-voting CPU usage 96.532s
          approval-voting CPU usage per block 9.653s
      ## TODOs
      [x] Polish implementation.
      [x] Use what we have so far to evaluate
       before merging.
      [x] List of features and additional dimensions we want to use for
      [x] Run benchmark on hardware similar with versi and kusama nodes.
      [ ] Add benchmark to be run in CI for catching regression in
      [ ] Rebase on latest changes for network emulation.
      Signed-off-by: default avatarAndrei Sandu <[email protected]>
      Signed-off-by: default avatarAlexandru Gheorghe <[email protected]>
      Co-authored-by: default avatarAndrei Sandu <[email protected]>
      Co-authored-by: default avatarAndrei Sandu <[email protected]>
  2. Feb 02, 2024
  3. Jan 31, 2024
  4. Jan 30, 2024
  5. Jan 29, 2024
  6. Jan 26, 2024
  7. Jan 25, 2024
  8. Jan 24, 2024
  9. Jan 23, 2024
    • Andrei Sandu's avatar
      approval-distribution: aggresion must target unfinalized chain rather than unapproved chain (#2988) · b4dfad83
      Andrei Sandu authored
      Found the issue while investigating the recent finality stall on Westend
      after upgrading to 1.6.0. Approval distribution aggression is supposed
      to trade off bandwidth and re-send assignemnts/approvals until enough
      approvals are be received by at least 2/3 validators. This is supposed
      to be a catch all mechanism when network connectivity goes south or many
      validators reboot at the same time.
      This fix ensures that we always resend approvals starting with the first
      unfinalized block even in the case when it appears approved from the
      node's perspective.
      - [x] Versi test
      Signed-off-by: default avatarAndrei Sandu <[email protected]>
    • Niklas Adolfsson's avatar
      rpc: backpressured RPC server (bump jsonrpsee 0.20) (#1313) · e16ef086
      Niklas Adolfsson authored
      This is a rather big change in jsonrpsee, the major things in this bump
      - Server backpressure (the subscription impls are modified to deal with
      - Allow custom error types / return types (remove jsonrpsee::core::Error
      and jsonrpee::core::CallError)
      - Bug fixes (graceful shutdown in particular not used by substrate
         - Less dependencies for the clients in particular
         - Return type requires Clone in method call responses
         - Moved to tokio channels
         - Async subscription API (not used in this PR)
      Major changes in this PR:
      - The subscriptions are now bounded and if subscription can't keep up
      with the server it is dropped
      - CLI: add parameter to configure the jsonrpc server bounded message
      buffer (default is 64)
      - Add our own subscription helper to deal with the unbounded streams in
      The most important things in this PR to review is the added helpers
      functions in `substrate/client/rpc/src/utils.rs` and the rest is pretty
      much chore.
      Regarding the "bounded buffer limit" it may cause the server to handle
      the JSON-RPC calls
      slower than before.
      The message size limit is bounded by "--rpc-response-size" thus "by
      default 10MB * 64 = 640MB"
      but the subscription message size is not covered by this limit and could
      be capped as well.
      Hopefully the last release prior to 1.0, sorry in advance for a big PR
      Previous attempt: https://github.com/paritytech/substrate/pull/13992
      Resolves https://github.com/paritytech/polkadot-sdk/issues/748, resolves
  10. Jan 22, 2024
  11. Jan 21, 2024
  12. Jan 19, 2024
  13. Jan 18, 2024
  14. Jan 17, 2024
  15. Jan 16, 2024
    • Andrei Eres's avatar
      subsystem-bench: cache misses profiling (#2893) · ec7bfae0
      Andrei Eres authored
      ## Why we need it
      To provide another level of understanding to why polkadot's subsystems
      may perform slower than expected. Cache misses occur when processing
      large amounts of data, such as during availability recovery.
      ## Why Cachegrind
      Cachegrind has many drawbacks: it is slow, it uses its own cache
      simulation, which is very basic. But unlike `perf`, which is a great
      tool, Cachegrind can run in a virtual machine. This means we can easily
      run it in remote installations and even use it in CI/CD to catch
      possible regressions.
      Why Cachegrind and not Callgrind, another part of Valgrind? It is simply
      empirically proven that profiling runs faster with Cachegrind.
      ## First results
      First results have been obtained while testing of the approach. Here is
      an example.
      $ target/testnet/subsystem-bench --n-cores 10 --cache-misses data-availability-read
      $ cat cachegrind_report.txt
      I refs:        64,622,081,485
      I1  misses:         3,018,168
      LLi misses:           437,654
      I1  miss rate:           0.00%
      LLi miss rate:           0.00%
      D refs:        12,161,833,115  (9,868,356,364 rd   + 2,293,476,751 wr)
      D1  misses:       167,940,701  (   71,060,073 rd   +    96,880,628 wr)
      LLd misses:        33,550,018  (   16,685,853 rd   +    16,864,165 wr)
      D1  miss rate:            1.4% (          0.7%     +           4.2%  )
      LLd miss rate:            0.3% (          0.2%     +           0.7%  )
      LL refs:          170,958,869  (   74,078,241 rd   +    96,880,628 wr)
      LL misses:         33,987,672  (   17,123,507 rd   +    16,864,165 wr)
      LL miss rate:             0.0% (          0.0%     +           0.7%  )
      The CLI output shows that 1.4% of the L1 data cache missed, which is not
      so bad, given that the last-level cache had that data most of the time
      missing only 0.3%. Instruction data of the L1 has 0.00% misses of the
      time. Looking at an output file with `cg_annotate` shows that most of
      the misses occur during reed-solomon, which is expected.
  16. Jan 15, 2024
  17. Jan 12, 2024
  18. Jan 10, 2024
  19. Jan 09, 2024