1. Apr 03, 2024
  2. Apr 02, 2024
  3. Apr 01, 2024
  4. Mar 31, 2024
  5. Mar 27, 2024
  6. Mar 26, 2024
    • Andrei Eres's avatar
      [subsystem-benchmarks] Save results to json (#3829) · fd79b3b0
      Andrei Eres authored
      Here we add the ability to save subsystem benchmark results in JSON
      format to display them as graphs
      
      To draw graphs, CI team will use
      [github-action-benchmark](https://github.com/benchmark-action/github-action-benchmark).
      Since we are using custom benchmarks, we need to prepare [a specific
      data
      type](https://github.com/benchmark-action/github-action-benchmark?tab=readme-ov-file#examples):
      ```
      [
          {
              "name": "CPU Load",
              "unit": "Percent",
              "value": 50
          }
      ]
      ```
      
      Then we'll get graphs like this: 
      
      ![example](https://raw.githubusercontent.com/rhysd/ss/master/github-action-benchmark/main.png)
      
      [A live page with
      graphs](https://benchmark-action.github.io/github-action-benchmark/dev/bench/
      
      )
      
      ---------
      
      Co-authored-by: default avatarordian <[email protected]>
      fd79b3b0
    • Dcompoze's avatar
      Fix spelling mistakes across the whole repository (#3808) · 002d9260
      Dcompoze authored
      **Update:** Pushed additional changes based on the review comments.
      
      **This pull request fixes various spelling mistakes in this
      repository.**
      
      Most of the changes are contained in the first **3** commits:
      
      - `Fix spelling mistakes in comments and docs`
      
      - `Fix spelling mistakes in test names`
      
      - `Fix spelling mistakes in error messages, panic messages, logs and
      tracing`
      
      Other source code spelling mistakes are separated into individual
      commits for easier reviewing:
      
      - `Fix the spelling of 'authority'`
      
      - `Fix the spelling of 'REASONABLE_HEADERS_IN_JUSTIFICATION_ANCESTRY'`
      
      - `Fix the spelling of 'prev_enqueud_messages'`
      
      - `Fix the spelling of 'endpoint'`
      
      - `Fix the spelling of 'children'`
      
      - `Fix the spelling of 'PenpalSiblingSovereignAccount'`
      
      - `Fix the spelling of 'PenpalSudoAccount'`
      
      - `Fix the spelling of 'insufficient'`
      
      - `Fix the spelling of 'PalletXcmExtrinsicsBenchmark'`
      
      - `Fix the spelling of 'subtracted'`
      
      - `Fix the spelling of 'CandidatePendingAvailability'`
      
      - `Fix the spelling of 'exclusive'`
      
      - `Fix the spelling of 'until'`
      
      - `Fix the spelling of 'discriminator'`
      
      - `Fix the spelling of 'nonexistent'`
      
      - `Fix the spelling of 'subsystem'`
      
      - `Fix the spelling of 'indices'`
      
      - `Fix the spelling of 'committed'`
      
      - `Fix the spelling of 'topology'`
      
      - `Fix the spelling of 'response'`
      
      - `Fix the spelling of 'beneficiary'`
      
      - `Fix the spelling of 'formatted'`
      
      - `Fix the spelling of 'UNKNOWN_PROOF_REQUEST'`
      
      - `Fix the spelling of 'succeeded'`
      
      - `Fix the spelling of 'reopened'`
      
      - `Fix the spelling of 'proposer'`
      
      - `Fix the spelling of 'InstantiationNonce'`
      
      - `Fix the spelling of 'depositor'`
      
      - `Fix the spelling of 'expiration'`
      
      - `Fix the spelling of 'phantom'`
      
      - `Fix the spelling of 'AggregatedKeyValue'`
      
      - `Fix the spelling of 'randomness'`
      
      - `Fix the spelling of 'defendant'`
      
      - `Fix the spelling of 'AquaticMammal'`
      
      - `Fix the spelling of 'transactions'`
      
      - `Fix the spelling of 'PassingTracingSubscriber'`
      
      - `Fix the spelling of 'TxSignaturePayload'`
      
      - `Fix the spelling of 'versioning'`
      
      - `Fix the spelling of 'descendant'`
      
      - `Fix the spelling of 'overridden'`
      
      - `Fix the spelling of 'network'`
      
      Let me know if this structure is adequate.
      
      **Note:** The usage of the words `Merkle`, `Merkelize`, `Merklization`,
      `Merkelization`, `Merkleization`, is somewhat inconsistent but I left it
      as it is.
      
      ~~**Note:** In some places the term `Receival` is used to refer to
      message reception, IMO `Reception` is the correct word here, but I left
      it as it is.~~
      
      ~~**Note:** In some places the term `Overlayed` is used instead of the
      more acceptable version `Overlaid` but I also left it as it is.~~
      
      ~~**Note:** In some places the term `Applyable` is used instead of the
      correct version `Applicable` but I also left it as it is.~~
      
      **Note:** Some usage of British vs American english e.g. `judgement` vs
      `judgment`, `initialise` vs `initialize`, `optimise` vs `optimize` etc.
      are both present in different places, but I suppose that's
      understandable given the number of contributors.
      
      ~~**Note:** There is a spelling mistake in `.github/CODEOWNERS` but it
      triggers errors in CI when I make changes to it, so I left it as it
      is.~~
      002d9260
  7. Mar 25, 2024
  8. Mar 19, 2024
  9. Mar 15, 2024
    • ordian's avatar
      collator protocol changes for elastic scaling (validator side) (#3302) · 02e1a7f4
      ordian authored
      Fixes #3128.
      
      This introduces a new variant for the collation response from the
      collator that includes the parent head data. For now, collators won't
      send this new variant. We'll need to change the collator side of the
      collator protocol to detect all the cores assigned to a para and send
      the parent head data in the case when it's more than 1 core.
      
      - [x] validate approach
      - [x] check head data hash
      02e1a7f4
  10. Mar 13, 2024
  11. Mar 11, 2024
    • Andrei Eres's avatar
      subsystem-bench: adjust test config to Kusama (#3583) · 05381afc
      Andrei Eres authored
      Fixes https://github.com/paritytech/polkadot-sdk/issues/3528
      
      ```rust
      latency:
          mean_latency_ms = 30 // common sense
          std_dev = 2.0 // common sense
      n_validators = 300 // max number of validators, from chain config
      n_cores = 60 // 300/5
      max_validators_per_core = 5 // default
      min_pov_size = 5120 // max
      max_pov_size = 5120 // max
      peer_bandwidth = 44040192 // from the Parity's kusama validators
      bandwidth = 44040192 // from the Parity's kusama validators
      connectivity = 90 // we need to be connected to 90-95% of peers
      ```
      05381afc
  12. Mar 08, 2024
    • Alexandru Gheorghe's avatar
      collator-protocol: Always stay connected to validators in backing group (#3544) · 6f3caac0
      Alexandru Gheorghe authored
      Looking at rococo-asset-hub
      https://github.com/paritytech/polkadot-sdk/issues/3519 there seems to be
      a lot of instances where collator did not advertise their collations,
      while there are multiple problems there, one of it is that we are
      connecting and disconnecting to our assigned validators every block,
      because on reconnect_timeout every 4s we call connect_to_validators and
      that will produce 0 validators when all went well, so set_reseverd_peers
      called from validator discovery will disconnect all our peers.
      More details here:
      https://github.com/paritytech/polkadot-sdk/issues/3519#issuecomment-1972667343
      
      Now, this shouldn't be a problem, but it stacks with an existing bug in
      our network stack where if disconnect from a peer the peer might not
      notice it, so it won't detect the reconnect either and it won't send us
      the necessary view updates, so we won't advertise the collation to it
      more details here:
      
      https://github.com/paritytech/polkadot-sdk/issues/3519#issuecomment-1972958276
      
      
      
      To avoid hitting this condition that often, let's keep the peers in the
      reserved set for the entire duration we are allocated to a backing
      group. Backing group sizes(1 rococo, 3 kusama, 5 polkadot) are really
      small, so this shouldn't lead to that many connections. Additionally,
      the validators would disconnect us any way if we don't advertise
      anything for 4 blocks.
      
      ## TODO
      - [x] More testing.
      - [x] Confirm on rococo that this is improving the situation. (It
      doesn't but just because other things are going wrong there).
      
      ---------
      
      Signed-off-by: default avatarAlexandru Gheorghe <[email protected]>
      6f3caac0
  13. Mar 01, 2024
    • Andrei Eres's avatar
      subsystem-bench: add regression tests for availability read and write (#3311) · f0e589d7
      Andrei Eres authored
      ### What's been done
      - `subsystem-bench` has been split into two parts: a cli benchmark
      runner and a library.
      - The cli runner is quite simple. It just allows us to run `.yaml` based
      test sequences. Now it should only be used to run benchmarks during
      development.
      - The library is used in the cli runner and in regression tests. Some
      code is changed to make the library independent of the runner.
      - Added first regression tests for availability read and write that
      replicate existing test sequences.
      
      ### How we run regression tests
      - Regression tests are simply rust integration tests without the
      harnesses.
      - They should only be compiled under the `subsystem-benchmarks` feature
      to prevent them from running with other tests.
      - This doesn't work when running tests with `nextest` in CI, so
      additional filters have been added to the `nextest` runs.
      - Each benchmark run takes a different time in the beginning, so we
      "warm up" the tests until their CPU usage differs by only 1%.
      - After the warm-up, we run the benchmarks a few more times and compare
      the average with the exception using a precision.
      
      ### What is still wrong?
      - I haven't managed to set up approval voting tests. The spread of their
      results is too large and can't be narrowed down in a reasonable amount
      of time in the warm-up phase.
      - The tests start an unconfigurable prometheus endpoint inside, which
      causes errors because they use the same 9999 port. I disable it with a
      flag, but I think it's better to extract the endpoint launching outside
      the test, as we already do with `valgrind` and `pyroscope`. But we still
      use `prometheus` inside the tests.
      
      ### Future work
      * https://github.com/paritytech/polkadot-sdk/issues/3528
      * https://github.com/paritytech/polkadot-sdk/issues/3529
      * https://github.com/paritytech/polkadot-sdk/issues/3530
      * https://github.com/paritytech/polkadot-sdk/issues/3531
      
      
      
      ---------
      
      Co-authored-by: default avatarAlexander Samusev <[email protected]>
      f0e589d7
  14. Feb 26, 2024
  15. Feb 20, 2024
    • Oliver Tale-Yazdi's avatar
      Lift dependencies to the workspace (Part 2/x) (#3366) · e89d0fca
      Oliver Tale-Yazdi authored
      
      
      Lifting some more dependencies to the workspace. Just using the
      most-often updated ones for now.
      It can be reproduced locally.
      
      ```sh
      # First you can check if there would be semver incompatible bumps (looks good in this case):
      $ zepter transpose dependency lift-to-workspace --ignore-errors syn quote thiserror "regex:^serde.*"
      
      # Then apply the changes:
      $ zepter transpose dependency lift-to-workspace --version-resolver=highest syn quote thiserror "regex:^serde.*" --fix
      
      # And format the changes:
      $ taplo format --config .config/taplo.toml
      ```
      
      ---------
      
      Signed-off-by: default avatarOliver Tale-Yazdi <[email protected]>
      e89d0fca
  16. Feb 19, 2024
    • Alexandru Gheorghe's avatar
      gossip-support: add unittests for update authorities (#3258) · 197c6cf9
      Alexandru Gheorghe authored
      
      
      ~The previous fix was actually incomplete because we update the
      authorties only on the situation where we decided to reconnect because
      we had a low connectivity issue. Now the problem is that
      update_authority_ids use the list of connected peers, so on restart that
      does contain anything, so calling immediately after
      issue_connection_request won't detect all authorities, so we need to
      also check every block as the comment said, but that did not match the
      code.~
      
      Actually the fix was correct the flow is follow if more than 1/3 of the
      authorities can not be resolved we set last_failure and call
      `ConnectToResolvedValidators`.
      
      We will call UpdateAuthorities for all the authorities already connected
      and for which we already know the address and for the ones that will
      connect later on `PeerConnected` will have the AuthorityId field set,
      because it is already known, so approval-distribution will update its
      cache topology.
      
      ---------
      
      Signed-off-by: default avatarAlexandru Gheorghe <[email protected]>
      Co-authored-by: default avatarAlexander Samusev <[email protected]>
      197c6cf9
  17. Feb 12, 2024
  18. Jan 31, 2024
  19. Jan 30, 2024
  20. Jan 29, 2024
  21. Jan 26, 2024
  22. Jan 25, 2024
  23. Jan 24, 2024
  24. Jan 23, 2024
    • Andrei Sandu's avatar
      approval-distribution: aggresion must target unfinalized chain rather than unapproved chain (#2988) · b4dfad83
      Andrei Sandu authored
      
      
      Found the issue while investigating the recent finality stall on Westend
      after upgrading to 1.6.0. Approval distribution aggression is supposed
      to trade off bandwidth and re-send assignemnts/approvals until enough
      approvals are be received by at least 2/3 validators. This is supposed
      to be a catch all mechanism when network connectivity goes south or many
      validators reboot at the same time.
      
      This fix ensures that we always resend approvals starting with the first
      unfinalized block even in the case when it appears approved from the
      node's perspective.
      
      TODO:
      - [x] Versi test
      
      ---------
      
      Signed-off-by: default avatarAndrei Sandu <[email protected]>
      b4dfad83
  25. Jan 22, 2024
  26. Jan 18, 2024
  27. Jan 12, 2024
  28. Jan 10, 2024
  29. Dec 19, 2023
  30. Dec 18, 2023
  31. Dec 14, 2023
    • Andrei Sandu's avatar
      Introduce subsystem benchmarking tool (#2528) · 8a6e9ef1
      Andrei Sandu authored
      This tool makes it easy to run parachain consensus stress/performance
      testing on your development machine or in CI.
      
      ## Motivation
      The parachain consensus node implementation spans across many modules
      which we call subsystems. Each subsystem is responsible for a small part
      of logic of the parachain consensus pipeline, but in general the most
      load and performance issues are localized in just a few core subsystems
      like `availability-recovery`, `approval-voting` or
      `dispute-coordinator`. In the absence of such a tool, we would run large
      test nets to load/stress test these parts of the system. Setting up and
      making sense of the amount of data produced by such a large test is very
      expensive, hard to orchestrate and is a huge development time sink.
      
      ## PR contents
      - CLI tool 
      - Data Availability Read test
      - reusable mockups and components needed so far
      - Documentation on how to get started
      
      ### Data Availability Read test
      
      An overseer is built with using a real `availability-recovery` susbsytem
      instance while dependent subsystems like `av-store`, `network-bridge`
      and `runtime-api` are mocked. The network bridge will emulate all the
      network peers and their answering to requests.
      
      The test is going to be run for a number of blocks. For each block it
      will generate send a “RecoverAvailableData” request for an arbitrary
      number of candidates. We wait for the subsystem to respond to all
      requests before moving to the next block.
      At the same time we collect the usual subsystem metrics and task CPU
      metrics and show some nice progress reports while running.
      
      ### Here is how the CLI looks like:
      
      ```
      [2023-11-28T13:06:27Z INFO  subsystem_bench::core::display] n_validators = 1000, n_cores = 20, pov_size = 5120 - 5120, error = 3, latency = Some(PeerLatency { min_latency: 1ms, max_latency: 100ms })
      [2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Generating template candidate index=0 pov_size=5242880
      [2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Created test environment.
      [2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Pre-generating 60 candidates.
      [2023-11-28T13:06:30Z INFO  subsystem-bench::core] Initializing network emulation for 1000 peers.
      [2023-11-28T13:06:30Z INFO  subsystem-bench::availability] Current block 1/3
      [2023-11-28T13:06:30Z INFO  substrate_prometheus_endpoint] ️ Prometheus exporter started at 127.0.0.1:9999
      [2023-11-28T13:06:30Z INFO  subsystem_bench::availability] 20 recoveries pending
      [2023-11-28T13:06:37Z INFO  subsystem_bench::availability] Block time 6262ms
      [2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
      [2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Current block 2/3
      [2023-11-28T13:06:37Z INFO  subsystem_bench::availability] 20 recoveries pending
      [2023-11-28T13:06:43Z INFO  subsystem_bench::availability] Block time 6369ms
      [2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
      [2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Current block 3/3
      [2023-11-28T13:06:43Z INFO  subsystem_bench::availability] 20 recoveries pending
      [2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time 6194ms
      [2023-11-28T13:06:49Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
      [2023-11-28T13:06:49Z INFO  subsystem_bench::availability] All blocks processed in 18829ms
      [2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Throughput: 102400 KiB/block
      [2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time: 6276 ms
      [2023-11-28T13:06:49Z INFO  subsystem_bench::availability] 
          
          Total received from network: 415 MiB
          Total sent to network: 724 KiB
          Total subsystem CPU usage 24.00s
          CPU usage per block 8.00s
          Total test environment CPU usage 0.15s
          CPU usage per block 0.05s
      ```
      
      ### Prometheus/Grafana stack in action
      <img width="1246" alt="Screenshot 2023-11-28 at 15 11 10"
      src="https://github.com/paritytech/polkadot-sdk/assets/54316454/eaa47422-4a5e-4a3a-aaef-14ca644c1574">
      <img width="1246" alt="Screenshot 2023-11-28 at 15 12 01"
      src="https://github.com/paritytech/polkadot-sdk/assets/54316454/237329d6-1710-4c27-8f67-5fb11d7f66ea">
      <img width="1246" alt="Screenshot 2023-11-28 at 15 12 38"
      src="https://github.com/paritytech/polkadot-sdk/assets/54316454/a07119e8-c9f1-4810-a1b3-f1b7b01cf357
      
      ">
      
      ---------
      
      Signed-off-by: default avatarAndrei Sandu <[email protected]>
      8a6e9ef1