Skip to content
Snippets Groups Projects
  1. Dec 11, 2024
    • Alexandru Gheorghe's avatar
      Make approval-distribution aggression a bit more robust and less spammy (#6696) · 79375e7c
      Alexandru Gheorghe authored
      
      After finality started lagging on kusama around `2025-11-25 15:55:40`
      nodes started being overloaded with messages and some restarted with
      ```
      Subsystem approval-distribution-subsystem appears unresponsive when sending a message of type polkadot_node_subsystem_types::messages::ApprovalDistributionMessage. origin=polkadot_service::relay_chain_selection::SelectRelayChainInner<sc_client_db::Backend<sp_runtime::generic::block::Block<sp_runtime::generic::header::Header<u32, sp_runtime::traits::BlakeTwo256>, sp_runtime::OpaqueExtrinsic>>, polkadot_overseer::Handle>
      ```
      
      I think this happened because our aggression in the current form is way
      too spammy and create problems in situation where we already constructed
      blocks with a load of candidates to check which what happened around
      `#25933682` before and after. However aggression, does help in the
      nightmare scenario where the network is segmented and sparsely
      connected, so I tend to think we shouldn't completely remove it.
      
      The current configuration is:
      ```
      l1_threshold: Some(16),
      l2_threshold: Some(28),
      resend_unfinalized_period: Some(8),
      ```
      The way aggression works right now :
      1. After L1 is triggered all nodes send all messages they created to all
      the other nodes and all messages they would have they already send
      according to the topology.
      2. Because of resend_unfinalized_period for each block all messages at
      step 1) are sent every 8 blocks, so for example let's say we have blocks
      1 to 24 unfinalized, then at block 25, all messages for block 1, 9 will
      be resent, and consequently at block 26, all messages for block 2, 10
      will be resent, this becomes worse as more blocks are created if backing
      backpressure did not kick in yet. In total this logic makes that each
      node receive 3 * total_number_of messages_per_block
      3. L2 aggression is way too spammy, when L2 aggression is enabled all
      nodes sends all messages of a block on GridXY, that means that all
      messages are received and sent by node at least 2*sqrt(num_validators),
      so on kusama would be 66 * NUM_MESSAGES_AT_FIRST_UNFINALIZED_BLOCK, so
      even with a reasonable number of messages like 10K, which you can have
      if you escalated because of no shows, you end-up sending and receiving
      ~660k messages at once, I think that's what makes the
      approval-distribution to appear unresponsive on some nodes.
      4. Duplicate messages are received by the nodes which turn, mark the
      node as banned, which may create more no-shows.
      
      ## Proposed improvements:
      1. Make L2 trigger way later 28 blocks, instead of 64, this should
      literally the last resort, until then we should try to let the
      approval-voting escalation mechanism to do its things and cover the
      no-shows.
      2. On L1 aggression don't send messages for blocks too far from the
      first_unfinalized there is no point in sending the messages for block
      20, if block 1 is still unfinalized.
      3. On L1 aggression, send messages then back-off for 3 *
      resend_unfinalized_period to give time for everyone to clear up their
      queues.
      4. If aggression is enabled accept duplicate messages from validators
      and don't punish them by reducting their reputation which, which may
      create no-shows.
      
      ---------
      
      Signed-off-by: default avatarAlexandru Gheorghe <alexandru.gheorghe@parity.io>
      Co-authored-by: default avatarAndrei Sandu <54316454+sandreim@users.noreply.github.com>
      (cherry picked from commit 85dd228d)
      79375e7c
    • paritytech-cmd-bot-polkadot-sdk[bot]'s avatar
      [stable2409] Backport #6729 (#6829) · a034a702
      paritytech-cmd-bot-polkadot-sdk[bot] authored
      
      Backport #6729 into `stable2409` from alexggh.
      
      See the
      [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
      on how to use this bot.
      
      <!--
        # To be used by other automation, do not modify:
        original-pr-number: #${pull_number}
      -->
      
      ---------
      
      Signed-off-by: default avatarAlexandru Gheorghe <alexandru.gheorghe@parity.io>
      Co-authored-by: default avatarAlexandru Gheorghe <49718502+alexggh@users.noreply.github.com>
      Co-authored-by: default avatarAlexandru Gheorghe <alexandru.gheorghe@parity.io>
    • paritytech-cmd-bot-polkadot-sdk[bot]'s avatar
      [stable2409] Backport #6781 (#6823) · b4bcdf2c
      paritytech-cmd-bot-polkadot-sdk[bot] authored
      
      Backport #6781 into `stable2409` from bkontur.
      
      See the
      [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
      on how to use this bot.
      
      <!--
        # To be used by other automation, do not modify:
        original-pr-number: #${pull_number}
      -->
      
      ---------
      
      Co-authored-by: default avatarBranislav Kontur <bkontur@gmail.com>
  2. Dec 10, 2024
  3. Dec 09, 2024
  4. Nov 28, 2024
  5. Nov 27, 2024
  6. Nov 21, 2024
  7. Nov 14, 2024
  8. Nov 13, 2024
  9. Nov 12, 2024
  10. Nov 05, 2024
  11. Oct 18, 2024
  12. Oct 17, 2024
  13. Oct 16, 2024
  14. Oct 15, 2024
  15. Oct 11, 2024
  16. Oct 02, 2024