network/strategy: Backoff and ban overloaded peers to avoid submitting the...
network/strategy: Backoff and ban overloaded peers to avoid submitting the same request multiple times (#5029)
This PR avoids submitting the same block or state request multiple times
to the same slow peer.
Previously, we submitted the same request to the same slow peer, which
resulted in reputation bans on the slow peer side.
Furthermore, the strategy selected the same slow peer multiple times to
submit queries to, although a better candidate may exist.
Instead, in this PR we:
- introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only
track the state of disconnected peers with a request in flight
- when the `DisconnectedPeers` detects a peer disconnected with a
request in flight, the peer is backed off
- on the first disconnection: 60 seconds
- on second disconnection: 120 seconds
- on the third disconnection the peer is banned, and the peer remains
banned until the peerstore decays its reputation
This PR lifts the pressure from overloaded nodes that cannot process
requests in due time.
And if a peer is detected to be slow after backoffs, the peer is banned.
Theoretically, submitting the same request multiple times can still
happen when:
- (a) we backoff and ban the peer
- (b) the network does not discover other peers -- this may also be a
test net
- (c) the peer gets reconnected after the reputation decay and is still
slow to respond
Aims to improve:
- https://github.com/paritytech/polkadot-sdk/issues/4924
- https://github.com/paritytech/polkadot-sdk/issues/531
Next Steps:
- Investigate the network after this is deployed, possibly bumping the
keep-alive timeout or seeing if there's something else misbehaving
This PR builds on top of:
- https://github.com/paritytech/polkadot-sdk/pull/4987
### Testing Done
- Added a couple of unit tests where test harness were set in place
- Local testnet
```bash
13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD
13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false
13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true
13:16:49.108 WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting.
```
cc @paritytech/networking
---------
Signed-off-by: Alexandru Vasile <[email protected]>