Skip to content
Unverified Commit 4d2f7932 authored by Iulian Barbu's avatar Iulian Barbu Committed by GitHub
Browse files

cumulus/client: added external rpc connection retry logic (#5515)



# Description

Adds retry logic that makes the RPC relay chain interface more reliable
for the cases of a collator connecting to external RPC servers.

Closes #5514 
Closes #4278

Final solution still debated on #5514 , what this PR addresses might
change (e.g. #4278 might require a more advanced approach).

## Integration

Users that start collators should barely observe differences based on
this logic, since the retry logic applies only in case the collators
fail to connect to the RPC servers. In practice I assume the RPC servers
are already live before starting collators, and the issue isn't visible.

## Review Notes

The added retry logic is for retrying the connection to the RPC servers
(which can be multiple). It is at the level of the
cumulus/client/relay-chain-rpc-interface module, but more specifically
relevant to the RPC clients logic (`ClientManager`). The retry logic is
not configurable, it tries to connect to the RPC client for 5 times,
with an exponential backoff in between each iteration starting with 1
second wait time and ending with 16 seconds. The same logic is applied
in case an existing connection to an RPC is dropped. There is a
`ReconnectingWebsocketWorker` who ensures there is connectivity to at
least on RPC node, and the retry logic makes this stronger by insisting
on trying connections to the RPC servers list for 5 times.

## Testing

- This was tested manually by starting zombienet natively based on
[006-rpc_collator_builds_blocks.toml](https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml)
and observing collators don't fail anymore:

```bash
zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml
```

- Added a unit test that exercises the retry logic for a client
connection to a server that comes online in 10 seconds. The retry logic
can wait for as long as 30 seconds, but thought that it is too much for
a unit test. Just being conscious of CI time if it runs this test, but I
am happy to see suggestions around it too. I am not that sure either it
runs in CI, haven't figured it out entirely yet. The test can be
considered an integration test too, but it exercises crate internal
implementation, not the public API.

Collators example logs after the change:
```
2024-08-29 14:28:11.730  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:12.737  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:12.739  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:12.755  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:14.758  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:14.759  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:14.760  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:18.766  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:26.770  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/"
```

---------

Signed-off-by: default avatarIulian Barbu <[email protected]>
Co-authored-by: default avatarSebastian Kunert <[email protected]>
parent 325df541
Pipeline #495094 waiting for manual action with stages
in 46 minutes and 46 seconds
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment