• Liam Aharon's avatar
    remote-ext: fix state download stall on slow connections and reduce memory usage (#1295) · 55f35442
    Liam Aharon authored
    Original PR https://github.com/paritytech/substrate/pull/14746
    
    ---
    
    ## Fixing stall
    
    ### Introduction
    I experienced an apparent stall downloading state from
    `https://rococo-try-runtime-node.parity-chains.parity.io:443` which was
    having networking difficulties only responding to my JSONRPC requests
    with 50-200KB/s of bandwidth.
    
    This PR fixes the issue causing the stall, and generally improves
    performance remote-ext when it downloads state by greatly reducing the
    chances of a timeout occuring.
    
    ### Description
    Introduces a new `REQUEST_DURATION_TARGET` constant and modifies
    `get_storage_data_dynamic_batch_size` to
    
    - Increase or decrease the batch size of the next request depending on
    whether the elapsed time of the last request was gt or lt the target
    - Reset the batch size to 1 if the request times out
    
    This fixes an issue on slow connections that can otherwise cause
    multiple timeouts and a stalled download when:
    
    1. The batch size increases rapidly as remote-ext downloads keys with
    small associated storage values
    2. remote-ext tries to process a large series of subsequent keys all
    with extremely large associated storage values (Rococo has a series of
    keys 1-5MB large)
    3. The huge storage values download for 5 minutes until the request
    times out
    4. The partially downloaded keys are thrown out and remote-ext tries
    again with a smaller batch size, but the batch size is still far too
    large and takes 5 minutes to be reduced again
    5. The download will be essentially stalled for many hours while the
    above step cycles
    
    
    After this PR, the request size will
    
    - Not grow as large to begin with, as it is regulated downwards as the
    request duration exceeds the target
    - Drop immediately to 1 if the request times out. A timeout indicates
    the keys next in line to download have extremely large storage values
    compared to previously downloaded keys, and we need to reset the batch
    size to figure out what our new ideal batch size is. By not resetting
    down to 1, we risk the next request timing out again.
    
    ## Reducing memory
    
    As suggested by @bkchr, I adjusted `get_storage_data_dynamic_batch_size`
    from being recursive to a loop which allows removing a bunch of clones
    that were chewing through a lot of memory. I noticed actually it was
    using up to 50GB swap previously when downloading Polkadot keys on a
    slow connection, because it needed to recurse and clone a lot.
    
    After this change it uses only ~1.5GB memory.
    55f35442