• Marcin S.'s avatar
    PVF: Vote invalid on panics in execution thread (after a retry) (#7155) · 82e4dbcc
    Marcin S. authored
    * PVF: Remove `rayon` and some uses of `tokio`
    
    1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.
    
    2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]
    
    [^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.
    
    3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.
    
    4. The order of thread selection was flipped to make (3) sound (see note in code).
    
    I left some TODO's related to panics which I'm going to address soon as part of https://github.com/paritytech/polkadot/issues/7045.
    
    * PVF: Vote invalid on panics in execution thread (after a retry)
    
    Also make sure we kill the worker process on panic errors and internal errors to
    potentially clear any error states independent of the candidate.
    
    * Address a couple of TODOs
    
    Addresses a couple of follow-up TODOs from
    https://github.com/paritytech/polkadot/pull/7153
    
    .
    
    * Add some documentation to implementer's guide
    
    * Fix compile error
    
    * Fix compile errors
    
    * Fix compile error
    
    * Update roadmap/implementers-guide/src/node/utility/candidate-validation.md
    
    Co-authored-by: default avatarAndrei Sandu <[email protected]>
    
    * Address comments + couple other changes (see message)
    
    - Measure the CPU time in the prepare thread, so the observed time is not
      affected by any delays in joining on the thread.
    
    - Measure the full CPU time in the execute thread.
    
    * Implement proper thread synchronization
    
    Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std
    docs.
    
    Considered also using a condvar to signal the CPU thread to end, in place of an
    mpsc channel. This was not done because `Condvar::wait_timeout_while` is
    documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not
    documented as such. Also, we would need a separate condvar, to avoid this case:
    the worker thread finishes its job, notifies the condvar, the CPU thread returns
    first, and we join on it and not the worker thread. So it was simpler to leave
    this part as is.
    
    * Catch panics in threads so we always notify condvar
    
    * Use `WaitOutcome` enum instead of bool condition variable
    
    * Fix retry timeouts to depend on exec timeout kind
    
    * Address review comments
    
    * Make the API for condvars in workers nicer
    
    * Add a doc
    
    * Use condvar for memory stats thread
    
    * Small refactor
    
    * Enumerate internal validation errors in an enum
    
    * Fix comment
    
    * Add a log
    
    * Fix test
    
    * Update variant naming
    
    * Address a missed TODO
    
    ---------
    
    Co-authored-by: default avatarAndrei Sandu <[email protected]>
    82e4dbcc