Fix order of resending messages after restart (#6729)
The way we build the messages we need to send to approval-distribution can result in a situation where is we have multiple assignments covered by a coalesced approval, the messages are sent in this order: ASSIGNMENT1, APPROVAL, ASSIGNMENT2, because we iterate over each candidate and add to the queue of messages both the assignment and the approval for that candidate, and when the approval reaches the approval-distribution subsystem it won't be imported and gossiped because one of the assignment for it is not known. So in a network where a lot of nodes are restarting at the same time we could end up in a situation where a set of the nodes correctly received the assignments and approvals before the restart and approve their blocks and don't trigger their assignments. The other set of nodes should receive the assignments and approvals after the restart, but because the approvals never get broacasted anymore because of this bug, the only way they could approve is if other nodes start broadcasting their assignments. I think this bug contribute to the reason the network did not recovered on `25-11-25 15:55:40` after the restarts. Tested this scenario with a `zombienet` where `nodes` are finalising blocks because of aggression and all nodes are restarted at once and confirmed the network lags and doesn't recover before and it does after the fix --------- Signed-off-by: Alexandru Gheorghe <[email protected]> (cherry picked from commit 65a4e5ee)
Please register or sign in to comment