Closed Bug 1547084 Opened 6 months ago Closed 5 months ago

Support many replaying children


(Core :: Web Replay, enhancement, P5)




Tracking Status
firefox68 --- fixed


(Reporter: bhackett, Assigned: bhackett)


(Blocks 1 open bug)



(9 files)

Right now a middleman process supports at most four child processes at once: an optional recording process, two replaying processes for restoring old program states, and an optional replaying process which searches for logpoint hits. When integrated with the cloud (or on a powerful local machine) we could support many more replaying processes, which would make it much faster to restore program states at particular points or search the recording for logpoint hits.

As part of this, the middleman should also be able to survive replaying children that crash or go unresponsive. Right now the middleman crashes immediately if any of its children crash. We used to start up a new replaying process to recover after crashing, but because of the limited amount of children the UI could go unresponsive for large amounts of time after a crash happened, and this feature was removed. Having more replaying children makes it more likely that one will crash, but also means that it should be faster to recover the UI with a child at a nearby point after a crash occurs.

Type: defect → enhancement
Priority: -- → P5
Attached patch patchSplinter Review

This is a big patch that makes the architectural changes necessary to allow efficient use of many replaying children. The main problem with the existing approach is that it is focused on running back and forth through the recording in search of breakpoint hits. With many children, this process is inefficient and hard to coordinate.

The new strategy introduced by this patch reorients things around scans of the recording. As the recording is being made we have different children scanning different parts of the recording. The scans determine the set of execution points where each script location is hit. Scanning is slower than simple replaying, but by spreading the load between any number of replaying processes we'll be able to keep the scan data up to date even while the recording continues to grow. The scan data is used as the basis for most debugger features. When the user wants to run forward or backward through the interior of the recording, we don't reexecute that code (as we have to do now), but just look at the scan data to figure out the execution point where the next breakpoint hit will occur. Once that point is determined, we warp a replaying process there and pause. Similarly, logpoints are handled by looking at the scan data to find all the execution points where the logpoint should hit, then sending different background replaying processes to those points. The entire recording does not need to be reexecuted, as it does now.

The end result of this is a system that can efficiently operate with many children in parallel, while still keeping things simple and elegant. This patch is a net loss of more than 2000 lines of C++ (most of it the navigation logic used when searching for breakpoint hits) and a net gain of about 350 lines of JS.

Performance right now will not be great, especially when stepping. This patch works well enough to pass tests but hasn't been optimized at all. There will be a fair number of additional changes to get performance where it should be, per the dependencies of bug 1547081. Also, this patch doesn't take care of crash recovery, which will need to be done in another bug.

Assignee: nobody → bhackett1024

The goal of the bug is for web replay's architecture to be more performant.

One happy side-effect is that the architecture is also significantly less intrusive on platform (# of messages sent, child process, hash table). This helps us achieve our goal of reducing the platform footprint.

Pushed by
Part 1 - Remove recordReplayDirective interface and uses, r=mccr8.
Part 2 - Remove reverseStepIn and reverseStepOut logic, r=loganfsmyth.
Part 3 - C++ changes and removal for new control logic, r=loganfsmyth.
Part 4 - Switch to a search focused control logic architecture, r=loganfsmyth.
Part 5 - Debugger changes for new control logic, r=loganfsmyth.
Part 6 - Console changes for new control logic, r=nchevobbe.
Part 7 - Test changes for new control logic, r=loganfsmyth.
Regressions: 1552420
You need to log in before you can comment on or make changes to this bug.