Closed Bug 1445219 Opened 7 years ago Closed 7 years ago

Web Replay: Restore replaying process after crashes and hangs

Categories

(Core :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID
Tracking Status
firefox61 --- affected

People

(Reporter: bhackett1024, Assigned: bhackett1024)

References

Details

Attachments

(8 files)

There are several intermittent crashes and hangs I've seen while using Web Replay. While of course it would be nice if these never happened, and while I will try to track down their sources, it would also be nice if these didn't have a big impact on people using the tool. The middleman process knows where the replaying process is and what it's doing, and it would be cool if the middleman could detect that the replaying process has crashed or hung and then spin up a new replaying process and direct it to where the last one was at. This would all happen transparently to the user, other than the time spent waiting while the process gets back to where it was. (Eventually this recovery could be exposed in the UI, and there is already a similar need in showing the user progress when rewinding, but that will be left for the future.)
Remove the existing last-ditch restore infrastructure. This was used when a mismatch with the recording is detected while replaying --- we would rewind to the last snapshot and try again, hopefully avoiding the problem. It wasn't enabled though and had rotted, and the approach in this bug is considerably more robust against the problems that crop up in practice.
Assignee: nobody → bhackett1024
Previously GeckoChildProcessHost used environment variables to determine what type of recording/replaying/middleman process to spawn. Not only is this really gross, it also makes it hard to have a middleman spawn multiple processes of different kinds at different times. This patch makes the spawned process type an explicit parameter to the functions that spawn processes.
This patch changes breakpoint sets to happen via special IPC messages between the middleman and recording/replaying processes, rather than folding them into the DebuggerRequest JSON messages sent for other debugger actions. This is nice because recovery of a crashed process needs to treat changes to breakpoints differently from other messages, and it also allows some nice cleanups and simplifications in the code for managing debugger requests (breakpoint operations already had some special casing going on).
Fix a source of unhandled recording divergences which was causing unexpected (though not incorrect) behavior while testing.
We need to be able to provoke crashes in the replaying process in order to have automated testing for child process recovery.
Here are the main changes to the infrastructure needed so that the middleman process can spin up a new replaying process and take it to the same place where the last replaying process crashed. If the second replaying process crashes before getting to that point then we report the error to the user and give up, otherwise the second process seamlessly takes over for future debugging.
Attached patch Part 7 - Tests.Splinter Review
This patch fixes an issue I saw where if a process is resumed, but hangs, we don't try to detect the hang until after trying to interact with the process in some way (pressing the pause button, etc.). Now, we start monitoring for a hang as soon as the process is resumed. https://hg.mozilla.org/projects/ash/rev/51a0586397b6fa9297c78e35fb7107c6e634f0c0
Closing this bug, all the changes here will be reviewed in separate bugs dependent on bug 1422587.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: