Closed Bug 1445219 Opened 7 years ago Closed 7 years ago

Web Replay: Restore replaying process after crashes and hangs

Tracking

()

Status:

RESOLVED INVALID

Tracking Flags:

Tracking

Status

firefox61

---

affected

People

(Reporter: bhackett1024, Assigned: bhackett1024)

References

Details

Attachments

(8 files)

Part 1 - Remove last ditch restore infrastructure. 7 years ago Brian Hackett [Laid off!] 18.91 KB, patch		Details \| Diff \| Splinter Review
Part 2 - Explicitly indicate type of record/replay child to spawn in GeckoChildProcessHost. 7 years ago Brian Hackett [Laid off!] 11.54 KB, patch		Details \| Diff \| Splinter Review
Part 3 - Use special IPC messages to set or clear breakpoints. 7 years ago Brian Hackett [Laid off!] 31.10 KB, patch		Details \| Diff \| Splinter Review
Part 4 - Avoid recording divergences when querying the system time. 7 years ago Brian Hackett [Laid off!] 1.48 KB, patch		Details \| Diff \| Splinter Review
Part 5 - Add testing function for provoking crashes in replaying processes. 7 years ago Brian Hackett [Laid off!] 7.21 KB, patch		Details \| Diff \| Splinter Review
Part 6 - Add infrastructure for recovering crashed/hung replaying processes. 7 years ago Brian Hackett [Laid off!] 30.80 KB, patch		Details \| Diff \| Splinter Review
Part 7 - Tests. 7 years ago Brian Hackett [Laid off!] 5.04 KB, patch		Details \| Diff \| Splinter Review
Part 8 - Always watch for hangs when resuming a replaying process. 7 years ago Brian Hackett [Laid off!] 3.61 KB, patch		Details \| Diff \| Splinter Review

Brian Hackett [Laid off!]

Assignee

Description

•

7 years ago

There are several intermittent crashes and hangs I've seen while using Web Replay. While of course it would be nice if these never happened, and while I will try to track down their sources, it would also be nice if these didn't have a big impact on people using the tool. The middleman process knows where the replaying process is and what it's doing, and it would be cool if the middleman could detect that the replaying process has crashed or hung and then spin up a new replaying process and direct it to where the last one was at. This would all happen transparently to the user, other than the time spent waiting while the process gets back to where it was. (Eventually this recovery could be exposed in the UI, and there is already a similar need in showing the user progress when rewinding, but that will be left for the future.)

Brian Hackett [Laid off!]

Assignee

Comment 1

•

7 years ago

Attached patch Part 1 - Remove last ditch restore infrastructure. — Details — Splinter Review

Remove the existing last-ditch restore infrastructure. This was used when a mismatch with the recording is detected while replaying --- we would rewind to the last snapshot and try again, hopefully avoiding the problem. It wasn't enabled though and had rotted, and the approach in this bug is considerably more robust against the problems that crop up in practice.

Assignee: nobody → bhackett1024

Brian Hackett [Laid off!]

Assignee

Comment 2

•

7 years ago

Attached patch Part 2 - Explicitly indicate type of record/replay child to spawn in GeckoChildProcessHost. — Details — Splinter Review

Previously GeckoChildProcessHost used environment variables to determine what type of recording/replaying/middleman process to spawn. Not only is this really gross, it also makes it hard to have a middleman spawn multiple processes of different kinds at different times. This patch makes the spawned process type an explicit parameter to the functions that spawn processes.

Brian Hackett [Laid off!]

Assignee

Comment 3

•

7 years ago

Attached patch Part 3 - Use special IPC messages to set or clear breakpoints. — Details — Splinter Review

This patch changes breakpoint sets to happen via special IPC messages between the middleman and recording/replaying processes, rather than folding them into the DebuggerRequest JSON messages sent for other debugger actions. This is nice because recovery of a crashed process needs to treat changes to breakpoints differently from other messages, and it also allows some nice cleanups and simplifications in the code for managing debugger requests (breakpoint operations already had some special casing going on).

Brian Hackett [Laid off!]

Assignee

Comment 4

•

7 years ago

Attached patch Part 4 - Avoid recording divergences when querying the system time. — Details — Splinter Review

Fix a source of unhandled recording divergences which was causing unexpected (though not incorrect) behavior while testing.

Brian Hackett [Laid off!]

Assignee

Comment 5

•

7 years ago

Attached patch Part 5 - Add testing function for provoking crashes in replaying processes. — Details — Splinter Review

We need to be able to provoke crashes in the replaying process in order to have automated testing for child process recovery.

Brian Hackett [Laid off!]

Assignee

Comment 6

•

7 years ago

Attached patch Part 6 - Add infrastructure for recovering crashed/hung replaying processes. — Details — Splinter Review

Here are the main changes to the infrastructure needed so that the middleman process can spin up a new replaying process and take it to the same place where the last replaying process crashed. If the second replaying process crashes before getting to that point then we report the error to the user and give up, otherwise the second process seamlessly takes over for future debugging.

Brian Hackett [Laid off!]

Assignee

Comment 7

•

7 years ago

Attached patch Part 7 - Tests. — Details — Splinter Review

Brian Hackett [Laid off!]

Assignee

Comment 8

•

7 years ago

https://hg.mozilla.org/projects/ash/rev/04eb48072173246aa5802d27095b86378bdb799f https://hg.mozilla.org/projects/ash/rev/be70121487bdec055dc650cb237a35b875caf28a https://hg.mozilla.org/projects/ash/rev/c6c63a1763eaf086c0f7b6e0823603b35daeed60 https://hg.mozilla.org/projects/ash/rev/51460e01e0a618bafed775ba6d1a3ffce878fb92 https://hg.mozilla.org/projects/ash/rev/7c0d410e49fc155155b918042eb4c0159382a451 https://hg.mozilla.org/projects/ash/rev/01729d661d3e448ab8cc2a06db67ef7519a73f64 https://hg.mozilla.org/projects/ash/rev/50280c3f8e3b36d066195c19498aebcf0c833a16

Brian Hackett [Laid off!]

Assignee

Comment 9

•

7 years ago

Attached patch Part 8 - Always watch for hangs when resuming a replaying process. — Details — Splinter Review

This patch fixes an issue I saw where if a process is resumed, but hangs, we don't try to detect the hang until after trying to interact with the process in some way (pressing the pause button, etc.). Now, we start monitoring for a hang as soon as the process is resumed. https://hg.mozilla.org/projects/ash/rev/51a0586397b6fa9297c78e35fb7107c6e634f0c0

Brian Hackett [Laid off!]

Assignee

Comment 10

•

7 years ago

Closing this bug, all the changes here will be reviewed in separate bugs dependent on bug 1422587.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → INVALID

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Web Replay: Restore replaying process after crashes and hangs

Categories

(Core :: General, defect)

Tracking

()

People

(Reporter: bhackett1024, Assigned: bhackett1024)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(8 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Attachment

General

Description

File Name

Content Type