Closed Bug 1422587 Opened 6 years ago Closed 5 years ago
Web Replay v2
A few weeks ago I restarted work on Web Replay (bug 1207696). I don't have a good explanation for why the initial attempt to land this ran out of steam, but I believe in this technology and want to keep developing it. So far I've been focusing on a couple major architectural changes: 1. Allow a recording process to transition into a replaying process. This will greatly simplify the workflow users need to follow to replay a page: just record the debugging session, and at any time execution can be rewound to replay earlier behaviors. After a process is first rewound it won't be able to record again by e.g. fast forwarding to the end of the recording, but maybe this could be supported someday. 2. Overhaul graphics handling to greatly reduce the amount of interaction between record/replay code and graphics code. The new model is to put a compositor in the recording/replaying process which renders data into a buffer, whose contents are relayed to the parent process' compositor by some simple layer messages in the middleman process. This avoids shuttling graphics IPDL messages through the middleman, which is tricky because these messages are a moving target and because their semantics are unclear when rewinding. Additionally, I've torn out the special DrawTargets used when recording/replaying, which were more trouble than they were worth. The attached patch makes some progress on these: recorded processes mostly work in the presence of a middleman, including using their own compositor and being able to show graphics in the parent process.
This patch makes some fixes so that recording works better, and some more intricate changes to allow snapshots to be taken while recording. This seems to be working pretty well. The remaining steps in the rearchitecting are to get replaying/rewinding working again (the main issue is to deal with the new graphics changes) and then to allow transitioning recording processes into replaying processes and rewinding them.
Attachment #8933986 - Attachment is obsolete: true
This patch fixes things so that replaying works again, including showing graphics. Graphics turned out to be pretty simple --- on the workloads I've looked at, when using the Skia backend the native CoreGraphics APIs are mainly used for rendering fonts, and the rendered pixel data can be included directly in the recording without bloating it very much.
Attachment #8937049 - Attachment is obsolete: true
I updated the UI in the new debugger client and with this patch plus those changes (I'll try to land those in the client later this month) can again rewind, reverse/forward step, eval(), etc. in a replaying process.
Attachment #8939339 - Attachment is obsolete: true
This patch completes the rearchitecting described in comment 0. The debugger can be attached to a recording process, paused, then rewound to replay the process' earlier behavior. I'll be rebasing next and will start testing on more complicated web pages.
Attachment #8941411 - Attachment is obsolete: true
Rebased and updated WIP that works on more complicated websites. I've been able to record/replay npr.org and the facebook.com newsfeed. Up next are testing/bugfixing and some optimizations.
Attachment #8942715 - Attachment is obsolete: true
I've been working on fixing bugs, making some optimizations to the snapshot process, and fit-and-finish improvements (like showing alerts instead of dumping problems to stderr). There are still a few more things to fix but I'm hoping to release something (make a branch that people can push to, and a .dmg to download) by the end of this month.
Attachment #8946154 - Attachment is obsolete: true
I've started using the ash project repo (https://hg.mozilla.org/projects/ash) for Web Replay development. Tests/builds aren't working yet (bug 1439731) but after that is fixed this should be a suitable place for automated testing, providing builds, and allowing other folks to contribute before things are ready to merge to trunk. I've attached the rolled up initial patch I pushed. I'll be attaching smaller patches shortly for the individual csets this includes, besides the stuff from bug 1207696. A few of these are a little rough but most are ready for review, though getting them reviewed is not a priority for me right now. I'll file other bugs blocking this for future changes in the repository, and other bugs with finalized patches from this one when it is time for them to be reviewed (trying to avoid the madhouse that bug 1207696 turned into).
Attachment #8951572 - Attachment is obsolete: true
Disable the content process sandbox when recording/replaying, so the process can access recording and temporary snapshot files.
Disable stylo in recording/replaying processes. This is a bummer but on more complicated web sites I've seen some differences between recording/replay in the order in which layout things are processed. This is probably due to a hashtable being iterated in a different order or something but I haven't had time to determine what the problem is.
This atomic value is accessed during GC.
malloc's behavior is allowed to vary between recording/replaying, so atomic values it uses should not be recorded.
Avoiding recording this atomic makes it easier to see what is going on when assertions fail.
The Gecko profiler may be active during GC, so atomics it uses shouldn't be recorded.
I think these atomics can be accessed during GC.
When recording times related to script parsing, use the actual time rather than recording/replaying the time. Script parsing should be happening on the main thread in a deterministic way (so that scripts are created in a specific order, which the debugger relies on) so I'm not 100% sure this change is necessary.
Avoid using getenv in a few places in the GC and TI that might run non-deterministically.
The next patch makes some changes to the rayon-core crate, so until those changes are upstreamed use the local in-tree version of the crate.
rayon-core (the rust crate managing the thread pool for stylo and other things) uses atomics for coordination between threads, which are not recorded. This patch wraps those atomics in locks so that their accesses are correctly ordered between recording and replay. This patch is rather hacky and a placeholder until there is a better way for interfacing between rust code and the record/replay system.
This patch includes refcount changes to Runnable and its subclasses in the recording. I added this because I saw a subclass where (a) the destructor performed recorded operations, and (b) the thread where the last reference was released could vary across runs. Recording the refcount is only necessary on runnables where both of these are true, so this patch might be a little heavy handed but seems ok for now.
Add some IDL so that JS code can insert debugging assertions to check for consistency between recording and replay. This is called as 'recordReplayAssert(text)', similar to 'dump(text)'.
Some other patches create platform mutexes that do not need to be recorded (e.g. JS mutexes), so specify whether these mutexes should be recorded in the constructor.
Add a template argument to JS hashtables which specifies whether to preserve iteration order across recording/replaying. This preservation is done by ensuring a deterministic hash number for each item, which itself is done by setting every hash number in the table to zero when recording/replaying. Since this makes accesses on the table take linear time, and since only a couple JS hashtables actually need to preserve iteration order, the template argument was added to control whether to do the preservation.
Name tables in the frontend affect the order in which variables are defined in generated scripts, which can affect the recording by calling lookup hooks in different orders. This patch preserves iteration order in these tables.
Make sure that the child recording/replaying process of the middleman process is spawned in the same way as the middleman process itself. Without this a different executable is chosen I think and things do not work.
Instead of using a special IPDL protocol for communication between the middleman and replaying process (as in bug 1207696), IPC between these processes now uses a special bidirectional channel. This change is done because middleman processes are now used when recording as well (so the recording process can transition into a replaying process) and we will get terribly confused if live IPC is done over a single IPDL channel where some of it is included in the recording and some isn't. This patch includes the logic for this channel, definitions of messages passed between the processes, and all the logic for managing state and communicating between these processes.
This patch has a couple changes to IPC protocol code so that middleman processes can send IPDL messages to the recording/replaying process and so that we don't get confused by seeing unexpected pids when IPDL messages are forwarded through the middleman between the UI process and recording/replaying process.
The ContentProcess instance in the middleman process needs some changes to use special message channels for forwarding IPDL messages between processes. This patch has some changes to ContentProcess and PContent itself that also avoid sending messages that are redundant with messages sent by the recording process. The UI process will get confused if certain messages are sent to it by both the middleman and recording processes.
The middleman needs to make deep copies of IPDL messages in order to send them to another process, which this patch adds support for.
The UI process will get confused if audio IPC connections are opened to it from both the middleman and recording processes.
The middleman will get confused if printing IPC connections are opened to it from both the middleman and recording processes.
Async messages sent over IPC via the frame message manager sometimes need to go to the middleman process (if they are related to the debugger server code which runs there), and otherwise should go to the recording process.
Add IDL and IPDL glue to allow JS code in the UI process to instruct the recording process to save the recording to a file.
Add IPC so that fatal errors in the recording/replaying process (including crashes, but also divergence between recording and replay and some other stuff) are reported to the user. Currently this just throws up an alert, though I'm sure in the future it will be more refined.
The recording/replaying process creates its own compositor which renders to a buffer that is sent to the middleman process after each paint. This patch has changes to graphics code to make this possible, and to make sure the middleman does not try to do any rendering of its own.
Using a separate compositor in the recording/replaying process means that only a single layer is sent to the UI process containing all rendered data for the tab. This seems to make the APZ (Async Pan/Zoom) code think the page is like a static canvas and scrolling does not work correctly. This patch disables APZ in recording/replaying processes so that scrolling works correctly (albeit not as smoothly as with APZ).
Add preferences for record/replay settings.
Add menu items for recording, saving the recording, and replaying.
Server side devtools changes, on top of those covered by bug 1207696.
Some client side devtools changes to code that is (I think) hosted in m-c rather than on github.
Normally the debugger tries itself to fetch the contents of source files (.html, .js, etc.) referenced in a page before displaying them. This doesn't work right when running in a middleman process, but more importantly when replaying there is no guarantee the current contents of those source files match up with what they were when recording. This patch adds an API and DOM users of that API to supply JS with the contents of source files it comes across, which are relayed to the debugger JS code via the debugger C++ object.
Assorted crufty fixes and hacks to get things working. These will need to be revisited before landing and either removed or tidied up into proper patches.
Various record/replay assertions for checking consistent behavior between recording and replaying. There are some intermittent issues I haven't tracked down which some of these are related to, but most of these were added to debug problems that have since been fixed. They aren't hurting anything other than adding clutter, and since they help with detecting inconsistencies earlier it seems good to keep them around for now.
Various fixes to the previous patches to get things to compile on the tryserver, mostly to satisfy the clang static analyses.
All the code from dependent bugs has landed, excepting tests, and Web Replay should be in tomorrow's nightly (Mac only for now). Features are turned on via the devtools.recordreplay.enabled pref (default to false for now) and accessed via the 'Tools -> Web Developer' menu. Things might not be super stable yet.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
(In reply to Brian Hackett (:bhackett) from comment #46) > All the code from dependent bugs has landed, excepting tests, and Web Replay > should be in tomorrow's nightly (Mac only for now). Features are turned on > via the devtools.recordreplay.enabled pref (default to false for now) and > accessed via the 'Tools -> Web Developer' menu. Things might not be super > stable yet. From code coverage notifications, we noticed the code was not tested. Do you have WIP patches for adding tests?
(In reply to Marco Castelluccio [:marco] from comment #47) > From code coverage notifications, we noticed the code was not tested. Do you > have WIP patches for adding tests? See bug 1465491, which has some patches in progress. (This code is also OSX and Nightly only right now.)
You need to log in before you can comment on or make changes to this bug.