Closed Bug 1465294 Opened 6 years ago Closed 6 years ago

Web Replay: Recording/replaying/middleman behavior tweaks

Categories

(Core Graveyard :: Web Replay, enhancement)

enhancement
Not set
normal

Tracking

(firefox62 affected)

RESOLVED FIXED
Tracking Status
firefox62 --- affected

People

(Reporter: bhackett1024, Assigned: bhackett1024)

References

Details

(Whiteboard: leave-open)

Attachments

(15 files, 4 obsolete files)

791 bytes, patch
smaug
: review+
Details | Diff | Splinter Review
1001 bytes, patch
smaug
: review+
Details | Diff | Splinter Review
1.86 KB, patch
smaug
: review+
Details | Diff | Splinter Review
2.70 KB, patch
smaug
: review+
Details | Diff | Splinter Review
805 bytes, patch
mrbkap
: review+
Details | Diff | Splinter Review
877 bytes, patch
kinetik
: review+
Details | Diff | Splinter Review
730 bytes, patch
gsvelto
: review+
Details | Diff | Splinter Review
894 bytes, patch
bugzilla
: review+
Details | Diff | Splinter Review
1.88 KB, patch
mccr8
: review+
Details | Diff | Splinter Review
903 bytes, patch
froydnj
: review+
Details | Diff | Splinter Review
887 bytes, patch
mconley
: review+
Details | Diff | Splinter Review
902 bytes, patch
mconley
: review+
Details | Diff | Splinter Review
1.98 KB, patch
smaug
: review+
Details | Diff | Splinter Review
666 bytes, patch
mayhemer
: review+
Details | Diff | Splinter Review
789 bytes, patch
xidorn
: review+
Details | Diff | Splinter Review
This bug has assorted behavior tweaks in recording/replaying and middleman processes, mostly to avoid crashes.  A particular issue of note is dealing with IPDL messages that may be sent by both recording and middleman processes.  The UI process receives IPDL from the recording process and middleman process on the same message channel, and if it gets two messages over a channel when it only expects one then it will get confused or crash, so we only send the message from one process type.
The GetEventTargetFor call here is returning null in middleman processes.
Assignee: nobody → bhackett1024
Attachment #8981723 - Flags: review?(bugs)
TraceBlackJS is called at non-deterministic points and isn't supposed to be able to perform any recorded events, but calling ProcessGlobal::Get() can do so.  This patch avoids this issue by just not recording any events that happen within this call, but this is a placeholder for a more permanent fix.
Attachment #8981724 - Flags: review?(bugs)
Frame messages sent to a middleman process will be received both there and in its recording process child.  This patch ignores most of these messages when they are being received in a middleman process, except for debugger messages (devtools server code runs in the middleman) and session store flushes (needed to shutdown properly).
Attachment #8981725 - Flags: review?(bugs)
The recording and middleman processes can't both create CPOW managers.
Attachment #8981726 - Flags: review?(bugs)
The recording and middleman processes can't both send a first idle message, but since the UI process expects to receive one and there might not be a recording process, send this message from the middleman.
Attachment #8981727 - Flags: review?(mrbkap)
The recording and middleman processes can't both create an audio IPC connection.
Attachment #8981728 - Flags: review?(padenot)
If the middleman process sends PHal messages then sometimes they end up going to an actor that has already been deleted.
Attachment #8981730 - Flags: review?(gsvelto)
The crash reporter is not supported yet in recording or replaying processes.
Attachment #8981731 - Flags: review?(continuation)
Queries from the debugger in the middleman can cause a replaying process to try to do things it didn't while recording.  While we can deal with this problem in general, the debugger ends up with an unhelpful response to the query ('the operation failed' or somesuch).  This patch does some error handling so that we can provide better answers to queries that tend to hit this issue, like getting all the properties of a window.
Attachment #8981732 - Flags: review?(continuation)
The recording and middleman processes can't both open a Necko connection with the UI process.
Attachment #8981733 - Flags: review?(honzab.moz)
Recording/replaying processes directly access the recording file they are using, so shouldn't be sandboxed.
Attachment #8981734 - Flags: review?(gpascutto)
The background hang manager is not created in middleman processes (the IPDL messages which trigger its creation are only received in the recording process) so background hang threads should not be created either.
Attachment #8981735 - Flags: review?(nfroyd)
The recording and middleman processes can't open a printing proxy connection with the UI process.
Attachment #8981736 - Flags: review?(mconley)
Stylo requires modifications to upstream packages (rayon-core, crossbeam-deque) to record/replay some atomics those packages use for communicating between threads (see bug 1448607 comments 6 and 7).  It seems best for now to disable the stylo thread pool when recording/replaying, but in order to do that there need to be servo bindings so that global_style_data.rs can decide whether to disable the thread pool.

Also, from the changelog it looks like I'm supposed to go upstream somewhere to modify global_style_data.rs itself.  Should I land a version of this patch where the bindings are no-ops and return false, respectively, then fix things in that upstream repository?
Attachment #8981737 - Flags: review?(xidorn+moz)
The interrupt callback is invoked at non-deterministic points when recording/replaying, so painting should not occur there.
Attachment #8981738 - Flags: review?(mconley)
In order to support workers while recording/replaying, the interrupt callback needs to not perform recorded events and cycle collections need to happen at consistent places.  (There were some other changes under the hood required to support web workers, mainly making some previously main-thread-only APIs multithreaded.)
Attachment #8981740 - Flags: review?(bugs)
Comment on attachment 8981737 [details] [diff] [review]
Part 14 - Add record/replay servo bindings.

Review of attachment 8981737 [details] [diff] [review]:
-----------------------------------------------------------------

If what you currently want is to disable parallelism when recording / replaying, it's probably easier to just add a check in ServoStyleSet::ShouldTraverseInParallel[1] rather than adding a binding to do it in the Servo side.

It's not clear to me what change do you want to do in the Servo side, but any code in /servo directory can be changed in the same way as other code (in the same patch, land together). They are not considered "upstream". Files inside /third_party directory, however, are upstream libraries. Those libraries shouldn't be able to use these binding functions anyway.


[1] https://searchfox.org/mozilla-central/rev/5a744713370ec47969595e369fd5125f123e6d24/layout/style/ServoStyleSet.cpp#1513-1526
Attachment #8981737 - Flags: review?(xidorn+moz)
Comment on attachment 8981730 [details] [diff] [review]
Part 7 - Don't send PHal message from middleman processes.

Review of attachment 8981730 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8981730 - Flags: review?(gsvelto) → review+
Comment on attachment 8981733 [details] [diff] [review]
Part 10 - Don't create Necko children in middleman processes.

Review of attachment 8981733 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/ipc/NeckoChild.cpp
@@ +57,5 @@
>    MOZ_ASSERT(IsNeckoChild(), "InitNeckoChild called by non-child!");
>  
> +  // Necko children are not created in middleman processes, to avoid
> +  // conflicting with the child created in the middleman's recording process.
> +  if (!gNeckoChild && !recordreplay::IsMiddleman()) {

I'm not sure what this code has to do.  There are places like [1] that do dereference of gNeckoChild right after call to this method expecting gNeckoChild be up.

If you want to prevent some code on child processes to run, you need to be more specific how this works and what you expect.

[1] https://searchfox.org/mozilla-central/rev/5a744713370ec47969595e369fd5125f123e6d24/netwerk/cookie/CookieServiceChild.cpp#85
comment 19
Flags: needinfo?(bhackett1024)
Comment on attachment 8981728 [details] [diff] [review]
Part 6 - Don't initialize audio IPC connection in middleman process.

:kinetik should have a look at this to make sure, but it looks correct.
Attachment #8981728 - Flags: review?(padenot) → review?(kinetik)
Attached patch global_style_data change (obsolete) — Splinter Review
Here is the change I'm hoping to land which uses the new servo bindings.  I wasn't sure if I should ask for this to be reviewed here, because the changelog for this file is a bunch of 'servo: Merge #N' entries which made it look like development was happening in another repository.

I tried updating ServoStyleSet::ShouldTraverseInParallel, but this change is not sufficient by itself, I think because the rayon thread pool is still being created.  Even if these rayon threads are never given more than one task at a time they are still synchronizing with each other in a way we can't record/replay without instrumentation for their atomics.
Flags: needinfo?(bhackett1024)
Attachment #8981939 - Flags: feedback?(xidorn+moz)
(In reply to Honza Bambas (:mayhemer) from comment #19)
> Comment on attachment 8981733 [details] [diff] [review]
> Part 10 - Don't create Necko children in middleman processes.
> 
> Review of attachment 8981733 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: netwerk/ipc/NeckoChild.cpp
> @@ +57,5 @@
> >    MOZ_ASSERT(IsNeckoChild(), "InitNeckoChild called by non-child!");
> >  
> > +  // Necko children are not created in middleman processes, to avoid
> > +  // conflicting with the child created in the middleman's recording process.
> > +  if (!gNeckoChild && !recordreplay::IsMiddleman()) {
> 
> I'm not sure what this code has to do.  There are places like [1] that do
> dereference of gNeckoChild right after call to this method expecting
> gNeckoChild be up.
> 
> If you want to prevent some code on child processes to run, you need to be
> more specific how this works and what you expect.
> 
> [1]
> https://searchfox.org/mozilla-central/rev/
> 5a744713370ec47969595e369fd5125f123e6d24/netwerk/cookie/CookieServiceChild.
> cpp#85

Middleman processes are content processes that do very little by themselves.  They don't load documents, don't run scripts except for devtools server code, and shouldn't need to (for example) initialize the cookie service.  They still go through normal initialization paths for content processes, though, and InitNeckoChild is always called (I think through nsHTMLDNSPrefetch::Initialize).

Would it be better to change IsNeckoChild() to return false for middleman processes?
For folks like me who aren't really familiar with how WebReplay works, or what the middleman process is, I found this document helpful: https://developer.mozilla.org/en-US/docs/WebReplay
Comment on attachment 8981736 [details] [diff] [review]
Part 13 - Don't create printing proxy child in middleman processes.

Review of attachment 8981736 [details] [diff] [review]:
-----------------------------------------------------------------

Seems okay. Out of curiosity, as new IPC Actors / protocols are added, how do we make decisions on which ones should be disabled like this for the middleman process? Is there a way we could have that decision making centralized somewhere in the future, to avoid needing to maintain all of these checks everywhere?
Attachment #8981736 - Flags: review?(mconley) → review+
Comment on attachment 8981738 [details] [diff] [review]
Part 15 - Don't paint from interrupt callback when recording/replaying.

Review of attachment 8981738 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ProcessHangMonitor.cpp
@@ +337,5 @@
>    }
>  
> +  // Don't paint from the interrupt callback when recording or replaying, as
> +  // the interrupt callback is triggered non-deterministically.
> +  if (forcePaint && !recordreplay::IsRecordingOrReplaying()) {

This looks like this needs a rebase - this code has changed a little bit, but the general idea makes sense. Thanks!
Attachment #8981738 - Flags: review?(mconley) → review+
Comment on attachment 8981731 [details] [diff] [review]
Part 8 - Don't enable crash reporter in recording/replaying processes.

Review of attachment 8981731 [details] [diff] [review]:
-----------------------------------------------------------------

This looks reasonable to me, but Aaron is more familiar with crash reporting IPC, I think, so maybe he can weigh in here.
Attachment #8981731 - Flags: review?(continuation) → review?(aklotz)
Attachment #8981732 - Flags: review?(continuation) → review+
Comment on attachment 8981728 [details] [diff] [review]
Part 6 - Don't initialize audio IPC connection in middleman process.

Review of attachment 8981728 [details] [diff] [review]:
-----------------------------------------------------------------

It seems like it'd be clearer to check IsMiddleman() in InitLibrary() where InitAudioIPCConnection() is called from.
Attachment #8981728 - Flags: review?(kinetik) → review+
For part 16 (which does records when we run a CC on a worker to ensure that reply is okay), is there some other code that handles that for the main thread? I'd have thought that you'd want to handle that inside of the CC rather than separately at each entry point.
(In reply to Mike Conley (:mconley) (:⚙️) (Catching up on needinfos / reviews) from comment #25)
> Comment on attachment 8981736 [details] [diff] [review]
> Part 13 - Don't create printing proxy child in middleman processes.
> 
> Review of attachment 8981736 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Seems okay. Out of curiosity, as new IPC Actors / protocols are added, how
> do we make decisions on which ones should be disabled like this for the
> middleman process? Is there a way we could have that decision making
> centralized somewhere in the future, to avoid needing to maintain all of
> these checks everywhere?

Possibly.  The ideal strategy would have the middleman only communicate with the UI process using actors for the bare minimum of protocols it requires (I believe this is limited to PContent, PBrowser, PCompositorManager, PCompositorBridge, and PLayerTransaction).  It might be possible to use a message channel for e.g. PContent that forwards messages from this minimum set of protocols to the ContentParent in the UI process, and forwards other messages to an in-process ContentParent that serves as an endpoint for whatever actors the middleman process wants to create, but never sends anything of its own.  We already have a message channel that does similar things (see bug 1465287 part 3) but I'll have to do some experiments to see if this is an appropriate / viable approach.
(In reply to Andrew McCreight [:mccr8] from comment #29)
> For part 16 (which does records when we run a CC on a worker to ensure that
> reply is okay), is there some other code that handles that for the main
> thread? I'd have thought that you'd want to handle that inside of the CC
> rather than separately at each entry point.

From what I've seen the CC runs at deterministic points on the main thread, i.e. not in immediate response to a GC, though that might be because we already have a fair amount of instrumentation in parts 5c and 7 of bug 1207696 to make GC related activity more deterministic.

FWIW pretty soon I want to look into making the points where GCs occur deterministic (the set of collected things would remain non-deterministic).  This would allow removing a lot of this instrumentation and would allow incremental GCs and CCs to happen while recording/replaying.  Most importantly, this would eliminate a class of issues related to recorded events happening while tracing roots, e.g. what part 2 of this bug is covering for.
Comment on attachment 8981939 [details] [diff] [review]
global_style_data change

Review of attachment 8981939 [details] [diff] [review]:
-----------------------------------------------------------------

I guess this is fine, but I have a question: it seems to me that a recording / replying process must be recording / replying from the very beginning when the process starts, is it correct? If so, I suspect you should be able to set environment variable for the process when you are trying to start it? In that case, you should be able to just set "STYLO_THREADS=1" to suppress thread pool, which would reuse the existing mechanism without adding a new binding.

If a process doesn't know it's recording / replying from the very beginning, there is a chance that the pool has already been built, so this approach is not going to work either.
Attachment #8981939 - Flags: feedback?(xidorn+moz)
Comment on attachment 8981733 [details] [diff] [review]
Part 10 - Don't create Necko children in middleman processes.

Review of attachment 8981733 [details] [diff] [review]:
-----------------------------------------------------------------

If the middleman process is not doing anything, then I think better maybe to go with making [1] return false.

[1] https://searchfox.org/mozilla-central/rev/5a744713370ec47969595e369fd5125f123e6d24/netwerk/ipc/NeckoCommon.h#87-97
Attachment #8981733 - Flags: review?(honzab.moz) → review-
Comment on attachment 8981734 [details] [diff] [review]
Part 11 - Disable sandbox in recording/replaying processes.

Review of attachment 8981734 [details] [diff] [review]:
-----------------------------------------------------------------

So, uh, what's the security story here?

Is this a developer only thing that is disabled in releases? Is this something exposed in devtools? Will this be handling data coming from other content processes? What exactly is the data being passed to that is written out?
Attachment #8981734 - Flags: review?(gpascutto)
Comment on attachment 8981723 [details] [diff] [review]
Part 1 - Always use the chrome tab group in middleman processes.

I guess this is fine. Or one could set mTabGroup to ChromeTabGroup when TabChild is created.
Attachment #8981723 - Flags: review?(bugs) → review+
Comment on attachment 8981724 [details] [diff] [review]
Part 2 - Allow calling ProcessGlobal::Get during GC.

rs+ I guess. Assuming AutoPassThroughThreadEvents isn't doing anything heavy.
Attachment #8981724 - Flags: review?(bugs) → review+
Comment on attachment 8981725 [details] [diff] [review]
Part 3 - Handle certain frame messages in middleman processes.


>+// When recording or replaying, return whether a message should be received in
>+// the middleman process instead of the recording/replaying process.
>+static bool
>+DirectMessageToMiddleman(const nsAString& aMessage)
>+{
>+  nsCString cmsg = NS_ConvertUTF16toUTF8(aMessage);
>+  const char* cstr = cmsg.get();
>+  return strncmp(cstr, "debug:", 6) == 0
>+      || strcmp(cstr, "SessionStore:flush") == 0;
>+}

return StringBeginsWith(aMessage, NS_LITERAL_STRING("debug:") ||
       aMessage.EqualsLiteral("SessionStore:flush");

Why is sessionstore interesting case here?

I assume you'll want to bypass some more message, and then this should be rearchitected a bit, but ok hack  for now.
Attachment #8981725 - Flags: review?(bugs) → review+
Attachment #8981740 - Flags: review?(bugs) → review+
Attachment #8981726 - Flags: review?(bugs) → review+
(In reply to Gian-Carlo Pascutto [:gcp] from comment #34)
> Comment on attachment 8981734 [details] [diff] [review]
> Part 11 - Disable sandbox in recording/replaying processes.
> 
> Review of attachment 8981734 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> So, uh, what's the security story here?
> 
> Is this a developer only thing that is disabled in releases? Is this
> something exposed in devtools? Will this be handling data coming from other
> content processes? What exactly is the data being passed to that is written
> out?

This is a developer only feature that is only enabled in Mac nightlies.  Devtools can interact with recording/replaying processes, but not other content processes (no IPC is performed with these processes that could not be performed with any other content process).  The data being written to disk is information about library calls the content process has performed that can be used later to replay a very similar (not quite identical) execution.  For more details see https://developer.mozilla.org/en-US/docs/WebReplay.
(In reply to Olli Pettay [:smaug] from comment #37)
> Comment on attachment 8981725 [details] [diff] [review]
> Part 3 - Handle certain frame messages in middleman processes.
> 
> 
> >+// When recording or replaying, return whether a message should be received in
> >+// the middleman process instead of the recording/replaying process.
> >+static bool
> >+DirectMessageToMiddleman(const nsAString& aMessage)
> >+{
> >+  nsCString cmsg = NS_ConvertUTF16toUTF8(aMessage);
> >+  const char* cstr = cmsg.get();
> >+  return strncmp(cstr, "debug:", 6) == 0
> >+      || strcmp(cstr, "SessionStore:flush") == 0;
> >+}
> 
> return StringBeginsWith(aMessage, NS_LITERAL_STRING("debug:") ||
>        aMessage.EqualsLiteral("SessionStore:flush");
> 
> Why is sessionstore interesting case here?
> 
> I assume you'll want to bypass some more message, and then this should be
> rearchitected a bit, but ok hack  for now.

The session store message was necessary for tab shutdown to complete normally.  I added comments for these to DirectMessageToMiddleman().
Attachment #8981733 - Attachment is obsolete: true
Attachment #8982261 - Flags: review?(honzab.moz)
This patch sets the STYLO_THREADS environment variable when we initialize record/replay state, which is near the start of the main routine.
Attachment #8981737 - Attachment is obsolete: true
Attachment #8981939 - Attachment is obsolete: true
Attachment #8982262 - Flags: review?(xidorn+moz)
Comment on attachment 8981735 [details] [diff] [review]
Part 12 - Don't create background hang thread in middleman processes.

Review of attachment 8981735 [details] [diff] [review]:
-----------------------------------------------------------------

Makes sense to me!
Attachment #8981735 - Flags: review?(nfroyd) → review+
Attachment #8981731 - Flags: review?(aklotz) → review+
Attachment #8981727 - Flags: review?(mrbkap) → review+
Attachment #8982262 - Flags: review?(xidorn+moz) → review+
>This is a developer only feature that is only enabled in Mac nightlies.

If it's enabled in builds we are sending out to "consumers" rather than ones you build yourself then I think we'd need to be very careful here.

>Devtools can interact with recording/replaying processes, but not other content processes (no IPC is performed with these processes that could not be performed with any other content process).

Is the recording or playback process exposed to data that is coming from untrusted sources? (Why is it a content process?)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #43)
> Is the recording or playback process exposed to data that is coming from
> untrusted sources? (Why is it a content process?)

Yes, it is exposed to the same sources of data that a normal content process would be.  It is able to do the same things as any other content process, except that it records its behaviors for later replaying.

I'm going to experiment with enabling the sandbox as normal for recording/replaying processes, as this issue was also raised in bug 1465287 comment 17.
Attachment #8982261 - Flags: review?(honzab.moz) → review+
Comment on attachment 8981734 [details] [diff] [review]
Part 11 - Disable sandbox in recording/replaying processes.

With the changes in bug 1466877 this patch is no longer necessary: the sandbox is now enabled in recording/replaying and middleman processes.
Attachment #8981734 - Attachment is obsolete: true
Component: General → Web Replay
Pushed by bhackett@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a0ee9e803ff2
Part 14 - Use a single thread in stylo when recording/replaying, r=xidorn.
Whiteboard: leave-open
Pushed by bhackett@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3fbe20aba2e8
Part 1 - Always use the chrome tab group in middleman processes, r=smaug.
https://hg.mozilla.org/integration/mozilla-inbound/rev/9124ba27f946
Part 2 - Allow calling ProcessGlobal::Get during GC, r=smaug.
https://hg.mozilla.org/integration/mozilla-inbound/rev/9415c8950951
Part 3 - Handle certain frame messages in middleman processes, r=smaug.
https://hg.mozilla.org/integration/mozilla-inbound/rev/1d25793b823c
Part 4 - Don't construct CPOW managers in middleman processes, r=smaug.
https://hg.mozilla.org/integration/mozilla-inbound/rev/8bf1f931c2c0
Part 5 - Only send first idle message from the middleman process, r=mrbkap.
https://hg.mozilla.org/integration/mozilla-inbound/rev/d852c5eb2dd8
Part 6 - Don't initialize audio IPC connection in middleman process, r=kinetik.
https://hg.mozilla.org/integration/mozilla-inbound/rev/54ef0b4722bc
Part 7 - Don't send PHal message from middleman processes, r=gsvelto.
https://hg.mozilla.org/integration/mozilla-inbound/rev/4d1f89b98ab9
Part 8 - Don't enable crash reporter in recording/replaying processes, r=aklotz.
https://hg.mozilla.org/integration/mozilla-inbound/rev/0fe978adaddc
Part 11 - Don't create Necko children in middleman processes, r=mayhemer.
https://hg.mozilla.org/integration/mozilla-inbound/rev/555311d0ffc4
Part 12 - Don't create background hang thread in middleman processes, r=froydnj.
https://hg.mozilla.org/integration/mozilla-inbound/rev/37322330c373
Part 13 - Don't create printing proxy child in middleman processes, r=mconley.
https://hg.mozilla.org/integration/mozilla-inbound/rev/3899be82452e
Part 15 - Don't paint from interrupt callback when recording/replaying, r=mconley.
https://hg.mozilla.org/integration/mozilla-inbound/rev/278522a7668f
Part 16 - Support web workers when recording/replaying, r=smaug.
Pushed by bhackett@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7d8fba279aef
Part 9 - Avoid triggering unhandled recording divergences in XPConnect, r=mccr8.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: