Closed Bug 1131375 Opened 5 years ago Closed 5 years ago

[e10s] Sync messages cause async messages to be processed out of order when dispatched from the in content child process

Categories

(Core :: DOM: Content Processes, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
e10s m6+ ---
firefox40 --- fixed

People

(Reporter: mossop, Assigned: billm, NeedInfo)

Details

Attachments

(2 files, 6 obsolete files)

Attached patch testcaseSplinter Review
When a frame script is loaded in the main process (for a tab that isn't loaded remotely) sending a sync messages causes async messages to be processed out of order.

STR:

In a framescript for a frame loaded in the main process:

1. Send some async messages through the frame message manager.
2. Send some async messages through the process message manager.
3. Send a sync message through the process message manager.

In the parent we would expect to see the messages arrive in the order sent. Instead sending a sync message through the process message manager flushes all async messages from the process message manager immediately (https://dxr.mozilla.org/mozilla-central/source/dom/base/nsFrameMessageManager.cpp#1834) so we see the messages in the order 2, 3, 1.

The same is true if we swap the frame and process message managers (https://dxr.mozilla.org/mozilla-central/source/dom/base/nsInProcessTabChildGlobal.cpp#38).
Assignee: nobody → wmccloskey
Attached patch message-order (obsolete) — Splinter Review
Thanks Dave. Always a pleasure to fix these bugs with tests attached :-).

This patch adds a singleton class that holds all the messages queued by the <browser> message manager as well as the process message manager. That way everything can be flushed at once.
Attachment #8563104 - Flags: review?(bugs)
Attached patch script-load-order (obsolete) — Splinter Review
I also noticed a problem where frame scripts that are loaded after a message is sent can be run before the handler for the message. I think we should fix this too. I just made frame script loading always go through the event loop like other messages.
Attachment #8563107 - Flags: review?(bugs)
Attached patch frame-script-order-test (obsolete) — Splinter Review
Here's a test for the frame script ordering issue.
Attachment #8563108 - Flags: review?(dtownsend)
Attached patch script-load-order v2 (obsolete) — Splinter Review
Found a bug on try.
Attachment #8563107 - Attachment is obsolete: true
Attachment #8563107 - Flags: review?(bugs)
Attachment #8563169 - Flags: review?(mconley)
Hey billm - curious to know if you actually meant to retarget this review request to me. nsInProcessTabChildGlobal.cpp is something I'm only tangentially familiar with (at best) - smaug was probably the better choice.

Or was there something particular in there you wanted me to gnaw on?
Flags: needinfo?(wmccloskey)
Attachment #8563108 - Flags: review?(dtownsend) → review+
Comment on attachment 8563169 [details] [diff] [review]
script-load-order v2

Sorry, yeah, this was meant for smaug.
Flags: needinfo?(wmccloskey)
Attachment #8563169 - Flags: review?(mconley) → review?(bugs)
Comment on attachment 8563169 [details] [diff] [review]
script-load-order v2

Let's wait on these until I have a green try run. This code is a little trickier than I thought.
Attachment #8563169 - Flags: review?(bugs)
Attached patch message-order v2 (obsolete) — Splinter Review
This one is green after fixing a leak. I'll have to debug the frame script loading patch some more.
Attachment #8563104 - Attachment is obsolete: true
Attachment #8564455 - Flags: review?(bugs)
Comment on attachment 8564455 [details] [diff] [review]
message-order v2

>+SameProcessMessageQueue::~SameProcessMessageQueue()
>+{
>+  MOZ_ASSERT(mQueue.IsEmpty());
It is not clear to me why mQueue couldn't be non-empty here.


>+SameProcessMessageQueue::Push(Runnable* aRunnable)
>+{
>+  mQueue.AppendElement(aRunnable);
>+  NS_DispatchToCurrentThread(aRunnable);
>+}
>+
>+void
>+SameProcessMessageQueue::Flush()
>+{
>+  nsTArray<nsRefPtr<Runnable>> queue;
>+  mQueue.SwapElements(queue);
>+  for (size_t i = 0; i < queue.Length(); i++) {
>+    nsRefPtr<Runnable> runnable = queue[i];
Why you need nsRefPtr here?
nsTArray should guarantee the object stays alive

>+  class Runnable : public nsIRunnable {
nit, { goes to its own line.

>+  public:
>+    explicit Runnable(nsIRunnable* aRunnable);
>+
>+    NS_DECL_ISUPPORTS
>+    NS_DECL_NSIRUNNABLE
>+
>+  private:
>+    virtual ~Runnable() {}
>+
>+    bool mDispatched;
>+    nsCOMPtr<nsIRunnable> mRunnable;
>+  };
So this setup is a bit odd. Wrapping other runnables into this one.
Why not just make the relevant runnables classes to inherit this new Runnable class
instead of the current nsRunnable. Less objects, less malloc'ing and free'ing. 
You may want to change Run() in those other classes to HandleMessage or some such, and check
mDispatched before calling that method.
Attachment #8564455 - Flags: review?(bugs) → review-
Attached patch message-order v3 (obsolete) — Splinter Review
Attachment #8564455 - Attachment is obsolete: true
Attachment #8565190 - Flags: review?(bugs)
Comment on attachment 8565190 [details] [diff] [review]
message-order v3

(at some point we may want to implement the queue as linked list)
Attachment #8565190 - Flags: review?(bugs) → review+
And Gaia python integration test failures (every suite was hitting failures like the one below).
https://treeherder.mozilla.org/logviewer.html#?job_id=6734899&repo=mozilla-inbound
Attached patch message-order v4Splinter Review
Fixes the windows build failure.
Attachment #8565190 - Attachment is obsolete: true
Attachment #8569545 - Flags: review+
Attachment #8563169 - Attachment is obsolete: true
Attachment #8563108 - Attachment is obsolete: true
Fabrice, could you maybe help out here? These patches pass all tests except for the Gaia UI tests. All they do is to ensure that messages sent using process message managers are ordered correctly with respect to messages sent using frame message managers in single-process mode.

Here's a try run that shows the failures:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4993f1df9833

I followed all the instructions for building a b2g Linux desktop build and running the UI tests. I just can't get the problem to reproduce. And yet it seems to happen all the time on tinderbox.

I have no idea how to debug this otherwise since the tests don't seem to generate any logs.

I would very much appreciate any help. I've spent hours on this and I'm on the verge of abandoning this bug.
Flags: needinfo?(fabrice)
I can't reproduce locally either :(

James & Gareth, who could help debug that?
Flags: needinfo?(jlal)
Flags: needinfo?(gaye)
Flags: needinfo?(fabrice)
Perhaps Geo or Davehunt?
Flags: needinfo?(jlal)
Flags: needinfo?(gmealer)
Flags: needinfo?(gaye)
Flags: needinfo?(dave.hunt)
I've noticed similar crashes in today's test results, so it may not be related to this patch. See bug 1137653.
Any update here Dave? I'm having trouble figuring out what's going on in bug 1137653 and how it relates to this one.
Flags: needinfo?(dave.hunt)
I think the frequency of the crashes has reduced. Does try still reproduce this?
Flags: needinfo?(dave.hunt)
Dave, could you please take another look at this? I looked through the different logs and nothing pops out. These tests fail consistently on try but so far no one has been able to reproduce the problem.
Flags: needinfo?(dave.hunt)
Looking at the latest try results, we appear to be timing out whilst executing an async script to launch an app. Here we're waiting for the 'appopen' event to fire, see https://github.com/mozilla-b2g/gaia/blob/864e60fd182528de9e3eba1d8153667b2caeba6f/tests/atoms/gaia_apps.js#L284 is it possible that with this patch the event is not being fired? Have you tried reproducing the failure with the same environment as buildbot?
Flags: needinfo?(dave.hunt)
https://hg.mozilla.org/mozilla-central/rev/d90378f90513
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
I'm still seeing async messages processed out of order in nightly.  I don't know if it is being caused by sync messages or something else.  I'm listening to the messages in a javascript component and nowhere else.  I'm sending several async messages from a stream listener in a frame script, passing the stream listener callbacks up to the component.  Obviously the opnStartRequest/OnDataAvaialable/OnStopRequest happen in the correct order in content, although very rapidly, but the corresponding async messages are received in random order in the component.
You need to log in before you can comment on or make changes to this bug.