data written to pipe/socketpair get lost in b2g emulator

RESOLVED FIXED

Status

defect
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: bagder, Assigned: bagder)

Tracking

unspecified
x86_64
Linux

Firefox Tracking Flags

(firefox48 fixed)

Details

Attachments

(1 attachment, 1 obsolete attachment)

Assignee

Description

5 years ago
In bug 1008091 I've been working on code that starts a new thread on FxOS (and Linux) and when shutting it down, it indicates death to the child thread using a pipe - the parent thread writes to it and the child thread poll()s it.

When running this code in the b2g emulator (in the try-runs), at seemingly random intervals (but still frequent enough to be a problem) data that is written to the pipe doesn't arrive in the other end and thus it doesn't wake up poll() in the waiting thread. The result is a thread that doesn't terminate when it is supposed to. I've also tried to replace the pipe with a socketpair, but with seemlingly the same end result.

This problem gets visible best when running xpcshell tests in the emulator. It is very rare that all tests manage to run without at least one of them getting hung on this problem and the test gets killed due to timeout (after 300 seconds).

In bug 1008091 I intend to proceed and implement a work-around that detects when it runs in the emulator and avoids this situation only then, but it is awkward and ugly. The work-around writes to a shared variable and it makes the poll() timeout every NN milliseconds to check that variable.

To repeat this problem, remove the variable-trick in nsNotifyAddrListener_Linux.cpp and run the xpcshell tests in the emulator, or before that patch has landed an earlier version of the patch that doesn't use the variable trick can be used.
I'm wondering if this is related to the other emulator flakiness we've been seeing for ages.
I had a look into this and I think I have an idea of what's going on.

Here's part of nsNotifyAddrListener::Init():

...
348: rv = NS_NewNamedThread("Link Monitor", getter_AddRefs(mThread), this);
     NS_ENSURE_SUCCESS(rv, rv);

 #ifdef MOZ_NUWA_PROCESS
     nsCOMPtr<nsIRunnable> runner = new NuwaMarkLinkMonitorThreadRunner();
     mThread->Dispatch(runner, NS_DISPATCH_NORMAL);
 #endif

356: if (-1 == pipe(mShutdownPipe)) {
         return NS_ERROR_FAILURE;
     }
...

And here's part of nsNotifyAddrListener::Run():

...
     struct pollfd fds[2];
260: fds[0].fd = mShutdownPipe[0];
...
     while (!shutdown) {
279:     int rc = EINTR_RETRY(poll(fds, 2, pollTimeout));
...

Note that mShutdownPipe isn't actually set up (on line 356) until after the thread is created (on line 348). If the thread runs before that call to pipe, line 260 will store -1 in fds[0].fd. Then, the call to poll on line 279 will be waiting on some nonexistent fd -1. If pollTimeout is -1, that call will never return, leading to hangs.
Assignee

Comment 4

4 years ago
I can only but agree and say that it seems obvious now that you point it out. And probably the timing was so that this only hit us on the b2g emulator.
Assignee

Comment 5

4 years ago
Just moving the setting up the pipe should do it then. You agree? I'll take it on a try-run adventure too and see how it plays.
Attachment #8667862 - Flags: feedback?(dkeeler)
Comment on attachment 8667862 [details] [diff] [review]
0001-bug-1112499-set-up-shutdown-pipe-before-new-thread-s.patch

Review of attachment 8667862 [details] [diff] [review]:
-----------------------------------------------------------------

This should do it, yes. Do you think it's still necessary to have the shutdown indicator variable/B2G workaround?
Attachment #8667862 - Flags: feedback?(dkeeler) → feedback+
Assignee

Comment 9

4 years ago
This second try-run was a test that didn't use the shutdown variable and then we're back to getting lots of mochitest timeouts on the B2G ICS emulator. It would indicate that it doesn't receive the shutdown signal appropriately.

Clearly there's something still lurking in there.
Daniel, can you move forward with landing this? Moving the call to pipe is at least more correct than what's going on now (without this fix, on my linux box xpcshell tests will periodically hang for 5 minutes until they're killed and then (successfully) re-run, which hurts productivity). If there's more to investigate, we can do that in a follow-up bug.
Flags: needinfo?(daniel)
Assignee

Comment 12

3 years ago
Okidoki, thanks for bringing this back to my attention - it had fallen through some cracks and ended up in a dusty corner of mine. Here's a squashed and rebased version of the patch. A try-run is in progress as I write this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd9136e8c152

I'll PTO the next week so if you feel confident, feel free to move for commit. Otherwise I'll make sure it happens when I get back.

I also realized I didn't get an r+ from anyone on this. Are you okay with it?
Attachment #8667862 - Attachment is obsolete: true
Flags: needinfo?(daniel)
Attachment #8724257 - Flags: review?(dkeeler)
Comment on attachment 8724257 [details] [diff] [review]
v2-0001-bug-1112499-set-up-shutdown-pipe-before-new-threa.patch

Review of attachment 8724257 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM, but I suppose a necko peer should sign off.
Attachment #8724257 - Flags: review?(dkeeler) → review?(mcmanus)
(Also, thanks for the quick response here.)
Attachment #8724257 - Flags: review?(mcmanus) → review+
Assignee

Updated

3 years ago
Assignee: nobody → daniel
Keywords: checkin-needed

Comment 16

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/c12df2e6a862
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.