Closed Bug 1112499 Opened 10 years ago Closed 8 years ago

data written to pipe/socketpair get lost in b2g emulator


(Firefox OS Graveyard :: Emulator, defect)

Not set


(firefox48 fixed)

Tracking Status
firefox48 --- fixed


(Reporter: bagder, Assigned: bagder)




(1 file, 1 obsolete file)

In bug 1008091 I've been working on code that starts a new thread on FxOS (and Linux) and when shutting it down, it indicates death to the child thread using a pipe - the parent thread writes to it and the child thread poll()s it.

When running this code in the b2g emulator (in the try-runs), at seemingly random intervals (but still frequent enough to be a problem) data that is written to the pipe doesn't arrive in the other end and thus it doesn't wake up poll() in the waiting thread. The result is a thread that doesn't terminate when it is supposed to. I've also tried to replace the pipe with a socketpair, but with seemlingly the same end result.

This problem gets visible best when running xpcshell tests in the emulator. It is very rare that all tests manage to run without at least one of them getting hung on this problem and the test gets killed due to timeout (after 300 seconds).

In bug 1008091 I intend to proceed and implement a work-around that detects when it runs in the emulator and avoids this situation only then, but it is awkward and ugly. The work-around writes to a shared variable and it makes the poll() timeout every NN milliseconds to check that variable.

To repeat this problem, remove the variable-trick in nsNotifyAddrListener_Linux.cpp and run the xpcshell tests in the emulator, or before that patch has landed an earlier version of the patch that doesn't use the variable trick can be used.
I'm wondering if this is related to the other emulator flakiness we've been seeing for ages.
I had a look into this and I think I have an idea of what's going on.

Here's part of nsNotifyAddrListener::Init():

348: rv = NS_NewNamedThread("Link Monitor", getter_AddRefs(mThread), this);
     NS_ENSURE_SUCCESS(rv, rv);

     nsCOMPtr<nsIRunnable> runner = new NuwaMarkLinkMonitorThreadRunner();
     mThread->Dispatch(runner, NS_DISPATCH_NORMAL);

356: if (-1 == pipe(mShutdownPipe)) {
         return NS_ERROR_FAILURE;

And here's part of nsNotifyAddrListener::Run():

     struct pollfd fds[2];
260: fds[0].fd = mShutdownPipe[0];
     while (!shutdown) {
279:     int rc = EINTR_RETRY(poll(fds, 2, pollTimeout));

Note that mShutdownPipe isn't actually set up (on line 356) until after the thread is created (on line 348). If the thread runs before that call to pipe, line 260 will store -1 in fds[0].fd. Then, the call to poll on line 279 will be waiting on some nonexistent fd -1. If pollTimeout is -1, that call will never return, leading to hangs.
I can only but agree and say that it seems obvious now that you point it out. And probably the timing was so that this only hit us on the b2g emulator.
Just moving the setting up the pipe should do it then. You agree? I'll take it on a try-run adventure too and see how it plays.
Attachment #8667862 - Flags: feedback?(dkeeler)
Comment on attachment 8667862 [details] [diff] [review]

Review of attachment 8667862 [details] [diff] [review]:

This should do it, yes. Do you think it's still necessary to have the shutdown indicator variable/B2G workaround?
Attachment #8667862 - Flags: feedback?(dkeeler) → feedback+
This second try-run was a test that didn't use the shutdown variable and then we're back to getting lots of mochitest timeouts on the B2G ICS emulator. It would indicate that it doesn't receive the shutdown signal appropriately.

Clearly there's something still lurking in there.
Daniel, can you move forward with landing this? Moving the call to pipe is at least more correct than what's going on now (without this fix, on my linux box xpcshell tests will periodically hang for 5 minutes until they're killed and then (successfully) re-run, which hurts productivity). If there's more to investigate, we can do that in a follow-up bug.
Flags: needinfo?(daniel)
Okidoki, thanks for bringing this back to my attention - it had fallen through some cracks and ended up in a dusty corner of mine. Here's a squashed and rebased version of the patch. A try-run is in progress as I write this:

I'll PTO the next week so if you feel confident, feel free to move for commit. Otherwise I'll make sure it happens when I get back.

I also realized I didn't get an r+ from anyone on this. Are you okay with it?
Attachment #8667862 - Attachment is obsolete: true
Flags: needinfo?(daniel)
Attachment #8724257 - Flags: review?(dkeeler)
Comment on attachment 8724257 [details] [diff] [review]

Review of attachment 8724257 [details] [diff] [review]:

LGTM, but I suppose a necko peer should sign off.
Attachment #8724257 - Flags: review?(dkeeler) → review?(mcmanus)
(Also, thanks for the quick response here.)
Attachment #8724257 - Flags: review?(mcmanus) → review+
Assignee: nobody → daniel
Keywords: checkin-needed
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.