Intermittent failure to receive sandbox broker responses (SandboxBrokerTest.MultiThreadOpen, SandboxBrokerTest.MultiThreadStat, or “Sandbox: Unexpected EOF” by itself near non-shutdown crashes)

RESOLVED FIXED in Firefox 63

Status

()

P1
normal
RESOLVED FIXED
3 years ago
4 days ago

People

(Reporter: KWierso, Assigned: jld)

Tracking

(Blocks: 1 bug, {intermittent-failure})

43 Branch
mozilla63
intermittent-failure
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox48 wontfix, firefox49 wontfix, firefox50 wontfix, firefox51 wontfix, firefox63 fixed)

Details

(Whiteboard: sb+)

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(2 attachments)

(Assignee)

Comment 1

3 years ago
-5 means this:

> #define EIO              5      /* I/O error */

But that's not from the actual open(); this is also in the test log:

> Sandbox: Unexpected EOF, op 0 flags 00 path /dev/zero

That's from the broker client.  The parent is supposed to write a SandboxBrokerCommon::Response and attach a file descriptor; the child got EOF instead.  I need to look at this a little more closely.
Component: Security → Security: Process Sandboxing
(Assignee)

Comment 2

3 years ago
To clarify: for the gtests this is in multiple threads in the same process; the “child” role is taken by the main thread and the “parent” is the broker thread.

I've heard about implementations of AF_UNIX sockets having bugs when a socket is closed and there's outstanding data not yet read by the peer, but I think that was specific to OS X.  Linux seems to handle that fine in the general case; I tried adding some nanosleep() calls to adjust the timing but they didn't reproduce this.

This code isn't being used for real on desktop yet, but it (or something equivalent) will be needed eventually; it is used on B2G, but B2G is now Tier 3.  So this test could be disabled if the failures become too common and there's no solution in sight, but that would invite bit rot.

This particular test invokes the broker client in a tight loop on multiple threads concurrently, to try to catch thread-unsafety in the client or broker bugs that would mix up different requests' respnses somehow.  It's almost but not quite entirely implausible that there's an uncommon race condition in the Linux kernel this is being run on (still a heavily patched 3.2.x, I think?), which went undetected because normal workloads don't stress the SCM_RIGHTS code like that.
(Assignee)

Updated

3 years ago
Duplicate of this bug: 1246436
(Assignee)

Updated

3 years ago
Summary: Intermittent SandboxBrokerTest.MultiThreadOpen | Expected: (zerofd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms) → Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms)

Updated

2 years ago
Blocks: 1257239
Whiteboard: sb+
6 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 5
* fx-team: 1

Platform breakdown:
* linux64: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2016-06-20&endday=2016-06-26&tree=all
5 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 2
* mozilla-central: 2
* mozilla-beta: 1

Platform breakdown:
* linux64: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2016-08-01&endday=2016-08-07&tree=all
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1295421
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1296200

Comment 8

2 years ago
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
11 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 5
* mozilla-inbound: 2
* fx-team: 2
* mozilla-central: 1
* mozilla-aurora: 1

Platform breakdown:
* linux64: 11

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2016-08-29&endday=2016-09-04&tree=all
8 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 5
* mozilla-inbound: 2
* fx-team: 1

Platform breakdown:
* linux64: 8

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2016-09-05&endday=2016-09-11&tree=all
10 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 4
* autoland: 4
* try: 1
* fx-team: 1

Platform breakdown:
* linux64: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2016-09-12&endday=2016-09-18&tree=all
status-firefox48: --- → wontfix
status-firefox49: --- → wontfix
status-firefox50: --- → affected
status-firefox51: --- → affected
6 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 4
* mozilla-aurora: 1
* autoland: 1

Platform breakdown:
* linux64: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2016-09-19&endday=2016-09-25&tree=all
(Assignee)

Comment 13

2 years ago
(In reply to Jed Davis [:jld] {⏰UTC-7} from comment #2)
> This code isn't being used for real on desktop yet, but it (or something
> equivalent) will be needed eventually

“Eventually” is now: we've been using this code on Nightly since bug 1289718 landed.

It would really help if this bug could be reproduced under rr.  I tried running the tests locally under rr chaos mode, but no luck.

Updated

2 years ago
Blocks: 1317084
(Assignee)

Updated

2 years ago
Summary: Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms) → Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0) or (nullfd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms)
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1317084
1 failures in 822 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 1

Platform breakdown:
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2017-07-17&endday=2017-07-23&tree=all
1 failures in 888 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2017-07-31&endday=2017-08-06&tree=all
1 failures in 939 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 1

Platform breakdown:
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2017-08-28&endday=2017-09-03&tree=all
1 failures in 1032 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* try: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2017-09-11&endday=2017-09-17&tree=all
1 failures in 792 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2017-11-27&endday=2017-12-03&tree=all
2 failures in 590 pushes (0.003 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* try: 2

Platform breakdown:
* linux64-qr: 1
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2017-12-18&endday=2017-12-24&tree=all
1 failures in 147 pushes (0.007 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* autoland: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2017-12-25&endday=2017-12-31&tree=all
(Assignee)

Updated

8 months ago
See Also: → bug 1421201
(Assignee)

Comment 22

8 months ago
This seems to be turning up “for real” now that we're using the broker on desktop: see bug 1420475 and bug 1421201.  I added a log message in bug 1422198 for the case where sendmsg() fails, but it's not happening.  So the message is sent but never arrives.

I believe we're using much newer kernels for CI now (new enough to include unprivileged user namespaces and seccomp thread sync, at least), so the last part of comment #2 is probably not it.

I suppose it could be a kernel “feature”, like if it's not guaranteed that the sendmsg->close->recvmsg ordering won't drop the message but in practice that happens only in some odd edge case, but I don't know how much sense that makes.  (That *is* a problem with TCP, if the data is buffered locally but not transmitted yet — see http://blog.csdn.net/CPP_CHEN/article/details/29864509 — but there's no reason that would apply to AF_UNIX.  In any case, Chromium is doing the same thing we are, although they may not be using this kind of brokering as much as we do.)

In another bug I was speculating about a double-close or use-after-close bug somewhere, but I'm having trouble coming up with a plausible scenario where we wouldn't at least sometimes see actual errors, not just the message disappearing.

If the “missing” message was getting written to the wrong request's response socket, *after* the correct response was sent (these tests, if I recall correctly from when I wrote them years ago, try to check for concurrent requests getting their responses swapped), then that would show up like this.  But I don't have a plausible mechanism for how a bug would get us into that state without sometimes causing other failures instead.  This at least would be relatively easy to test: do a second recvmsg in the client and assert that it gets EOF.

I'm really just guessing at this point.
See Also: → bug 1420475
Whiteboard: sb+ → sb?
(Assignee)

Comment 23

8 months ago
One thing I notice, in the 4 instances we still have logs for, are the loop counters:

Loop 10000/10000 @ /builds/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:419
Loop 674/10000 @ /home/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:408
Loop 9397/10000 @ /builds/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:420
Loop 42/10000 @ /builds/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:419

They're all near the beginning or end of the loop (so maybe when other threads are starting/exiting?), and one of them is the last of 10k iterations which is kind of suspect just by itself.


(Also, the canonical version of that article about TCP seems to be https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable )
2 failures in 462 pushes (0.004 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-central: 1
* autoland: 1

Platform breakdown:
* linux64-qr: 1
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2018-01-01&endday=2018-01-07&tree=all
(Assignee)

Comment 25

7 months ago
Something that's worth trying: burn a few CPU-hours trying to reproduce this locally under rr.  If it replays, then I can try to see what's going on with the file descriptors.  Or, if the replay diverges, that would suggest a kernel bug.
Flags: needinfo?(jld)
2 failures in 788 pushes (0.003 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* try: 1
* mozilla-inbound: 1

Platform breakdown:
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2018-01-08&endday=2018-01-14&tree=all
(Assignee)

Updated

7 months ago
Duplicate of this bug: 1433135
(Assignee)

Comment 28

7 months ago
I did run this a few dozen times in rr (with and without chaos mode), but it turns out to be essentially a worst case for recording overhead because it's almost all context switches and nontrivial syscalls; see https://github.com/mozilla/rr/issues/2159

And I haven't reproduced it yet, but it's possible that letting it run for longer might help….
(Assignee)

Comment 29

7 months ago
I've run this about 1,000 times under rr now, and the bug hasn't happened.  I also tried running it 1,000 times locally *not* under rr, and… the bug also didn't reproduce.  So this may not be a useful approach.
Flags: needinfo?(jld)

Updated

7 months ago
Whiteboard: sb? → sb+
1 failures in 735 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2018-01-29&endday=2018-02-04&tree=all
2 failures in 702 pushes (0.003 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* try: 1
* mozilla-inbound: 1

Platform breakdown:
* linux64-qr: 1
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2018-02-05&endday=2018-02-11&tree=all
1 failures in 685 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2018-02-12&endday=2018-02-18&tree=all
1 failures in 831 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-central: 1

Platform breakdown:
* linux64-qr: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2018-02-26&endday=2018-03-04&tree=all
1 failures in 747 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-central: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2018-03-05&endday=2018-03-11&tree=all
2 failures in 814 pushes (0.002 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-inbound: 2

Platform breakdown:
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1243108&startday=2018-03-12&endday=2018-03-18&tree=all
1 failures in 822 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* autoland: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1243108&startday=2018-03-20&endday=2018-03-26&tree=trunk
(Assignee)

Updated

4 months ago
See Also: → bug 1442989
(Assignee)

Updated

4 months ago
See Also: → bug 1439140
1 failures in 792 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* linux32: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1243108&startday=2018-04-09&endday=2018-04-15&tree=trunk
2 failures in 782 pushes (0.003 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* autoland: 2

Platform breakdown:
* linux32: 2

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1243108&startday=2018-04-16&endday=2018-04-22&tree=trunk
(Assignee)

Comment 39

4 months ago
Unfortunately, this doesn't seem to have been a double-close caused by bug 1278361, because we're still seeing it.  (This isn't a big surprise — see earlier comments that the observables here don't make sense for that kind of bug — but it sure would have been convenient if it had mysteriously been fixed.  Sigh.)

Also, bug 1456022 seems to be another manifestation of this.
(Assignee)

Updated

4 months ago
See Also: → bug 1456022
(Assignee)

Comment 40

4 months ago
I can reproduce this locally, given an hour or two: I increased the loop count in the tests by 10x (to amortize startup/shtudown/scripting overhead) and ran mach gtest in a shell loop, with an opt build, and got 1 hit in ~400 runs; a debug build also worked but took a lot longer (1 in ~1800).  Note that this roughly corresponds to 4,000 opt runs of the normal test.

So, things that can be done:

* Send one or more fake responses with a fictitious error code (or otherwise marked) after the real response, so the client can detect the case where the response was lost.  This assumes that whatever is happening is a more-or-less independent chance per message, not some property of the response socket itself.

* Send the real response multiple times, to work around lost responses if that's what's going on.  (I hope we won't have to do that.)

* If I can be sufficiently sure that an unexpected EOF means the request was never received (see above), add a retry loop in the client.  (Retrying operations that may or may not have already happened is… delicate, and I don't want to go there if I don't have to; see also NFS.)

* Add a variant of the multithread test where the clients alternate opening a per-thread temp file with O_CREAT | O_EXCL and unlinking it — either a dropped request or a duplicated one will cause an error.  (mkdir + rmdir would also work.)
(Assignee)

Comment 41

4 months ago
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #40)
> * Send one or more fake responses with a fictitious error code (or otherwise
> marked) after the real response, so the client can detect the case where the
> response was lost.

I did this and ran it overnight: 4 hits in ~1600 runs, all of them apparently a lost request rather than a lost response.

> * Add a variant of the multithread test where the clients alternate opening
> a per-thread temp file with O_CREAT | O_EXCL and unlinking it — either a
> dropped request or a duplicated one will cause an error.  (mkdir + rmdir
> would also work.)

I did this (both creat+unlink and mkdir+rmdir) and ran it all afternoon and evening, and I got one error after 1388 runs (~2.78e9 broker calls) — a lost rmdir that succeeded when reissued.  Changing the operations seems to affect the error rate; this may be significant.

But that test closed the file before the unlink.  I modified it to close the file last (so that it's the broker client thread and not the server that causes the inode to be freed), and it's failed twice in only 160 iterations (80e6 calls): one a lost unlink that succeeded on retry, and one a lost open that did *not* — it failed with EEXIST.  And that's *with* the change to add a second response.  So now this is making even less sense than it did before.


Another possibility here is writing a self-contained C program that does the same socket operations as the broker test and run *that* for a few hours, to try to rule out everything else in Gecko and maybe point toward a kernel bug.
(Assignee)

Comment 42

4 months ago
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #41)
> So now this is making even less sense than it did before.

It turns out I messed up the retry logic a little.  The open was retried, but the unlink (and rmdir from the earlier test) weren't.  So this means the requests *are* being received, but the responses are lost, even when multiple responses are sent.  This suggests that something is happening to the socket that's passed to the parent (but not to the non-socket fds passed back for the open test).

The creat+unlink+close test continues to be much more effective than the creat+close+unlink or mkdir+rmdir: 11 failures in 3041 runs (one per ~2.76e8 calls, or ~10x the rate).  4 were failed opens that failed with EEXIST when retried, and 7 were failed unlinks that did observably unlink.
1 failures in 780 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-beta: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1243108&startday=2018-04-23&endday=2018-04-29&tree=trunk
(Assignee)

Comment 44

4 months ago
New findings:

* Extra assertions (for short reads/writes, close() failing) don't fire, but do seem to reduce the repro rate, even though it's a handful of arithmetic ops and a predicatable branch or two in a path that includes a nontrivial syscall, so it shouldn't be throwing off timing all that much.  Needs more data.

* Setting the response socket nonblocking and spamming extra responses until there's an error doesn't help; none of them are received.  This probably wasn't too surprising given previous results.

* The response socket's inode number in the server after it sends the reply is the same as what the client sees immediately after the socketpair() call.  (I modified the broker requests to include the inode number.)  I haven't yet tested the client end of the socket the same way; I could, but I'm doubtful.

* It doesn't appear to be a soundness bug in the Linux kernel's socket GC.  (Because sockets can be held in queued SCM_RIGHTS messages in sockets, there can be reference cycles; the kernel has a cycle collector that breaks cycles by, if I understand correctly, removing and destroying the socket buffer.)  This was tested by using ftrace to log calls to kfree_skb; a call from unix_gc can be seen when running a test program that creates a one-socket cycle, but not when the gtests fail.  (kfree_skb doesn't appear to be called for the normal socket destruction caused by the broker, so the log is low-traffic enough to read manually — about 19 calls per test run — and doesn't appear to have lost messages due to buffer overrun).

* I also tried doing a second recvmsg on both sides and asserting that they got EOF, to try to detect messages possibly being sent to the wrong socket somehow.  To make that not deadlock, I also added shutdown(fd, SHUT_WR) calls.  With that build, the bug hasn't reproduced once in 6.0e9 broker calls.

If that last result holds up, and if I can troubleshoot it down to a smaller part of the change… I don't *like* making changes without understanding why, but I could.
(Assignee)

Updated

4 months ago
Duplicate of this bug: 1456022
(Assignee)

Comment 46

4 months ago
Created attachment 8972766 [details]
Simplified standalone test case

I wrote a small C program to try to reproduce this without the overhead of Gecko.  It works *very* well — it takes about 60 runs (60 million RPCs) per failure on average, but with a lot of variance (which is expected for an exponential distribution).  This is partially due to giving each client a separate dup() of the client fd; if I don't do that, it took 448 runs before the first failure.  That may be due to lock contention on the fd vs. just on the underlying socket changing the timing.

It doesn't seem to reproduce at all (not observed in 3000 runs) if confined to a single core.  It also didn't reproduce in 136 runs under rr chaos mode (where it's 60x slower).

Adding a SHUT_WR on the response socket after sending and before closing reduces but doesn't eliminate it: the first failure was after 300 runs.

So I added a sched_yield in that spot, and it lasted 1171 runs.  This suggests experimenting with ensuring that the last close on the sending end of the response socket happens-after the receive — sending a dup of it back to the client in a second message, or doing a blocking read on it (expecting EOF) before closing.
Assignee: nobody → jld
(Assignee)

Comment 47

4 months ago
I tested with a change to make the server wait for EOF from the client, and that seems to avoid the bug (no failures in 4 billion RPCs).  It also doubles the wall time, although it might not affect latency in practice if we're not getting multithreaded back-to-back requests.

The approach of sending the socket back to the client (in a message that will either be dropped on client close, or fail to be sent if the client closes first) also holds up: it ran for 14.4 billion RPCs without failing.  That change actually *decreases* wall time slightly for some reason, although total CPU time increased somewhat.

For comparison, I also tried running the unmodified test case on FreeBSD; it ran for 13.8 billion RPCs with no failures.
(Assignee)

Comment 48

4 months ago
As for *why* this is happening, I think I finally have an explanation:

1. (server) recvmsg calls __skb_try_recv_datagram, sees no pending data, and prepares to sleep
2. (client) sendmsg calls skb_queue_tail to append the packet to the peer's receive queue
3. (client) close → fput → unix_release_sock sets the peer's sk_shutdown field to the equivalent of SHUT_RDWR
4. (server) recvmsg calls __skb_wait_for_more_packets, which reads sk_shutdown and indicates EOF

Corresponding source lines:

[1] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/unix/af_unix.c#L2098
[2] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/unix/af_unix.c#L1797
[3] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/unix/af_unix.c#L540
[4] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/core/datagram.c#L102
(Assignee)

Updated

3 months ago
See Also: → bug 1459553
(Assignee)

Updated

3 months ago
Priority: P3 → P1
Summary: Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0) or (nullfd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms) → Intermittent failure to receive sandbox broker responses (SandboxBrokerTest.MultiThreadOpen, SandboxBrokerTest.MultiThreadStat, or “Sandbox: Unexpected EOF” by itself near non-shutdown crashes)
(Assignee)

Updated

3 months ago
See Also: bug 1421201
Duplicate of this bug: 1421201
(Assignee)

Updated

3 months ago
See Also: bug 1420475
Duplicate of this bug: 1420475
(Assignee)

Updated

3 months ago
See Also: bug 1442989
(Assignee)

Updated

3 months ago
Duplicate of this bug: 1442989
(Assignee)

Updated

3 months ago
See Also: bug 1439140
Duplicate of this bug: 1439140
(Assignee)

Updated

3 months ago
See Also: bug 1459553
Duplicate of this bug: 1459553
(Assignee)

Updated

3 months ago
status-firefox50: affected → wontfix
status-firefox51: affected → wontfix
See Also: bug 1317084, bug 1456022
(Assignee)

Updated

3 months ago
No longer blocks: 1317084
1 failures in 660 pushes (0.002 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1243108&startday=2018-05-07&endday=2018-05-13&tree=trunk
1 failures in 639 pushes (0.002 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-release: 1

Platform breakdown:
* linux32: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1243108&startday=2018-05-14&endday=2018-05-20&tree=trunk
1 failures in 521 pushes (0.002 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr60: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1243108&startday=2018-06-18&endday=2018-06-24&tree=trunk
(Assignee)

Updated

2 months ago
Duplicate of this bug: 1471528
1 failures in 718 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* linux32: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1243108&startday=2018-06-25&endday=2018-07-01&tree=trunk
2 failures in 622 pushes (0.003 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-inbound: 2

Platform breakdown:
* osx-10-10: 1
* linux64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1243108&startday=2018-07-02&endday=2018-07-08&tree=trunk
(Assignee)

Updated

a month ago
Blocks: 1447867
Comment hidden (mozreview-request)
(Assignee)

Comment 61

a month ago
Back when I posted comment #48, I was planning to report the bug upstream first and seek input on what workaround we should use, but then other things intervened.  Instead, what I'm going to do now is land the workaround technique that's been reasonably well verified with the minimal test case, so that things stop breaking, and if later it's decided that there's a better way then we can always change it.

Comment 62

26 days ago
mozreview-review
Comment on attachment 8993568 [details]
Bug 1243108 - Work around Linux kernel bug causing broker responses to be dropped.

https://reviewboard.mozilla.org/r/258280/#review265980
Attachment #8993568 - Flags: review?(gpascutto) → review+

Comment 63

25 days ago
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ecae69f91831
Work around Linux kernel bug causing broker responses to be dropped. r=gcp
https://hg.mozilla.org/mozilla-central/rev/ecae69f91831
Status: NEW → RESOLVED
Last Resolved: 25 days ago
status-firefox63: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
(Assignee)

Updated

12 days ago
Duplicate of this bug: 1481345
1 failures in 2579 pushes (0.0 failures/push) were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1243108&startday=2018-08-06&endday=2018-08-12&tree=all
New occurrence:

https://treeherder.mozilla.org/logviewer.html#?job_id=193734658&repo=mozilla-beta
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 68

4 days ago
The patch wasn't uplifted to Beta, so it's expected to still see failures there.  It's 3 weeks before release, and this has been broken since 54, so it probably wouldn't be accepted for uplift.
Status: REOPENED → RESOLVED
Last Resolved: 25 days ago4 days ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.