1243108 - Intermittent failure to receive sandbox broker responses (SandboxBrokerTest.MultiThreadOpen, SandboxBrokerTest.MultiThreadStat, or “Sandbox: Unexpected EOF” by itself near non-shutdown crashes)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Description

•

10 years ago

https://treeherder.mozilla.org/logviewer.html#?job_id=20472319&repo=mozilla-inbound

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 1

•

10 years ago

-5 means this: > #define EIO 5 /* I/O error */ But that's not from the actual open(); this is also in the test log: > Sandbox: Unexpected EOF, op 0 flags 00 path /dev/zero That's from the broker client. The parent is supposed to write a SandboxBrokerCommon::Response and attach a file descriptor; the child got EOF instead. I need to look at this a little more closely.

Component: Security → Security: Process Sandboxing

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 2

•

10 years ago

To clarify: for the gtests this is in multiple threads in the same process; the “child” role is taken by the main thread and the “parent” is the broker thread. I've heard about implementations of AF_UNIX sockets having bugs when a socket is closed and there's outstanding data not yet read by the peer, but I think that was specific to OS X. Linux seems to handle that fine in the general case; I tried adding some nanosleep() calls to adjust the timing but they didn't reproduce this. This code isn't being used for real on desktop yet, but it (or something equivalent) will be needed eventually; it is used on B2G, but B2G is now Tier 3. So this test could be disabled if the failures become too common and there's no solution in sight, but that would invite bit rot. This particular test invokes the broker client in a tight loop on multiple threads concurrently, to try to catch thread-unsafety in the client or broker bugs that would mix up different requests' respnses somehow. It's almost but not quite entirely implausible that there's an uncommon race condition in the Linux kernel this is being run on (still a heavily patched 3.2.x, I think?), which went undetected because normal workloads don't stress the SCM_RIGHTS code like that.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

10 years ago

Summary: Intermittent SandboxBrokerTest.MultiThreadOpen | Expected: (zerofd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms) → Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms)

Jim Mathies [:jimm]

Updated

•

9 years ago

Blocks: sb-test

Whiteboard: sb+

Comment hidden (Intermittent Failures Robot)

BMO Automation

Comment 8

•

9 years ago

Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.

Priority: -- → P3

Comment hidden (Intermittent Failures Robot)

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

status-firefox48: --- → wontfix

status-firefox49: --- → wontfix

status-firefox50: --- → affected

status-firefox51: --- → affected

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

9 years ago

Comment 13

•

9 years ago

(In reply to Jed Davis [:jld] {⏰UTC-7} from comment #2) > This code isn't being used for real on desktop yet, but it (or something > equivalent) will be needed eventually “Eventually” is now: we've been using this code on Nightly since bug 1289718 landed. It would really help if this bug could be reproduced under rr. I tried running the tests locally under rr chaos mode, but no luck.

Jim Mathies [:jimm]

Updated

•

9 years ago

Blocks: 1317084

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

9 years ago

Summary: Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms) → Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0) or (nullfd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms)

Comment hidden (Intermittent Failures Robot)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

8 years ago

Comment 22

•

8 years ago

This seems to be turning up “for real” now that we're using the broker on desktop: see bug 1420475 and bug 1421201. I added a log message in bug 1422198 for the case where sendmsg() fails, but it's not happening. So the message is sent but never arrives. I believe we're using much newer kernels for CI now (new enough to include unprivileged user namespaces and seccomp thread sync, at least), so the last part of comment #2 is probably not it. I suppose it could be a kernel “feature”, like if it's not guaranteed that the sendmsg->close->recvmsg ordering won't drop the message but in practice that happens only in some odd edge case, but I don't know how much sense that makes. (That *is* a problem with TCP, if the data is buffered locally but not transmitted yet — see http://blog.csdn.net/CPP_CHEN/article/details/29864509 — but there's no reason that would apply to AF_UNIX. In any case, Chromium is doing the same thing we are, although they may not be using this kind of brokering as much as we do.) In another bug I was speculating about a double-close or use-after-close bug somewhere, but I'm having trouble coming up with a plausible scenario where we wouldn't at least sometimes see actual errors, not just the message disappearing. If the “missing” message was getting written to the wrong request's response socket, *after* the correct response was sent (these tests, if I recall correctly from when I wrote them years ago, try to check for concurrent requests getting their responses swapped), then that would show up like this. But I don't have a plausible mechanism for how a bug would get us into that state without sometimes causing other failures instead. This at least would be relatively easy to test: do a second recvmsg in the client and assert that it gets EOF. I'm really just guessing at this point.

Comment 23

•

8 years ago

One thing I notice, in the 4 instances we still have logs for, are the loop counters: Loop 10000/10000 @ /builds/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:419 Loop 674/10000 @ /home/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:408 Loop 9397/10000 @ /builds/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:420 Loop 42/10000 @ /builds/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:419 They're all near the beginning or end of the loop (so maybe when other threads are starting/exiting?), and one of them is the last of 10k iterations which is kind of suspect just by itself. (Also, the canonical version of that article about TCP seems to be https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable )

Comment hidden (Intermittent Failures Robot)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 25

•

8 years ago

Something that's worth trying: burn a few CPU-hours trying to reproduce this locally under rr. If it replays, then I can try to see what's going on with the file descriptors. Or, if the replay diverges, that would suggest a kernel bug.

Flags: needinfo?(jld)

Comment hidden (Intermittent Failures Robot)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 28

•

8 years ago

I did run this a few dozen times in rr (with and without chaos mode), but it turns out to be essentially a worst case for recording overhead because it's almost all context switches and nontrivial syscalls; see https://github.com/mozilla/rr/issues/2159 And I haven't reproduced it yet, but it's possible that letting it run for longer might help….

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 29

•

8 years ago

I've run this about 1,000 times under rr now, and the bug hasn't happened. I also tried running it 1,000 times locally *not* under rr, and… the bug also didn't reproduce. So this may not be a useful approach.

Flags: needinfo?(jld)

Jim Mathies [:jimm]

Updated

•

8 years ago

Whiteboard: sb? → sb+

Comment hidden (Intermittent Failures Robot)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

7 years ago

Updated

•

7 years ago

Comment 39

•

7 years ago

Unfortunately, this doesn't seem to have been a double-close caused by bug 1278361, because we're still seeing it. (This isn't a big surprise — see earlier comments that the observables here don't make sense for that kind of bug — but it sure would have been convenient if it had mysteriously been fixed. Sigh.) Also, bug 1456022 seems to be another manifestation of this.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

7 years ago

Comment 40

•

7 years ago

I can reproduce this locally, given an hour or two: I increased the loop count in the tests by 10x (to amortize startup/shtudown/scripting overhead) and ran mach gtest in a shell loop, with an opt build, and got 1 hit in ~400 runs; a debug build also worked but took a lot longer (1 in ~1800). Note that this roughly corresponds to 4,000 opt runs of the normal test. So, things that can be done: * Send one or more fake responses with a fictitious error code (or otherwise marked) after the real response, so the client can detect the case where the response was lost. This assumes that whatever is happening is a more-or-less independent chance per message, not some property of the response socket itself. * Send the real response multiple times, to work around lost responses if that's what's going on. (I hope we won't have to do that.) * If I can be sufficiently sure that an unexpected EOF means the request was never received (see above), add a retry loop in the client. (Retrying operations that may or may not have already happened is… delicate, and I don't want to go there if I don't have to; see also NFS.) * Add a variant of the multithread test where the clients alternate opening a per-thread temp file with O_CREAT | O_EXCL and unlinking it — either a dropped request or a duplicated one will cause an error. (mkdir + rmdir would also work.)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 41

•

7 years ago

(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #40) > * Send one or more fake responses with a fictitious error code (or otherwise > marked) after the real response, so the client can detect the case where the > response was lost. I did this and ran it overnight: 4 hits in ~1600 runs, all of them apparently a lost request rather than a lost response. > * Add a variant of the multithread test where the clients alternate opening > a per-thread temp file with O_CREAT | O_EXCL and unlinking it — either a > dropped request or a duplicated one will cause an error. (mkdir + rmdir > would also work.) I did this (both creat+unlink and mkdir+rmdir) and ran it all afternoon and evening, and I got one error after 1388 runs (~2.78e9 broker calls) — a lost rmdir that succeeded when reissued. Changing the operations seems to affect the error rate; this may be significant. But that test closed the file before the unlink. I modified it to close the file last (so that it's the broker client thread and not the server that causes the inode to be freed), and it's failed twice in only 160 iterations (80e6 calls): one a lost unlink that succeeded on retry, and one a lost open that did *not* — it failed with EEXIST. And that's *with* the change to add a second response. So now this is making even less sense than it did before. Another possibility here is writing a self-contained C program that does the same socket operations as the broker test and run *that* for a few hours, to try to rule out everything else in Gecko and maybe point toward a kernel bug.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 42

•

7 years ago

(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #41) > So now this is making even less sense than it did before. It turns out I messed up the retry logic a little. The open was retried, but the unlink (and rmdir from the earlier test) weren't. So this means the requests *are* being received, but the responses are lost, even when multiple responses are sent. This suggests that something is happening to the socket that's passed to the parent (but not to the non-socket fds passed back for the open test). The creat+unlink+close test continues to be much more effective than the creat+close+unlink or mkdir+rmdir: 11 failures in 3041 runs (one per ~2.76e8 calls, or ~10x the rate). 4 were failed opens that failed with EEXIST when retried, and 7 were failed unlinks that did observably unlink.

Comment hidden (Intermittent Failures Robot)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 44

•

7 years ago

New findings: * Extra assertions (for short reads/writes, close() failing) don't fire, but do seem to reduce the repro rate, even though it's a handful of arithmetic ops and a predicatable branch or two in a path that includes a nontrivial syscall, so it shouldn't be throwing off timing all that much. Needs more data. * Setting the response socket nonblocking and spamming extra responses until there's an error doesn't help; none of them are received. This probably wasn't too surprising given previous results. * The response socket's inode number in the server after it sends the reply is the same as what the client sees immediately after the socketpair() call. (I modified the broker requests to include the inode number.) I haven't yet tested the client end of the socket the same way; I could, but I'm doubtful. * It doesn't appear to be a soundness bug in the Linux kernel's socket GC. (Because sockets can be held in queued SCM_RIGHTS messages in sockets, there can be reference cycles; the kernel has a cycle collector that breaks cycles by, if I understand correctly, removing and destroying the socket buffer.) This was tested by using ftrace to log calls to kfree_skb; a call from unix_gc can be seen when running a test program that creates a one-socket cycle, but not when the gtests fail. (kfree_skb doesn't appear to be called for the normal socket destruction caused by the broker, so the log is low-traffic enough to read manually — about 19 calls per test run — and doesn't appear to have lost messages due to buffer overrun). * I also tried doing a second recvmsg on both sides and asserting that they got EOF, to try to detect messages possibly being sent to the wrong socket somehow. To make that not deadlock, I also added shutdown(fd, SHUT_WR) calls. With that build, the bug hasn't reproduced once in 6.0e9 broker calls. If that last result holds up, and if I can troubleshoot it down to a smaller part of the change… I don't *like* making changes without understanding why, but I could.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 46

•

7 years ago

Attached file Simplified standalone test case — Details

I wrote a small C program to try to reproduce this without the overhead of Gecko. It works *very* well — it takes about 60 runs (60 million RPCs) per failure on average, but with a lot of variance (which is expected for an exponential distribution). This is partially due to giving each client a separate dup() of the client fd; if I don't do that, it took 448 runs before the first failure. That may be due to lock contention on the fd vs. just on the underlying socket changing the timing. It doesn't seem to reproduce at all (not observed in 3000 runs) if confined to a single core. It also didn't reproduce in 136 runs under rr chaos mode (where it's 60x slower). Adding a SHUT_WR on the response socket after sending and before closing reduces but doesn't eliminate it: the first failure was after 300 runs. So I added a sched_yield in that spot, and it lasted 1171 runs. This suggests experimenting with ensuring that the last close on the sending end of the response socket happens-after the receive — sending a dup of it back to the client in a second message, or doing a blocking read on it (expecting EOF) before closing.

Assignee: nobody → jld

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 47

•

7 years ago

I tested with a change to make the server wait for EOF from the client, and that seems to avoid the bug (no failures in 4 billion RPCs). It also doubles the wall time, although it might not affect latency in practice if we're not getting multithreaded back-to-back requests. The approach of sending the socket back to the client (in a message that will either be dropped on client close, or fail to be sent if the client closes first) also holds up: it ran for 14.4 billion RPCs without failing. That change actually *decreases* wall time slightly for some reason, although total CPU time increased somewhat. For comparison, I also tried running the unmodified test case on FreeBSD; it ran for 13.8 billion RPCs with no failures.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 48

•

7 years ago

As for *why* this is happening, I think I finally have an explanation: 1. (server) recvmsg calls __skb_try_recv_datagram, sees no pending data, and prepares to sleep 2. (client) sendmsg calls skb_queue_tail to append the packet to the peer's receive queue 3. (client) close → fput → unix_release_sock sets the peer's sk_shutdown field to the equivalent of SHUT_RDWR 4. (server) recvmsg calls __skb_wait_for_more_packets, which reads sk_shutdown and indicates EOF Corresponding source lines: [1] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/unix/af_unix.c#L2098 [2] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/unix/af_unix.c#L1797 [3] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/unix/af_unix.c#L540 [4] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/core/datagram.c#L102

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

7 years ago

Updated

•

7 years ago

Priority: P3 → P1

Summary: Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0) or (nullfd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms) → Intermittent failure to receive sandbox broker responses (SandboxBrokerTest.MultiThreadOpen, SandboxBrokerTest.MultiThreadStat, or “Sandbox: Unexpected EOF” by itself near non-shutdown crashes)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

7 years ago

Updated

•

7 years ago

Updated

•

7 years ago

Updated

•

7 years ago

Updated

•

7 years ago

Updated

•

7 years ago

status-firefox50: affected → wontfix

status-firefox51: affected → wontfix

Updated

•

7 years ago

No longer blocks: 1317084

Comment hidden (Intermittent Failures Robot)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

7 years ago

Blocks: 1447867

Comment hidden (mozreview-request)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 61

•

7 years ago

Back when I posted comment #48, I was planning to report the bug upstream first and seek input on what workaround we should use, but then other things intervened. Instead, what I'm going to do now is land the workaround technique that's been reasonably well verified with the minimal test case, so that things stop breaking, and if later it's decided that there's a better way then we can always change it.

Gian-Carlo Pascutto [:gcp]

Comment 62

•

7 years ago

mozreview-review

Comment on attachment 8993568 [details] Bug 1243108 - Work around Linux kernel bug causing broker responses to be dropped. https://reviewboard.mozilla.org/r/258280/#review265980

Attachment #8993568 - Flags: review?(gpascutto) → review+

Pulsebot

Comment 63

•

7 years ago

Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ecae69f91831 Work around Linux kernel bug causing broker responses to be dropped. r=gcp

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 64

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ecae69f91831

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox63: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla63

Comment hidden (Intermittent Failures Robot)

Andrei Ciure[:aciure]

Comment 67

•

7 years ago

New occurrence: https://treeherder.mozilla.org/logviewer.html#?job_id=193734658&repo=mozilla-beta

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 68

•

7 years ago

The patch wasn't uplifted to Beta, so it's expected to still see failures there. It's 3 weeks before release, and this has been broken since 54, so it probably wouldn't be accepted for uplift.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Comment hidden (Intermittent Failures Robot)

Eliza Balazs [:ebalazs_]

Comment 70

•

7 years ago

Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=195441939&repo=mozilla-central&lineNumber=9479

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 71

•

7 years ago

(In reply to Eliza Balazs [:ebalazs_] from comment #70) > Recent failure log: > https://treeherder.mozilla.org/logviewer.html#?job_id=195441939&repo=mozilla- > central&lineNumber=9479 This failure has absolutely nothing to do with this bug. It's just bug 1482029.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Simplified standalone test case 7 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 4.13 KB, text/plain		Details
Bug 1243108 - Work around Linux kernel bug causing broker responses to be dropped. 7 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 59 bytes, text/x-review-board-request	gcp : review+	Details