661158 - Crash on mobile SMP devices while sending PRNetAddr over IPDL [@ Pickle::ReadBytes]

Josh Matthews [:jdm]

Reporter

Description

•

13 years ago

Seen in https://crash-stats.mozilla.com/report/index/64a04257-3024-46b0-8270-1d55a2110530 .

u408661

Comment 1

•

13 years ago

From crash-stats, it looks like we're sending info about a unix domain socket in an HttpChannel (unless the line number in NeckoMessageUtils.h is wrong, and it doesn't appear to be from the call chain). I'm not sure why we would be doing that. A "quick" fix would be to just remove the case for domain sockets (which we originally thought was kind of useless anyway, but kept for completeness), but it seems now like that might cause some other kind of problem. CC'ing Jason to see if he has any ideas on how we might have gotten into this state in the first place.

Jason Duell

Comment 2

•

13 years ago

Attached patch Initialize child channel prnetaddr variables correctly (obsolete) — Details — Splinter Review

I've found one error, and fixed it in this patch--we forgot to move init code of prnetaddrs from nsHttpChannel to the base class. That would explain getting garbage/segfaults from calling Child::GetRemoteAddress, etc before RecvOnStartRequest. But that's not what we're seeing here. We seem to be seeing nsHttpChannel providing PRNetAddrs with PR_AF_LOCAL set, and that seems, well, impossible--we init the value to PF_AF_UNSPEC in the constructor, and the only time we change it is to set it from the HTTP transaction--which doesn't use PF_AF_LOCAL. Weird. I've stuck in code to blow up the browser on the parent side when/if things go awry like this (we're already screwed at that point if it's happening--the child will wind up aborting--and we might as well try to get more info from the user if possible).

Attachment #538313 - Flags: review?(josh)

Josh Matthews [:jdm]

Reporter

Updated

•

13 years ago

Attachment #538313 - Flags: review?(josh) → review+

Jason Duell

Comment 3

•

13 years ago

Comment on attachment 538313 [details] [diff] [review] Initialize child channel prnetaddr variables correctly moved this patch to a new bug 663229, since it doesn't really fix the issue here (or so I suspect).

Attachment #538313 - Attachment is obsolete: true

Mark Finkle (:mfinkle) (use needinfo?)

Comment 4

•

13 years ago

This is the top crasher for Fx6 (mobile), but doesn't even appear in the Fx5 (mobile) crasher list.

Scoobidiver (away)

Comment 5

•

13 years ago

It is #2 top browser crasher in Fennec 6.0.

tracking-fennec: --- → ?

Naoki Hirata :nhirata (please use needinfo instead of cc)

Updated

•

13 years ago

Crash Signature: [@ Pickle::ReadBytes]

Doug Turner (:dougt)

Updated

•

13 years ago

Assignee: nobody → josh

tracking-fennec: ? → 6+

Josh Matthews [:jdm]

Reporter

Comment 6

•

13 years ago

Here's an oddity - the patch in bug 663229 should have caused no more Pickle::ReadByte crashes after June 10th when it landed, but https://crash-stats.mozilla.com/report/index/0137c437-ca4f-46ed-8100-df9e42110714 is a crash from a July 14th build that shows the same error.

Mark Finkle (:mfinkle) (use needinfo?)

Comment 7

•

13 years ago

Do we have any thoughts getting a fix for Fx6?

Josh Matthews [:jdm]

Reporter

Comment 8

•

13 years ago

Jason, the only reasonable idea I've had recently is thinking about whether there could be a race when modifying the PRNetAddr values versus sending them. I know that the socket transport says that the self address is finished being modified by the time OnStartRequest is called, but what about the peer address?

Brad Lassey [:blassey] (use needinfo?)

Updated

•

13 years ago

tracking-fennec: 6+ → -

Sheila Mooney

Updated

•

13 years ago

Whiteboard: [mobile-crash]

Jason Duell

Comment 9

•

13 years ago

We set the channel's mPeerAddress in nsHttpChannel::OnTransportStatus, which runs before OnStart/Data/etc, so there should be no issue with the value not being set yet.

Jason Duell

Comment 10

•

13 years ago

To be precise: mPeerAddress is set during either of the status events STATUS_CONNECTED_TO || STATUS_WAITING_FOR. I wonder if there could be an issue with loading from cache...

Jason Duell

Comment 11

•

13 years ago

JDM: do you have any time to look at this? I'm running very low on spare cycles before the next code fork. cjones: any debugging ideas would be swell... This is apparently causing 13% of all mobile crashes at the moment :( My theory in comment 10 is wrong--what I said in comment 2 is right--there's no logical way we ever set the PRNetAddr type to PR_AF_LOCAL. It now seems most likely to me that there's an intermittent IPDL serialization error here. Something in either the send or the recv of PHttpChannel::OnStartRequest is going wrong. Nothing else really makes sense, given that we explicitly do a RUNTIMEABORT if we see PR_AF_LOCAL going out, and yet we're seeing it getting received on the child side (at least if the crashreporter stack trace can be trusted) The "juiciest" culprit here is perhaps here: http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/HttpChannelParent.cpp#430 we create a temporary nsHttpResponseHead() to send over IPDL if there isn't one in the parent (I forget when this can happen, but it's possible though rare). I'm wondering if that might somehow allocate something that doesn't serialize well over IPDL. For instance, it contains an nsHttpHeaderArray that will have an empty nsTArray<nsEntry>. But cjones tells me empty nsTArray's ought to serialize ok. So there may be no problem there. We don't use nsAutoTArray (which is not memmove-friendly, and hence not IPDL-friendly). There are lots of other fields that get sent in OnStartRequest, but they all resolve to basic types--I don't see anything that ought to fail randomly. Notes: - We crash trying to deserialize a PRNetAddr union type (PR_AF_LOCAL: i.e. Unix domain socket: resolves to "1"): so it's presumably reading some "garbage" that both tells the deserialization code that we've got a PR_AF_LOCAL socket, when we don't. Then we try to read 104 bytes and hit a bus error (I think: the stack trace actually seems to die on deref-ing a chromium iterator, which doesn't make much sense). PRNetAddr are actually long enough (112 bytes), but presumably we're already badly screwed up at this point that things go awry for some reason. Here's a 9.0 instance of the crash--the only recent one AFIACT: https://crash-stats.mozilla.com/report/index/19dc4def-2fb1-45e7-9667-e1a8d2110915

Jason Duell

Comment 12

•

13 years ago

Meh. I tweaked the HTTP stack to always use an empty, on-the-stack nsHttpResponseHead: but it doesn't cause the IPDL error. I also tried Clear()-ing the RequestHeaderTuples argument to SendOnStartRequest, and that doesn't break anything either. Everything else that's sent in SendOnStartRequest is ultimately a basic Mozilla type (PRBool, PRUint32, nsCString), so I'm at a loss as to how any there can be any deserialization errors. Perhaps there's some HTTP header values that IPDL doesn't deserialize correctly? Seems unlikely. Other ideas: I thought of moving the PRNetAddrs to the start of the OnStartRequest msg. That will probably make this problem "go away" in that whatever serialization error that happens will blow up some other code somewhere. Not sure if it's likely to be useful, but it might be worth a shot. The downside is that we'd now have a different crash signature, with not a lot of frequency in the short term (we only have one 9.0 branch crash instance on crash reports). So this is one part fumbly debugging live on the 'net, three parts ostrich algorithm. We could also try to remove the PR_AF_LOCAL code entirely, and let the ParamTraits return 'false', which should kill the child process, perhaps with some useful info? (but cjones tells me he thinks it wouldn't result in a crash report, so this might just hide the problem from us, even more badly than in the previous "fix"). Just about out of ideas, at least for now. I've got to move on to Websockets stuff. I'm again suspecting some sort of IPDL bug.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 13

•

13 years ago

All the crash reports linked from here are for Galaxy S II builds. I wasn't able to find more reports on crash-stats, which is surprising for a topcrasher, but whatever. Anyway, the Galaxy S II has a dual-core CPU. Any threading bugs will be exacerbated there, in particular because ARM has a more relaxed memory model than x86. (In reply to Jason Duell (:jduell) from comment #9) > We set the channel's mPeerAddress in nsHttpChannel::OnTransportStatus, which > runs before OnStart/Data/etc, so there should be no issue with the value not > being set yet. (In reply to Jason Duell (:jduell) from comment #10) > To be precise: mPeerAddress is set during either of the status events > STATUS_CONNECTED_TO || STATUS_WAITING_FOR. I wonder if there could be an > issue with loading from cache... How is mPeerAddress read/written? A memory ordering or cache problem might be possible. I wouldn't rule out a low-level IPC bug, but we send a *lot* of messages across a lot of platforms, so I would be surprised to only be seeing it in PHttpChannel, on android. Stranger things have happened though. Also, if [@ Pickle::ReadBytes ] is the #1 crash *signature*, but there's a wide variety of code below that on the stack, then an IPC bug looks more likely.

Jason Duell

Comment 14

•

13 years ago

> I wasn't able to find more reports on crash-stats Make sure you select product=fennec: http://tinyurl.com/3obufw8 I assume you can tell a Galaxy system from "I9100" being in the OS Version string. Looking through the crash reports, I do see these are 90+% of the cases, but there are others here and there 2.6.35.7-RAGEmod-T30 (HTC Desire: I looked it up: single CPU) 2.6.35.11-ninphetamine-2.1.3+ #26 SMP PREEMPT Linux 2.6.35.13-g3422292 #1 Linux 2.6.35.7-SC02COMKG5-CL382988 #2 Linux 2.6.35.7-I9100TDUKF4-CL331615 #2 Linux 2.6.35.7-M250K.EG26.1902-CL374700 #2 2.6.35.7-SPH-D710.EG30-CL435734 #2 I don't know the market demographics, but it does seem like we're getting a higher proportion of errors on multicore. However, the stack traces for all threads are virtually identical, which makes an IPDL race condition seem less likely, assuming the IPDL thread's stack trace below is benign (i.e not dispatching something when it shouldn't): 0 libc.so syscall 1 @0x41f3cfff 2 libxul.so epoll_dispatch /libevent/epoll.c:210 3 libxul.so event_base_loop /libevent/event.c:515 > How is mPeerAddress read/written? The only time it's modified in the parent (other than the HTTP channel constructor) is during OnTransport STATUS_CONNECTED_TO or STATUS_WAITING_FOR events, which are quite a bit sooner than OnStartRequest, and span main thread event loops and probably lots ot time and machine instructions. Also, we're not seeing the bad value getting written from the parent. On the child it's set only in the constructor, and when we receive the OnStartRequest IPDL msg. So there's no necko races in setting it--it's either a ParamTraits serialization issue, or an IPDL race AFAICT.

Naoki Hirata :nhirata (please use needinfo instead of cc)

Updated

•

13 years ago

Keywords: crash, topcrash

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 15

•

13 years ago

(In reply to Jason Duell (:jduell) from comment #14) > > I wasn't able to find more reports on crash-stats > > Make sure you select product=fennec: > > http://tinyurl.com/3obufw8 Aha. Thanks. > I assume you can tell a Galaxy system from "I9100" being in the OS Version > string. Looking through the crash reports, I do see these are 90+% of the > cases, but there are others here and there > > 2.6.35.7-RAGEmod-T30 (HTC Desire: I looked it up: single CPU) This is a mod kernel for the Galaxy S II. > 2.6.35.11-ninphetamine-2.1.3+ #26 SMP PREEMPT Another Galaxy S II mod. (SMP => multiple cores.) > Linux 2.6.35.13-g3422292 #1 This is the HTC Sensation, which runs a dual-core Snapdragon. Basically the same chipset as in the alternate version of the Galaxy S II. > Linux 2.6.35.7-SC02COMKG5-CL382988 #2 > Linux 2.6.35.7-I9100TDUKF4-CL331615 #2 > Linux 2.6.35.7-M250K.EG26.1902-CL374700 #2 > 2.6.35.7-SPH-D710.EG30-CL435734 #2 These are all Galaxy S II. > > I don't know the market demographics, but it does seem like we're getting a > higher proportion of errors on multicore.> Looks like (100-\epsilon)% AFAICT. > However, the stack traces for all > threads are virtually identical, which makes an IPDL race condition seem > less likely, assuming the IPDL thread's stack trace below is benign (i.e > not dispatching something when it shouldn't): > > 0 libc.so syscall > 1 @0x41f3cfff > 2 libxul.so epoll_dispatch /libevent/epoll.c:210 > 3 libxul.so event_base_loop /libevent/event.c:515 > Yeah, this stack is fine, but note that there's some latency between a thread crashing and it being snapshotted by breakpad. Since this thread didn't crash, it might have returned to the event loop by the time breakpad SIGSTOPs it. This doesn't make me suspect a race condition in the IPC code any less or any more. > > > How is mPeerAddress read/written? > > The only time it's modified in the parent (other than the HTTP channel > constructor) is during OnTransport STATUS_CONNECTED_TO or STATUS_WAITING_FOR > events, which are quite a bit sooner than OnStartRequest, and span main > thread event loops and probably lots ot time and machine instructions. I don't follow this exactly, but is there something sequencing the write of mPeerAddress and the main thread reading it? For example, is it true that something like this *must* happen --- Necko thread: write mPeerAddress [post a task T to main thread] main thread: [process task T] read mPeerAddress If so, then this is fine, because posting a task requires some operations that end up being equivalent to a memory barrier. The only other possibilities I can think of for mPeerAddress would be something akin to use-after-free, but I would expect that to manifest as other crashes too. Is mPeerAddress a reference to a field in another object, by any chance, an object that die concurrent with this serialization? > Also, we're not seeing the bad value getting written from the parent. On > the child it's set only in the constructor, and when we receive the > OnStartRequest IPDL msg. So there's no necko races in setting it--it's > either a ParamTraits serialization issue, or an IPDL race AFAICT. It looks really unlikely to me (i.e. impossible) for the PRNetAddr serializer to be writing a bad value. HOWEVER, it *is* possible that one of the serializers for the params preceding |selfAddr| are doing something wrong. For example, if they write more bytes to the Message than they read back, then they'll leave the buffer in a bad state. That could result in PRNetAddress reading back a value it didn't write. It's definitely worth giving those serializers a once-over (nsHttpResponseHead or RequestHeaderTuples seems most suspect). I still think an IPC bug is fairly unlikely, but stranger things have happened. Reading bytes from the socket on the IPC-IO thread is strongly sequenced with them being processed on the main thread. This only happening with PHttpChannel:OnStartRequest seems pretty suspicious too. If it were an improbable race condition, I would expect it to happen with OnDataAvailable or something higher-traffic like that.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 16

•

13 years ago

(In reply to Chris Jones [:cjones] [:warhammer] from comment #15) > It's definitely worth > giving those serializers a once-over (nsHttpResponseHead or > RequestHeaderTuples seems most suspect). Sorry, and also checking for possible data races there. nsHttpResponseHead sounds suspicious to me, just by the name of it. helgrind might be able to catch bugs in that on desktop.

Jason Duell

Comment 17

•

13 years ago

re: ordering of mPeerAddress write/read. Yes, it's basically this: Necko thread: write mPeerAddress [post a task T to main thread for OnTransportStatus] main thread: [process task T] read mPeerAddress So no race there (and as mentioned, we RUNTIMEABORT if we see the bad value being written from the parent: so it's definitely corruption/serialization on the child). I already eyeballed nsHttpResponseHead fairly thoroughly. I'll try to find some cycles to learn helgrind and run it, but probably not before the FF9 branch cutoff. And it'll depend on the fennec emulator also showing the bug on a multicore linux desktop (seems possible but of course not guaranteed).

Summary: Crash while sending PRNetAddr over IPDL [@ Pickle::ReadBytes] → Crash on mobile SMP devices while sending PRNetAddr over IPDL [@ Pickle::ReadBytes]

Jason Duell

Comment 18

•

13 years ago

dougt/mfinkle, If you have developers with multicore phones, it would be very useful to have someone try to catch this bug in gdb. We don't have a clear test case for making it happen, but if this is causing 13% of all crashes on mobile AND it's only happening on multicore, it should be crashing multicore a fair amount. Also someone could try to troll for it on desktop linux (possibly with helgrind): I don't have time for that right now.

Doug Turner (:dougt)

Comment 19

•

13 years ago

I attempted to use Helgrind on desktop, but never got very far. Lots of warning/errors before the Fennec window came up (before the child process starts). Valgrind eventually crashed.

Josh Matthews [:jdm]

Reporter

Comment 20

•

13 years ago

I shall not mislead people that I am working on this.

Assignee: josh → nobody

Robert Kaiser

Comment 21

•

13 years ago

(In reply to Jason Duell (:jduell) from comment #18) > We don't have a clear test case for making it happen, but if this is causing > 13% of all crashes on mobile Now up to 20% of all crashes on Mobile Firefox 7 over the last three days.

Scoobidiver (away)

Comment 22

•

13 years ago

It's #1 top crasher in 8.0 and represents 19% of all crashes.

Jason Duell

Comment 23

•

13 years ago

My understanding (mfinkle told me!) is that mobile is not longer using e10s, so the codepath that causes this to get hit is not in any production use going forward. So it's not a high priority. Please do correct me if I'm wrong--we shouldn't be seeing this for native android mobile, only XUL.

Josh Matthews [:jdm]

Reporter

Comment 24

•

13 years ago

Comment 23 is correct.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 25

•

13 years ago

Erm, fennec 9 and 10 will be multi-process (AFAIK) and this is the #1 crasher ... B2G is (will be) also. Was anyone ever able to even reproduce this?

Robert Kaiser

Comment 26

•

13 years ago

The focus for the moment is off e10s for new development of Fennec, but as comment #25 says, we are still shipping two releases with it, and then B2G comes as another, and even if it's delayed, e10s for desktop will probably come at some point. So we should not dismiss the crash too easily.

Josh Matthews [:jdm]

Reporter

Comment 27

•

13 years ago

I don't believe we ever had a report with reliable STR (or even unreliable ones).

Matt Brubeck (:mbrubeck)

Comment 28

•

13 years ago

For Android tablets, we are also still shipping XUL/e10s fennec as Firefox 11 and quite possibly more releases.

Robert Kaiser

Comment 29

•

13 years ago

(In reply to Josh Matthews [:jdm] from comment #27) > I don't believe we ever had a report with reliable STR (or even unreliable > ones). That's the unfortunate side of things. But doesn't the summary tell already what the circumstances are that lead to the crash - or am I just not understanding the issue due to not knowing any code?

Naoki Hirata :nhirata (please use needinfo instead of cc)

Comment 30

•

13 years ago

I've tried to reproduce this before but have not succeeded. I may need to try to get a wifi + 3G device that has dual-core? some sample URLs from Socorro: http://www.holidaycheck.de/ http://ecodiario.eleconomista.es/politica/noticias/3611830/12/11/la-casa-del-rey-aconsejo-a-urdangarin-que-dejara-noos-y-cambiara-de-actividad.html http://www.google.com/m?hl=fr&gl=fr&client=ms-android-samsung&source=android-unknown&action=devloc&q=galaxy+note http://www.google.com/m?hl=es&gl=es&client=ms-android-samsung&source=android-launcher-widget&action=devloc&q=liga+bbva+clasificacion http://m.laola1.at/login.php http://www.jappy.de/ about:home http://bell.mobi/web/bell/home http://www.google.com/m?q=fusball+livticker&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:de:official http://www.google.se/ http://www.google.com/m?hl=it&gl=it&client=ms-android-samsung&source=android-com.gau.go.launcherex&action=devloc&q=sabina+began http://www.google.com/

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 31

•

13 years ago

A random sample of crash reports I took a while back were pretty much all Galaxy S IIs. That says two things - dual-core chipset - Gingerbread release of android (2.3.x) The fact that we don't seem to be seeing this on tablet chipsets, which are also all dual core, might suggest that we're hitting a bug in bionic 2.3.x that's been since fixed. If so ... oy vey.

Scoobidiver (away)

Updated

•

13 years ago

OS: Linux → Android

Hardware: x86 → ARM

Jason Duell

Comment 32

•

13 years ago

FYI: I seem to be seeing similar symptoms (with the same function) in bug 664163 comment 10. For now I'm punting on fixing it, but if we want to revisit this when e10s is important again, it looks like we can repro that bug on our linux buildbots, unlike this one.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 33

•

13 years ago

It's important now. B2G is building around it. Please to be fixing! :)

Scoobidiver (away)

Updated

•

12 years ago

Keywords: topcrash

Patrick McManus [:mcmanus]

Updated

•

9 years ago

Group: network-core-security

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → DUPLICATE

Daniel Veditz [:dveditz]

Updated

•

8 years ago

Group: network-core-security