From crash-stats, it looks like we're sending info about a unix domain socket in an HttpChannel (unless the line number in NeckoMessageUtils.h is wrong, and it doesn't appear to be from the call chain). I'm not sure why we would be doing that. A "quick" fix would be to just remove the case for domain sockets (which we originally thought was kind of useless anyway, but kept for completeness), but it seems now like that might cause some other kind of problem. CC'ing Jason to see if he has any ideas on how we might have gotten into this state in the first place.
Created attachment 538313 [details] [diff] [review] Initialize child channel prnetaddr variables correctly I've found one error, and fixed it in this patch--we forgot to move init code of prnetaddrs from nsHttpChannel to the base class. That would explain getting garbage/segfaults from calling Child::GetRemoteAddress, etc before RecvOnStartRequest. But that's not what we're seeing here. We seem to be seeing nsHttpChannel providing PRNetAddrs with PR_AF_LOCAL set, and that seems, well, impossible--we init the value to PF_AF_UNSPEC in the constructor, and the only time we change it is to set it from the HTTP transaction--which doesn't use PF_AF_LOCAL. Weird. I've stuck in code to blow up the browser on the parent side when/if things go awry like this (we're already screwed at that point if it's happening--the child will wind up aborting--and we might as well try to get more info from the user if possible).
Comment on attachment 538313 [details] [diff] [review] Initialize child channel prnetaddr variables correctly moved this patch to a new bug 663229, since it doesn't really fix the issue here (or so I suspect).
This is the top crasher for Fx6 (mobile), but doesn't even appear in the Fx5 (mobile) crasher list.
It is #2 top browser crasher in Fennec 6.0.
Here's an oddity - the patch in bug 663229 should have caused no more Pickle::ReadByte crashes after June 10th when it landed, but https://crash-stats.mozilla.com/report/index/0137c437-ca4f-46ed-8100-df9e42110714 is a crash from a July 14th build that shows the same error.
Do we have any thoughts getting a fix for Fx6?
Jason, the only reasonable idea I've had recently is thinking about whether there could be a race when modifying the PRNetAddr values versus sending them. I know that the socket transport says that the self address is finished being modified by the time OnStartRequest is called, but what about the peer address?
We set the channel's mPeerAddress in nsHttpChannel::OnTransportStatus, which runs before OnStart/Data/etc, so there should be no issue with the value not being set yet.
To be precise: mPeerAddress is set during either of the status events STATUS_CONNECTED_TO || STATUS_WAITING_FOR. I wonder if there could be an issue with loading from cache...
JDM: do you have any time to look at this? I'm running very low on spare cycles before the next code fork. cjones: any debugging ideas would be swell... This is apparently causing 13% of all mobile crashes at the moment :( My theory in comment 10 is wrong--what I said in comment 2 is right--there's no logical way we ever set the PRNetAddr type to PR_AF_LOCAL. It now seems most likely to me that there's an intermittent IPDL serialization error here. Something in either the send or the recv of PHttpChannel::OnStartRequest is going wrong. Nothing else really makes sense, given that we explicitly do a RUNTIMEABORT if we see PR_AF_LOCAL going out, and yet we're seeing it getting received on the child side (at least if the crashreporter stack trace can be trusted) The "juiciest" culprit here is perhaps here: http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/HttpChannelParent.cpp#430 we create a temporary nsHttpResponseHead() to send over IPDL if there isn't one in the parent (I forget when this can happen, but it's possible though rare). I'm wondering if that might somehow allocate something that doesn't serialize well over IPDL. For instance, it contains an nsHttpHeaderArray that will have an empty nsTArray<nsEntry>. But cjones tells me empty nsTArray's ought to serialize ok. So there may be no problem there. We don't use nsAutoTArray (which is not memmove-friendly, and hence not IPDL-friendly). There are lots of other fields that get sent in OnStartRequest, but they all resolve to basic types--I don't see anything that ought to fail randomly. Notes: - We crash trying to deserialize a PRNetAddr union type (PR_AF_LOCAL: i.e. Unix domain socket: resolves to "1"): so it's presumably reading some "garbage" that both tells the deserialization code that we've got a PR_AF_LOCAL socket, when we don't. Then we try to read 104 bytes and hit a bus error (I think: the stack trace actually seems to die on deref-ing a chromium iterator, which doesn't make much sense). PRNetAddr are actually long enough (112 bytes), but presumably we're already badly screwed up at this point that things go awry for some reason. Here's a 9.0 instance of the crash--the only recent one AFIACT: https://crash-stats.mozilla.com/report/index/19dc4def-2fb1-45e7-9667-e1a8d2110915
Meh. I tweaked the HTTP stack to always use an empty, on-the-stack nsHttpResponseHead: but it doesn't cause the IPDL error. I also tried Clear()-ing the RequestHeaderTuples argument to SendOnStartRequest, and that doesn't break anything either. Everything else that's sent in SendOnStartRequest is ultimately a basic Mozilla type (PRBool, PRUint32, nsCString), so I'm at a loss as to how any there can be any deserialization errors. Perhaps there's some HTTP header values that IPDL doesn't deserialize correctly? Seems unlikely. Other ideas: I thought of moving the PRNetAddrs to the start of the OnStartRequest msg. That will probably make this problem "go away" in that whatever serialization error that happens will blow up some other code somewhere. Not sure if it's likely to be useful, but it might be worth a shot. The downside is that we'd now have a different crash signature, with not a lot of frequency in the short term (we only have one 9.0 branch crash instance on crash reports). So this is one part fumbly debugging live on the 'net, three parts ostrich algorithm. We could also try to remove the PR_AF_LOCAL code entirely, and let the ParamTraits return 'false', which should kill the child process, perhaps with some useful info? (but cjones tells me he thinks it wouldn't result in a crash report, so this might just hide the problem from us, even more badly than in the previous "fix"). Just about out of ideas, at least for now. I've got to move on to Websockets stuff. I'm again suspecting some sort of IPDL bug.
All the crash reports linked from here are for Galaxy S II builds. I wasn't able to find more reports on crash-stats, which is surprising for a topcrasher, but whatever. Anyway, the Galaxy S II has a dual-core CPU. Any threading bugs will be exacerbated there, in particular because ARM has a more relaxed memory model than x86. (In reply to Jason Duell (:jduell) from comment #9) > We set the channel's mPeerAddress in nsHttpChannel::OnTransportStatus, which > runs before OnStart/Data/etc, so there should be no issue with the value not > being set yet. (In reply to Jason Duell (:jduell) from comment #10) > To be precise: mPeerAddress is set during either of the status events > STATUS_CONNECTED_TO || STATUS_WAITING_FOR. I wonder if there could be an > issue with loading from cache... How is mPeerAddress read/written? A memory ordering or cache problem might be possible. I wouldn't rule out a low-level IPC bug, but we send a *lot* of messages across a lot of platforms, so I would be surprised to only be seeing it in PHttpChannel, on android. Stranger things have happened though. Also, if [@ Pickle::ReadBytes ] is the #1 crash *signature*, but there's a wide variety of code below that on the stack, then an IPC bug looks more likely.
> I wasn't able to find more reports on crash-stats Make sure you select product=fennec: http://tinyurl.com/3obufw8 I assume you can tell a Galaxy system from "I9100" being in the OS Version string. Looking through the crash reports, I do see these are 90+% of the cases, but there are others here and there 220.127.116.11-RAGEmod-T30 (HTC Desire: I looked it up: single CPU) 18.104.22.168-ninphetamine-2.1.3+ #26 SMP PREEMPT Linux 22.214.171.124-g3422292 #1 Linux 126.96.36.199-SC02COMKG5-CL382988 #2 Linux 188.8.131.52-I9100TDUKF4-CL331615 #2 Linux 184.108.40.206-M250K.EG26.1902-CL374700 #2 220.127.116.11-SPH-D710.EG30-CL435734 #2 I don't know the market demographics, but it does seem like we're getting a higher proportion of errors on multicore. However, the stack traces for all threads are virtually identical, which makes an IPDL race condition seem less likely, assuming the IPDL thread's stack trace below is benign (i.e not dispatching something when it shouldn't): 0 libc.so syscall 1 @0x41f3cfff 2 libxul.so epoll_dispatch /libevent/epoll.c:210 3 libxul.so event_base_loop /libevent/event.c:515 > How is mPeerAddress read/written? The only time it's modified in the parent (other than the HTTP channel constructor) is during OnTransport STATUS_CONNECTED_TO or STATUS_WAITING_FOR events, which are quite a bit sooner than OnStartRequest, and span main thread event loops and probably lots ot time and machine instructions. Also, we're not seeing the bad value getting written from the parent. On the child it's set only in the constructor, and when we receive the OnStartRequest IPDL msg. So there's no necko races in setting it--it's either a ParamTraits serialization issue, or an IPDL race AFAICT.
(In reply to Jason Duell (:jduell) from comment #14) > > I wasn't able to find more reports on crash-stats > > Make sure you select product=fennec: > > http://tinyurl.com/3obufw8 Aha. Thanks. > I assume you can tell a Galaxy system from "I9100" being in the OS Version > string. Looking through the crash reports, I do see these are 90+% of the > cases, but there are others here and there > > 18.104.22.168-RAGEmod-T30 (HTC Desire: I looked it up: single CPU) This is a mod kernel for the Galaxy S II. > 22.214.171.124-ninphetamine-2.1.3+ #26 SMP PREEMPT Another Galaxy S II mod. (SMP => multiple cores.) > Linux 126.96.36.199-g3422292 #1 This is the HTC Sensation, which runs a dual-core Snapdragon. Basically the same chipset as in the alternate version of the Galaxy S II. > Linux 188.8.131.52-SC02COMKG5-CL382988 #2 > Linux 184.108.40.206-I9100TDUKF4-CL331615 #2 > Linux 220.127.116.11-M250K.EG26.1902-CL374700 #2 > 18.104.22.168-SPH-D710.EG30-CL435734 #2 These are all Galaxy S II. > > I don't know the market demographics, but it does seem like we're getting a > higher proportion of errors on multicore.> Looks like (100-\epsilon)% AFAICT. > However, the stack traces for all > threads are virtually identical, which makes an IPDL race condition seem > less likely, assuming the IPDL thread's stack trace below is benign (i.e > not dispatching something when it shouldn't): > > 0 libc.so syscall > 1 @0x41f3cfff > 2 libxul.so epoll_dispatch /libevent/epoll.c:210 > 3 libxul.so event_base_loop /libevent/event.c:515 > Yeah, this stack is fine, but note that there's some latency between a thread crashing and it being snapshotted by breakpad. Since this thread didn't crash, it might have returned to the event loop by the time breakpad SIGSTOPs it. This doesn't make me suspect a race condition in the IPC code any less or any more. > > > How is mPeerAddress read/written? > > The only time it's modified in the parent (other than the HTTP channel > constructor) is during OnTransport STATUS_CONNECTED_TO or STATUS_WAITING_FOR > events, which are quite a bit sooner than OnStartRequest, and span main > thread event loops and probably lots ot time and machine instructions. I don't follow this exactly, but is there something sequencing the write of mPeerAddress and the main thread reading it? For example, is it true that something like this *must* happen --- Necko thread: write mPeerAddress [post a task T to main thread] main thread: [process task T] read mPeerAddress If so, then this is fine, because posting a task requires some operations that end up being equivalent to a memory barrier. The only other possibilities I can think of for mPeerAddress would be something akin to use-after-free, but I would expect that to manifest as other crashes too. Is mPeerAddress a reference to a field in another object, by any chance, an object that die concurrent with this serialization? > Also, we're not seeing the bad value getting written from the parent. On > the child it's set only in the constructor, and when we receive the > OnStartRequest IPDL msg. So there's no necko races in setting it--it's > either a ParamTraits serialization issue, or an IPDL race AFAICT. It looks really unlikely to me (i.e. impossible) for the PRNetAddr serializer to be writing a bad value. HOWEVER, it *is* possible that one of the serializers for the params preceding |selfAddr| are doing something wrong. For example, if they write more bytes to the Message than they read back, then they'll leave the buffer in a bad state. That could result in PRNetAddress reading back a value it didn't write. It's definitely worth giving those serializers a once-over (nsHttpResponseHead or RequestHeaderTuples seems most suspect). I still think an IPC bug is fairly unlikely, but stranger things have happened. Reading bytes from the socket on the IPC-IO thread is strongly sequenced with them being processed on the main thread. This only happening with PHttpChannel:OnStartRequest seems pretty suspicious too. If it were an improbable race condition, I would expect it to happen with OnDataAvailable or something higher-traffic like that.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #15) > It's definitely worth > giving those serializers a once-over (nsHttpResponseHead or > RequestHeaderTuples seems most suspect). Sorry, and also checking for possible data races there. nsHttpResponseHead sounds suspicious to me, just by the name of it. helgrind might be able to catch bugs in that on desktop.
re: ordering of mPeerAddress write/read. Yes, it's basically this: Necko thread: write mPeerAddress [post a task T to main thread for OnTransportStatus] main thread: [process task T] read mPeerAddress So no race there (and as mentioned, we RUNTIMEABORT if we see the bad value being written from the parent: so it's definitely corruption/serialization on the child). I already eyeballed nsHttpResponseHead fairly thoroughly. I'll try to find some cycles to learn helgrind and run it, but probably not before the FF9 branch cutoff. And it'll depend on the fennec emulator also showing the bug on a multicore linux desktop (seems possible but of course not guaranteed).
dougt/mfinkle, If you have developers with multicore phones, it would be very useful to have someone try to catch this bug in gdb. We don't have a clear test case for making it happen, but if this is causing 13% of all crashes on mobile AND it's only happening on multicore, it should be crashing multicore a fair amount. Also someone could try to troll for it on desktop linux (possibly with helgrind): I don't have time for that right now.
I attempted to use Helgrind on desktop, but never got very far. Lots of warning/errors before the Fennec window came up (before the child process starts). Valgrind eventually crashed.
I shall not mislead people that I am working on this.
(In reply to Jason Duell (:jduell) from comment #18) > We don't have a clear test case for making it happen, but if this is causing > 13% of all crashes on mobile Now up to 20% of all crashes on Mobile Firefox 7 over the last three days.
It's #1 top crasher in 8.0 and represents 19% of all crashes.
My understanding (mfinkle told me!) is that mobile is not longer using e10s, so the codepath that causes this to get hit is not in any production use going forward. So it's not a high priority. Please do correct me if I'm wrong--we shouldn't be seeing this for native android mobile, only XUL.
Comment 23 is correct.
Erm, fennec 9 and 10 will be multi-process (AFAIK) and this is the #1 crasher ... B2G is (will be) also. Was anyone ever able to even reproduce this?
The focus for the moment is off e10s for new development of Fennec, but as comment #25 says, we are still shipping two releases with it, and then B2G comes as another, and even if it's delayed, e10s for desktop will probably come at some point. So we should not dismiss the crash too easily.
I don't believe we ever had a report with reliable STR (or even unreliable ones).
For Android tablets, we are also still shipping XUL/e10s fennec as Firefox 11 and quite possibly more releases.
(In reply to Josh Matthews [:jdm] from comment #27) > I don't believe we ever had a report with reliable STR (or even unreliable > ones). That's the unfortunate side of things. But doesn't the summary tell already what the circumstances are that lead to the crash - or am I just not understanding the issue due to not knowing any code?
I've tried to reproduce this before but have not succeeded. I may need to try to get a wifi + 3G device that has dual-core? some sample URLs from Socorro: http://www.holidaycheck.de/ http://ecodiario.eleconomista.es/politica/noticias/3611830/12/11/la-casa-del-rey-aconsejo-a-urdangarin-que-dejara-noos-y-cambiara-de-actividad.html http://www.google.com/m?hl=fr&gl=fr&client=ms-android-samsung&source=android-unknown&action=devloc&q=galaxy+note http://www.google.com/m?hl=es&gl=es&client=ms-android-samsung&source=android-launcher-widget&action=devloc&q=liga+bbva+clasificacion http://m.laola1.at/login.php http://www.jappy.de/ about:home http://bell.mobi/web/bell/home http://www.google.com/m?q=fusball+livticker&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:de:official http://www.google.se/ http://www.google.com/m?hl=it&gl=it&client=ms-android-samsung&source=android-com.gau.go.launcherex&action=devloc&q=sabina+began http://www.google.com/
A random sample of crash reports I took a while back were pretty much all Galaxy S IIs. That says two things - dual-core chipset - Gingerbread release of android (2.3.x) The fact that we don't seem to be seeing this on tablet chipsets, which are also all dual core, might suggest that we're hitting a bug in bionic 2.3.x that's been since fixed. If so ... oy vey.
FYI: I seem to be seeing similar symptoms (with the same function) in bug 664163 comment 10. For now I'm punting on fixing it, but if we want to revisit this when e10s is important again, it looks like we can repro that bug on our linux buildbots, unlike this one.
It's important now. B2G is building around it. Please to be fixing! :)