Closed
Bug 1161717
Opened 9 years ago
Closed 9 years ago
[Page Thumbnails] Startup crash for localized Firefox builds [@ mozilla::net::PNeckoChild::SendPHttpChannelConstructor]
Categories
(Core :: Networking: HTTP, defect)
Tracking
()
People
(Reporter: whimboo, Unassigned)
References
Details
(Keywords: crash, Whiteboard: [mozmill][qa-automation-blocked])
Crash Data
Attachments
(3 files)
While investigating the test results from our Mozmill tests from the last 4 weeks I have seen that we have a new top crasher on OS X 10.8. It seems to crash the build each time during startup between two tests. It started on April 24th (20150424064750): report: bp-9ff2f47e-5b93-4b16-a5e6-c7a572150505. Crash Reason EXC_BAD_ACCESS / KERN_INVALID_ADDRESS Crash Address 0x0 First ten stack frames: 0 libmozalloc.dylib mozalloc_abort(char const*) memory/mozalloc/mozalloc_abort.cpp 1 XUL Abort xpcom/base/nsDebugImpl.cpp 2 XUL NS_DebugBreak xpcom/base/nsDebugImpl.cpp 3 XUL mozilla::net::PNeckoChild::SendPHttpChannelConstructor(mozilla::net::PHttpChannelChild*, mozilla::dom::PBrowserOrId const&, IPC::SerializedLoadContext const&, mozilla::net::HttpChannelCreationArgs const&) obj-firefox/x86_64/ipc/ipdl/PNeckoChild.cpp 4 XUL mozilla::net::HttpChannelChild::ContinueAsyncOpen() netwerk/protocol/http/HttpChannelChild.cpp 5 XUL mozilla::net::HttpChannelChild::AsyncOpen(nsIStreamListener*, nsISupports*) netwerk/protocol/http/HttpChannelChild.cpp 6 XUL nsScriptLoader::StartLoad(nsScriptLoadRequest*, nsAString_internal const&, bool) dom/base/nsScriptLoader.cpp 7 XUL nsScriptLoader::ProcessScriptElement(nsIScriptElement*) dom/base/nsScriptLoader.cpp 8 XUL nsScriptElement::MaybeProcessScript() dom/base/nsScriptElement.cpp 9 XUL mozilla::dom::HTMLScriptElement::BindToTree(nsIDocument*, nsIContent*, nsIContent*, bool) dom/html/HTMLScriptElement.cpp 10 XUL nsINode::doInsertChildAt(nsIContent*, unsigned int, bool, nsAttrAndChildArray&) dom/base/nsINode.cpp
Reporter | ||
Comment 1•9 years ago
|
||
It might be a regression between 38b6 and 38b7. Pushlog: https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=c68a6293bb0d&tochange=504ec068cc33 I don't see anything obvious here.
Comment 2•9 years ago
|
||
that's e10s code asserting failure in a non e10s (38) build. Since this is between two tests could the prefs be in a weird state?
Reporter | ||
Comment 3•9 years ago
|
||
For Mozmill tests we always have e10s turned off because this framework doesn't support it. So not sure why this code is getting run in that case. https://github.com/mozilla/mozmill/blob/master/mozmill/mozmill/__init__.py#L125
Reporter | ||
Comment 5•9 years ago
|
||
The same also applies to all the Linux boxes we own in SCL3. Sadly I'm not able to submit any of those reports via the crash reporter. But ted helped me and gave me details for the dmp files: Thread 0 (crashed) 0 libmozalloc.so!mozalloc_abort(char const*) [mozalloc_abort.cpp:504ec068cc33 : 37 + 0x0] rbx = 0x00007fa52ebad868 r12 = 0x00007fa52ebad868 r13 = 0x0000000000000000 r14 = 0x00007fa53202bdb4 r15 = 0x0000000000000139 rip = 0x00007fa5344fbfc9 rsp = 0x00007fff8eac6b70 rbp = 0x0000000000000003 Found by: given as instruction pointer in context 1 libxul.so!NS_DebugBreak [nsDebugImpl.cpp:504ec068cc33 : 469 + 0x7] rbx = 0x00007fff8eac6bc0 r12 = 0x00007fa52ebad868 r13 = 0x0000000000000000 r14 = 0x00007fa53202bdb4 r15 = 0x0000000000000139 rip = 0x00007fa53039ee85 rsp = 0x00007fff8eac6b80 rbp = 0x0000000000000003 Found by: call frame info 2 libxul.so!mozilla::net::PNeckoChild::SendPHttpChannelConstructor(mozilla::net::PHttpChannelChild*, mozilla::dom::PBrowserOrId const&, IPC::SerializedLoadContext const&, mozilla::net::HttpChannelCreationArgs const&) [PNeckoChild.cpp:504ec068cc33 : 313 + 0x4] rbx = 0x0000000000000000 r12 = 0x00007fff8eac7290 r13 = 0x00007fff8eac70a0 r14 = 0x00007fff8eac7090 r15 = 0x00007fa5339c1a50 rip = 0x00007fa530615f73 rsp = 0x00007fff8eac6ff0 rbp = 0x00007fa5197c71c0 Found by: call frame info 3 libxul.so!mozilla::net::HttpChannelChild::ContinueAsyncOpen() [HttpChannelChild.cpp:504ec068cc33 : 1644 + 0x43] rbx = 0x00007fa51765f000 r12 = 0x00007fa51d424800 r13 = 0x00007fa5339b2990 r14 = 0x00000000cd140014 r15 = 0x00007fa5339c1a50 rip = 0x00007fa5304e9424 rsp = 0x00007fff8eac7040 rbp = 0x0000000000000000 Found by: call frame info
OS: Mac OS X → All
Reporter | ||
Comment 6•9 years ago
|
||
Marking as qa-automation-blocked because we have too many instances of those crashes. Right now its kinda hard to nail down the problem because this crash does not always happen but only each 5 or so full test runs.
Hardware: x86_64 → All
Whiteboard: [mozmill] → [mozmill][qa-automation-blocked]
Reporter | ||
Comment 7•9 years ago
|
||
Actually I can see some of those lines in our add-on related Mozmill tests for Firefox 38.0 RC:
> [Child 27528] ###!!! ABORT: constructor for actor failed: file ./PNeckoChild.cpp, line 313
It's the same file as for the crash, so I wonder if that is related.
Reporter | ||
Comment 8•9 years ago
|
||
This crash is the most occurring crash for our automated Mozmill test runs and Firefox 38. In detail it means that really each locale except en-US is crashing with this stack. This happens for 38.0 build 3 and for todays 38.0.5b1. http://mozmill-release.blargon7.com/#/functional/failure?app=Firefox&branch=38&platform=Linux&from=2015-05-10&to=2015-05-11&test=%2FtestToolbar%2FtestHomeButton.js&func=testHomeButton.js We collect crashes at the end of the testrun, so I cannot precisely say where exactly the crash is occurring. But one scenario I have already found can be triggered by running the following two tests in sequence: * testMenu_quitApplication (http://hg.mozilla.org/qa/mozmill-tests/file/mozilla-release/firefox/tests/functional/restartTests/testMenu_quitApplication/test1.js) * testChangeTheme (http://hg.mozilla.org/qa/mozmill-tests/file/mozilla-release/firefox/tests/functional/testAddons/testChangeTheme.js)
Summary: Startup crash in mozilla::net::PNeckoChild::SendPHttpChannelConstructor → Startup crash for localized Firefox builds [@ mozilla::net::PNeckoChild::SendPHttpChannelConstructor]
Version: 39 Branch → 38 Branch
Reporter | ||
Comment 9•9 years ago
|
||
The link to our dashboard should have been the following to also include the OS X crashes and the 38.0 build 3 results: http://mozmill-release.blargon7.com/#/functional/failure?app=Firefox&branch=38&platform=All&from=2015-05-8&to=2015-05-11&test=%2FtestToolbar%2FtestHomeButton.js&func=testHomeButton.js
Reporter | ||
Comment 10•9 years ago
|
||
And typo again :( http://mozmill-release.blargon7.com/#/functional/failure?app=Firefox&branch=38&platform=All&from=2015-05-08&to=2015-05-11&test=%2FtestToolbar%2FtestHomeButton.js&func=testHomeButton.js
Reporter | ||
Comment 11•9 years ago
|
||
The tests which trigger the crash are doing the following: * Quit Firefox via the menu's quit entry * Closing all tabs * Opening 'addons/install.html?addon=themes/plain.jar' via the local httpd.js webserver (which is identical to http://mozqa.com/data/firefox/addons/install.html?addon=themes/plain.jar) * Click the link to install the theme and proceed the install dialog * Open the Add-on Manager, selecting the theme pane, and restart Firefox During that restart Firefox crashes directly during start-up.
Reporter | ||
Comment 12•9 years ago
|
||
Log file with some Javascript warnings/errors which might help to investigate this problem.
Reporter | ||
Comment 14•9 years ago
|
||
Sure! I totally forgot about that. Here it is.
Flags: needinfo?(hskupin)
Reporter | ||
Comment 15•9 years ago
|
||
Reporter | ||
Comment 16•9 years ago
|
||
So by further testing this problem seems to exist only when about:newtab is used when new tabs are getting opened. The crash is gone when I make use of about:blank in the Mozmill test. We already had a couple of problems with this page already in the past. Mostly with background thumbnailing. Given the situation here on that bug I feel that this could closely be related.
Reporter | ||
Comment 17•9 years ago
|
||
Interestingly this is not happening on Windows, maybe because the content sandbox is disabled via bug 1158849?
Flags: needinfo?(bobowen.code)
Comment 18•9 years ago
|
||
Looking at the log, I can only see that after restart is using e10s as if the pref is changed. And there is a log for child process so e10s pref had changed.
Comment 19•9 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #17) > Interestingly this is not happening on Windows, maybe because the content > sandbox is disabled via bug 1158849? As far as I can tell, the content process (and therefore the thumbnail process) is only sandboxed on Nightly for both OSX and Windows now. So, I don't think so. The thumbnail process was being sandboxed in branches other than Nightly by mistake on Windows, which is what bug 1158849 fixed.
Flags: needinfo?(bobowen.code)
Reporter | ||
Comment 20•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #18) > Looking at the log, I can only see that after restart is using e10s as if > the pref is changed. > And there is a log for child process so e10s pref had changed. The pref you are referring here should be browser.tabs.remote.autostart right? And in this case it should be true?
Reporter | ||
Comment 21•9 years ago
|
||
If that is the pref, it is still set to false after restart. Just tested.
Comment 22•9 years ago
|
||
about:blank is taking a bit different path that is why it is not crashing. Why it is not crashing on windows ant this moment i do not know. The pref that i was talking about is browser.tabs.remote.autostart. The place where necko code decides to call child or not child is: http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpHandler.cpp#1840 I do not see much from the log.
Reporter | ||
Comment 23•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #22) > about:blank is taking a bit different path that is why it is not crashing. Right, because it doesn't trigger the background thumbnail process. I simply used that for now to let the crash stop for our tests of Firefox 38 builds. > Why it is not crashing on windows ant this moment i do not know. > > The pref that i was talking about is browser.tabs.remote.autostart. > > The place where necko code decides to call child or not child is: > http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/ > nsHttpHandler.cpp#1840 Also I would like say that e10s on Aurora is not enabled by default. So are those child processes created because of an asynchronous background process?
Reporter | ||
Comment 24•9 years ago
|
||
I found bug 726347 which added a pref for disabling the background thumbnail process. I disabled that while leaving the newtab page active, and indeed it stops crashing Firefox.
Flags: needinfo?(ttaubert)
Summary: Startup crash for localized Firefox builds [@ mozilla::net::PNeckoChild::SendPHttpChannelConstructor] → [Page Thumbnails] Startup crash for localized Firefox builds [@ mozilla::net::PNeckoChild::SendPHttpChannelConstructor]
Reporter | ||
Comment 25•9 years ago
|
||
Just as an information... our machines were this crash is occurring are located in SCL3 behind a proxy. Maybe that has an influence here. But not sure why its only happening for localized builds and not en-US, and not on Windows. It's fun, and I do not have time to further investigate it. If it turns out to be really important I can have a further look but so far I will continue with other important things.
Comment 26•9 years ago
|
||
Sorry, no time to investigate this in the near future. Drew has been working mostly on the background thumbnailer though. You might also want to loop in a few e10s folks maybe that know about HTTP channel impl.
Flags: needinfo?(ttaubert)
Comment 27•9 years ago
|
||
Sounds like we've got steps to reproduce. We should do that and attach a debugger, and see what happens in the code in comment 22 (ie. where we create HTTP channels) and see what's going on. We create an e10s child channel is IsNeckoChild() returns true. But if that's happening, something really weird is going on with XRE_GetProcessType() and we should ask :bent how it could ever return the wrong answer. Note that IsNeckoChild() caches its result, so if there's a race/window where XRE_GetProcessType() returns the wrong answer, we'd keep it forever. But really, processes shouldn't change type (?). I'm fine with getting rid of the cached result if it "fixes" things here.
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(honzab.moz)
Reporter | ||
Comment 28•9 years ago
|
||
I assume someone would have to build an l10n debug build of a most recent Firefox 38.0. I don't have the capacity to look that deep into it. But I would be happy to run a debugger if I get such a build e.g. via try.
Reporter | ||
Comment 29•9 years ago
|
||
This is still a top crasher for our Mozmill tests. We get hundreds of reports for each release. :(
Comment 30•9 years ago
|
||
OK, tracking for 39.
status-firefox39:
--- → affected
tracking-firefox39:
--- → +
Reporter | ||
Comment 31•9 years ago
|
||
[Tracking Requested - why for this release]: It would be good if we can get this fixed for 38ESR given that we will have more releases here. I have to say that we have not seen this crash for 39 builds yet.
status-firefox-esr38:
--- → affected
tracking-firefox-esr38:
--- → ?
Comment 32•9 years ago
|
||
I have tries to reproduce it locally, but I couldn't. I have even used proxy, but no difference. From a chat with :whimboo - he will start replacing mozmill tests with marionette, so we will see in a week. I will try to figure out how to make build from comment #28.
Flags: needinfo?(hskupin)
Reporter | ||
Comment 33•9 years ago
|
||
Due to a lot of failures for Mozmill tests and no-one who could fix or at least analyze the problems, we decided to shutdown most of them about 2 weeks ago. Since then we no longer see this crash. I feel its not worth the time to dig more into this crash given that all reports so far came from our testing machines. I would close it as incomplete for now, with the option to reopen if it comes back with the Marionette tests.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(hskupin)
Resolution: --- → INCOMPLETE
Comment 34•9 years ago
|
||
Dropping tracking for this since the crash sounds very specific to particular test machines and no one is working on it. We aren't seeing this in crash-stats.
Updated•2 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•