Closed Bug 1161717 Opened 9 years ago Closed 9 years ago

[Page Thumbnails] Startup crash for localized Firefox builds [@ mozilla::net::PNeckoChild::SendPHttpChannelConstructor]

Categories

(Core :: Networking: HTTP, defect)

38 Branch
defect
Not set
critical

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox38 --- wontfix
firefox39 - wontfix
firefox-esr38 - wontfix

People

(Reporter: whimboo, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [mozmill][qa-automation-blocked])

Crash Data

Attachments

(3 files)

While investigating the test results from our Mozmill tests from the last 4 weeks I have seen that we have a new top crasher on OS X 10.8. It seems to crash the build each time during startup between two tests. It started on April 24th (20150424064750):

report: bp-9ff2f47e-5b93-4b16-a5e6-c7a572150505.

Crash Reason 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
Crash Address 	0x0

First ten stack frames:

0 	libmozalloc.dylib 	mozalloc_abort(char const*) 	memory/mozalloc/mozalloc_abort.cpp
1 	XUL 	Abort 	xpcom/base/nsDebugImpl.cpp
2 	XUL 	NS_DebugBreak 	xpcom/base/nsDebugImpl.cpp
3 	XUL 	mozilla::net::PNeckoChild::SendPHttpChannelConstructor(mozilla::net::PHttpChannelChild*, mozilla::dom::PBrowserOrId const&, IPC::SerializedLoadContext const&, mozilla::net::HttpChannelCreationArgs const&) 	obj-firefox/x86_64/ipc/ipdl/PNeckoChild.cpp
4 	XUL 	mozilla::net::HttpChannelChild::ContinueAsyncOpen() 	netwerk/protocol/http/HttpChannelChild.cpp
5 	XUL 	mozilla::net::HttpChannelChild::AsyncOpen(nsIStreamListener*, nsISupports*) 	netwerk/protocol/http/HttpChannelChild.cpp
6 	XUL 	nsScriptLoader::StartLoad(nsScriptLoadRequest*, nsAString_internal const&, bool) 	dom/base/nsScriptLoader.cpp
7 	XUL 	nsScriptLoader::ProcessScriptElement(nsIScriptElement*) 	dom/base/nsScriptLoader.cpp
8 	XUL 	nsScriptElement::MaybeProcessScript() 	dom/base/nsScriptElement.cpp
9 	XUL 	mozilla::dom::HTMLScriptElement::BindToTree(nsIDocument*, nsIContent*, nsIContent*, bool) 	dom/html/HTMLScriptElement.cpp
10 	XUL 	nsINode::doInsertChildAt(nsIContent*, unsigned int, bool, nsAttrAndChildArray&) 	dom/base/nsINode.cpp
It might be a regression between 38b6 and 38b7.

Pushlog:
https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=c68a6293bb0d&tochange=504ec068cc33

I don't see anything obvious here.
that's e10s code asserting failure in a non e10s (38) build. Since this is between two tests could the prefs be in a weird state?
For Mozmill tests we always have e10s turned off because this framework doesn't support it. So not sure why this code is getting run in that case.

https://github.com/mozilla/mozmill/blob/master/mozmill/mozmill/__init__.py#L125
ideas?
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(honzab.moz)
The same also applies to all the Linux boxes we own in SCL3. Sadly I'm not able to submit any of those reports via the crash reporter. But ted helped me and gave me details for the dmp files:

Thread 0 (crashed)
 0  libmozalloc.so!mozalloc_abort(char const*) [mozalloc_abort.cpp:504ec068cc33 : 37 + 0x0]
    rbx = 0x00007fa52ebad868   r12 = 0x00007fa52ebad868
    r13 = 0x0000000000000000   r14 = 0x00007fa53202bdb4
    r15 = 0x0000000000000139   rip = 0x00007fa5344fbfc9
    rsp = 0x00007fff8eac6b70   rbp = 0x0000000000000003
    Found by: given as instruction pointer in context
 1  libxul.so!NS_DebugBreak [nsDebugImpl.cpp:504ec068cc33 : 469 + 0x7]
    rbx = 0x00007fff8eac6bc0   r12 = 0x00007fa52ebad868
    r13 = 0x0000000000000000   r14 = 0x00007fa53202bdb4
    r15 = 0x0000000000000139   rip = 0x00007fa53039ee85
    rsp = 0x00007fff8eac6b80   rbp = 0x0000000000000003
    Found by: call frame info
 2  libxul.so!mozilla::net::PNeckoChild::SendPHttpChannelConstructor(mozilla::net::PHttpChannelChild*, mozilla::dom::PBrowserOrId const&, IPC::SerializedLoadContext const&, mozilla::net::HttpChannelCreationArgs const&) [PNeckoChild.cpp:504ec068cc33 : 313 + 0x4]
    rbx = 0x0000000000000000   r12 = 0x00007fff8eac7290
    r13 = 0x00007fff8eac70a0   r14 = 0x00007fff8eac7090
    r15 = 0x00007fa5339c1a50   rip = 0x00007fa530615f73
    rsp = 0x00007fff8eac6ff0   rbp = 0x00007fa5197c71c0
    Found by: call frame info
 3  libxul.so!mozilla::net::HttpChannelChild::ContinueAsyncOpen() [HttpChannelChild.cpp:504ec068cc33 : 1644 + 0x43]
    rbx = 0x00007fa51765f000   r12 = 0x00007fa51d424800
    r13 = 0x00007fa5339b2990   r14 = 0x00000000cd140014
    r15 = 0x00007fa5339c1a50   rip = 0x00007fa5304e9424
    rsp = 0x00007fff8eac7040   rbp = 0x0000000000000000
    Found by: call frame info
OS: Mac OS X → All
Marking as qa-automation-blocked because we have too many instances of those crashes. Right now its kinda hard to nail down the problem because this crash does not always happen but only each 5 or so full test runs.
Hardware: x86_64 → All
Whiteboard: [mozmill] → [mozmill][qa-automation-blocked]
Actually I can see some of those lines in our add-on related Mozmill tests for Firefox 38.0 RC:

> [Child 27528] ###!!! ABORT: constructor for actor failed: file ./PNeckoChild.cpp, line 313

It's the same file as for the crash, so I wonder if that is related.
Blocks: 1150242
No longer blocks: 1150242
Blocks: 1163181
This crash is the most occurring crash for our automated Mozmill test runs and Firefox 38. In detail it means that really each locale except en-US is crashing with this stack. This happens for 38.0 build 3 and for todays 38.0.5b1.

http://mozmill-release.blargon7.com/#/functional/failure?app=Firefox&branch=38&platform=Linux&from=2015-05-10&to=2015-05-11&test=%2FtestToolbar%2FtestHomeButton.js&func=testHomeButton.js

We collect crashes at the end of the testrun, so I cannot precisely say where exactly the crash is occurring. But one scenario I have already found can be triggered by running the following two tests in sequence:

* testMenu_quitApplication (http://hg.mozilla.org/qa/mozmill-tests/file/mozilla-release/firefox/tests/functional/restartTests/testMenu_quitApplication/test1.js)
* testChangeTheme (http://hg.mozilla.org/qa/mozmill-tests/file/mozilla-release/firefox/tests/functional/testAddons/testChangeTheme.js)
Summary: Startup crash in mozilla::net::PNeckoChild::SendPHttpChannelConstructor → Startup crash for localized Firefox builds [@ mozilla::net::PNeckoChild::SendPHttpChannelConstructor]
Version: 39 Branch → 38 Branch
The tests which trigger the crash are doing the following:

* Quit Firefox via the menu's quit entry
* Closing all tabs
* Opening 'addons/install.html?addon=themes/plain.jar' via the local httpd.js webserver (which is identical to http://mozqa.com/data/firefox/addons/install.html?addon=themes/plain.jar)
* Click the link to install the theme and proceed the install dialog
* Open the Add-on Manager, selecting the theme pane, and restart Firefox

During that restart Firefox crashes directly during start-up.
Attached file log.txt
Log file with some Javascript warnings/errors which might help to investigate this problem.
Can you make http log? 
Thanks
Flags: needinfo?(hskupin)
Attached file http_log.txt
Sure! I totally forgot about that. Here it is.
Flags: needinfo?(hskupin)
So by further testing this problem seems to exist only when about:newtab is used when new tabs are getting opened. The crash is gone when I make use of about:blank in the Mozmill test.

We already had a couple of problems with this page already in the past. Mostly with background thumbnailing. Given the situation here on that bug I feel that this could closely be related.
Interestingly this is not happening on Windows, maybe because the content sandbox is disabled via bug 1158849?
Flags: needinfo?(bobowen.code)
Looking at the log, I can only see that after restart is using e10s as if the pref is changed. 
And there is a log for child process so e10s pref had changed.
(In reply to Henrik Skupin (:whimboo) from comment #17)
> Interestingly this is not happening on Windows, maybe because the content
> sandbox is disabled via bug 1158849?

As far as I can tell, the content process (and therefore the thumbnail process) is only sandboxed on Nightly for both OSX and Windows now.
So, I don't think so.

The thumbnail process was being sandboxed in branches other than Nightly by mistake on Windows, which is what bug 1158849 fixed.
Flags: needinfo?(bobowen.code)
(In reply to Dragana Damjanovic [:dragana] from comment #18)
> Looking at the log, I can only see that after restart is using e10s as if
> the pref is changed. 
> And there is a log for child process so e10s pref had changed.

The pref you are referring here should be browser.tabs.remote.autostart right? And in this case it should be true?
If that is the pref, it is still set to false after restart. Just tested.
about:blank is taking a bit different path that is why it is not crashing.

Why it is not crashing on windows ant this moment i do not know.

The pref that i was talking about is browser.tabs.remote.autostart.

The place where necko code decides to call child or not child is:
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpHandler.cpp#1840

I do not see much from the log.
(In reply to Dragana Damjanovic [:dragana] from comment #22)
> about:blank is taking a bit different path that is why it is not crashing.

Right, because it doesn't trigger the background thumbnail process. I simply used that for now to let the crash stop for our tests of Firefox 38 builds.

> Why it is not crashing on windows ant this moment i do not know.
> 
> The pref that i was talking about is browser.tabs.remote.autostart.
> 
> The place where necko code decides to call child or not child is:
> http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/
> nsHttpHandler.cpp#1840

Also I would like say that e10s on Aurora is not enabled by default. So are those child processes created because of an asynchronous background process?
I found bug 726347 which added a pref for disabling the background thumbnail process. I disabled that while leaving the newtab page active, and indeed it stops crashing Firefox.
Flags: needinfo?(ttaubert)
Summary: Startup crash for localized Firefox builds [@ mozilla::net::PNeckoChild::SendPHttpChannelConstructor] → [Page Thumbnails] Startup crash for localized Firefox builds [@ mozilla::net::PNeckoChild::SendPHttpChannelConstructor]
Just as an information... our machines were this crash is occurring are located in SCL3 behind a proxy. Maybe that has an influence here. But not sure why its only happening for localized builds and not en-US, and not on Windows. It's fun, and I do not have time to further investigate it. If it turns out to be really important I can have a further look but so far I will continue with other important things.
Sorry, no time to investigate this in the near future. Drew has been working mostly on the background thumbnailer though. You might also want to loop in a few e10s folks maybe that know about HTTP channel impl.
Flags: needinfo?(ttaubert)
Sounds like we've got steps to reproduce.  We should do that and attach a debugger, and see what happens in the code in comment 22 (ie. where we create HTTP channels) and see what's going on.  We create an e10s child channel is IsNeckoChild() returns true.  But if that's happening, something really weird is going on with XRE_GetProcessType() and we should ask :bent how it could ever return the wrong answer.

Note that IsNeckoChild() caches its result, so if there's a race/window where XRE_GetProcessType() returns the wrong answer, we'd keep it forever. But really, processes shouldn't change type (?).  I'm fine with getting rid of the cached result if it "fixes" things here.
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(honzab.moz)
I assume someone would have to build an l10n debug build of a most recent Firefox 38.0. I don't have the capacity to look that deep into it. But I would be happy to run a debugger if I get such a build e.g. via try.
This is still a top crasher for our Mozmill tests. We get hundreds of reports for each release. :(
OK, tracking for 39.
[Tracking Requested - why for this release]:
It would be good if we can get this fixed for 38ESR given that we will have more releases here. I have to say that we have not seen this crash for 39 builds yet.
I have tries to reproduce it locally, but I couldn't. I have even used proxy, but no difference.

From a chat with :whimboo - he will start replacing mozmill tests with marionette, so we will see in a week.

I will try to figure out how to make build from comment #28.
Flags: needinfo?(hskupin)
Due to a lot of failures for Mozmill tests and no-one who could fix or at least analyze the problems, we decided to shutdown most of them about 2 weeks ago. Since then we no longer see this crash.

I feel its not worth the time to dig more into this crash given that all reports so far came from our testing machines. I would close it as incomplete for now, with the option to reopen if it comes back with the Marionette tests.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(hskupin)
Resolution: --- → INCOMPLETE
Dropping tracking for this since the crash sounds very specific to particular test machines and no one is working on it. We aren't seeing this in crash-stats.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: