Closed Bug 1150515 Opened 9 years ago Closed 9 years ago

crash in mozalloc_abort(char const* const) | NS_DebugBreak | mozilla::ipc::GeckoChildProcessHost::OpenPrivilegedHandle(unsigned long)

Categories

(Core :: IPC, defect)

40 Branch
x86
Windows XP
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
e10s - ---
firefox37 --- wontfix
firefox38 --- wontfix
firefox38.0.5 --- wontfix
firefox39 --- fixed
firefox40 --- fixed
firefox-esr38 --- wontfix

People

(Reporter: adalucinet, Assigned: bobowen)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-5ae13675-ef03-4692-b435-7a82d2150402.
=============================================================

Startup crash encountered as soon as latest Nightly 40.0a1 is launched with a clean profile on Windows XP.

More reports:
https://crash-stats.mozilla.com/signature/?signature=mozalloc_abort%28char+const*+const%29+|+NS_DebugBreak+|+mozilla%3A%3Aipc%3A%3AGeckoChildProcessHost%3A%3AOpenPrivilegedHandle%28unsigned+long%29&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1
No bug on latest Nightly 40.0a1 launched with a new profile on Windows 8.1
I get this everytime I open a new tab on win2k3.
For example, report bp-6875ac26-f7c6-4877-94df-d8cee2150408
Does this only happen with e10s enabled or also without it?
This happens with and without e10s enabled.
After some tests, I see that only the "New tab" is affected.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #3)
> Does this only happen with e10s enabled or also without it?

For me, this takes down the entire process on startup on XP, only with e10s enabled. Can't use e10s there because of it, and while it's the default I basically can't use Nightly there. I'm assuming this is a regression and we should get a window for it to figure out what broke this. Alexandra, can you get us one?
Component: General → IPC
Flags: needinfo?(alexandra.lucinet)
OS: Windows NT → Windows XP
Product: Firefox → Core
(In reply to :Gijs Kruitbosch from comment #5)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #3)
> > Does this only happen with e10s enabled or also without it?
> 
> For me, this takes down the entire process on startup on XP, only with e10s
> enabled. Can't use e10s there because of it, and while it's the default I
> basically can't use Nightly there. I'm assuming this is a regression and we
> should get a window for it to figure out what broke this. Alexandra, can you
> get us one?

Successfully narrowed down a *manual* regression range:
Last good Nightly: 2015-04-01
First bad Nightly: 2015-04-02
Pushlog: http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2015-04-01&enddate=2015-04-02

Please note that with builds before this start up crash (including the one from 2015-04-01), I hit the shutdown crash bug 1138520;
Flags: needinfo?(alexandra.lucinet)
Considering the crash stack, I suspect bug 1119878. Bob, is that plausible and/or can you take a look at this?
Flags: needinfo?(bobowen.code)
(In reply to :Gijs Kruitbosch from comment #7)
> Considering the crash stack, I suspect bug 1119878. Bob, is that plausible
> and/or can you take a look at this?

It certainly is plausible, as part of that bug fix I fixed a bug in Open(Privileged)ProcessHandle.
Before, it apparently never failed (always returned true) because it was checking for INVALID_HANDLE_VALUE instead of NULL for the return value form OpenProcess when that call fails.

We'd tried to fix this before, but got a load of try failures, which indicated that the bug was possibly masking other bugs.
I believed (clearly mistakenly) that my patch had sorted these problems, so I fixed the OpenProcessHandle bug as well.

So, I can undo that part of my patch although I'll need to check that any new usages of OpenProcessHandle won't break.

One thing I don't understand is why this problem wasn't caught in our automated tests.
Also, as I say above this is probably just masking a different bug, but at least one that possibly doesn't cause a start-up crash. :-)
Flags: needinfo?(bobowen.code)
(In reply to Bob Owen (:bobowen) from comment #8)
> (In reply to :Gijs Kruitbosch from comment #7)
> > Considering the crash stack, I suspect bug 1119878. Bob, is that plausible
> > and/or can you take a look at this?
> 
> It certainly is plausible, as part of that bug fix I fixed a bug in
> Open(Privileged)ProcessHandle.
> Before, it apparently never failed (always returned true) because it was
> checking for INVALID_HANDLE_VALUE instead of NULL for the return value form
> OpenProcess when that call fails.
> 
> We'd tried to fix this before, but got a load of try failures, which
> indicated that the bug was possibly masking other bugs.
> I believed (clearly mistakenly) that my patch had sorted these problems, so
> I fixed the OpenProcessHandle bug as well.
> 
> So, I can undo that part of my patch although I'll need to check that any
> new usages of OpenProcessHandle won't break.
> 
> One thing I don't understand is why this problem wasn't caught in our
> automated tests.
> Also, as I say above this is probably just masking a different bug, but at
> least one that possibly doesn't cause a start-up crash. :-)

I am not an expert, but it seems like all the crashes in socorro are from WinXP 64-bit. We might be running 32-bit on infra for the tests? Is that plausible at all? Alexandra, am I right in thinking you're running 64-bit XP ? ( https://msdn.microsoft.com/en-us/library/windows/desktop/ms724832%28v=vs.85%29.aspx and the socorro aggregations suggest that all the 5.2 version numbers imply this is 64-bit XP)
(In reply to :Gijs Kruitbosch from comment #9)
> (In reply to Bob Owen (:bobowen) from comment #8)
 
> I am not an expert, but it seems like all the crashes in socorro are from
> WinXP 64-bit. We might be running 32-bit on infra for the tests? Is that
> plausible at all? Alexandra, am I right in thinking you're running 64-bit XP
> ? (
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms724832%28v=vs.
> 85%29.aspx and the socorro aggregations suggest that all the 5.2 version
> numbers imply this is 64-bit XP)

Yes, WinXP 32-bit seems to be working fine and it crashes immediately on WinXP 64-bit.

When I try with a Nightly build from 1st Apr, it appears to start OK but you get no content as the child process has failed to start.
The main firefox process also fails to shut down cleanly.

Alexandra - can you check to see if your build from 1st Apr was really working properly.

So, it looks like this is an existing bug that was partially being hidden by the OpenProcessHandle bug.

I suspect that this is a similar problem to bug 1110760, where the sandboxed process wasn't starting on Vista 64-bit.
If so, then e10s must have been broken on WinXP 64-bit since Dec last year.

I'll look into it, whether Alexandra confirms that it is a new problem or not.
Assignee: nobody → bobowen.code
Status: NEW → ASSIGNED
Flags: needinfo?(alexandra.lucinet)
(In reply to Bob Owen (:bobowen) from comment #10)
> Alexandra - can you check to see if your build from 1st Apr was really
> working properly.

As mentioned in comment 6, with Nightly from 2015-04-01, under Windows XP 64 bit, I hit the shutdown crash bug 1138520; also, some rendering issues are present.
Flags: needinfo?(alexandra.lucinet)
Mike - wow_helper.exe (used for the Windows sandbox) doesn't seem to be compatible with Windows XP 64-bit.
This is the thing you got building to fix the sandbox for Vista 64-bit (bug 1110760).

If I try to just run the exe from a command prompt I get a "wow_helper.exe is not a valid Win32 application." error.
This is a bit odd given that it is a 64-bit exe.
It also gives an "Access is denied." error in the command prompt, although I can't see that error when I run procmon.

If I swap it for the Chrome version of wow_helper.exe it starts fine.

Any ideas?
Flags: needinfo?(mh+mozilla)
This is how it's linked:
c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox/_virtualenv/Scripts/python.exe c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/config/expandlibs_exec.py --uselist -- link -NOLOGO -OUT:wow_helper.exe -PDB:wow_helper.pdb -SUBSYSTEM:WINDOWS,5.01 -LARGEADDRESSAWARE -NXCOMPAT -RELEASE -DYNAMICBASE -DEBUG -DEBUGTYPE:CV -DEBUG -OPT:REF  service64_resolver.obj target_code.obj wow_helper.obj ./module.res    kernel32.lib user32.lib gdi32.lib winmm.lib wsock32.lib advapi32.lib secur32.lib netapi32.lib

There's a -SUBSYSTEM:WINDOWS,5.01 there so it should work. If it doesn't then I don't know, and I don't have access to a XP 64-bits.

David, any idea?
Flags: needinfo?(mh+mozilla) → needinfo?(dmajor)
03:50:06     INFO -  LINK : warning LNK4010: invalid subsystem version number 5.01; default subsystem version assumed

On a Win64 build you have to give a value of at least 5.02, otherwise it defaults to 6.00. (Don't worry, 5.02 is fine for XP64 and Server 2003) https://msdn.microsoft.com/en-us/library/fcc1zstk.aspx
Flags: needinfo?(dmajor)
Although, once you make that change, we shouldn't call the resulting binary from XP32. So I'd put that behind a version check if it's not already.
(In reply to David Major [:dmajor] from comment #14)
> 03:50:06     INFO -  LINK : warning LNK4010: invalid subsystem version
> number 5.01; default subsystem version assumed
> 
> On a Win64 build you have to give a value of at least 5.02, otherwise it
> defaults to 6.00. (Don't worry, 5.02 is fine for XP64 and Server 2003)
> https://msdn.microsoft.com/en-us/library/fcc1zstk.aspx

Thanks David, I had just found the exact same thing. :-)
Looks like a lot of people suggest -OSVERSION:5.1 as well, but I'm not sure that's necessary.
(In reply to David Major [:dmajor] from comment #15)
> Although, once you make that change, we shouldn't call the resulting binary
> from XP32. So I'd put that behind a version check if it's not already.

Yes, the chromium sandbox code makes sure we are running under WOW64 before using wow_helper.
Yeah I don't think OSVERSION is necessary. We never used it on the other binaries, at least.
... and the Windows sandbox lives on WinXP 64-bit \o/

I had to remove all the config flags from my mozconfig to get it to run properly.
I seemed to be getting the original problem from bug 1110760, where the functions weren't in the right order.

I checked it against Vista 64-bit as well.

Here's a try push that just builds, to see if the debug build from there works:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a21651e00341

Could well have been something else in my mozconfig that upsets the delicate balance of this binary.

(In reply to David Major [:dmajor] from comment #18)
> Yeah I don't think OSVERSION is necessary. We never used it on the other
> binaries, at least.

It looks like the OSVERSION gets defaulted to the same as the subsystem, so we're fine without it.
I'm not sure if there is a better way of doing this, but this seems to work.

The debug build from the try push did work, so it must be something else in my mozconfig that screws up the function ordering.
Attachment #8599480 - Flags: review?(mh+mozilla)
Comment on attachment 8599480 [details] [diff] [review]
Set the subsystem to WINDOWS,5.02 for wow_helper so that it runs on WinXP 64-bit.

Review of attachment 8599480 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/sandbox/win/wow_helper/Makefile.in
@@ +41,5 @@
>  # LNK1246: '/SAFESEH' not compatible with 'x64' target machine
>  LDFLAGS := $(filter-out -SAFESEH,$(LDFLAGS))
> +
> +# On a x64 builds we need to specify a subsystem of at least 5.02, otherwise it
> +# gets rejected and uses the default one, which is later than WinXP 64-bit.

# When targetting x64, we need to specify a subsystem of at least 5.02, because the 5.01 value
# we inherit from the x86 parts is silently ignored, making the linker default to 6.00 (Vista)

@@ +42,5 @@
>  LDFLAGS := $(filter-out -SAFESEH,$(LDFLAGS))
> +
> +# On a x64 builds we need to specify a subsystem of at least 5.02, otherwise it
> +# gets rejected and uses the default one, which is later than WinXP 64-bit.
> +WIN32_GUI_EXE_LDFLAGS=-SUBSYSTEM:WINDOWS,5.02

Gotta love when wrong values are ignored instead of throwing an error.
Attachment #8599480 - Flags: review?(mh+mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/d0703ff693a1
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Comment on attachment 8599480 [details] [diff] [review]
Set the subsystem to WINDOWS,5.02 for wow_helper so that it runs on WinXP 64-bit.

Maire - just to let you know that GMP should start working on WinXP 64-bit with this patch. Also to see if you agree that it's not worth trying to get this into Fx38.

Approval Request Comment
[Feature/regressing bug #]:
The sandbox hasn't been working on WinXP 64-bit since it was released for the GMP process, so Bug 985252.

[User impact if declined]:
GMP processes will continue to fail to start on WinXP 64-bit.
Only asking for Aurora here as we are quite late in the release cycle.
My understanding is that the general usage of GMP for OpenH264 has been fairly low and EME is not being released for WinXP, so it may not be worth trying to get this into the next release.

[Describe test coverage new/current, TreeHerder]:
GMP processes have mochitests, but they don't run on WinXP 64-bit.

I have manual tested e10s (content process) and GMP OpenH264 (using http://mozilla.github.io/webrtc-landing/pc_test.html) on a WinXP 64-bit VM.

[Risks and why]:
Low (maybe medium) - currently the sandboxed process fails to start completely on WinXP 64-bit. It may be that this is hiding other problems on the platform for GMP though.

[String/UUID change made/needed]:
None
Flags: needinfo?(mreavy)
Attachment #8599480 - Flags: approval-mozilla-aurora?
Thanks for the ping, Bob.  I think it's too risky to try to get this into Fx38 this late in the cycle.  I think it needs more bake time.  I can see possible edge conditions popping up once this goes to a larger audience, and I think we'll need at least a few weeks (which the Fx39 Aurora/Beta cycles should give us) to shake them out and feel confident that we're "good" for Release.
Flags: needinfo?(mreavy)
Comment on attachment 8599480 [details] [diff] [review]
Set the subsystem to WINDOWS,5.02 for wow_helper so that it runs on WinXP 64-bit.

Approved for uplift to aurora. Sounds potentially risky, so let's keep an eye on the crash rate after this uplifts.
Attachment #8599480 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I don't reproduce it anymore with the latest Nightly, thanks.
Flags: needinfo?(spetreolle)
Sylvain Petreolle, thanks for testing!

Apparently, various crashes started to show up in Socorro: https://crash-stats.mozilla.com/signature/?signature=mozalloc_abort(char+const*+const)+%7C+NS_DebugBreak+%7C+mozilla%3A%3Aipc%3A%3AGeckoChildProcessHost%3A%3AOpenPrivilegedHandle(unsigned+long)&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1 and, as far as I can tell, the crashing thread is the same.

Any ideas? Thanks in advance!
Flags: needinfo?(bobowen.code)
In case it helps: Almost all of the remaining crashes are on native-32-bit XPSP3.
(In reply to Alexandra Lucinet, QA Mentor [:adalucinet] from comment #30)
> Sylvain Petreolle, thanks for testing!
> 
> Apparently, various crashes started to show up in Socorro:
> https://crash-stats.mozilla.com/signature/
> ?signature=mozalloc_abort(char+const*+const)+%7C+NS_DebugBreak+%7C+mozilla%3A
> %3Aipc%3A%3AGeckoChildProcessHost%3A%3AOpenPrivilegedHandle(unsigned+long)&_c
> olumns=date&_columns=product&_columns=version&_columns=build_id&_columns=plat
> form&_columns=reason&_columns=address&page=1 and, as far as I can tell, the
> crashing thread is the same.
> 
> Any ideas? Thanks in advance!

All I can really say is that this crash is a symptom of the sandboxed child process failing to start for some reason.

This crash itself should be fixed by bug 1146874, but that won't fix the problems with the process start-up, which could potentially be down to more than one reason.

I haven't had a chance to look through many of the crashes, but maybe we are hitting more problems with Anti-Virus and AV like things.

I'll leave the needinfo and try and look into the crashes in more detail, when I can.
Also, maybe I can look into some ways of getting better information about why the child process isn't starting or being set up correctly.
(In reply to Alexandra Lucinet, QA Mentor [:adalucinet] from comment #30)
> Sylvain Petreolle, thanks for testing!
> 
> Apparently, various crashes started to show up in Socorro:
> https://crash-stats.mozilla.com/signature/
> ?signature=mozalloc_abort(char+const*+const)+%7C+NS_DebugBreak+%7C+mozilla%3A
> %3Aipc%3A%3AGeckoChildProcessHost%3A%3AOpenPrivilegedHandle(unsigned+long)&_c
> olumns=date&_columns=product&_columns=version&_columns=build_id&_columns=plat
> form&_columns=reason&_columns=address&page=1 and, as far as I can tell, the
> crashing thread is the same.
> 
> Any ideas? Thanks in advance!

The current crashes from this search are all on Aurora, so shouldn't be down to the sandbox, as that is only enabled on Nightly (for e10s).

So, this must be down to the child process not starting for a different reason.

All of the module lists include privman64.dll which apparently is from some corporate security tool:
http://www.beyondtrust.com/Solutions/ManagingWindowsDesktops/

The crashes are in two clusters when you look at the date, so might be just be one person.

This will stop crashing once bug 1146874 lands, but it may just mean that e10s doesn't work anyway if a different content process doesn't get started.
Flags: needinfo?(bobowen.code)
You need to log in before you can comment on or make changes to this bug.