Closed Bug 1271890 Opened 7 years ago Closed 6 years ago

Crash in base::win::PEImage::GetProcAddress

Categories

(Core :: Security: Process Sandboxing, defect, P3)

x86_64
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 --- wontfix
firefox50 --- wontfix
firefox51 --- fixed

People

(Reporter: kanru, Unassigned)

References

Details

(Keywords: crash, Whiteboard: sb+)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-aa4b0171-4be9-4129-a9e9-e07002160510.
=============================================================

This is currently #14 crashes on Nightly 20160509040557
18 crashes since May 04, the earliest crash is found in 46.0a1

https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&signature=base%3A%3Awin%3A%3APEImage%3A%3AGetProcAddress&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#reports

stack:
base::win::PEImage::GetProcAddress(char const*)
sandbox::ServiceResolverThunk::ResolveTarget(void const*, char const*, void**)
sandbox::InterceptionManager::PatchClientFunctions(sandbox::DllInterceptionData*, unsigned __int64, sandbox::DllInterceptionData*)
sandbox::InterceptionManager::PatchNtdll(bool)
sandbox::PolicyBase::SetupAllInterceptions(sandbox::TargetProcess*)
sandbox::PolicyBase::AddTarget(sandbox::TargetProcess*)
sandbox::BrokerServicesBase::SpawnTarget(wchar_t const*, wchar_t const*, sandbox::TargetPolicy*, _PROCESS_INFORMATION*)
mozilla::SandboxBroker::LaunchApp(wchar_t const*, wchar_t const*, bool, void**)
mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunchInternal(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, base::ProcessArchitecture)
mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunch(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, base::ProcessArchitecture)
mozilla::ipc::GeckoChildProcessHost::RunPerformAsyncLaunch(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, base::ProcessArchitecture)

Bugzilla search found that this also happens intermittently on automation, probably something we want to address.
Flags: needinfo?(bobowen.code)
Yes, this is possibly because more people are trying out the 64-bit version, as it only seems to happen there.

Unfortunately this is in Chromium sandbox code and it's a part they don't use.

Once we start using firefox.exe for the content process it might help.
Although we wouldn't be able to get rid of this different code path use until we switch all child processes to firefox.exe.

I'll take another look at some point (leaving ni?) to see if I can work out why we're getting this.
From what I remember before it looked like plugin-container.exe was getting unloaded while we were trying to look up addresses for patching in the child process.
OS: Windows 10 → Windows
Hardware: Unspecified → x86_64
Whiteboard: sb?
See Also: → 1275813
Hey Kanru, is there anyone on your team that might have the time to poke at this?
Flags: needinfo?(bobowen.code) → needinfo?(kchen)
(In reply to Jim Mathies [:jimm] from comment #2)
> Hey Kanru, is there anyone on your team that might have the time to poke at
> this?

I can take a look. However I've only skimmed through the sandboxing code so it might take a while.

(In reply to Bob Owen (:bobowen) from comment #1)
> Yes, this is possibly because more people are trying out the 64-bit version,
> as it only seems to happen there.
> 
> Unfortunately this is in Chromium sandbox code and it's a part they don't
> use.

There is a big warning at the beginning of pe_image.cc saying that the code is not tested on x64. The newer version in chromium removed the warning and didn't change the code very much. Not sure if it's used in production but it was tested.
 
> Once we start using firefox.exe for the content process it might help.
> Although we wouldn't be able to get rid of this different code path use
> until we switch all child processes to firefox.exe.

I don't know yet how using firefox.exe for the content process would help here but I assume it's because we don't have to patch the child process when using firefox.exe.

> I'll take another look at some point (leaving ni?) to see if I can work out
> why we're getting this.
> From what I remember before it looked like plugin-container.exe was getting
> unloaded while we were trying to look up addresses for patching in the child
> process.

Do you mean FreeLibrary is called somewhere before GetProcAddress? MSDN says FreeLibrary is refcounted so the crash could only happen if FreeLibrary is called in multiple threads and on a handle returned by GetModuleHandle.

Maybe we could added a GetModuleHandleEx guard before GetProcAddress and assert that handle is valid. But soon or later we will hit the invalid handle in other places.
Flags: needinfo?(kchen)
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #3)

> I can take a look. However I've only skimmed through the sandboxing code so
> it might take a while.

Thanks.

> (In reply to Bob Owen (:bobowen) from comment #1)
> > Yes, this is possibly because more people are trying out the 64-bit version,
> > as it only seems to happen there.
> > 
> > Unfortunately this is in Chromium sandbox code and it's a part they don't
> > use.
> 
> There is a big warning at the beginning of pe_image.cc saying that the code
> is not tested on x64. The newer version in chromium removed the warning and
> didn't change the code very much. Not sure if it's used in production but it
> was tested.

Yes, I noticed that when I looked at this previously for bug 1121028, but then the problem seemed to go away.
After searching yesterday it looks like there are quite a few related intermittent oranges.
Chrome doesn't use the SANDBOX_EXPORTS parts of the code, because they use the same EXE for the child.

> > Once we start using firefox.exe for the content process it might help.
> > Although we wouldn't be able to get rid of this different code path use
> > until we switch all child processes to firefox.exe.
> 
> I don't know yet how using firefox.exe for the content process would help
> here but I assume it's because we don't have to patch the child process when
> using firefox.exe.

We still have to patch the child process, but it doesn't have to load the child EXE to work out where it needs to patch.
 
> > I'll take another look at some point (leaving ni?) to see if I can work out
> > why we're getting this.
> > From what I remember before it looked like plugin-container.exe was getting
> > unloaded while we were trying to look up addresses for patching in the child
> > process.
> 
> Do you mean FreeLibrary is called somewhere before GetProcAddress? MSDN says
> FreeLibrary is refcounted so the crash could only happen if FreeLibrary is
> called in multiple threads and on a handle returned by GetModuleHandle.
> 
> Maybe we could added a GetModuleHandleEx guard before GetProcAddress and
> assert that handle is valid. But soon or later we will hit the invalid
> handle in other places.

From having another quick look yesterday, I'm less convinced that this might be some problem with the EXE getting unloaded.
It only calls FreeLibrary after all the patching is done and the address is outside of where it should be anyway I think.
I was thinking of landing a patch to keep the child EXE loaded anyway, so we don't have to keep loading and unloading it every time we start a sandboxed child process.
I have a patch, but I'll do that in a different bug.

My plan for this bug was to look at several dumps to see if I could see any correlation, maybe in the particular interception that is being applied.
I also noticed that this hasn't happened on Nightly since build ID 20160520030251.
The only thing I can think that might have affected this are my patches to remove the static CRT linking and link the sandbox into firefox.exe (bug 1035125), but I'm pretty sure that landed a couple of days earlier, so that doesn't quite correlate.
(In reply to Bob Owen (:bobowen) from comment #4)
> > > I'll take another look at some point (leaving ni?) to see if I can work out
> > > why we're getting this.
> > > From what I remember before it looked like plugin-container.exe was getting
> > > unloaded while we were trying to look up addresses for patching in the child
> > > process.
> > 
> > Do you mean FreeLibrary is called somewhere before GetProcAddress? MSDN says
> > FreeLibrary is refcounted so the crash could only happen if FreeLibrary is
> > called in multiple threads and on a handle returned by GetModuleHandle.
> > 
> > Maybe we could added a GetModuleHandleEx guard before GetProcAddress and
> > assert that handle is valid. But soon or later we will hit the invalid
> > handle in other places.
> 
> From having another quick look yesterday, I'm less convinced that this might
> be some problem with the EXE getting unloaded.
> It only calls FreeLibrary after all the patching is done and the address is
> outside of where it should be anyway I think.
> I was thinking of landing a patch to keep the child EXE loaded anyway, so we
> don't have to keep loading and unloading it every time we start a sandboxed
> child process.
> I have a patch, but I'll do that in a different bug.

Could you link to the bug? Maybe your patch can also fix this bug.
 
> My plan for this bug was to look at several dumps to see if I could see any
> correlation, maybe in the particular interception that is being applied.
> I also noticed that this hasn't happened on Nightly since build ID
> 20160520030251.
> The only thing I can think that might have affected this are my patches to
> remove the static CRT linking and link the sandbox into firefox.exe (bug
> 1035125), but I'm pretty sure that landed a couple of days earlier, so that
> doesn't quite correlate.

I saw new reports from 5/25, 5/27 and 5/29 so it's still occurring.

I'll continue to find correlations but at a lower priority since sandboxing is currently only enabled on Nightly.
Flags: needinfo?(bobowen.code)
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #5)
> (In reply to Bob Owen (:bobowen) from comment #4)

> Could you link to the bug? Maybe your patch can also fix this bug.

I'll get that filed and upload the patch.

> I saw new reports from 5/25, 5/27 and 5/29 so it's still occurring.
> 
> I'll continue to find correlations but at a lower priority since sandboxing
> is currently only enabled on Nightly.

Yes, just a temporary lull it seems.

Sandboxing is enabled for GMP and NPAPI (64-bit) on release, it is only content process sandboxing that is Nightly only.
Flags: needinfo?(bobowen.code)
See Also: → 1276961
Flags: needinfo?(jmathies)
Whiteboard: sb? → sbwc1
Flags: needinfo?(jmathies)
Priority: -- → P3
Whiteboard: sbwc1 → sb+
Crash volume for signature 'base::win::PEImage::GetProcAddress':
 - nightly (version 51): 10 crashes from 2016-08-01.
 - aurora  (version 50): 10 crashes from 2016-08-01.
 - beta    (version 49): 48 crashes from 2016-08-02.
 - release (version 48): 48 crashes from 2016-07-25.
 - esr     (version 45): 26 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       2       2       2
 - aurora        6       1       1
 - beta         11      19      11
 - release      16      13       8
 - esr           5       0       1

Affected platform: Windows

Crash rank on the last 7 days:
           Browser     Content   Plugin
 - nightly #173
 - aurora  #191
 - beta    #1167
 - release #1067
 - esr     #897
See Also: → 1307327
As I said in bug 1307327 ... we've not seen this at all in Aurora or Nightly since August, which was not long before I landed the latest updates to the Chromium sandbox code in Fx51 (bug 1287426).
I can't spot a specific change that might have fixed this, but then we never really got to the bottom of why it was happening anyway.

Fingers crossed ...
We've not seen this in any of the 51 Betas either and it was there in nearly all the 50s.
So I think we can safely say this is resolved now, with bug 1287426 being the only likely candidate for the fix.
Status: NEW → RESOLVED
Closed: 6 years ago
Depends on: 1287426
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.