Closed Bug 741540 Opened 12 years ago Closed 12 years ago

Thread-unsafe nsWindowsDllInterceptor::AddHook is used in AvailableMemoryTracker after other threads have been initialized (?)

Categories

(Core :: XPConnect, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla14
Tracking Status
firefox11 - wontfix
firefox12 - wontfix
firefox13 - affected
firefox14 + fixed

People

(Reporter: marcia, Assigned: justin.lebar+bug)

Details

Attachments

(1 file, 1 obsolete file)

Trusteer Rapport passed this info on to us re: a new bug they are seeing in Firefox 11. KaiRo thought it would be useful to file a new bug to track this. 

http://mxr.mozilla.org/mozilla-release/source/xpcom/base/AvailableMemoryTracker.cpp#458, which was added in https://hg.mozilla.org/releases/mozilla-release/rev/8348c44ba1ca is something that might be useful according to KaiRo. Adding Justin LeBar to the bug.

Email From Trusteer:
While analyzing logs from our users in the field, we noticed several
crashes in FF 11.0.
After inspecting them , it looks like a bug that was introduced in FF 11.
Here is a technical explanation to pass on to the relevant people on
your side:
The actual bug is in nsWindowDllInterceptor.h. in the AddHook function.
The crashes we saw are in your MapViewOfFileHook function.
In FF 11 you use a new mechanism called AvailableMemoryTracker. In
AvailableMemoryTracker.cpp (line 458) you set this mechanism up by
hooking MapViewOfFile.
The implementation of AddHook (nsWindowsDllInterceptor.h) hooks the
relevant function and only after that assigns the tramp value to the
origFunc variable. This means that if between the call to
CreateTrampoline (in line 161) and the actual assignment (line 167) , a
code calls MapViewOfFile, a crash will happen because the
MapViewOfFileHook code (AvailableMemoryTracker.cpp line 255) will call
sMapViewOfFileOrig which is NULL (or points at garbage) at this point.
I think the analysis in comment 0 is correct; there's a race condition here.  It's worse now that we're hooking into VirtualAlloc, since lots of code can end up in there.
Assignee: nobody → justin.lebar+bug
Status: NEW → ASSIGNED
Summary: New bug in Firefox 11 related to AvailableMemoryTracker? → Race condition in nsWindowsDllInterceptor::AddHook
Marcia, can they be cc'ed on the bug? I'd like to know if they saw this bug in the wild, and whether this is something that we should fast-track to fix for FF12...
We can fix the race condition identified in comment 0 easily enough, but it wouldn't fix the larger problem, which is that we don't atomically modify VirtualAlloc itself.  So there'd still be a race condition here.

We can fix this on our end by initializing the AvailableMemoryTracker before we think any other threads have started, but I'd like to know more about Trusteer's situation: Is he starting some threads very early in Gecko startup?
Attached patch Patch v1 (obsolete) — Splinter Review
Do you think this is early enough?
Attachment #611594 - Flags: review?(benjamin)
Summary: Race condition in nsWindowsDllInterceptor::AddHook → Thread-unsafe nsWindowsDllInterceptor::AddHook is used in AvailableMemoryTracker after other threads have been initialized (?)
Regarding Benjamin's question : yes, we saw this happen in the wild.
I also looked at the proposed fix  and the fix actually doesn't solve the bug in the AddHook function, but instead tries to init the mechanism in an early enough place where it is assumed that there's no chance of other threads calling to the hooked functions.
I see several problems here :
1.	 I think it's a very fragile assumption from your perspective. Maybe one day some firefox programmer will create a thread before this init ? Maybe the AddHook function will be called in some other unrelated part of firefox in the future by someone who is not aware of this issue ?
2.	This fix assumes that only firefox code runs at this point. This easily breaks in the presence of almost any security software (almost all of them inject some code to processes).

I think that maybe you don't want rely on the above assumptions and I also think you can very easily improve the patching code (although, it is far from being bulletproof) :
1.	Actually assigning the tramp value to the output variable before doing the instructions rewrite is trivial.
2.	At least in the 32bit case, where your patch is 5 bytes,  you can use an atomic replace by using the "lock cmpxchg8b" instruction (of course, with the relevant loop surrounding it).
3.	Regardless of whether you do (1-2), you need to call FlushInstructionCache after you perform your instructions replacement. 

It's important to note that the above atomicity is not good enough, because you will have a problem with code that runs inside the function you try to patch while you do the instructions replacement , for example : 
1.	Assume that the first instruction is 2 bytes. First instruction was fetched and is being executed. The IP is incremented by 2 bytes.
2.	You replace the 5 bytes (atomically, but it's irrelevant)
3.	The cpu fetches the second instruction which is garbage (the middle of your 5 bytes jump). 

Because you patch very standard functions in windows, they all contain the usual hotpatch placeholder ("mov edi,edi") and the 5 nop bytes before the functions. You can easily solve all of the patching problems (at least in 32bit) by using this fact.
If you prefer not to use the hotpatch technique, there are more complex techniques for performing a function patch  , which try to avoid any race problems. We can further discuss these solutions in a separate thread in case you wish to.
> 1.	 I think it's a very fragile assumption from your perspective.

Welcome to Gecko.  :)  This is all happening very early in the lifecycle of the application; it's unlikely someone is going to start a thread before this point.

> 2.	This fix assumes that only firefox code runs at this point. This easily breaks in the 
> presence of almost any security software (almost all of them inject some code to processes).

Yes, this is the main problem with the current approach.

> I also looked at the proposed fix  and the fix actually doesn't solve the bug in the AddHook 
> function, but instead tries to init the mechanism in an early enough place where it is assumed that 
> there's no chance of other threads calling to the hooked functions.

Correct.  I didn't want to "fix" AddHook, because we can't actually make it thread-safe on all platforms.  In particular, we can't atomically write the code on 64-bit.  I'd rather not wallpaper over this unless we have no other option.

> Regardless of whether you do (1-2), you need to call FlushInstructionCache after you perform your 
> instructions replacement. 

What's the danger of not calling this?  The icache won't get updated, and our patch won't immediately take effect.  That's OK...  If on the other hand we could get into a situation where one of our writes takes effect without the other one taking effect, that would be quite bad.

> It's important to note that the above atomicity is not good enough [snip]

Sounds like even more of a reason not to try to be atomic here.

> Because you patch very standard functions in windows, they all contain the usual hotpatch 
> placeholder ("mov edi,edi") and the 5 nop bytes before the functions. You can easily solve all of 
> the patching problems (at least in 32bit) by using this fact.

Okay, so I think I understand how this would work.  To be explicit:

  * We write our long jump into the bytes above VirtualAlloc.
  * Then we modify the beginning of VirtualAlloc to contain a short-jump backwards 5 bytes, to the beginning of our long jump.

But doesn't this have the same race as you identified above?  The short-jump backwards is more than one byte long, so if we're overwriting a one-byte instruction, control could flow into the middle of the short-jump, and we're screwed.

It seems like the big problem with the current approach is code which injects itself into Firefox.  I'm not sure what to do about that, short of disabling the low-memory detector (which wouldn't be the end of the world).
> Welcome to Gecko.  :)  This is all happening very early in the lifecycle of
> the application; it's unlikely someone is going to start a thread before
> this point.

Hi Justin. I agree with you regarding the AvailableMemoryTracker and the fact that you init it very early. But my point is (and correct me if I'm wrong as I'm not familiar with these aspects of your code base and code reuse) that the AddHook function is a general function that it's available for anyone to use. If this is the case, then what prevents a programmer (which is not you) in the future from calling AddHook on some function (not related to the AvailableMemoryTracker) in some other part of firefox where other threads already started running ? how will he know of the dangers of using this function when other threads are already running ? 

> But doesn't this have the same race as you identified above?  The short-jump
> backwards is more than one byte long, so if we're overwriting a one-byte
> instruction, control could flow into the middle of the short-jump, and we're
> screwed.

no. the "hotpatch" solution doesn't suffer from this problem. This is the reason microsoft put this "mov edi,edi"  instruction at the beginning of functions and the nop space before it. it's for the purpose of giving them the ability to apply patches safely in runtime.
yes, the instruction is more than 1 byte long, but the important part here is that in the hotpatch solution you doesn't break instruction boundaries and you replace a full instruction with another full instruction of the same size and you do it atomically because it's 2 bytes long. This is why they used a "mov edi,edi" instruction (which is basically a disguised 2 bytes "no operation" instruction) and not 2 consecutive nops. using 2 consecutive nops would have indeed suffer from the same problem you raised. I think this may help to explain this : http://blogs.msdn.com/b/oldnewthing/archive/2011/09/21/10214405.aspx
> If this is the case, then what prevents a programmer (which is not you) in the future from calling 
> AddHook on some function (not related to the AvailableMemoryTracker) in some other part of firefox 
> where other threads already started running ? how will he know of the dangers of using this function 
> when other threads are already running ? 

A big scary comment?  Code review?  I don't pretend that this is ideal -- I'd rather make the function thread-safe, if we can.  But there are lots and lots of places where we rely on comments plus review to prevent footguns.

> http://blogs.msdn.com/b/oldnewthing/archive/2011/09/21/10214405.aspx

Thanks for the link.  My mistake was thinking that the nops were *before* the function, rather than at its beginning.  This could work...

On the other hand, I looked at the telemetry for the AvailableMemoryTracker, and we almost never hit it.  (Or maybe we hit it more often, but then OOM anyway, so we don't get telemetry.)  So I question how useful it is.
I think we should take this change as-is -- it only improves the situation.

I'll look at seeing whether I can make AddHook thread-safe, in a separate bug.  If not, or if we can only make it thread-safe under certain circumstances (e.g., 32-bit only), perhaps we can bail if it's not thread-safe and a second thread is running.
I filed bug 742491 on making AddHook thread-safe.  If you don't mind, let's move the discussion there.
Attachment #611594 - Flags: review?(benjamin) → review+
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #2)
> Marcia, can they be cc'ed on the bug? I'd like to know if they saw this bug
> in the wild, and whether this is something that we should fast-track to fix
> for FF12...

I'm also obviously interested in finding out the magnitude of this issue and whether we need to take a fix for FF12.

Ilan - can you give us an idea of the volume of crashes you're seeing on your side? If you have a specific signature that we can search for in our crash system, it would be very helpful. Thanks!
What crash signature are we searching for. Ilan can you attach a report to the bug?
He says in comment 0:

> The crashes we saw are in your MapViewOfFileHook function.
Try run for 3104cbf1b88f is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=3104cbf1b88f
Results (out of 171 total builds):
    exception: 3
    success: 135
    warnings: 32
    failure: 1
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-3104cbf1b88f
Based on the red and purple (??) on try, I guess I moved this call too early.  Which, in retrospect, 

I want to insert the hooks as early as possible, but I'll change things so the hooks do nothing until a later point in time, when we say it's safe.
> Which, in retrospect,

...I should have expected, since we post events to the main thread's event loop and whatnot.
Here's my first attempt at making AddHook thread-safe, at least in the limited win32 case:

  https://tbpl.mozilla.org/?tree=Try&rev=c132e62bb097
Assignee: justin.lebar+bug → nobody
Component: XPCOM → XSLT
QA Contact: xpcom → xslt
Component: XSLT → XPCOM
QA Contact: xslt → xpcom
Assignee: nobody → justin.lebar+bug
Component: XPCOM → XSLT
> Here's my first attempt at making AddHook thread-safe, at least in the limited win32 case:

Sigh, that was a complete bugzilla fail.  This is for bug 742491, of course.
Assignee: justin.lebar+bug → nobody
Component: XSLT → XPConnect
QA Contact: xpcom → xpconnect
Assignee: nobody → justin.lebar+bug
Just because they listed that function, does it mean it's at the top of the stack? It really helps us to get an actual report. 

I searched for crash signatures containing - MapViewOfFileHook. I found this signature...`anonymous namespace''::MapViewOfFileHook(void*, unsigned long, unsigned long, unsigned long, unsigned long). We have 4 of them across all versions of FF in the past 4 weeks and 3 of them are dups. One is in 12 and the 3 dups are in 11.

I also found this signature - WindowsDllInterceptor::CreateTrampoline(void*, int). There are 5 crashes total in the past week, 4 on 11 and 1 on 12.

I scanned some of the modules in these reports to look for something that would confirm they are running Trusteer. It would be something like rapportutil.dll right? I couldn't find it. So maybe these aren't the right signatures? I got the idea that the volume would be higher.
the top of the stack in the crashes we saw is :

036df8ec              0215e20d             0x0
036df90c              00d751bd            xul!`anonymous namespace'::MapViewOfFileHook(void * aFileMappingObject = 0x000004d4, unsigned long aDesiredAccess = 1, unsigned long aFileOffsetHigh = 0, unsigned long aFileOffsetLow = 0, unsigned long aNumBytesToMap = 0)+0x19

I don't have now the exact data about the volume of this crash (I will be able to give you this later on), but I think it was more than the number you mentioned. On one day this week (April 2nd), we received 5 such crash dump reports. all of them with the above stack. all of them were users of firefox 11.
Attached patch Patch v2Splinter Review
This looks much better on try.

Before this patch, we used to check prefs and not install the hooks if all the
low-memory tracking was disabled.

Now we have to install the hooks unconditionally, because we're installing them
before the pref service is initialized.  (I actually didn't check that the pref
service was uninitialized at this point, but even if it happens to work, I
don't want to rely on that.)

That's unfortunate, but I don't see an easy way around it.
Attachment #612635 - Flags: review?(benjamin)
Try run for 30026b3fa880 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=30026b3fa880
Results (out of 50 total builds):
    success: 43
    warnings: 7
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-30026b3fa880
(In reply to ilan.fraiman from comment #20)
> the top of the stack in the crashes we saw is :
> 
> 036df8ec              0215e20d             0x0
> 036df90c              00d751bd            xul!`anonymous
> namespace'::MapViewOfFileHook(void * aFileMappingObject = 0x000004d4,
> unsigned long aDesiredAccess = 1, unsigned long aFileOffsetHigh = 0,
> unsigned long aFileOffsetLow = 0, unsigned long aNumBytesToMap = 0)+0x19

This would lead to this signature, quoted by Sheila above:
`anonymous namespace''::MapViewOfFileHook(void*, unsigned long, unsigned long, unsigned long, unsigned long)
Here's a link to reports with that one that we saw in the last 4 weeks: https://crash-stats.mozilla.com/report/list?range_value=4&range_unit=weeks&signature=%60anonymous%20namespace%27%27%3A%3AMapViewOfFileHook%28void%2A%2C%20unsigned%20long%2C%20unsigned%20long%2C%20unsigned%20long%2C%20unsigned%20long%29
There seems to be only 5 of those in that timespan.
Attachment #611594 - Attachment is obsolete: true
we see on our side , on a daily basis,  ~0.05 crashes per 100 users from this type of crash.
Ilan, how are you seeing these crashes? Is your software intercepting/blocking the normal crash-reporting mechanism within Firefox?
Attachment #612635 - Flags: review?(benjamin) → review+
https://hg.mozilla.org/mozilla-central/rev/ed3b761936be
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla14
(In reply to ilan.fraiman from comment #24)
> we see on our side , on a daily basis,  ~0.05 crashes per 100 users from
> this type of crash.

Since the user population and crash rate is low enough that this isn't absolutely critical, we don't expect to uplift a fix prior to FF12's release. I'll let Justin or Benjamin nominate for Aurora 13 approval if they feel this is risk is fairly low risk change.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: