Closed Bug 988789 Opened 10 years ago Closed 10 years ago

Assertion failure: memcmp(reinterpret_cast<void*>(instr), cache_page->cachedData(offset), SimInstruction::kInstrSize) == 0, at jit/arm/Simulator-arm.cpp:1058

Categories

(Core :: JavaScript Engine, defect)

ARM
Linux
defect
Not set
major

Tracking

()

VERIFIED FIXED
mozilla32
Tracking Status
firefox29 --- unaffected
firefox30 --- verified
firefox31 --- verified
firefox32 --- fixed
firefox-esr24 --- unaffected
b2g-v1.2 --- unaffected
b2g-v1.3 --- unaffected
b2g-v1.3T --- unaffected
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: decoder, Assigned: dougc)

References

Details

(Keywords: assertion, sec-high, testcase)

Attachments

(5 files, 6 obsolete files)

72.85 KB, patch
dougc
: review+
abillings
: sec-approval+
Details | Diff | Splinter Review
74.53 KB, patch
dougc
: review+
Details | Diff | Splinter Review
1.99 KB, patch
dougc
: review+
Details | Diff | Splinter Review
61.87 KB, patch
dougc
: review+
Details | Diff | Splinter Review
72.09 KB, patch
dougc
: review+
Details | Diff | Splinter Review
The following testcase asserts on mozilla-central revision df75761307b6 (x86 ARM simulator build, run with --fuzzing-safe --arm-sim-icache-checks --ion-eager):


function printBugNumber (num) {}
function enterFunc (funcName) {
   try  {  }  catch(ex)  {  }
}
var lfcode = new Array();
lfcode.push("\
function foo() {\
var BUGNUMBER = 436700;\
  enterFunc ('test');\
  printBugNumber(BUGNUMBER);\
  gczeal(2);\
} foo();\
");
lfcode.push("\
for (var i=0; (function() {\
  'use asm';\
  function _main() {\
    var $1=0, $2=0, $3=0, $4=0, $5=0, $6=0, $7=0, $8=0, $9=0, $10=0, label=0;\
        $8 = $4 + 1 | 0;\
        return $8|0;\
  }\
  return _main;\
})()(); i++) {};\
");
while (true) {
    var file = lfcode.shift(); if (file == undefined) { break; }
    if (file == "evaluate") {} else {
        loadFile(file)
    }
}
function loadFile(lfVarx) {
	unescape();
        new Function(lfVarx)(); 
        if (!isNaN()) {
	  a = unescape();
        }
}
The instruction cache for the cloned asm.js code is not being flushed when there is no heap buffer.  There is a flush in initHeap, but it's not being called, and if the coalescing were to work in future then it would be best to define the entire buffer early.

The coalescing does not work, even if an AutoFlushCache is defined because updateTop depends on MaybeGetIonContext() which is returning false in this path.  This might have a performance impact on large asm.js modules.
Attachment #8401062 - Flags: feedback?(luke)
Hiding the updateTop() in AsmJSModule::initHeap() was perhaps a bad choice.  What if instead we have an explicit AsmJSModule::syncExecutableMemory() that does the updateTop call and we put this at the end of DynamicallyLinkModule(), so it is always called, clone or not.
(In reply to Luke Wagner [:luke] from comment #2)
> Hiding the updateTop() in AsmJSModule::initHeap() was perhaps a bad choice. 
> What if instead we have an explicit AsmJSModule::syncExecutableMemory() that
> does the updateTop call and we put this at the end of
> DynamicallyLinkModule(), so it is always called, clone or not.

Yes, it would be best to flush the entire code buffer when done.  But restoring the code state and then re-patching causes flushing and there seems to have been some logic to defining the entire region early - the ARM backend cache flush update method had been coalescing the flushing, but it now only does so when within the same page.  This coalescing should work for the non-clone path, but I should double check.

So it would be good to flush when done, but it is still useful to suppress earlier redundant flushing in the same region.
Doug, thanks for fixing this. Good to see the simulator's icache flushing checks are useful :)
(In reply to Douglas Crosher [:dougc] from comment #3)
> But restoring the code state and then re-patching causes flushing and there
> seems to have been some logic to defining the entire region early

I don't really understand this sentence.

> This coalescing should work for
> the non-clone path, but I should double check.

IIUC, there will be a whole-module flush from MacroAssembler::finish() and then one more in initHeap() after the patching.  What I was proposing was moving the initHeap() flush to the end of DynamicallyLinkModule().  Since DynamicallyLinkModule must happen before running the code, if we wanted to avoid redundant flushing, we'd want to remove the flush in MacroAssembler:finish().
Douglas, are Luke's comments enough to move forward here?  I'm going to assign it to you because you have a patch in progress.  This sounds bad, so I'm going to mark it sec-high.
Assignee: nobody → dtc-moz
Keywords: sec-high
(In reply to Andrew McCreight [:mccr8] from comment #6)
> Douglas, are Luke's comments enough to move forward here?  I'm going to
> assign it to you because you have a patch in progress.  This sounds bad, so
> I'm going to mark it sec-high.

Yes.  I have started to look into this, and it needs to be fixed otherwise cloned asm.js modules on the ARM would not execute reliably.  I do not concur entirely with Luke's suggestions, but will continue the discussion.  The current patch appears to fix this issue and I believe it could be landed as a temporary fix, however the code paths are convoluted and the current patch does not avoid redundant flushing.  Shall follow up.
This patch moves the AutoFlushCache pointer into thread local storage.  This avoids the need to have a reference to a JSContext.  Could use some feedback on how appropriate this solution is?

There is some cache flushing occurring in Exclusive Contexts, in AsmJS.cpp:CheckModule() and AsmJSModule.cpp:staticallyLink(), and the JSContext is an option here.  These paths currently generate a lot of small cache flush system calls, and with the patch these are now merged reducing the number of system calls.

The bug is fixed by calling updateTop() with the full code object in AsmJSModule::clone().  This now occurs within an AutoFlushCache context which delays the actual cache flush and merges all the small flush requests until the end of LinkAsmJS().

The cache is now flushed at the end of each AutoCacheFlush context.

The logic in AutoFlushCache::update() has been modified to avoid flushing if the new request is within the existing scope.  This is the common use pattern and it works well to merge the flushing.

The logical in AutoFlushCache::update() has also been modified to flush the existing noted region if a page is jumped, rather then keeping the noted region and directly flushing the new area.  Common usage flushes in an advancing sequence so it is better to try and pick up merging from the head.

The IONFLAGS=cacheflush output appears to indicate that the AutoFlushCache merging is working well with the patch.
Attachment #8401062 - Attachment is obsolete: true
Attachment #8401062 - Flags: feedback?(luke)
Attachment #8406840 - Flags: feedback?(luke)
Comment on attachment 8406840 [details] [diff] [review]
Flush the instruction cache when cloning a module.

Review of attachment 8406840 [details] [diff] [review]:
-----------------------------------------------------------------

Nice general cleanup; way nicer this way!  Since you are touching pretty much all of the uses of AutoFlushCache anyway, what do you think about renaming it to AutoFlushICache?  I recall finding "Cache" ambiguous and confusing at first.

One question below and a few nits:

::: js/src/jit/AsmJSLink.cpp
@@ +652,5 @@
>  // Implements the semantics of an asm.js module function that has been successfully validated.
>  static bool
>  LinkAsmJS(JSContext *cx, unsigned argc, JS::Value *vp)
>  {
> +    AutoFlushCache afc("LinkAsmJS");

Sorry if you already tried to explain this but, it seems to me that, since we have to link any asm.js module before executing its code, we could simply have, right before the final return of LinkAsmJS:

AutoFlushCache afc("LinkAsmJS");
AutoFlushCache::updateTop(code_, codeBytes_);

The only concern is if there are random updates that happen during compilation or linking that, because there is no live AutoFlushCache, each trigger immediate flushes.  For that, perhaps we could use AutoFlushInhibitor?

::: js/src/jit/Ion.cpp
@@ +2930,4 @@
>      used_(false)
>  {
> +    PerThreadData *pt = TlsPerThreadData.get();
> +    AutoFlushCache *afc = pt->PerThreadData::autoFlushCache();

Are these explicit PerThreadData:: qualifiers needed?  If they aren't required, I'd remove them.

::: js/src/jit/IonCaches.cpp
@@ +1693,5 @@
>  
>      GetPropertyIC &cache = ion->getCache(cacheIndex).toGetProperty();
>      RootedPropertyName name(cx, cache.name());
>  
> +    AutoFlushCache afc ("GetPropertyCache");

Pre-existing: can you remove the space after 'afc'?

@@ +2748,5 @@
>  bool
>  SetPropertyIC::update(JSContext *cx, size_t cacheIndex, HandleObject obj,
>                        HandleValue value)
>  {
> +    AutoFlushCache afc ("SetPropertyCache");

same
Comment on attachment 8406840 [details] [diff] [review]
Flush the instruction cache when cloning a module.

Clearing while waiting for answer.
Attachment #8406840 - Flags: feedback?(luke)
I'm still hitting this, what's the status here? :)
Flags: needinfo?(dtc-moz)
(In reply to Christian Holler (:decoder) from comment #11)
> I'm still hitting this, what's the status here? :)

It's my top priority.
Flags: needinfo?(dtc-moz)
This is a rewrite of the instruction cache flushing, and a significant simplification.  It avoids redundant flushing for asm.js code, avoiding flushing until the asm.js code is dynamically linked, and it fixes flushing of the asm.js code when there is no buffer heap.

The common use case is merging cache flushes when preparing a code object.  In this case the entire range of the code object is being flushed and as the code is patched smaller redundant flushes could occur.  The design allows an AutoFlushICache dynamic thread local context to be declared in which the range of the code object can be set which defers flush until the end of this dynamic context.  The redundant flushing within this code range is also deferred avoiding redundant flushing.  Flushing outside this code range is not affected and proceeds immediately.

In some cases flushing is not necessary, such as when compiling an asm.js module which is flushed again when dynamic linking, and also in error paths that abandon the code.  Flushing within the set code range can be inhibited within the AutoFlushICache dynamic context by setting an inhibit flag.

The rewrite does not attempt to merge cache flushing across code objects and this significantly reduces the complexity and I believe makes the code much more maintainable.  It also avoids a reported issue in the Linux kernel in which only the first vma in a range is actually flushed - assuming a code object uses a single vma then this should not be a problem.

The compiler can be re-entered while within an AutoFlushICache dynamic context, and previously these paths needed to be identified and protected with an AutoFlushInhibitor dynamic context - to run the pending flushing before the code was executed.   The rewrite avoids this complexity by limiting cache flush merging to a single defined code range and is safe based on the assumption that code being assembled or patched is not executed before the exit of the respective AutoFlushICache dynamic context.

Although the simplification reduces some merging opportunities, these were in practice the exception.  

The merging can be explored by setting the IONFLAGS=cacheflush environment variable. The output '=' character represents a merged flush, and the '-' character an inhibited flush.  When running asm.js code you will see a large block of inhibited flushing when the code is compiled, and a large block of merged flushing when the code is dynamically linked.   A flush outside an AutoFlushICache context is represented by the '#' character, and an flush outside the set code range is represented by the '*' character and these do not appear in high frequency.
Attachment #8406840 - Attachment is obsolete: true
Attachment #8413484 - Flags: review?(luke)
Comment on attachment 8413484 [details] [diff] [review]
Rework the instruct cache flushing.

Should check on the MIPS backend.
Attachment #8413484 - Flags: review?(luke)
(In reply to Douglas Crosher [:dougc] from comment #14)
> Should check on the MIPS backend.

Don't worry about MIPS, it's not a platform we officially support so we don't have to spend any time on fixing it and it shouldn't hold up important work for other archs.

Nice work refactoring this btw!
* Took another pass over the mips backend.

* Disabled the flushing code except for ARM and MIPS code.
Attachment #8413484 - Attachment is obsolete: true
Attachment #8413679 - Flags: review?(luke)
Group: javascript-core-security
Comment on attachment 8413679 [details] [diff] [review]
Rework the instruction cache flushing.

Review of attachment 8413679 [details] [diff] [review]:
-----------------------------------------------------------------

Great!  Thanks for all the consideration.

::: js/src/jit/AsmJS.cpp
@@ +7023,5 @@
>          return m.fail(nullptr, "top-level export (return) must be the last statement");
>  
>      ScopedJSDeletePtr<AsmJSModule> module;
> +    // The instruction cache is flushed when dynamically linking, so can inhibit now.
> +    AutoFlushICache afc("CheckModule", /* inhibit= */ true);

Can you move the afc decl and the comment to before the module decl and put \n before and after?

::: js/src/jit/AsmJSLink.cpp
@@ +357,5 @@
>  }
>  
> +// Note this function can re-enter JS and is called in an AutoFlushICache context,
> +// but merges only occur within the code object being linked so flushing of other
> +// code objects will occur immediately and are not delayed by the current context.

While this is true, it doesn't seem to warrant a comment because, iiuc, this is just the normal way AutoFlushICache works.

@@ +764,5 @@
>      // When a module is linked, it is dynamically specialized to the given
>      // arguments (buffer, ffis). Thus, if the module is linked again (it is just
>      // a function so it can be called multiple times), we need to clone a new
>      // module.
> +    AutoFlushICache afc("LinkAsmJS");

Can you move this AutoFlushICache up above the comment, with \n before and after and add some comment to the effect of:

All ICache flushing of the module being linked has been inhibited under the assumption that the module is flushed after dynamic linking (when the last code mutation occurs).  Thus, enter an AutoFlushICache context for the entire module now.  The module range is set below.

@@ +771,5 @@
> +            return false;
> +    } else {
> +        // The entire code object is flushed when dynamically linking.
> +        // Flushing would have been avoid up to this point.
> +        moduleObj->module().setAutoFlushICacheRange();

With the above comment, I would change this comment to:

CloneModule already calls setAutoFlushICacheRange internally before patching the cloned module, so avoid calling twice.

Alternatively, we could avoid exposing this detail by changing the assertion in setRange to JS_ASSERT(!start_ || (start_ == start && stop_ == stop).  That way the setAutoFlushICacheRange() can be taken out of the branch (moved below the module = moduleObj->module() below), without needing any comment.  Up to you.

@@ +782,5 @@
>      // the given heap (ArrayBuffer) and a new global data segment (the closure
>      // state shared by the inner asm.js functions).
>      if (!DynamicallyLinkModule(cx, args, module)) {
> +        // Abandon any pending instruction cache flushing.
> +        AutoFlushICache::setInhibit();

Is this necessary for correctness?  I don't really see how and, assuming it's not, then it doesn't really seem valuable to optimize this path by inhibiting flushing.

::: js/src/jit/AsmJSModule.cpp
@@ +932,5 @@
>      memcpy(out.code_, code_, pod.codeBytes_);
>  
> +    // We already know the exact extent of areas that need to be patched, just make sure we
> +    // flush all of them at once.
> +    AutoFlushICache::setRange(uintptr_t(out.code_), pod.codeBytes_);

Can this instead be a call to setAutoFlushICacheRange() (with the comment moved to setAutoFlushICacheRange)?  Can you also move the setAutoFlushICacheRange() to right before restoreToInitialState() which, iiuc, is the only reason this needs to be inside AsmJSModule::clone.

@@ +1364,5 @@
>      if (!module)
>          return false;
> +    // No need to flush the instruction cache now, it will be flushed
> +    // when dynamically linking.
> +    AutoFlushICache afc("LookupAsmJSModuleInCache", /* inhibit= */ true);

It seems like you could put the afc decl after the module->deserialize() call (nothing is inhibited during deserialization), followed immediately by a call to module->setAutoFlushICacheRange().  This would be a slight readability win since it would link the two together.  (Could you also put a \n before and after this new block of code?)

::: js/src/jit/Ion.cpp
@@ +2924,5 @@
> +// Flush the instruction cache.
> +//
> +// If called within a dynamic AutoFlushICache context and if the range is already pending
> +// flushing for this AutoFlushICache context then the request is ignored with the
> +// understanding that it will be flushed an the exit from the AutoFlushICache context.

s/an the/on/

@@ +2932,5 @@
> +// immediately rather than attempting to merge them.
> +//
> +// For efficiency it is expected that all large ranges will be flushed within an
> +// AutoFlushICache, so check.  If this assertion is hit then it does not necessarily
> +// indicated a progam fault but it might indicate a lost opportunity to merge cache

s/indicated/indicate/
Attachment #8413679 - Flags: review?(luke) → review+
Thank you for the quick review and the feedback.  Most of the suggestions have been applied.  This patch addresses the feedback and rebases the patch, and there is a small fix for the MIPS backend which is reported to be working with this patch.

I have seen some intermittent jit-test failures on an ARMv6 device that might be cache issues. I'll re-run some tests just to be sure.

Carrying forward r+.
Attachment #8413679 - Attachment is obsolete: true
Attachment #8415920 - Flags: review+
Rebase.  Carrying forward r+.

Sorry for the delay. I was running some background testing across a range of ARM hardware.

Only issues were on one ARMv6 device which was giving a few intermittent jit-test failures. Tried many cache flushing variations, e.g. page-at-a-time, and it did not solve the failures. Retested on a new RPi-B which is also ARMv6 and it had no jit-test failures, so assuming the failures were due to un-patched errata etc.

Try run: https://tbpl.mozilla.org/?tree=Try&rev=67826247576b
Attachment #8415920 - Attachment is obsolete: true
Attachment #8418443 - Flags: review+
Getting some crashes on ARM builds in the try runs.  Here's another try run:
https://tbpl.mozilla.org/?tree=Try&rev=d0b80c3bbff8

At least some crashes appear to be due to Bug 1001236 and there is a patch pending, so I'll re-try once this has landed.
(In reply to Douglas Crosher [:dougc] from comment #20)
> Getting some crashes on ARM builds in the try runs.  Here's another try run:
> https://tbpl.mozilla.org/?tree=Try&rev=d0b80c3bbff8
> 
> At least some crashes appear to be due to Bug 1001236 and there is a patch
> pending, so I'll re-try once this has landed.

Sorry, it's bug 924622 not 1001236.
Rebase. Carry forward r+.
Attachment #8418443 - Attachment is obsolete: true
Attachment #8422894 - Flags: review+
Would have been best to land this when there was solid green on the try builds but bug 924622 has not advanced. The crashes that bug 924622 addresses appear to be distinct so I suggest landing the cache changes now.
Keywords: checkin-needed
sec-high affecting multiple branches = needs sec-approval :)
Keywords: checkin-needed
Sorry, forgot to request sec-approval. Looking into it. It affects beta and aurora and is a big patch that will need separate patches and testing on each branch.
Blocks: 973725
Comment on attachment 8422894 [details] [diff] [review]
Rework the instruction cache flushing.

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

Guessing it would be hard to exploit.


Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

No, the fix is rather hidden because the patch reworks the large code.


Which older supported branches are affected by this flaw?

Beta and Aurora appear to be affected.


If not all supported branches, which bug introduced the flaw?

It appears to have been introduced in bug 973725.


Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

Backports have not yet been written.  The patch is large and needs to be right or the code will fail randomly. A lot has changed and separate patches will be needed for each branch and each will need some testing. Not good.


How likely is this patch to cause regressions; how much testing does it need?

If it's not perfect then it will cause random crashes. The ARM simulator can detect some cache flush issues, and detected this one, so this will help. This is not a patch I would have liked to uplift.
Attachment #8422894 - Flags: sec-approval?
Comment on attachment 8422894 [details] [diff] [review]
Rework the instruction cache flushing.

sec-approval+ for trunk.

If we need to get this on Aurora and Beta, we should get it in soon after it goes into trunk. You should talk to gkw about Arm Simulator testing.
Attachment #8422894 - Flags: sec-approval? → sec-approval+
(In reply to Al Billings [:abillings] from comment #27)
> If we need to get this on Aurora and Beta, we should get it in soon after it
> goes into trunk. You should talk to gkw about Arm Simulator testing.

Once it goes into trunk, decoder and I cover ARM simulator via x86 builds, and I also have my native odroid ARM board running to test it, so I'd figure we have enough coverage as soon as it lands.
Keywords: checkin-needed
Sorry the prior patch its failing on inbound.

This patch updates the recently added irregexp code to work with the new instruction cache flushing.  Carrying forward r+.

Also noticed that irregexp was missing perfspewer support which was also added, a very minor change.

https://tbpl.mozilla.org/?tree=Try&rev=dbf92f9cba69
Attachment #8424323 - Flags: review+
Carrying forward r+.
Attachment #8424325 - Flags: review+
Removing checking-needed because the pending patches have landed on inbound. Beta and Aurora patches are still being tested and will need separate patches.
Keywords: checkin-needed
Here's the patch rebased to the beta branch.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 

It appears to have been introduced in bug 973725.


User impact if declined: 

Cloned asm.js modules will likely crash on ARM hardware.


Testing completed: 

Passes the jit-tests locally with cache checking enabled. Two try runs, some errors but it is not clear that this patch caused them:
https://tbpl.mozilla.org/?tree=Try&rev=2c9d836408a9
https://tbpl.mozilla.org/?tree=Try&rev=11db9668dce5


Risk to taking this patch (and alternatives if risky): 

There is a moderate risk, but it has received some testing.


String or UUID changes made by this patch: n/a
Attachment #8424561 - Flags: review+
Attachment #8424561 - Flags: approval-mozilla-beta?
Here's the patch rebased to the Aurora branch.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 

It appears to have been introduced in bug 973725.


User impact if declined: 

Cloned asm.js modules will likely crash on ARM hardware.


Testing completed: 

Passes the jit-tests locally with cache checking enabled. Two try runs, some errors but it is not clear that this patch caused them:
https://tbpl.mozilla.org/?tree=Try&rev=2a61d14fc5d6
https://tbpl.mozilla.org/?tree=Try&rev=d5732a37550c

Risk to taking this patch (and alternatives if risky): 

There is a moderate risk, but it has received some testing.


String or UUID changes made by this patch: n/a
Attachment #8424564 - Flags: review+
Attachment #8424564 - Flags: approval-mozilla-aurora?
Attachment #8424564 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8424561 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Group: javascript-core-security
Hi decoder, can you verify in Fx30? Thanks.
Flags: needinfo?(choller)
Status: RESOLVED → VERIFIED
JSBugMon: This bug has been automatically verified fixed.
Verified fixed also manually on Fx30.
Flags: needinfo?(choller)
JSBugMon: This bug has been automatically verified fixed on Fx31
Depends on: 1013056
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: