Open Bug 810718 Opened 12 years ago Updated 2 years ago

Enforce pldhash RECURSION_LEVEL checks dynamically: abort on mutating a hash during iteration

Categories

(Core :: XPCOM, enhancement)

enhancement

Tracking

()

People

(Reporter: jruderman, Unassigned)

References

Details

(Keywords: sec-want)

Attachments

(2 files, 2 obsolete files)

A search for "ALL su,desc:RECURSION_LEVEL" gives me 57 bugs.  And quite a few of them have been identified as security bugs -- recent examples include bug 786111 and bug 803093.  (Are they all security bugs??)

> https://bugzilla.mozilla.org/buglist.cgi?bugidtype=include;order=bugs.bug_id;bug_id=803093,803093,802982,802982,786111,786111,784253,784253,779829,779829,779828,779828,779471,779471,764122,764122,755290,755290,733014,733014,731191,731191,714420,714420,705961,705961,700984,700984,697657,697657,696333,696333,681170,681170,669112,669112,668989,668989,660440,660440,651615,651615,647938,647938,647490,647490,646395,646395,637214,637214,634578,634578,631772,631772,629861,629861,627985,627985,624268,624268,615149,615149,614480,614480,613564,613564,605953,605953,605457,605457,590494,590494,580790,580790,528518,528518,525161,525161,517363,517363,501934,501934,478336,478336,475976,475976,472909,472909,469004,469004,463289,463289,456272,456272,449233,449233,439475,439475,437325,437325,421671,421671,415192,415192,398466,398466,398083,398083,378866,378866,368528,368528,334180,334180

How expensive would it be to turn RECURSION_LEVEL failures into aborts in non-debug builds?  I'm thinking that hash tables are already fairly expensive so it wouldn't be terrible.

(In Rust you can enforce "no mutation during iteration" statically most of the time, but I'm guessing that's a lost cause for pldhash in C++.)

My fuzzer is currently ignoring these assertions because of bug 705961.  But even with that fixed I wouldn't want to rely on fuzzing to catch everything.
Try it on nightly, but make it configurable? Make sure that we can find these in crash-stats easily by adding a comment/annotation. I bet things will start showing up and we'll have a better sense of the size of the problem.

There are two ways to hit this assertion: one is to use the hash from multiple threads. The other is to actually recursively modify during iteration on the main thread. I'm sure both will show up.
Configurable how?  I think pldhash is too low-level to rely on the prefs service.

The failure path can't call AppendAppNotesToCrashReport, because that function can't be called off the main thread in content processes (!?).  So I guess the failure path should call a MOZ_NEVER_INLINE function, whose name will appear at the top of crash reports.  (See also bug 788069.)

By the way, these crashes will probably be easier to find in crash-stats than the crashes caused by *not* having dynamic checks!

Since these hash tables can be used from multiple threads, ++RECURSION_LEVEL should become PR_ATOMIC_INCREMENT(RECUSION_LEVEL).
> Configurable how?

If the goal is to turn it on only for Nightly, or only for non-release builds, we can just have a build-time pref...
NS_RUNTIMEABORT will annotate correctly (when it can), so let's just use that.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #4)
> NS_RUNTIMEABORT will annotate correctly (when it can), so let's just use
> that.

Is there some alternative?

We're trying to use pldhash inside newdmd (bug 717853).  For weird reasons, newdmd has to live outside libxul.  But we want a hashtable (and STL's unordered_map is C++11-only), so we're re-compiling pldhash.cpp for newdmd.

If pldhash.cpp calls NS_RUNTIMEABORT, we have to re-compile nsDebugImpl.cpp (for NS_DebugBreak), and that causes us to pull other stuff in...it's a total mess.

In fact, we'd like to remove the NS_RUNTIMEABORT calls from nsTHashtable as well, for the same reason.

I guess MOZ_CRASH doesn't work for you, because it doesn't annotate properly?
Gah. You could of course add another set of ifdefs specifically for the DMD case and use NS_RUNTIMEABORT when available and MOZ_CRASH when not. We'd only be compiling pldhash.cpp *5* times per build then! ;-)

But also, what platforms are we trying to build DMD on which don't have an available unordered_map?
> But also, what platforms are we trying to build DMD on which don't have an available 
> unordered_map?

Huh, I did not expect that to work on bionic, but it seems to.  I'll ask in the bug if we can use that.  There's something to be said for using the thing we're familiar with, though...
FWIW I think we're going to use unordered_map or the JS hashtables, so feel free to add whatever to pldhash.
Depends on: 714453
Update:
- DMD ended up using the JS hash tables, so that's not an issue any more.
- Bug 815467 will cause |recursionLevel| to be present in non-debug builds (at no space cost!).

So this bug should be easy to do now.  bsmedberg, should I change lines like this:

    MOZ_ASSERT(RECURSION_LEVEL_SAFE_TO_FINISH(table));

to this:

    if (!RECURSION_LEVEL_SAFE_TO_FINISH(table)) {
      NS_RUNTIMEABORT("recursion level error in pldhash");
    }

?  (If so, I'll probably wrap that up in a macro for conciseness.)

(And Jesse's point about using PR_ATOMIC_INCREMENT() is a good one.)
Depends on: 815467
Flags: needinfo?(benjamin)
QA Contact: n.nethercote
>     MOZ_ASSERT(RECURSION_LEVEL_SAFE_TO_FINISH(table));
> 
> to this:
> 
>     if (!RECURSION_LEVEL_SAFE_TO_FINISH(table)) {
>       NS_RUNTIMEABORT("recursion level error in pldhash");
>     }

Yes.

> (And Jesse's point about using PR_ATOMIC_INCREMENT() is a good one.)

I'm not sure what that means, actually. Since the hashtables themselves aren't threadsafe, any access needs to be lock-protected, and that should be enough synchronization that we don't also need to use atomics here.
Flags: needinfo?(benjamin)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
Just some minor comment tweaks.
Attachment #825740 - Flags: review?(benjamin)
Attachment #825640 - Attachment is obsolete: true
Attachment #825640 - Flags: review?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> > (And Jesse's point about using PR_ATOMIC_INCREMENT() is a good one.)
> 
> I'm not sure what that means, actually. Since the hashtables themselves
> aren't threadsafe, any access needs to be lock-protected, and that should be
> enough synchronization that we don't also need to use atomics here.

Are there cases where multiple threads are reading, but none are writing, that might not currently be protected by locks?
I cannot think of a case like that. It would basically have to be an initialized table that is never touched subsequently.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #14)
> I cannot think of a case like that. It would basically have to be an
> initialized table that is never touched subsequently.

Never touched?  What if it's filled in one thread and then read again in multiple threads?  E.g. some kind of atoms table.
As far as I know we don't have any init-only hashtables which are then frozen and read from multiple threads. The XPT and atom tables are all runtime-modifiable.
(In reply to Nicholas Nethercote [:njn] from comment #15)
> (In reply to Benjamin Smedberg  [:bsmedberg] from comment #14)
> > I cannot think of a case like that. It would basically have to be an
> > initialized table that is never touched subsequently.
> 
> Never touched?  What if it's filled in one thread and then read again in
> multiple threads?  E.g. some kind of atoms table.

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp#959

is one such case.
Comment on attachment 825740 [details] [diff] [review]
Enforce pldhash RECURSION_LEVEL checks in all builds and abort on failure.

Given my hypothetical example and froydnj's actual example, I will update my patch to use atomic arithmetic.
Attachment #825740 - Flags: review?(benjamin)
This is just like the previous patch, but |recursionLevel| is now Atomic<>.

Because Atomic<uint16_t> isn't possible, I had to make it Atomic<uint32_t>,
which increases sizeof(PLDHashTable).  If this patch gets r+, I will write a
follow-up patch that reduces it by shrinking |entrySize| to 16 bits.
Attachment #829004 - Flags: review?(benjamin)
Attachment #825740 - Attachment is obsolete: true
Attachment #829004 - Flags: review?(benjamin) → review+
Attempting to create a pldhash with an entrySize greater than 64KB is
sufficiently unlikely that I decided to just crash instead of returning
failure.
Attachment #832005 - Flags: review?(benjamin)
Attachment #832005 - Flags: review?(benjamin) → review+
Alas...

Regression: Mozilla-Inbound - Number of Constructors - CentOS release 5 (Final) - 3.17% increase
------------------------------------------------------------------------------------------------
    Previous: avg 126.000 stddev 0.000 of 12 runs up to revision 8b676ce79d10
    New     : avg 130.000 stddev 0.000 of 12 runs since revision db7cbf61b001
    Change  : +4.000 (3.17% / z=0.000)
    Graph   : http://mzl.la/17EidKt

I think I changed PLDHashTable from a POD to a non-POD type by adding the Atomic<> member.  And we have several global PDLHashTable variables.  I'm not sure how to fix this.
Backed out:
https://hg.mozilla.org/integration/mozilla-inbound/rev/18d9216b6677


At the time of writing, it has caused the following additional performance regressions:

Regression: Mozilla-Inbound - Number of Constructors - CentOS (x86_64) release 5 (Final) - 3.17% increase
---------------------------------------------------------------------------------------------------------
    Previous: avg 126.000 stddev 0.000 of 12 runs up to revision 8b676ce79d10
    New     : avg 130.000 stddev 0.000 of 12 runs since revision db7cbf61b001
    Change  : +4.000 (3.17% / z=0.000)
    Graph   : http://mzl.la/17Eks0f

Regression: Mozilla-Inbound-Non-PGO - a11y Row Major MozAfterPaint - Ubuntu HW 12.04 x64 - 2.7% increase
--------------------------------------------------------------------------------------------------------
    Previous: avg 287.208 stddev 2.624 of 12 runs up to revision 8b676ce79d10
    New     : avg 294.958 stddev 1.602 of 12 runs since revision db7cbf61b001
    Change  : +7.750 (2.7% / z=2.954)
    Graph   : http://mzl.la/17EmAFe

Regression: Mozilla-Inbound - Tp5 Optimized - MacOSX 10.8 - 7.49% increase
--------------------------------------------------------------------------
    Previous: avg 232.476 stddev 5.161 of 12 runs up to revision 8b676ce79d10
    New     : avg 249.885 stddev 5.640 of 12 runs since revision db7cbf61b001
    Change  : +17.409 (7.49% / z=3.373)
    Graph   : http://mzl.la/18HArFQ

Regression: Mozilla-Inbound - a11y Row Major MozAfterPaint - MacOSX 10.8 - 19% increase
---------------------------------------------------------------------------------------
    Previous: avg 198.375 stddev 13.046 of 12 runs up to revision 8b676ce79d10
    New     : avg 236.125 stddev 6.314 of 12 runs since revision db7cbf61b001
    Change  : +37.750 (19% / z=2.894)
    Graph   : http://mzl.la/18HAxgC

Regression: Mozilla-Inbound - a11y Row Major MozAfterPaint - MacOSX 10.7 - 6.83% increase
-----------------------------------------------------------------------------------------
    Previous: avg 323.917 stddev 2.193 of 12 runs up to revision 8b676ce79d10
    New     : avg 346.042 stddev 1.544 of 12 runs since revision e24e7d00b56d
    Change  : +22.125 (6.83% / z=10.087)
    Graph   : http://mzl.la/18HCA4d
(In reply to Nicholas Nethercote [:njn] from comment #22)
> Alas...
> 
> Regression: Mozilla-Inbound - Number of Constructors - CentOS release 5
> (Final) - 3.17% increase
> -----------------------------------------------------------------------------
> -------------------
>     Previous: avg 126.000 stddev 0.000 of 12 runs up to revision 8b676ce79d10
>     New     : avg 130.000 stddev 0.000 of 12 runs since revision db7cbf61b001
>     Change  : +4.000 (3.17% / z=0.000)
>     Graph   : http://mzl.la/17EidKt
> 
> I think I changed PLDHashTable from a POD to a non-POD type by adding the
> Atomic<> member.  And we have several global PDLHashTable variables.  I'm
> not sure how to fix this.

Atomic<> globals should be fine because of the constructors being constexpr.  It wouldn't suprise me if you could fix this by just adding a constexpr default ctor to pldhash (compiler generated ones aren't constexpr, and for some compilers that actually matters...)
Just for completeness:

Regression: Mozilla-Inbound - Tp5 Optimized - MacOSX 10.6 (rev4) - 2.59% increase
---------------------------------------------------------------------------------
    Previous: avg 311.291 stddev 1.784 of 12 runs up to revision 8b676ce79d10
    New     : avg 319.352 stddev 0.853 of 12 runs since revision 78fd5daacc01
    Change  : +8.062 (2.59% / z=4.518)
    Graph   : http://mzl.la/182te8M

Regression: Mozilla-Inbound - SVG, Opacity Row Major - MacOSX 10.6 (rev4) - 2.47% increase
------------------------------------------------------------------------------------------
    Previous: avg 416.000 stddev 2.836 of 12 runs up to revision 8b676ce79d10
    New     : avg 426.292 stddev 3.071 of 12 runs since revision 78fd5daacc01
    Change  : +10.292 (2.47% / z=3.628)
    Graph   : http://mzl.la/17EJaxE

Regression: Mozilla-Inbound - a11y Row Major MozAfterPaint - MacOSX 10.6 (rev4) - 6.82% increase
------------------------------------------------------------------------------------------------
    Previous: avg 325.083 stddev 1.917 of 12 runs up to revision 8b676ce79d10
    New     : avg 347.250 stddev 2.624 of 12 runs since revision 78fd5daacc01
    Change  : +22.167 (6.82% / z=11.564)
    Graph   : http://mzl.la/17EHbJC
Happily, I got nine "Improvement" emails after backing out, which matches the nine "Regression" emails from the original landing.

I don't see how to proceed further here.  Even if the constructors issue can be worked out (as per comment 24), there are still the other regressions, which presumably(?) came from the extra recursionLevel updates.
Assignee: n.nethercote → nobody
No assignee, updating the status.
Status: ASSIGNED → NEW
No assignee, updating the status.
No assignee, updating the status.
No assignee, updating the status.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.