Closed Bug 955888 Opened 6 years ago Closed 6 years ago

Add a "chaos mode" to Gecko

Categories

(Core :: XPCOM, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla30

People

(Reporter: roc, Assigned: roc)

Details

(Keywords: sec-want)

Attachments

(8 files, 1 obsolete file)

1.99 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
1.74 KB, patch
benjamin
: review+
Details | Diff | Splinter Review
4.18 KB, patch
benjamin
: review+
Details | Diff | Splinter Review
2.46 KB, patch
mcmanus
: review+
Details | Diff | Splinter Review
2.33 KB, patch
mcmanus
: review+
Details | Diff | Splinter Review
1.85 KB, patch
mcmanus
: review+
Details | Diff | Splinter Review
2.69 KB, patch
jorendorff
: review+
Details | Diff | Splinter Review
7.54 KB, patch
bzbarsky
: review+
Details | Diff | Splinter Review
I'm experimenting with adding a global mode switch to Gecko that enables injection of aggressive nondeterminism. In various places we have code that makes decisions deterministically, where it would also be correct (but not necessarily performant) to choose nondeterministically among a range of options; by choosing among those options using a PRNG we can force our tests to explore more of the state space.

I have a patch stack that does the following:
-- set random priorities on threads
-- sched_yield just before dispatching an event
-- randomize priority of sockets in socket poll list (nsSocketTransportService::AddToPollList)
-- randomize insertion in nsHttpConnectionMgr's InsertTransactionSorted
-- randomize amount read when reading from socket in nsHttpConnection
-- randomize test order in reftests
-- randomize test order in mochitests
-- randomize PLDHash enumeration start position
-- add random variation to timers

These patches together trigger a decent number of test failures:
https://tbpl.mozilla.org/?tree=Try&rev=d50dfe799e1f
Some of these seem to be known intermittent oranges, but many are not. I need to do some more work to increase confidence that these patches are correct and are only exposing, not causing, incorrect behavior. It would also be interesting to characterize the effects of each patch.

The downside of chaos mode is that the resulting failures are difficult to reproduce and debug. So the broader vision is to use record-and-replay debugging tools to debug failures induced by chaos mode. Hopefully chaos mode will also benefit record and replay tools, by making it easier to explore interesting application state space in the perturbed environment of those tools. (Alternatively we could try to instrument ChaosMode to return a reproducible stream of random numbers for debugging purposes, but I suspect that approach won't work well due to other sources of nondeterminism messing with the ordering of PRNG calls.)
That sounds vaguely similar to some OOM testing decoder has done with the JS engine, except the choice point is (obviously) allocation.
Yes, it's related to fault injection. However, I'm trying to restrict my changes to those that shouldn't affect observable behavior. OOM faults, for example, maybe shouldn't cause a browser crash but necessarily cause tests to fail.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #0)
> -- randomize test order in reftests

FWIW, the run-reftests-in-parallel patches also seem to trigger unusual reftest failures on several platforms; it wouldn't be surprising if tests somehow depend on test ordering.
(In reply to comment #3)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #0)
> > -- randomize test order in reftests
> 
> FWIW, the run-reftests-in-parallel patches also seem to trigger unusual reftest
> failures on several platforms; it wouldn't be surprising if tests somehow
> depend on test ordering.

Yes, and even more so with mochitests.
Can we add randomness into GC and CC?

Add randomness into cache lookups?
(I'm primarily thinking of various layout-related caches/cached values,
like the "word cache", "text run cache", "frame offset cache",
"cached newline offset", "nsXULPrototypeCache", "font cache",
"font group cache", "gradient cache", "gfx.canvas.skiagl.dynamic-cache",
"harfbuzz script cache", "CmapCache", "gfx-d2d-surface-cache",
"userfontcache", "SkGlyphCache", "SkTypefaceCache", "SkImageCache",
"nsFontCache in nsDeviceContext.cpp", and various Cairo caches...)
I feel a bit uncomfortable with having this bug public -- I think we should
at least file the patches in a hidden bug and let our fuzz team test them
for a while before we make them public.  We have to assume our adversaries
are watching what we do in bugs like this...  and the Try server too...
Keywords: sec-want
Marked core security based on comment 6. Let's try to see if we find some critical stuff with this and then open the bug again once we are confident enough that nothing will blow up immediately :)
Group: core-security
Comment 6 is private: false
There are two major sources of non-determinacy for the GC and CC: scheduling, and the order in which it does things (traverses, marks, unlinks, etc.).  I think the former should be covered fairly well by fuzzing, as there are chrome-only functions to explicitly run them.  One example that I haven't done anything about is bug 715684.
(In reply to Mats Palmgren (:mats) from comment #5)
> Add randomness into cache lookups?
> (I'm primarily thinking of various layout-related caches/cached values,
> like the "word cache", "text run cache", "frame offset cache",
> "cached newline offset", "nsXULPrototypeCache", "font cache",
> "font group cache", "gradient cache", "gfx.canvas.skiagl.dynamic-cache",
> "harfbuzz script cache", "CmapCache", "gfx-d2d-surface-cache",
> "userfontcache", "SkGlyphCache", "SkTypefaceCache", "SkImageCache",
> "nsFontCache in nsDeviceContext.cpp", and various Cairo caches...)

What sort of randomness? Like randomly failing cache lookups? We could try that I suppose.
Yeah, either pretend the thing we're looking for isn't cached and take the
non-cached path, or just clear the whole cache occasionally, whichever is
more convenient.  The former is probably better for non-determinism.

I noticed that some of these caches are cleared on memory pressure,
so generating fake memory pressure events might be a tool to consider
as well.
> -- randomize amount read when reading from socket in nsHttpConnection

Do you expect this to "dominate" the nondeterminism in html parser chunking and interruptible reflow?  Or maybe you would want to instrument those separately?

These also come to mind

 -- random sched_yield or nanosleep before Lock() / Signal() calls (simulate contention)
 -- random delays in IPC message processing (when cross-thread/process becomes interesting)
 -- randomly forcing sync reflow
A lot of these could be fuzzer knobs, similar to the js engine's "TestingFunctions".  By giving control to the DOM fuzzer rather than an RNG, we can make reduced testcases for many bugs without even needing a record-and-replay setup.

> -- set random priorities on threads

For fuzzing I would prefer a testing function to change the priority of a named thread.  Most of our threads have names (e.g. "JS GC Helper") which show up in crash reports.  Hopefully we can look up threads by name.  Another possibility would be to have the fuzzer harness set environment variables like PRIORITY_JS_GC_HELPER that are automatically used when named threads are created.

When there are plenty of cores available, I'm not sure priorities have much effect.  It might be more powerful to pause or starve a thread for a second.

> -- sched_yield just before dispatching an event

This could be a boolean setting, or a "yield 3 events from now", perhaps.

> -- add random variation to timers

Is it valid to change the order in which timers fire?  For example, if I run {
  setTimeout(f, 50);
  setTimeout(g, 50);
}, are both orderings normally possible and valid?
Delaying kind of stuff could be useful.  One external reporter filed a bunch of fuzz bugs last year that had repro instructions like "Open the test case in 10 copies of Firefox at once", which I assume really just cause things to load slower and thus you'd get slightly different behavior.
(In reply to Chris Jones [:cjones] mostly inactive; ni?/f?/r? if you need me from comment #11)
> > -- randomize amount read when reading from socket in nsHttpConnection
> 
> Do you expect this to "dominate" the nondeterminism in html parser chunking
> and interruptible reflow?  Or maybe you would want to instrument those
> separately?

I hope randomizing socket read amounts would exercise both of those.

> These also come to mind
> 
>  -- random sched_yield or nanosleep before Lock() / Signal() calls (simulate
> contention)

Yes although that could get a lot more expensive.

>  -- random delays in IPC message processing (when cross-thread/process
> becomes interesting)
>  -- randomly forcing sync reflow

Good ideas :-)

(In reply to Jesse Ruderman from comment #12)
> A lot of these could be fuzzer knobs, similar to the js engine's
> "TestingFunctions".  By giving control to the DOM fuzzer rather than an RNG,
> we can make reduced testcases for many bugs without even needing a
> record-and-replay setup.

That's an interesting idea but I'm not sure how to expose API for all these knobs.

> > -- set random priorities on threads
> 
> For fuzzing I would prefer a testing function to change the priority of a
> named thread.  Most of our threads have names (e.g. "JS GC Helper") which
> show up in crash reports.  Hopefully we can look up threads by name. 
> Another possibility would be to have the fuzzer harness set environment
> variables like PRIORITY_JS_GC_HELPER that are automatically used when named
> threads are created.
> 
> When there are plenty of cores available, I'm not sure priorities have much
> effect.  It might be more powerful to pause or starve a thread for a second.

That's a good point. One way to address it would be to also set random thread affinities (e.g. pin each thread to one of N particular cores). http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/include/private/pprthred.h#88

> > -- add random variation to timers
> 
> Is it valid to change the order in which timers fire?  For example, if I run
> {
>   setTimeout(f, 50);
>   setTimeout(g, 50);
> }, are both orderings normally possible and valid?

No, but my patches don't do that. All timers are multiplexed onto a single thread which sleeps until the next timer needs to fire, and I mess with the timing of that sleep.
 -- discard decoded image data more aggressively
I'm finding these patches quite useful for reproducing intermittent bugs. I'd like to land them so everyone can benefit.

Finding a bug with chaos mode enabled does not easily lead to an exploit against a normal build. You'd actually have to figure out what the underlying bug is and modify your testcase significantly to trigger it in a normal build. Furthermore, since there's no replay mechanism built into these patches, it's quite hard to figure out what's going on because there's so much variation from run to run. So I don't think we should worry too much about opening this up.
I don't currently have a way to enable ChaosMode dynamically. Hooking up the initial value in a thread-safe way seems rather complex. For now, this has to be enabled by patching ChaosMode.h.
Attachment #8386030 - Flags: review?(jwalden+bmo)
Comment on attachment 8386043 [details] [diff] [review]
Part 8: In chaos mode, vary timer duration (while respecting order of firing)

>+            microseconds *= ChaosMode::randomUint32LessThan(ArrayLength(sFractions));

The RHS should be sFractions[ChaosMode::randomUint32LessThan(ArrayLength(sFractions))], I would think.

r=me with that fixed.
Attachment #8386043 - Flags: review?(bzbarsky) → review+
Comment on attachment 8386043 [details] [diff] [review]
Part 8: In chaos mode, vary timer duration (while respecting order of firing)

Though actually, wait.  I don't think this patch does what you want.  In particular, it will change when the timer thread wakes up, but not when timers should run.  The latter only happens if TimeStamp::Now() >= mTimers[0]->mTimeout, no?
Attachment #8386043 - Flags: review+ → review-
Attachment #8386033 - Flags: review?(benjamin) → review+
Attachment #8386035 - Flags: review?(benjamin) → review+
Comment on attachment 8386040 [details] [diff] [review]
Part 7: In chaos mode, start hashtable iteration at a random entry

Review of attachment 8386040 [details] [diff] [review]:
-----------------------------------------------------------------

What an intriguing stack of patches.
Attachment #8386040 - Flags: review?(jorendorff) → review+
Attachment #8386039 - Flags: review?(jduell.mcbugs) → review?(mcmanus)
Attachment #8386038 - Flags: review?(jduell.mcbugs) → review?(mcmanus)
Attachment #8386037 - Flags: review?(jduell.mcbugs) → review?(mcmanus)
Comment on attachment 8386039 [details] [diff] [review]
Part 6: In chaos mode, when reading from an HTTP connection, sometimes read less than the full amount of available data

Review of attachment 8386039 [details] [diff] [review]:
-----------------------------------------------------------------

this is good noodle warping stuff.

imho its also exactly the kind of thing we should be working in the open.
Attachment #8386039 - Flags: review?(mcmanus) → review+
Attachment #8386038 - Flags: review?(mcmanus) → review+
Comment on attachment 8386037 [details] [diff] [review]
Part 4: In chaos mode, put new sockets at a random position in the poll list

Review of attachment 8386037 [details] [diff] [review]:
-----------------------------------------------------------------

I think there is probably some (mostly theoretical) livelock that this can cause - essentially because some sockets are no longer guaranteed to be read in the presence of contention. The chance of actually seeing that happen is very small and given the context of chaos mode I think its fine to proceed.. just documenting it here.
Attachment #8386037 - Flags: review?(mcmanus) → review+
Severity: normal → enhancement
OS: Linux → All
Hardware: x86_64 → All
Summary: Investigate adding a "chaos mode" to Gecko → Add a "chaos mode" to Gecko
Group: core-security
Comment on attachment 8386030 [details] [diff] [review]
Part 1: Add ChaosMode.h base functionality

Review of attachment 8386030 [details] [diff] [review]:
-----------------------------------------------------------------

I've thought of doing something like this myself at some point.  Always felt like I was being pulled into too much feature-work or assigned-work to actually take the time to write anything, tho.  :-(  Glad somebody's doing it.

::: mfbt/ChaosMode.h
@@ +6,5 @@
> +
> +#ifndef mozilla_ChaosMode_h
> +#define mozilla_ChaosMode_h
> +
> +#include "mozilla/Types.h"

Just use <stdint.h> directly, please.

@@ +18,5 @@
> + * choices is encouraged to make random and extreme choices, to test more
> + * code paths and uncover bugs.
> + */
> +class ChaosMode
> +{

Indent the whole class body two more spaces, please.

::: mfbt/moz.build
@@ +19,5 @@
>      'BloomFilter.h',
>      'Casting.h',
>      'Char16.h',
>      'CheckedInt.h',
> +    'ChaosMode.h',

This probably doesn't build because it's not alphabetical.
Attachment #8386030 - Flags: review?(jwalden+bmo) → review+
Attached patch part 8 v2Splinter Review
Attachment #8386043 - Attachment is obsolete: true
Attachment #8387125 - Flags: review?(bzbarsky)
I'm going to address comments and land this ASAP if no-one strongly objects.
Comment on attachment 8387125 [details] [diff] [review]
part 8 v2

r=me
Attachment #8387125 - Flags: review?(bzbarsky) → review+
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #16)
> Finding a bug with chaos mode enabled does not easily lead to an exploit
> against a normal build. You'd actually have to figure out what the
> underlying bug is and modify your testcase significantly to trigger it in a
> normal build. Furthermore, since there's no replay mechanism built into
> these patches, it's quite hard to figure out what's going on because there's
> so much variation from run to run. So I don't think we should worry too much
> about opening this up.

I'm not sure if I understand how these patches help on _reproducing_ an intermittent failure without a way to replay the chaos mode...
A solution to that problem is imminent :-).
Comment on attachment 8386030 [details] [diff] [review]
Part 1: Add ChaosMode.h base functionality

Review of attachment 8386030 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/ChaosMode.h
@@ +22,5 @@
> +{
> +public:
> +  static bool isActive()
> +  {
> +    // XXX flip this to true to active chaos mode

s/active/activate/
You need to log in before you can comment on or make changes to this bug.