Closed
Bug 955888
Opened 11 years ago
Closed 11 years ago
Add a "chaos mode" to Gecko
Categories
(Core :: XPCOM, enhancement)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
mozilla30
People
(Reporter: roc, Assigned: roc)
References
(Blocks 1 open bug)
Details
(Keywords: sec-want)
Attachments
(8 files, 1 obsolete file)
1.99 KB,
patch
|
Waldo
:
review+
|
Details | Diff | Splinter Review |
1.74 KB,
patch
|
benjamin
:
review+
|
Details | Diff | Splinter Review |
4.18 KB,
patch
|
benjamin
:
review+
|
Details | Diff | Splinter Review |
2.46 KB,
patch
|
mcmanus
:
review+
|
Details | Diff | Splinter Review |
2.33 KB,
patch
|
mcmanus
:
review+
|
Details | Diff | Splinter Review |
1.85 KB,
patch
|
mcmanus
:
review+
|
Details | Diff | Splinter Review |
2.69 KB,
patch
|
jorendorff
:
review+
|
Details | Diff | Splinter Review |
7.54 KB,
patch
|
bzbarsky
:
review+
|
Details | Diff | Splinter Review |
I'm experimenting with adding a global mode switch to Gecko that enables injection of aggressive nondeterminism. In various places we have code that makes decisions deterministically, where it would also be correct (but not necessarily performant) to choose nondeterministically among a range of options; by choosing among those options using a PRNG we can force our tests to explore more of the state space.
I have a patch stack that does the following:
-- set random priorities on threads
-- sched_yield just before dispatching an event
-- randomize priority of sockets in socket poll list (nsSocketTransportService::AddToPollList)
-- randomize insertion in nsHttpConnectionMgr's InsertTransactionSorted
-- randomize amount read when reading from socket in nsHttpConnection
-- randomize test order in reftests
-- randomize test order in mochitests
-- randomize PLDHash enumeration start position
-- add random variation to timers
These patches together trigger a decent number of test failures:
https://tbpl.mozilla.org/?tree=Try&rev=d50dfe799e1f
Some of these seem to be known intermittent oranges, but many are not. I need to do some more work to increase confidence that these patches are correct and are only exposing, not causing, incorrect behavior. It would also be interesting to characterize the effects of each patch.
The downside of chaos mode is that the resulting failures are difficult to reproduce and debug. So the broader vision is to use record-and-replay debugging tools to debug failures induced by chaos mode. Hopefully chaos mode will also benefit record and replay tools, by making it easier to explore interesting application state space in the perturbed environment of those tools. (Alternatively we could try to instrument ChaosMode to return a reproducible stream of random numbers for debugging purposes, but I suspect that approach won't work well due to other sources of nondeterminism messing with the ordering of PRNG calls.)
Comment 1•11 years ago
|
||
That sounds vaguely similar to some OOM testing decoder has done with the JS engine, except the choice point is (obviously) allocation.
Assignee | ||
Comment 2•11 years ago
|
||
Yes, it's related to fault injection. However, I'm trying to restrict my changes to those that shouldn't affect observable behavior. OOM faults, for example, maybe shouldn't cause a browser crash but necessarily cause tests to fail.
Comment 3•11 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #0)
> -- randomize test order in reftests
FWIW, the run-reftests-in-parallel patches also seem to trigger unusual reftest failures on several platforms; it wouldn't be surprising if tests somehow depend on test ordering.
Comment 4•11 years ago
|
||
(In reply to comment #3)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #0)
> > -- randomize test order in reftests
>
> FWIW, the run-reftests-in-parallel patches also seem to trigger unusual reftest
> failures on several platforms; it wouldn't be surprising if tests somehow
> depend on test ordering.
Yes, and even more so with mochitests.
Comment 5•11 years ago
|
||
Can we add randomness into GC and CC?
Add randomness into cache lookups?
(I'm primarily thinking of various layout-related caches/cached values,
like the "word cache", "text run cache", "frame offset cache",
"cached newline offset", "nsXULPrototypeCache", "font cache",
"font group cache", "gradient cache", "gfx.canvas.skiagl.dynamic-cache",
"harfbuzz script cache", "CmapCache", "gfx-d2d-surface-cache",
"userfontcache", "SkGlyphCache", "SkTypefaceCache", "SkImageCache",
"nsFontCache in nsDeviceContext.cpp", and various Cairo caches...)
Comment 6•11 years ago
|
||
I feel a bit uncomfortable with having this bug public -- I think we should
at least file the patches in a hidden bug and let our fuzz team test them
for a while before we make them public. We have to assume our adversaries
are watching what we do in bugs like this... and the Try server too...
Keywords: sec-want
Comment 7•11 years ago
|
||
Marked core security based on comment 6. Let's try to see if we find some critical stuff with this and then open the bug again once we are confident enough that nothing will blow up immediately :)
Group: core-security
Comment 8•11 years ago
|
||
There are two major sources of non-determinacy for the GC and CC: scheduling, and the order in which it does things (traverses, marks, unlinks, etc.). I think the former should be covered fairly well by fuzzing, as there are chrome-only functions to explicitly run them. One example that I haven't done anything about is bug 715684.
Assignee | ||
Comment 9•11 years ago
|
||
(In reply to Mats Palmgren (:mats) from comment #5)
> Add randomness into cache lookups?
> (I'm primarily thinking of various layout-related caches/cached values,
> like the "word cache", "text run cache", "frame offset cache",
> "cached newline offset", "nsXULPrototypeCache", "font cache",
> "font group cache", "gradient cache", "gfx.canvas.skiagl.dynamic-cache",
> "harfbuzz script cache", "CmapCache", "gfx-d2d-surface-cache",
> "userfontcache", "SkGlyphCache", "SkTypefaceCache", "SkImageCache",
> "nsFontCache in nsDeviceContext.cpp", and various Cairo caches...)
What sort of randomness? Like randomly failing cache lookups? We could try that I suppose.
Comment 10•11 years ago
|
||
Yeah, either pretend the thing we're looking for isn't cached and take the
non-cached path, or just clear the whole cache occasionally, whichever is
more convenient. The former is probably better for non-determinism.
I noticed that some of these caches are cleared on memory pressure,
so generating fake memory pressure events might be a tool to consider
as well.
> -- randomize amount read when reading from socket in nsHttpConnection
Do you expect this to "dominate" the nondeterminism in html parser chunking and interruptible reflow? Or maybe you would want to instrument those separately?
These also come to mind
-- random sched_yield or nanosleep before Lock() / Signal() calls (simulate contention)
-- random delays in IPC message processing (when cross-thread/process becomes interesting)
-- randomly forcing sync reflow
Comment 12•11 years ago
|
||
A lot of these could be fuzzer knobs, similar to the js engine's "TestingFunctions". By giving control to the DOM fuzzer rather than an RNG, we can make reduced testcases for many bugs without even needing a record-and-replay setup.
> -- set random priorities on threads
For fuzzing I would prefer a testing function to change the priority of a named thread. Most of our threads have names (e.g. "JS GC Helper") which show up in crash reports. Hopefully we can look up threads by name. Another possibility would be to have the fuzzer harness set environment variables like PRIORITY_JS_GC_HELPER that are automatically used when named threads are created.
When there are plenty of cores available, I'm not sure priorities have much effect. It might be more powerful to pause or starve a thread for a second.
> -- sched_yield just before dispatching an event
This could be a boolean setting, or a "yield 3 events from now", perhaps.
> -- add random variation to timers
Is it valid to change the order in which timers fire? For example, if I run {
setTimeout(f, 50);
setTimeout(g, 50);
}, are both orderings normally possible and valid?
Comment 13•11 years ago
|
||
Delaying kind of stuff could be useful. One external reporter filed a bunch of fuzz bugs last year that had repro instructions like "Open the test case in 10 copies of Firefox at once", which I assume really just cause things to load slower and thus you'd get slightly different behavior.
Assignee | ||
Comment 14•11 years ago
|
||
(In reply to Chris Jones [:cjones] mostly inactive; ni?/f?/r? if you need me from comment #11)
> > -- randomize amount read when reading from socket in nsHttpConnection
>
> Do you expect this to "dominate" the nondeterminism in html parser chunking
> and interruptible reflow? Or maybe you would want to instrument those
> separately?
I hope randomizing socket read amounts would exercise both of those.
> These also come to mind
>
> -- random sched_yield or nanosleep before Lock() / Signal() calls (simulate
> contention)
Yes although that could get a lot more expensive.
> -- random delays in IPC message processing (when cross-thread/process
> becomes interesting)
> -- randomly forcing sync reflow
Good ideas :-)
(In reply to Jesse Ruderman from comment #12)
> A lot of these could be fuzzer knobs, similar to the js engine's
> "TestingFunctions". By giving control to the DOM fuzzer rather than an RNG,
> we can make reduced testcases for many bugs without even needing a
> record-and-replay setup.
That's an interesting idea but I'm not sure how to expose API for all these knobs.
> > -- set random priorities on threads
>
> For fuzzing I would prefer a testing function to change the priority of a
> named thread. Most of our threads have names (e.g. "JS GC Helper") which
> show up in crash reports. Hopefully we can look up threads by name.
> Another possibility would be to have the fuzzer harness set environment
> variables like PRIORITY_JS_GC_HELPER that are automatically used when named
> threads are created.
>
> When there are plenty of cores available, I'm not sure priorities have much
> effect. It might be more powerful to pause or starve a thread for a second.
That's a good point. One way to address it would be to also set random thread affinities (e.g. pin each thread to one of N particular cores). http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/include/private/pprthred.h#88
> > -- add random variation to timers
>
> Is it valid to change the order in which timers fire? For example, if I run
> {
> setTimeout(f, 50);
> setTimeout(g, 50);
> }, are both orderings normally possible and valid?
No, but my patches don't do that. All timers are multiplexed onto a single thread which sleeps until the next timer needs to fire, and I mess with the timing of that sleep.
Comment 15•11 years ago
|
||
-- discard decoded image data more aggressively
Assignee | ||
Comment 16•11 years ago
|
||
I'm finding these patches quite useful for reproducing intermittent bugs. I'd like to land them so everyone can benefit.
Finding a bug with chaos mode enabled does not easily lead to an exploit against a normal build. You'd actually have to figure out what the underlying bug is and modify your testcase significantly to trigger it in a normal build. Furthermore, since there's no replay mechanism built into these patches, it's quite hard to figure out what's going on because there's so much variation from run to run. So I don't think we should worry too much about opening this up.
Assignee | ||
Comment 17•11 years ago
|
||
I don't currently have a way to enable ChaosMode dynamically. Hooking up the initial value in a thread-safe way seems rather complex. For now, this has to be enabled by patching ChaosMode.h.
Attachment #8386030 -
Flags: review?(jwalden+bmo)
Assignee | ||
Comment 18•11 years ago
|
||
Attachment #8386033 -
Flags: review?(benjamin)
Assignee | ||
Comment 19•11 years ago
|
||
Attachment #8386035 -
Flags: review?(benjamin)
Assignee | ||
Comment 20•11 years ago
|
||
Attachment #8386037 -
Flags: review?(jduell.mcbugs)
Assignee | ||
Comment 21•11 years ago
|
||
Attachment #8386038 -
Flags: review?(jduell.mcbugs)
Assignee | ||
Comment 22•11 years ago
|
||
Attachment #8386039 -
Flags: review?(jduell.mcbugs)
Assignee | ||
Comment 23•11 years ago
|
||
Attachment #8386040 -
Flags: review?(jorendorff)
Assignee | ||
Comment 24•11 years ago
|
||
Attachment #8386043 -
Flags: review?(bzbarsky)
Comment 25•11 years ago
|
||
Comment on attachment 8386043 [details] [diff] [review]
Part 8: In chaos mode, vary timer duration (while respecting order of firing)
>+ microseconds *= ChaosMode::randomUint32LessThan(ArrayLength(sFractions));
The RHS should be sFractions[ChaosMode::randomUint32LessThan(ArrayLength(sFractions))], I would think.
r=me with that fixed.
Attachment #8386043 -
Flags: review?(bzbarsky) → review+
Comment 26•11 years ago
|
||
Comment on attachment 8386043 [details] [diff] [review]
Part 8: In chaos mode, vary timer duration (while respecting order of firing)
Though actually, wait. I don't think this patch does what you want. In particular, it will change when the timer thread wakes up, but not when timers should run. The latter only happens if TimeStamp::Now() >= mTimers[0]->mTimeout, no?
Attachment #8386043 -
Flags: review+ → review-
Updated•11 years ago
|
Attachment #8386033 -
Flags: review?(benjamin) → review+
Updated•11 years ago
|
Attachment #8386035 -
Flags: review?(benjamin) → review+
Comment 27•11 years ago
|
||
Comment on attachment 8386040 [details] [diff] [review]
Part 7: In chaos mode, start hashtable iteration at a random entry
Review of attachment 8386040 [details] [diff] [review]:
-----------------------------------------------------------------
What an intriguing stack of patches.
Attachment #8386040 -
Flags: review?(jorendorff) → review+
Updated•11 years ago
|
Attachment #8386039 -
Flags: review?(jduell.mcbugs) → review?(mcmanus)
Updated•11 years ago
|
Attachment #8386038 -
Flags: review?(jduell.mcbugs) → review?(mcmanus)
Updated•11 years ago
|
Attachment #8386037 -
Flags: review?(jduell.mcbugs) → review?(mcmanus)
Comment 28•11 years ago
|
||
Comment on attachment 8386039 [details] [diff] [review]
Part 6: In chaos mode, when reading from an HTTP connection, sometimes read less than the full amount of available data
Review of attachment 8386039 [details] [diff] [review]:
-----------------------------------------------------------------
this is good noodle warping stuff.
imho its also exactly the kind of thing we should be working in the open.
Attachment #8386039 -
Flags: review?(mcmanus) → review+
Updated•11 years ago
|
Attachment #8386038 -
Flags: review?(mcmanus) → review+
Comment 29•11 years ago
|
||
Comment on attachment 8386037 [details] [diff] [review]
Part 4: In chaos mode, put new sockets at a random position in the poll list
Review of attachment 8386037 [details] [diff] [review]:
-----------------------------------------------------------------
I think there is probably some (mostly theoretical) livelock that this can cause - essentially because some sockets are no longer guaranteed to be read in the presence of contention. The chance of actually seeing that happen is very small and given the context of chaos mode I think its fine to proceed.. just documenting it here.
Attachment #8386037 -
Flags: review?(mcmanus) → review+
Updated•11 years ago
|
Severity: normal → enhancement
OS: Linux → All
Hardware: x86_64 → All
Summary: Investigate adding a "chaos mode" to Gecko → Add a "chaos mode" to Gecko
Updated•11 years ago
|
Group: core-security
Comment 30•11 years ago
|
||
Comment on attachment 8386030 [details] [diff] [review]
Part 1: Add ChaosMode.h base functionality
Review of attachment 8386030 [details] [diff] [review]:
-----------------------------------------------------------------
I've thought of doing something like this myself at some point. Always felt like I was being pulled into too much feature-work or assigned-work to actually take the time to write anything, tho. :-( Glad somebody's doing it.
::: mfbt/ChaosMode.h
@@ +6,5 @@
> +
> +#ifndef mozilla_ChaosMode_h
> +#define mozilla_ChaosMode_h
> +
> +#include "mozilla/Types.h"
Just use <stdint.h> directly, please.
@@ +18,5 @@
> + * choices is encouraged to make random and extreme choices, to test more
> + * code paths and uncover bugs.
> + */
> +class ChaosMode
> +{
Indent the whole class body two more spaces, please.
::: mfbt/moz.build
@@ +19,5 @@
> 'BloomFilter.h',
> 'Casting.h',
> 'Char16.h',
> 'CheckedInt.h',
> + 'ChaosMode.h',
This probably doesn't build because it's not alphabetical.
Attachment #8386030 -
Flags: review?(jwalden+bmo) → review+
Assignee | ||
Comment 31•11 years ago
|
||
Attachment #8386043 -
Attachment is obsolete: true
Attachment #8387125 -
Flags: review?(bzbarsky)
Assignee | ||
Comment 32•11 years ago
|
||
I'm going to address comments and land this ASAP if no-one strongly objects.
Comment 33•11 years ago
|
||
Comment on attachment 8387125 [details] [diff] [review]
part 8 v2
r=me
Attachment #8387125 -
Flags: review?(bzbarsky) → review+
Comment 34•11 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #16)
> Finding a bug with chaos mode enabled does not easily lead to an exploit
> against a normal build. You'd actually have to figure out what the
> underlying bug is and modify your testcase significantly to trigger it in a
> normal build. Furthermore, since there's no replay mechanism built into
> these patches, it's quite hard to figure out what's going on because there's
> so much variation from run to run. So I don't think we should worry too much
> about opening this up.
I'm not sure if I understand how these patches help on _reproducing_ an intermittent failure without a way to replay the chaos mode...
Assignee | ||
Comment 35•11 years ago
|
||
A solution to that problem is imminent :-).
Comment 36•11 years ago
|
||
Comment on attachment 8386030 [details] [diff] [review]
Part 1: Add ChaosMode.h base functionality
Review of attachment 8386030 [details] [diff] [review]:
-----------------------------------------------------------------
::: mfbt/ChaosMode.h
@@ +22,5 @@
> +{
> +public:
> + static bool isActive()
> + {
> + // XXX flip this to true to active chaos mode
s/active/activate/
Assignee | ||
Comment 37•11 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/f656d2fd4264
https://hg.mozilla.org/integration/mozilla-inbound/rev/4d971e595980
https://hg.mozilla.org/integration/mozilla-inbound/rev/f741f55301ef
https://hg.mozilla.org/integration/mozilla-inbound/rev/7693a49a8210
https://hg.mozilla.org/integration/mozilla-inbound/rev/3eb91b629358
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b2ad936b5a9
https://hg.mozilla.org/integration/mozilla-inbound/rev/ecf3d4fbfa71
https://hg.mozilla.org/integration/mozilla-inbound/rev/c7d86e2ddfea
Comment 38•11 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/f656d2fd4264
https://hg.mozilla.org/mozilla-central/rev/4d971e595980
https://hg.mozilla.org/mozilla-central/rev/f741f55301ef
https://hg.mozilla.org/mozilla-central/rev/7693a49a8210
https://hg.mozilla.org/mozilla-central/rev/3eb91b629358
https://hg.mozilla.org/mozilla-central/rev/0b2ad936b5a9
https://hg.mozilla.org/mozilla-central/rev/ecf3d4fbfa71
https://hg.mozilla.org/mozilla-central/rev/c7d86e2ddfea
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
You need to log in
before you can comment on or make changes to this bug.
Description
•