Closed Bug 955888 Opened 12 years ago Closed 11 years ago

Add a "chaos mode" to Gecko

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla30

People

(Reporter: roc, Assigned: roc)

References

(Blocks 1 open bug)

Details

(Keywords: sec-want)

Attachments

(8 files, 1 obsolete file)

Part 1: Add ChaosMode.h base functionality 11 years ago Robert O'Callahan (:roc) (email my personal email if necessary) 1.99 KB, patch	Waldo : review+	Details \| Diff \| Splinter Review
Part 2: In chaos mode, sometimes yield just before dispatching an XPCOM event 11 years ago Robert O'Callahan (:roc) (email my personal email if necessary) 1.74 KB, patch	benjamin : review+	Details \| Diff \| Splinter Review
Part 3: In chaos mode, give threads randomly chosen priorities, and pin some threads to CPU 0 so they compete for CPU 11 years ago Robert O'Callahan (:roc) (email my personal email if necessary) 4.18 KB, patch	benjamin : review+	Details \| Diff \| Splinter Review
Part 4: In chaos mode, put new sockets at a random position in the poll list 11 years ago Robert O'Callahan (:roc) (email my personal email if necessary) 2.46 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
Part 5: In chaos mode, insert an HTTP transaction into the pending queue at a random position among transactions with the same priority 11 years ago Robert O'Callahan (:roc) (email my personal email if necessary) 2.33 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
Part 6: In chaos mode, when reading from an HTTP connection, sometimes read less than the full amount of available data 11 years ago Robert O'Callahan (:roc) (email my personal email if necessary) 1.85 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
Part 7: In chaos mode, start hashtable iteration at a random entry 11 years ago Robert O'Callahan (:roc) (email my personal email if necessary) 2.69 KB, patch	jorendorff : review+	Details \| Diff \| Splinter Review
Part 8: In chaos mode, vary timer duration (while respecting order of firing) 11 years ago Robert O'Callahan (:roc) (email my personal email if necessary) 2.99 KB, patch	bzbarsky : review-	Details \| Diff \| Splinter Review
part 8 v2 11 years ago Robert O'Callahan (:roc) (email my personal email if necessary) 7.54 KB, patch	bzbarsky : review+	Details \| Diff \| Splinter Review

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Description

•

12 years ago

I'm experimenting with adding a global mode switch to Gecko that enables injection of aggressive nondeterminism. In various places we have code that makes decisions deterministically, where it would also be correct (but not necessarily performant) to choose nondeterministically among a range of options; by choosing among those options using a PRNG we can force our tests to explore more of the state space. I have a patch stack that does the following: -- set random priorities on threads -- sched_yield just before dispatching an event -- randomize priority of sockets in socket poll list (nsSocketTransportService::AddToPollList) -- randomize insertion in nsHttpConnectionMgr's InsertTransactionSorted -- randomize amount read when reading from socket in nsHttpConnection -- randomize test order in reftests -- randomize test order in mochitests -- randomize PLDHash enumeration start position -- add random variation to timers These patches together trigger a decent number of test failures: https://tbpl.mozilla.org/?tree=Try&rev=d50dfe799e1f Some of these seem to be known intermittent oranges, but many are not. I need to do some more work to increase confidence that these patches are correct and are only exposing, not causing, incorrect behavior. It would also be interesting to characterize the effects of each patch. The downside of chaos mode is that the resulting failures are difficult to reproduce and debug. So the broader vision is to use record-and-replay debugging tools to debug failures induced by chaos mode. Hopefully chaos mode will also benefit record and replay tools, by making it easier to explore interesting application state space in the perturbed environment of those tools. (Alternatively we could try to instrument ChaosMode to return a reproducible stream of random numbers for debugging purposes, but I suspect that approach won't work well due to other sources of nondeterminism messing with the ordering of PRNG calls.)

Andrew McCreight [:mccr8]

Comment 1

•

12 years ago

That sounds vaguely similar to some OOM testing decoder has done with the JS engine, except the choice point is (obviously) allocation.

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 2

•

12 years ago

Yes, it's related to fault injection. However, I'm trying to restrict my changes to those that shouldn't affect observable behavior. OOM faults, for example, maybe shouldn't cause a browser crash but necessarily cause tests to fail.

Nathan Froyd [:froydnj]

Comment 3

•

12 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #0) > -- randomize test order in reftests FWIW, the run-reftests-in-parallel patches also seem to trigger unusual reftest failures on several platforms; it wouldn't be surprising if tests somehow depend on test ordering.

(no longer active)

Comment 4

•

12 years ago

(In reply to comment #3) > (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #0) > > -- randomize test order in reftests > > FWIW, the run-reftests-in-parallel patches also seem to trigger unusual reftest > failures on several platforms; it wouldn't be surprising if tests somehow > depend on test ordering. Yes, and even more so with mochitests.

Mats Palmgren (inactive)

Comment 5

•

12 years ago

Can we add randomness into GC and CC? Add randomness into cache lookups? (I'm primarily thinking of various layout-related caches/cached values, like the "word cache", "text run cache", "frame offset cache", "cached newline offset", "nsXULPrototypeCache", "font cache", "font group cache", "gradient cache", "gfx.canvas.skiagl.dynamic-cache", "harfbuzz script cache", "CmapCache", "gfx-d2d-surface-cache", "userfontcache", "SkGlyphCache", "SkTypefaceCache", "SkImageCache", "nsFontCache in nsDeviceContext.cpp", and various Cairo caches...)

Mats Palmgren (inactive)

Comment 6

•

12 years ago

I feel a bit uncomfortable with having this bug public -- I think we should at least file the patches in a hidden bug and let our fuzz team test them for a while before we make them public. We have to assume our adversaries are watching what we do in bugs like this... and the Try server too...

Keywords: sec-want

Christian Holler (:decoder)

Comment 7

•

12 years ago

Marked core security based on comment 6. Let's try to see if we find some critical stuff with this and then open the bug again once we are confident enough that nothing will blow up immediately :)

Group: core-security

Andrew McCreight [:mccr8]

Updated

•

12 years ago

Comment 6 is private: false

Andrew McCreight [:mccr8]

Comment 8

•

12 years ago

There are two major sources of non-determinacy for the GC and CC: scheduling, and the order in which it does things (traverses, marks, unlinks, etc.). I think the former should be covered fairly well by fuzzing, as there are chrome-only functions to explicitly run them. One example that I haven't done anything about is bug 715684.

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 9

•

12 years ago

(In reply to Mats Palmgren (:mats) from comment #5) > Add randomness into cache lookups? > (I'm primarily thinking of various layout-related caches/cached values, > like the "word cache", "text run cache", "frame offset cache", > "cached newline offset", "nsXULPrototypeCache", "font cache", > "font group cache", "gradient cache", "gfx.canvas.skiagl.dynamic-cache", > "harfbuzz script cache", "CmapCache", "gfx-d2d-surface-cache", > "userfontcache", "SkGlyphCache", "SkTypefaceCache", "SkImageCache", > "nsFontCache in nsDeviceContext.cpp", and various Cairo caches...) What sort of randomness? Like randomly failing cache lookups? We could try that I suppose.

Mats Palmgren (inactive)

Comment 10

•

12 years ago

Yeah, either pretend the thing we're looking for isn't cached and take the non-cached path, or just clear the whole cache occasionally, whichever is more convenient. The former is probably better for non-determinism. I noticed that some of these caches are cleared on memory pressure, so generating fake memory pressure events might be a tool to consider as well.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 11

•

12 years ago

> -- randomize amount read when reading from socket in nsHttpConnection Do you expect this to "dominate" the nondeterminism in html parser chunking and interruptible reflow? Or maybe you would want to instrument those separately? These also come to mind -- random sched_yield or nanosleep before Lock() / Signal() calls (simulate contention) -- random delays in IPC message processing (when cross-thread/process becomes interesting) -- randomly forcing sync reflow

Jesse Ruderman

Comment 12

•

12 years ago

A lot of these could be fuzzer knobs, similar to the js engine's "TestingFunctions". By giving control to the DOM fuzzer rather than an RNG, we can make reduced testcases for many bugs without even needing a record-and-replay setup. > -- set random priorities on threads For fuzzing I would prefer a testing function to change the priority of a named thread. Most of our threads have names (e.g. "JS GC Helper") which show up in crash reports. Hopefully we can look up threads by name. Another possibility would be to have the fuzzer harness set environment variables like PRIORITY_JS_GC_HELPER that are automatically used when named threads are created. When there are plenty of cores available, I'm not sure priorities have much effect. It might be more powerful to pause or starve a thread for a second. > -- sched_yield just before dispatching an event This could be a boolean setting, or a "yield 3 events from now", perhaps. > -- add random variation to timers Is it valid to change the order in which timers fire? For example, if I run { setTimeout(f, 50); setTimeout(g, 50); }, are both orderings normally possible and valid?

Andrew McCreight [:mccr8]

Comment 13

•

12 years ago

Delaying kind of stuff could be useful. One external reporter filed a bunch of fuzz bugs last year that had repro instructions like "Open the test case in 10 copies of Firefox at once", which I assume really just cause things to load slower and thus you'd get slightly different behavior.

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 14

•

12 years ago

(In reply to Chris Jones [:cjones] mostly inactive; ni?/f?/r? if you need me from comment #11) > > -- randomize amount read when reading from socket in nsHttpConnection > > Do you expect this to "dominate" the nondeterminism in html parser chunking > and interruptible reflow? Or maybe you would want to instrument those > separately? I hope randomizing socket read amounts would exercise both of those. > These also come to mind > > -- random sched_yield or nanosleep before Lock() / Signal() calls (simulate > contention) Yes although that could get a lot more expensive. > -- random delays in IPC message processing (when cross-thread/process > becomes interesting) > -- randomly forcing sync reflow Good ideas :-) (In reply to Jesse Ruderman from comment #12) > A lot of these could be fuzzer knobs, similar to the js engine's > "TestingFunctions". By giving control to the DOM fuzzer rather than an RNG, > we can make reduced testcases for many bugs without even needing a > record-and-replay setup. That's an interesting idea but I'm not sure how to expose API for all these knobs. > > -- set random priorities on threads > > For fuzzing I would prefer a testing function to change the priority of a > named thread. Most of our threads have names (e.g. "JS GC Helper") which > show up in crash reports. Hopefully we can look up threads by name. > Another possibility would be to have the fuzzer harness set environment > variables like PRIORITY_JS_GC_HELPER that are automatically used when named > threads are created. > > When there are plenty of cores available, I'm not sure priorities have much > effect. It might be more powerful to pause or starve a thread for a second. That's a good point. One way to address it would be to also set random thread affinities (e.g. pin each thread to one of N particular cores). http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/include/private/pprthred.h#88 > > -- add random variation to timers > > Is it valid to change the order in which timers fire? For example, if I run > { > setTimeout(f, 50); > setTimeout(g, 50); > }, are both orderings normally possible and valid? No, but my patches don't do that. All timers are multiplexed onto a single thread which sleeps until the next timer needs to fire, and I mess with the timing of that sleep.

Mats Palmgren (inactive)

Comment 15

•

11 years ago

-- discard decoded image data more aggressively

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 16

•

11 years ago

I'm finding these patches quite useful for reproducing intermittent bugs. I'd like to land them so everyone can benefit. Finding a bug with chaos mode enabled does not easily lead to an exploit against a normal build. You'd actually have to figure out what the underlying bug is and modify your testcase significantly to trigger it in a normal build. Furthermore, since there's no replay mechanism built into these patches, it's quite hard to figure out what's going on because there's so much variation from run to run. So I don't think we should worry too much about opening this up.

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 17

•

11 years ago

Attached patch Part 1: Add ChaosMode.h base functionality — Details — Splinter Review

I don't currently have a way to enable ChaosMode dynamically. Hooking up the initial value in a thread-safe way seems rather complex. For now, this has to be enabled by patching ChaosMode.h.

Attachment #8386030 - Flags: review?(jwalden+bmo)

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 18

•

11 years ago

Attached patch Part 2: In chaos mode, sometimes yield just before dispatching an XPCOM event — Details — Splinter Review

Attachment #8386033 - Flags: review?(benjamin)

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 19

•

11 years ago

Attached patch Part 3: In chaos mode, give threads randomly chosen priorities, and pin some threads to CPU 0 so they compete for CPU — Details — Splinter Review

Attachment #8386035 - Flags: review?(benjamin)

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 20

•

11 years ago

Attached patch Part 4: In chaos mode, put new sockets at a random position in the poll list — Details — Splinter Review

Attachment #8386037 - Flags: review?(jduell.mcbugs)

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 21

•

11 years ago

Attached patch Part 5: In chaos mode, insert an HTTP transaction into the pending queue at a random position among transactions with the same priority — Details — Splinter Review

Attachment #8386038 - Flags: review?(jduell.mcbugs)

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 22

•

11 years ago

Attached patch Part 6: In chaos mode, when reading from an HTTP connection, sometimes read less than the full amount of available data — Details — Splinter Review

Attachment #8386039 - Flags: review?(jduell.mcbugs)

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 23

•

11 years ago

Attached patch Part 7: In chaos mode, start hashtable iteration at a random entry — Details — Splinter Review

Attachment #8386040 - Flags: review?(jorendorff)

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 24

•

11 years ago

Attached patch Part 8: In chaos mode, vary timer duration (while respecting order of firing) (obsolete) — Details — Splinter Review

Attachment #8386043 - Flags: review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 25

•

11 years ago

Comment on attachment 8386043 [details] [diff] [review] Part 8: In chaos mode, vary timer duration (while respecting order of firing) >+ microseconds *= ChaosMode::randomUint32LessThan(ArrayLength(sFractions)); The RHS should be sFractions[ChaosMode::randomUint32LessThan(ArrayLength(sFractions))], I would think. r=me with that fixed.

Attachment #8386043 - Flags: review?(bzbarsky) → review+

Boris Zbarsky [:bzbarsky]

Comment 26

•

11 years ago

Comment on attachment 8386043 [details] [diff] [review] Part 8: In chaos mode, vary timer duration (while respecting order of firing) Though actually, wait. I don't think this patch does what you want. In particular, it will change when the timer thread wakes up, but not when timers should run. The latter only happens if TimeStamp::Now() >= mTimers[0]->mTimeout, no?

Attachment #8386043 - Flags: review+ → review-

Benjamin Smedberg

Updated

•

11 years ago

Attachment #8386033 - Flags: review?(benjamin) → review+

Benjamin Smedberg

Updated

•

11 years ago

Attachment #8386035 - Flags: review?(benjamin) → review+

Jason Orendorff [:jorendorff]

Comment 27

•

11 years ago

Comment on attachment 8386040 [details] [diff] [review] Part 7: In chaos mode, start hashtable iteration at a random entry Review of attachment 8386040 [details] [diff] [review]: ----------------------------------------------------------------- What an intriguing stack of patches.

Attachment #8386040 - Flags: review?(jorendorff) → review+

Jason Duell

Updated

•

11 years ago

Attachment #8386039 - Flags: review?(jduell.mcbugs) → review?(mcmanus)

Jason Duell

Updated

•

11 years ago

Attachment #8386038 - Flags: review?(jduell.mcbugs) → review?(mcmanus)

Jason Duell

Updated

•

11 years ago

Attachment #8386037 - Flags: review?(jduell.mcbugs) → review?(mcmanus)

Patrick McManus [:mcmanus]

Comment 28

•

11 years ago

Comment on attachment 8386039 [details] [diff] [review] Part 6: In chaos mode, when reading from an HTTP connection, sometimes read less than the full amount of available data Review of attachment 8386039 [details] [diff] [review]: ----------------------------------------------------------------- this is good noodle warping stuff. imho its also exactly the kind of thing we should be working in the open.

Attachment #8386039 - Flags: review?(mcmanus) → review+

Patrick McManus [:mcmanus]

Updated

•

11 years ago

Attachment #8386038 - Flags: review?(mcmanus) → review+

Patrick McManus [:mcmanus]

Comment 29

•

11 years ago

Comment on attachment 8386037 [details] [diff] [review] Part 4: In chaos mode, put new sockets at a random position in the poll list Review of attachment 8386037 [details] [diff] [review]: ----------------------------------------------------------------- I think there is probably some (mostly theoretical) livelock that this can cause - essentially because some sockets are no longer guaranteed to be read in the presence of contention. The chance of actually seeing that happen is very small and given the context of chaos mode I think its fine to proceed.. just documenting it here.

Attachment #8386037 - Flags: review?(mcmanus) → review+

Jesse Ruderman

Updated

•

11 years ago

Severity: normal → enhancement

OS: Linux → All

Hardware: x86_64 → All

Summary: Investigate adding a "chaos mode" to Gecko → Add a "chaos mode" to Gecko

Jesse Ruderman

Updated

•

11 years ago

Group: core-security

Jeff Walden [:Waldo]

Comment 30

•

11 years ago

Comment on attachment 8386030 [details] [diff] [review] Part 1: Add ChaosMode.h base functionality Review of attachment 8386030 [details] [diff] [review]: ----------------------------------------------------------------- I've thought of doing something like this myself at some point. Always felt like I was being pulled into too much feature-work or assigned-work to actually take the time to write anything, tho. :-( Glad somebody's doing it. ::: mfbt/ChaosMode.h @@ +6,5 @@ > + > +#ifndef mozilla_ChaosMode_h > +#define mozilla_ChaosMode_h > + > +#include "mozilla/Types.h" Just use <stdint.h> directly, please. @@ +18,5 @@ > + * choices is encouraged to make random and extreme choices, to test more > + * code paths and uncover bugs. > + */ > +class ChaosMode > +{ Indent the whole class body two more spaces, please. ::: mfbt/moz.build @@ +19,5 @@ > 'BloomFilter.h', > 'Casting.h', > 'Char16.h', > 'CheckedInt.h', > + 'ChaosMode.h', This probably doesn't build because it's not alphabetical.

Attachment #8386030 - Flags: review?(jwalden+bmo) → review+

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 31

•

11 years ago

Attached patch part 8 v2 — Details — Splinter Review

Attachment #8386043 - Attachment is obsolete: true

Attachment #8387125 - Flags: review?(bzbarsky)

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 32

•

11 years ago

I'm going to address comments and land this ASAP if no-one strongly objects.

Boris Zbarsky [:bzbarsky]

Comment 33

•

11 years ago

Comment on attachment 8387125 [details] [diff] [review] part 8 v2 r=me

Attachment #8387125 - Flags: review?(bzbarsky) → review+

(no longer active)

Comment 34

•

11 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #16) > Finding a bug with chaos mode enabled does not easily lead to an exploit > against a normal build. You'd actually have to figure out what the > underlying bug is and modify your testcase significantly to trigger it in a > normal build. Furthermore, since there's no replay mechanism built into > these patches, it's quite hard to figure out what's going on because there's > so much variation from run to run. So I don't think we should worry too much > about opening this up. I'm not sure if I understand how these patches help on _reproducing_ an intermittent failure without a way to replay the chaos mode...

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 35

•

11 years ago

A solution to that problem is imminent :-).

Nicholas Nethercote [inactive]

Comment 36

•

11 years ago

Comment on attachment 8386030 [details] [diff] [review] Part 1: Add ChaosMode.h base functionality Review of attachment 8386030 [details] [diff] [review]: ----------------------------------------------------------------- ::: mfbt/ChaosMode.h @@ +22,5 @@ > +{ > +public: > + static bool isActive() > + { > + // XXX flip this to true to active chaos mode s/active/activate/

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 37

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/f656d2fd4264 https://hg.mozilla.org/integration/mozilla-inbound/rev/4d971e595980 https://hg.mozilla.org/integration/mozilla-inbound/rev/f741f55301ef https://hg.mozilla.org/integration/mozilla-inbound/rev/7693a49a8210 https://hg.mozilla.org/integration/mozilla-inbound/rev/3eb91b629358 https://hg.mozilla.org/integration/mozilla-inbound/rev/0b2ad936b5a9 https://hg.mozilla.org/integration/mozilla-inbound/rev/ecf3d4fbfa71 https://hg.mozilla.org/integration/mozilla-inbound/rev/c7d86e2ddfea

Phil Ringnalda (:philor)

Comment 38

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/f656d2fd4264 https://hg.mozilla.org/mozilla-central/rev/4d971e595980 https://hg.mozilla.org/mozilla-central/rev/f741f55301ef https://hg.mozilla.org/mozilla-central/rev/7693a49a8210 https://hg.mozilla.org/mozilla-central/rev/3eb91b629358 https://hg.mozilla.org/mozilla-central/rev/0b2ad936b5a9 https://hg.mozilla.org/mozilla-central/rev/ecf3d4fbfa71 https://hg.mozilla.org/mozilla-central/rev/c7d86e2ddfea

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla30

Nika Layzell [:nika] (ni? for response)

Updated

•

3 years ago

Blocks: 1783711

You need to log in before you can comment on or make changes to this bug.