Allow running individual tests under chaos mode

RESOLVED FIXED in Firefox 41

Status

()

enhancement
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: kats, Assigned: kats)

Tracking

Trunk
mozilla41
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox41 fixed)

Details

Attachments

(3 attachments, 2 obsolete attachments)

Posted patch WIP (obsolete) — Splinter Review
Bug 955888 added a "chaos mode" switch in Gecko which injects "aggressive nondeterminism". This is useful for flushing out things like race conditions and other bugs, but in order to use it you have to patch mfbt/ChaosMode.h and do a build. Therefore in practice it is quite underused.

What I would like to do is make it possible to enable chaos mode at runtime, similar to a live pref. This would allow us to enable it on a per-test basis. Going forward I think anybody who writes a new test should make sure it is chaos-mode-safe (at least on Linux where we have rr), and reviewers should ask that new tests land with chaos mode enabled. Hopefully this will get us better coverage and stem the tide of new intermittent failures. As people discover chaos-failures in their new tests they might also fix pre-existing bugs in the code.

It doesn't seem easy to actually turn chaos mode into a pref because some of the callsites that use it probably run before the preference service is even initialized. That might be ok, I'm not sure. Making the mfbt code keep a static "chaos mode state" doesn't seem to compile, I think mfbt doesn't allow static data or something. For now I just hacked together a proof-of-concept to gauge interest in this, hopefully somebody can suggest a better way to do this.
Attachment #8604890 - Flags: feedback?(roc)
Attachment #8604890 - Flags: feedback?(nfroyd)
Comment on attachment 8604890 [details] [diff] [review]
WIP

Review of attachment 8604890 [details] [diff] [review]:
-----------------------------------------------------------------

This seems reasonable, though I'm unsure I have understood the motivation.  See comments for reftest.js.

I think you want to mark things as MFBT_{DATA,API} to get the visibility that you want:

http://mxr.mozilla.org/mozilla-central/source/mfbt/JSONWriter.h#118
http://mxr.mozilla.org/mozilla-central/source/mfbt/double-conversion/double-conversion.h#133

That should work better than stuffing things in nsCRTGlue.cpp/nsDebug.h.

::: dom/interfaces/base/nsIDOMWindowUtils.idl
@@ +1815,5 @@
>     */
>    attribute boolean serviceWorkersTestingEnabled;
> +
> +  void enterChaosMode();
> +  void leaveChaosMode();

Could use a comment explaining these declarations.

::: layout/tools/reftest/reftest.js
@@ +966,5 @@
>                  fuzzy_max_pixels = Number(m[3]);
>                }
> +            } else if (item == "chaos-mode") {
> +                cond = false;
> +                chaosMode = true;

Is the intent here that developers can annotate individual reftests (the ones they write, and ideally/eventually all reftests, at which point we should just have it on all the time) as chaos-mode, and then test them?  I guess the advantage is that they don't have to recompile to verify whether their test is chaos-mode safe, and we're reducing the barrier to manual testing?

I wonder how hard it would be to do the same thing for mochitests, if we can transfer enough information from the test manifests into the test harness, so we don't have to write a bunch of javascript for each chaos-mode-enabled test.
Attachment #8604890 - Flags: feedback?(nfroyd) → feedback+
(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #1)
> I think you want to mark things as MFBT_{DATA,API} to get the visibility
> that you want:

Thanks, that sounds like what I was looking for. I'll give that a shot.

> Could use a comment explaining these declarations.

Yeah this was just a quick hacky thing I threw together. It'll probably look different if the MFBT_DATA thing works.

> ::: layout/tools/reftest/reftest.js
> @@ +966,5 @@
> >                  fuzzy_max_pixels = Number(m[3]);
> >                }
> > +            } else if (item == "chaos-mode") {
> > +                cond = false;
> > +                chaosMode = true;
> 
> Is the intent here that developers can annotate individual reftests (the
> ones they write, and ideally/eventually all reftests, at which point we
> should just have it on all the time) as chaos-mode, and then test them?  I
> guess the advantage is that they don't have to recompile to verify whether
> their test is chaos-mode safe, and we're reducing the barrier to manual
> testing?

Yes, exactly. The goal for now is to land all new reftests with chaos-mode in the reftest.list file, so they are less prone to intermittent failures. We can pick away at pre-existing tests once we stem the tide of new failures.

> I wonder how hard it would be to do the same thing for mochitests, if we can
> transfer enough information from the test manifests into the test harness,
> so we don't have to write a bunch of javascript for each chaos-mode-enabled
> test.

I would definitely want to expose this to mochitests as well. It should be easy enough with the DOMWindowUtils API in place. We might be able to put something in the mochitest.ini file, or (simpler) something similar to SpecialPowers.pushPrefEnv, but like SpecialPowers.runThisTestInChaosMode().
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> (In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #1)
> > Is the intent here that developers can annotate individual reftests (the
> > ones they write, and ideally/eventually all reftests, at which point we
> > should just have it on all the time) as chaos-mode, and then test them?  I
> > guess the advantage is that they don't have to recompile to verify whether
> > their test is chaos-mode safe, and we're reducing the barrier to manual
> > testing?
> 
> Yes, exactly. The goal for now is to land all new reftests with chaos-mode
> in the reftest.list file, so they are less prone to intermittent failures.
> We can pick away at pre-existing tests once we stem the tide of new failures.

OK, cool.  Can you file a bug for verifying that pushes touching reftest.list files come with the chaos-mode annotation on additions, and rejecting the push otherwise?  If it detected re-enablings, that'd be a bonus.  This shouldn't be terribly hard to do, since we do something similar for the UUID bump required for IDL files...

> > I wonder how hard it would be to do the same thing for mochitests, if we can
> > transfer enough information from the test manifests into the test harness,
> > so we don't have to write a bunch of javascript for each chaos-mode-enabled
> > test.
> 
> I would definitely want to expose this to mochitests as well. It should be
> easy enough with the DOMWindowUtils API in place. We might be able to put
> something in the mochitest.ini file, or (simpler) something similar to
> SpecialPowers.pushPrefEnv, but like SpecialPowers.runThisTestInChaosMode().

I guess for new tests, whether we go the annotation route or the JS route doesn't matter much, but for old tests, I would think that using the annotations in manifests would be easier.
Attachment #8604890 - Attachment is obsolete: true
Attachment #8604890 - Flags: feedback?(roc)
Attachment #8606285 - Attachment is obsolete: true
Attachment #8606331 - Flags: review?(roc)
Attachment #8606331 - Flags: review?(nfroyd)
Assuming this goes well I'll file a follow-up for a commit hook that prevents new reftests from landing without chaos-mode.
Attachment #8606331 - Attachment description: Part 1 - Allow running individual reftests in chaos mode → Part 1 - Allow running individual tests in chaos mode
Comment on attachment 8606331 [details] [diff] [review]
Part 1 - Allow running individual reftests in chaos mode

Review of attachment 8606331 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/tools/reftest/reftest.js
@@ +966,5 @@
>                  fuzzy_max_pixels = Number(m[3]);
>                }
> +            } else if (item == "chaos-mode") {
> +                cond = false;
> +                chaosMode = true;

I won't get to a real review of this until Monday, but I was wondering about chaos-mode failures on non-Linux systems (and no test failures observed on Linux systems, so debugging via rr is not feasible).  Is such a scenario likely?  Should we restrict chaos-mode on reftests to Linux?
Attachment #8606331 - Attachment description: Part 1 - Allow running individual tests in chaos mode → Part 1 - Allow running individual reftests in chaos mode
ni? for comment 8
Flags: needinfo?(bugmail.mozilla)
(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #8)
> I won't get to a real review of this until Monday, but I was wondering about
> chaos-mode failures on non-Linux systems (and no test failures observed on
> Linux systems, so debugging via rr is not feasible).  Is such a scenario
> likely?  Should we restrict chaos-mode on reftests to Linux?

It's certainly possible, although I'm not sure about "likely". However I don't see that as a reason to restrict chaos-mode on reftests to Linux.

It's certainly possible to debug intermittent failures without rr, as long as it can be triggered reliably. If chaos-mode triggers it reliably, then I would expect test authors to debug it prior to landing their new test. If it fails only very rarely then maybe we should allow landing the test without chaos-mode - this is a bit of a judgement call depending on the value of the test and the frequency of the intermittent failure. Presumably any commit hook will allow overriding (a la CLOSED TREE or IGNORE IDL) and that mechanism can be used if necessary.

Overall though I think it's worth keeping chaos-mode on all platforms rather than restrict it to Linux.
Flags: needinfo?(bugmail.mozilla)
Attachment #8606334 - Flags: review?(botond) → review+
Comment on attachment 8606331 [details] [diff] [review]
Part 1 - Allow running individual reftests in chaos mode

Review of attachment 8606331 [details] [diff] [review]:
-----------------------------------------------------------------

The code seems OK. However, I have a couple of concerns:
1) What evidence do we have that chaos mode is actually useful at finding important bugs? I actually don't have any :-). Has anyone found that turning on chaos mode helps them reproduce an important bug?
2) Without rr I think debugging chaos-mode-induced failures will be pretty hard. I fear we might make intermittent tests fail more often --- but not often enough to make debugging tractable --- which maybe isn't a win.
Attachment #8606331 - Flags: review?(roc) → review+
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #11)
> The code seems OK. However, I have a couple of concerns:
> 1) What evidence do we have that chaos mode is actually useful at finding
> important bugs? I actually don't have any :-). Has anyone found that turning
> on chaos mode helps them reproduce an important bug?

I don't either, but I think it's largely because chaos mode isn't part of anybody's workflow right now. We have no data on the usefulness of chaos mode at all, and so cannot conclude anything about it. And I don't see that changing anytime soon unless we make it easier to use. So maybe think of this as an experiment to see if chaos mode can provide value.

> 2) Without rr I think debugging chaos-mode-induced failures will be pretty
> hard. I fear we might make intermittent tests fail more often --- but not
> often enough to make debugging tractable --- which maybe isn't a win.

This might be the case. Again, I don't think we can know this until we try it and see what happens. If it turns out this is not a good idea we can always back it out. For now we can keep it optional so that people can use it if they think it's a good idea and if we find that it provides value by flushing out bugs sooner we can make it mandatory later.
Try push [1] with these patches shows some build failures with building webapprt-stub on windows and mac. The mac log points to the gChaosModeCounter I added. I'll have to dig to find out how to fix this but if anybody knows off the top of their head do tell.

06:54:06     INFO -  /builds/slave/try-m64-0000000000000000000000/build/src/obj-firefox/i386/webapprt/mac/tmpRNzyPH.list:
06:54:06     INFO -      webapprt.o
06:54:06     INFO -  Undefined symbols for architecture i386:
06:54:06     INFO -    "mozilla::detail::gChaosModeCounter", referenced from:
06:54:06     INFO -        PL_DHashTableEnumerate(PLDHashTable*, PLDHashOperator (*)(PLDHashTable*, PLDHashEntryHdr*, unsigned int, void*), void*) in libxpcomglue.a(pldhash.o)
06:54:06     INFO -        PLDHashTable::Iterator::Iterator(PLDHashTable const*) in libxpcomglue.a(pldhash.o)
06:54:06     INFO -        PLDHashTable::Iterator::Iterator(PLDHashTable const*) in libxpcomglue.a(pldhash.o)
06:54:06     INFO -  ld: symbol(s) not found for architecture i386
06:54:06     INFO -  clang: error: linker command failed with exit code 1 (use -v to see invocation)


[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=1b0400439327
[2]
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13)
> Try push [1] with these patches shows some build failures with building
> webapprt-stub on windows and mac. The mac log points to the
> gChaosModeCounter I added. I'll have to dig to find out how to fix this but
> if anybody knows off the top of their head do tell.

I believe Eric has traversed this sorry road already and may have some suggestions.
Flags: needinfo?(erahm)
Comment on attachment 8606331 [details] [diff] [review]
Part 1 - Allow running individual reftests in chaos mode

Review of attachment 8606331 [details] [diff] [review]:
-----------------------------------------------------------------

I think the concept is worth it as an experiment.  If it doesn't seem to help, we can back out and think harder. :)

::: mfbt/ChaosMode.h
@@ +48,5 @@
>  
>  public:
>    static bool isActive(ChaosFeature aFeature)
>    {
> +    if (detail::gChaosModeCounter > 0) {

Does linking work any better if you out-of-line the functions accessing gChaosModeCounter into ChaosMode.cpp?  (Probably need to convert them to MFBT_API in the process.)
Attachment #8606331 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #14)
> (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13)
> > Try push [1] with these patches shows some build failures with building
> > webapprt-stub on windows and mac. The mac log points to the
> > gChaosModeCounter I added. I'll have to dig to find out how to fix this but
> > if anybody knows off the top of their head do tell.
> 
> I believe Eric has traversed this sorry road already and may have some
> suggestions.

This is probably bug 1160285, the fix, in theory, is to add mfbt as a lib on mac and create a static mfbt lib on windows and link against it.
Flags: needinfo?(erahm)
Depends on: 1160285
Nathan, any objections to moving all the ChaosMode stuff from mfbt to xpcom/base like glandium suggested at [1]? (I haven't tried this yet to see if it works).

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1160285#c10
Flags: needinfo?(nfroyd)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)
> Nathan, any objections to moving all the ChaosMode stuff from mfbt to
> xpcom/base like glandium suggested at [1]? (I haven't tried this yet to see
> if it works).
> 
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1160285#c10

I have no objections.  It might be nice if we kept in mfbt so JS can use it, but in the absence of use thus far, and making it easier to experiment with ChaosMode to see if we can flush out bugs and whatnot, moving it is preferable.  We can deal with moving it back when the time comes.
Flags: needinfo?(nfroyd)
Posted file build.log
Hm, moving it into xpcom/base doesn't seem to work either. There's a lot of little test executables all over that link against specific libraries but not libxul that fail to build, it looks like. I'm attaching a sample build failure log from my local build on OS X.
Depends on: 1169433
https://hg.mozilla.org/mozilla-central/rev/34066854f645
https://hg.mozilla.org/mozilla-central/rev/89ac61464a45
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
Depends on: 1174011
So this is weird. Yesterday I landed patches in bug 1172789, which introduced the use of PLDHashTable::Iterator to some new places. It turns out that PLDHashTable::Iterator had a bug that could cause a mod-by-zero to occur when chaos mode was enabled.

This bug is manifesting on mozilla-central as an M-oth perma-fail in browser/devtools/webide/test/sidebars/test_telemetry.html, which is not one of the chaos-mode-enabled tests, AFAICT. But it's not manifesting on mozilla-inbound.

Fortunately I just fixed the mod-by-zero in bug 1174046, so the perma-fail will go away soon.

However, it seems like (a) Chaos Mode is being activated on mozilla-central but not on mozilla-inbound, and (b) it's being activated on more tests than expected.

kats, any ideas?
Flags: needinfo?(bugmail.mozilla)
> So this is weird. 

philor worked it out:

- I landed the patch from bug 1172789 on mozilla-inbound; it introduced a latent defect that would trigger when certain tests ran in chaos mode.

- Bug 1159043 landed a patch on fx-team that turned on chaos mode for test_telemetry.html.

- Those two patches first came into contact with each other on mozilla-central, causing M-oth to fail there.

- The fx-team patch later got merged to mozilla-inbound, causing M-oth to fail there too. And so on to other repos.
Flags: needinfo?(bugmail.mozilla)
You need to log in before you can comment on or make changes to this bug.