Closed Bug 1344991 Opened 5 years ago Closed 5 years ago

Continue reftest testing after a crash

Categories

(Testing :: Reftest, defect)

Version 3
defect
Not set
normal

Tracking

(firefox55 fixed)

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: shinglyu, Assigned: shinglyu)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(1 file)

Reftest currently stops after a crash. This is a problem when we make a breaking change (e.g. Stylo) and has many crashes. In this case we would like it to continue running the rest of the tests, so we can get a whole picture of all the test results, instead of diabling the first crash and re-run.

This was technically difficult because the python part of the reftest harness just spin up a firefox instance, then the xul and jsm file inside the browser do the actually loading and collecting test results. For the e10s case, we might be able to recover from a content window crash and resume the test. For non-e10s case, I guess we have to parse the log (in the python harness) and record which test cases we've ran before, then restart firefox, but this time start from the rest of the test case we haven't ran before.
I have a prototype for this already. If a reftest chunk crashes, we can pass the name of the crashed test back to the python harness. The python harness then restart a test run, but this time also pass the name of the crashed test to it. When the reftest.jsm file received a name of a crashed test, it skips the test itself and all the tests before it (because we already ran it in the previous run). The log of the resumed round will be attached to the original log, so we can have a clear picture of the overall result.

My patch is still very hacky, I'll try to make it more robust and submit a patch for review.
Assignee: nobody → slyu
Blocks: 1345718
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

Hi Dbaron, do you mind giving me some feedback on the patch? 

It now runs on treeherder: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e17c666e61cbb048fc261715ea19386d2a7e82b9

(Please ignore the failures, those are Stylo-vs-gecko failures)

If you look at Rs16's live.log, you can find a line: 

  REFTEST INFO | Resuming crashed run from test 171

Which indicates we recovered from the crashed test #171
Attachment #8845282 - Flags: feedback?(dbaron)
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

https://reviewboard.mozilla.org/r/118446/#review120670

This looks pretty good overall, though I'm probably not the best reviewer for the python code, so I didn't look at it very closely.

A few minor comments on the JS bits follow, though:

::: layout/tools/reftest/reftest.jsm:389
(Diff revision 1)
>      } catch(e) {}
>  
> +    try {
> +        gSkipUntil = prefs.getCharPref("reftest.skipUntil");
> +    } catch(e) {
> +        gSkipUntil = undefined; // We can't use null, some test case are named null

I don't see why "named null" should matter here, given use of !== (or typeof).

::: layout/tools/reftest/reftest.jsm:544
(Diff revision 1)
>  
>              logger.info("Running chunk " + gThisChunk + " out of " + gTotalChunks + " chunks.  " +
>                  "tests " + (start+1) + "-" + end + "/" + gURLs.length);
>          }
>  
> +        // Skip until the previously crashed tests

"until the previously crashed tests" -> "_through_ the previously crashed _test_"

That is:
 - test singular rather than tests plural
 - through rather than until to indicate that the previously-crashed test is skipped (rather than being the first test run)

::: layout/tools/reftest/reftest.jsm:546
(Diff revision 1)
>                  "tests " + (start+1) + "-" + end + "/" + gURLs.length);
>          }
>  
> +        // Skip until the previously crashed tests
> +        // We have to do this after chunking so we don't break the numbers
> +        var newGURLs = []

the "g" prefix is to indicate that the variable is global.  So just call this variable newURLs.


Also, it seems like this whole chunk of code could be conditioned on gSkipUntil !== undefined rather than just having two tests in the middle.  That would both be simpler and faster.

::: layout/tools/reftest/reftest.jsm:566
(Diff revision 1)
> +            logger.info("Resuming crashed run from test " + (gURLs.length - newGURLs.length + 1));
> +        }
> +
> +        gURLs = newGURLs;
> +
>          if (gShuffle) {

Note that this feature won't work correctly with gShuffle.  There should probably be an error message if both are used.
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

MozReview doesn't understand feedback?, though!
Attachment #8845282 - Flags: feedback?(dbaron) → feedback+
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

I've fixed the feedback comments. I now use gURLs.indexOf() to find the crashed test instead of using a for loop.
Attachment #8845282 - Flags: review?(jmaher)
Attachment #8845282 - Flags: review?(dbaron)
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

https://reviewboard.mozilla.org/r/118446/#review122454

a few things, nothing major.

::: layout/tools/reftest/reftest.jsm:126
(Diff revision 3)
>  var gFailedOpaqueLayer = false;
>  var gFailedOpaqueLayerMessages = [];
>  var gFailedAssignedLayer = false;
>  var gFailedAssignedLayerMessages = [];
>  
> +var gSkipUntil = null;

I would prefer using a name more like 'gStartAt', but that is personal preference, if others like gSkipUntil I don't have a strong preference.

::: layout/tools/reftest/reftest.jsm:546
(Diff revision 3)
>                  "tests " + (start+1) + "-" + end + "/" + gURLs.length);
>          }
>  
> +        // Skip through previously crashed test
> +        // We have to do this after chunking so we don't break the numbers
> +        if (gSkipUntil !== undefined && !gShuffle){

nit: add a space before the opening { to match the style of the file.  Also on the line below.

::: layout/tools/reftest/reftest.jsm:550
(Diff revision 3)
> +        // We have to do this after chunking so we don't break the numbers
> +        if (gSkipUntil !== undefined && !gShuffle){
> +            var crash_idx = gURLs.map(function(url){
> +                                        return url['url1']['spec']
> +                                     }).indexOf(gSkipUntil)
> +            gURLs = gURLs.slice(crash_idx + 1);

in the case we cannot find the value defined at gSkipUntil, we will return the original value for gURLs, will that cause us to loop until the job times out?

::: layout/tools/reftest/runreftest.py:661
(Diff revision 3)
>                                         symbolsPath, test=self.lastTestSeen)
>  
>          runner.cleanup()
>          if not status and crashed:
>              status = 1
> -        return status
> +        return status, self.lastTestSeen

if all goes well lastTestSeen == 'Main app process exited normally' (line 653 relative to this patch), won't this affect things?

::: layout/tools/reftest/runreftest.py:692
(Diff revision 3)
> -            browserEnv = self.buildBrowserEnv(options, profileDir)
> +                browserEnv = self.buildBrowserEnv(options, profileDir)
>  
> -            self.log.info("Running with e10s: {}".format(options.e10s))
> +                self.log.info("Running with e10s: {}".format(options.e10s))
> -            status = self.runApp(profile,
> +                # Make runApp return the last seen test
> +                status, skipUntil = self.runApp(profile,
> -                                 binary=options.app,
> +                                    binary=options.app,

note, the formatting here won't be the same as before, is there a reason you didn't add another 11 spaces for the options here?

::: layout/tools/reftest/runreftest.py:708
(Diff revision 3)
> -                                     leak_thresholds=options.leakThresholds,
> +                                        leak_thresholds=options.leakThresholds,
> -                                     stack_fixer=get_stack_fixer_function(options.utilityPath,
> +                                        stack_fixer=get_stack_fixer_function(options.utilityPath,
> -                                                                          options.symbolsPath))
> +                                                                            options.symbolsPath))
> +                self.cleanup(profileDir)
> +                if options.shuffle:
> +                    self.log.error("Can not resume from a crash with --shuffle enabled. Please consider disabling --shuffle")

this message is always produced if using shuffle?  I don't know who uses shuffle, but this would be confusing to see.  In this case it assumes we are always running in skipUntil mode.

::: layout/tools/reftest/runreftest.py:713
(Diff revision 3)
> +                    self.log.error("Can not resume from a crash with --shuffle enabled. Please consider disabling --shuffle")
> +                    break
> +                if skipUntil == prevSkipUntil:
> +                    # If the test stuck on the same test, or there the crashed
> +                    # test appeared more then once, stop
> +                    self.log.error("Force stop becasuse we keep running into test {}".format(skipUntil))

nit: small typo, use 'because'
Attachment #8845282 - Flags: review?(jmaher) → review-
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

https://reviewboard.mozilla.org/r/118446/#review122646

I think this is mostly good, but I'd like to have another look once you address the error message issue (combining with gShuffle), so marking as review- for now.

::: layout/tools/reftest/reftest.jsm:126
(Diff revision 3)
>  var gFailedOpaqueLayer = false;
>  var gFailedOpaqueLayerMessages = [];
>  var gFailedAssignedLayer = false;
>  var gFailedAssignedLayerMessages = [];
>  
> +var gSkipUntil = null;

I'm fine with either name.

::: layout/tools/reftest/reftest.jsm:126
(Diff revision 3)
>  var gFailedOpaqueLayer = false;
>  var gFailedOpaqueLayerMessages = [];
>  var gFailedAssignedLayer = false;
>  var gFailedAssignedLayerMessages = [];
>  
> +var gSkipUntil = null;

Probably better to initialize to undefined here if the other initialization (in the catch) is also to undefined.  (Or just not initialize at all, since that should be undefined.)

::: layout/tools/reftest/reftest.jsm:546
(Diff revision 3)
>                  "tests " + (start+1) + "-" + end + "/" + gURLs.length);
>          }
>  
> +        // Skip through previously crashed test
> +        // We have to do this after chunking so we don't break the numbers
> +        if (gSkipUntil !== undefined && !gShuffle){

I think there should be an error message when combined with gShuffle; it's better to give an error than to go off and do an unexpected thing.
Attachment #8845282 - Flags: review?(dbaron) → review-
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #10)
> Comment on attachment 8845282 [details]
> Bug 1344991 - Continue reftest after a crash
> 
> https://reviewboard.mozilla.org/r/118446/#review122646
> 
> I think this is mostly good, but I'd like to have another look once you
> address the error message issue (combining with gShuffle), so marking as
> review- for now.
> 
> ::: layout/tools/reftest/reftest.jsm:126
> (Diff revision 3)
> >  var gFailedOpaqueLayer = false;
> >  var gFailedOpaqueLayerMessages = [];
> >  var gFailedAssignedLayer = false;
> >  var gFailedAssignedLayerMessages = [];
> >  
> > +var gSkipUntil = null;
> 
> I'm fine with either name.
> 
> ::: layout/tools/reftest/reftest.jsm:126
> (Diff revision 3)
> >  var gFailedOpaqueLayer = false;
> >  var gFailedOpaqueLayerMessages = [];
> >  var gFailedAssignedLayer = false;
> >  var gFailedAssignedLayerMessages = [];
> >  
> > +var gSkipUntil = null;
> 
> Probably better to initialize to undefined here if the other initialization
> (in the catch) is also to undefined.  (Or just not initialize at all, since
> that should be undefined.)
> 
Sorry for missing that, I'll initialize it as undefined just to be clear.
> ::: layout/tools/reftest/reftest.jsm:546
> (Diff revision 3)
> >                  "tests " + (start+1) + "-" + end + "/" + gURLs.length);
> >          }
> >  
> > +        // Skip through previously crashed test
> > +        // We have to do this after chunking so we don't break the numbers
> > +        if (gSkipUntil !== undefined && !gShuffle){
> 
> I think there should be an error message when combined with gShuffle; it's
> better to give an error than to go off and do an unexpected thing.
I already checked that in the python side so it will not resume when gShuffle is on. But I guess it doesn't hurt to add another warning here.
(In reply to Shing Lyu [:shinglyu] from comment #11)
> > I think there should be an error message when combined with gShuffle; it's
> > better to give an error than to go off and do an unexpected thing.
> I already checked that in the python side so it will not resume when
> gShuffle is on. But I guess it doesn't hurt to add another warning here.

It should be an error -- reftest should refuse to run.  That will make people notice the error, which a warning won't.
(In reply to Joel Maher ( :jmaher) from comment #9)
> Comment on attachment 8845282 [details]
> Bug 1344991 - Continue reftest after a crash
> 
> https://reviewboard.mozilla.org/r/118446/#review122454
> 
> a few things, nothing major.
> 
> ::: layout/tools/reftest/reftest.jsm:126
> (Diff revision 3)
> >  var gFailedOpaqueLayer = false;
> >  var gFailedOpaqueLayerMessages = [];
> >  var gFailedAssignedLayer = false;
> >  var gFailedAssignedLayerMessages = [];
> >  
> > +var gSkipUntil = null;
> 
> I would prefer using a name more like 'gStartAt', but that is personal
> preference, if others like gSkipUntil I don't have a strong preference.
> 

Technically it should be 'gStartAfter', because we'll skip the test it points to and start from the next one right after it.

(nit fixed)

> ::: layout/tools/reftest/reftest.jsm:550
> (Diff revision 3)
> > +        // We have to do this after chunking so we don't break the numbers
> > +        if (gSkipUntil !== undefined && !gShuffle){
> > +            var crash_idx = gURLs.map(function(url){
> > +                                        return url['url1']['spec']
> > +                                     }).indexOf(gSkipUntil)
> > +            gURLs = gURLs.slice(crash_idx + 1);
> 
> in the case we cannot find the value defined at gSkipUntil, we will return
> the original value for gURLs, will that cause us to loop until the job times
> out?
> 

I added a check here, thanks. 

> ::: layout/tools/reftest/runreftest.py:661
> (Diff revision 3)
> >                                         symbolsPath, test=self.lastTestSeen)
> >  
> >          runner.cleanup()
> >          if not status and crashed:
> >              status = 1
> > -        return status
> > +        return status, self.lastTestSeen
> 
> if all goes well lastTestSeen == 'Main app process exited normally' (line
> 653 relative to this patch), won't this affect things?
> 
In this case the status should be 0, so we won't run again. In the worst case if we did re-run the test, we'll throw an exception because of the previous change.

> ::: layout/tools/reftest/runreftest.py:692
> (Diff revision 3)
> > -            browserEnv = self.buildBrowserEnv(options, profileDir)
> > +                browserEnv = self.buildBrowserEnv(options, profileDir)
> >  
> > -            self.log.info("Running with e10s: {}".format(options.e10s))
> > +                self.log.info("Running with e10s: {}".format(options.e10s))
> > -            status = self.runApp(profile,
> > +                # Make runApp return the last seen test
> > +                status, skipUntil = self.runApp(profile,
> > -                                 binary=options.app,
> > +                                    binary=options.app,
> 
> note, the formatting here won't be the same as before, is there a reason you
> didn't add another 11 spaces for the options here?
> 

Sorry, just my vim plugin trying to be smart :p

> ::: layout/tools/reftest/runreftest.py:708
> (Diff revision 3)
> > -                                     leak_thresholds=options.leakThresholds,
> > +                                        leak_thresholds=options.leakThresholds,
> > -                                     stack_fixer=get_stack_fixer_function(options.utilityPath,
> > +                                        stack_fixer=get_stack_fixer_function(options.utilityPath,
> > -                                                                          options.symbolsPath))
> > +                                                                            options.symbolsPath))
> > +                self.cleanup(profileDir)
> > +                if options.shuffle:
> > +                    self.log.error("Can not resume from a crash with --shuffle enabled. Please consider disabling --shuffle")
> 
> this message is always produced if using shuffle?  I don't know who uses
> shuffle, but this would be confusing to see.  In this case it assumes we are
> always running in skipUntil mode.
> 

I'll double checked if we are trying to re-run here.

> nit: small typo, use 'because'
fixed
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

I've addressed the review comments above, thank you. :)
Attachment #8845282 - Flags: review?(jmaher)
Attachment #8845282 - Flags: review?(dbaron)
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

https://reviewboard.mozilla.org/r/118446/#review122930

only small details here- typically I would do r+ w/nits, giving this new method a try.

::: layout/tools/reftest/runreftest.py:671
(Diff revisions 3 - 5)
>              debuggerInfo = mozdebug.get_debugger_info(options.debugger, options.debuggerArgs,
>                                                        options.debuggerInteractive)
>  
>          profileDir = None
> -        skipUntil = None  # When the previous run crashed, we skip the tests we ran before
> +        startAfter = None  # When the previous run crashed, we skip the tests we ran before
>          prevSkipUntil = None

nit: s/prevSkipUntil/prevStartAfter/

::: layout/tools/reftest/runreftest.py:691
(Diff revision 5)
> -            # browser environment
> +                # browser environment
> -            browserEnv = self.buildBrowserEnv(options, profileDir)
> +                browserEnv = self.buildBrowserEnv(options, profileDir)
>  
> -            self.log.info("Running with e10s: {}".format(options.e10s))
> +                self.log.info("Running with e10s: {}".format(options.e10s))
> -            status = self.runApp(profile,
> +                status, startAfter = self.runApp(profile,
> -                                 binary=options.app,
> +                                     binary=options.app,

these only look indented 4 spaces, previously they aligned to the same starting point as profile, so 12 more spaces would be needed here.
Attachment #8845282 - Flags: review?(jmaher) → review-
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

https://reviewboard.mozilla.org/r/118446/#review123004

r=dbaron with the following changes (although again, I didn't look much at the python part; trusting jmaher to do that)

::: layout/tools/reftest/reftest.jsm:546
(Diff revision 5)
> +                logger.error("Can't resume from a crashed test when " +
> +                             "--shuffle is enabled, continue by shuffling " +
> +                             "all the tests");

This should be followed by:
  DoneTests();
  return;
so that it's actually an *error*.

::: layout/tools/reftest/reftest.jsm:556
(Diff revision 5)
> +        } else if (gStartAfter !== undefined) {
> +            // Skip through previously crashed test
> +            // We have to do this after chunking so we don't break the numbers
> +            var crash_idx = gURLs.map(function(url) {
> +                return url['url1']['spec']
> +            }).indexOf(gStartAfter)

please put a ; here rather than relying on semicolon insertion.

(Speaking of which... this file should really be used in a way, like strict mode, where we can't have semicolon insertion!  Probably not this patch, though.)
Attachment #8845282 - Flags: review?(dbaron) → review+
Just fixed all the nits. 

I also pushed a try [1] using a stylo build, because there are more crashes.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=223d5057a93e140700e24d7817d32277a8d4b397
Attachment #8845282 - Flags: review?(jmaher)
Attachment #8845282 - Flags: review?(dbaron)
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

https://reviewboard.mozilla.org/r/118446/#review128330

r=dbaron on the JS parts
Attachment #8845282 - Flags: review?(dbaron) → review+
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

https://reviewboard.mozilla.org/r/118446/#review127178

thanks for addressing the nits!
Attachment #8845282 - Flags: review?(jmaher) → review+
Pushed by slyu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3e6775cee4f7
Continue reftest after a crash r=dbaron,jmaher
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

Sorry I forgot to change the RemoteReftest used by Android tests. Because I'm not comfortable changing the behavior of RemoteReftest yet, I left a warning there and didn't actually change the behavior of RemoteReftest. Now desktop reftests will continue after a crash, but Android's reftest will remain the same. I'll open a followup bug to fix the Android one later.

(Because this patch didn't change any JS code. I'll reuse dbaron's r+ on the JS side)
Flags: needinfo?(slyu)
Attachment #8845282 - Flags: review?(jmaher)
Comment on attachment 8845282 [details]
Bug 1344991 - Continue reftest after a crash

why is reviewboard r+ and bugzilla is not :(
Attachment #8845282 - Flags: review?(jmaher) → review+
Pushed by slyu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6bbf9c0c5dac
Continue reftest after a crash r=dbaron,jmaher
https://hg.mozilla.org/mozilla-central/rev/6bbf9c0c5dac
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
I see the return type of RefTest.runApp() has changed, but not RemoteReftest.runApp() - might that cause a problem?
Flags: needinfo?(slyu)
Depends on: 1356002
I believe the issue is that you have changed runreftest.py runApp to return a 2-tuple, but in remotereftest.py runApp you call the runreftest.py version of runApp and assign the result to a single value, then return only that value.

remotereftest.py should probably look like 

    def runApp(self, profile, binary, cmdargs, env,
               timeout=None, debuggerInfo=None,
               symbolsPath=None, options=None,
               valgrindPath=None, valgrindArgs=None, valgrindSuppFiles=None):
        return self.automation.runApp(None, env,
                                        binary,
                                        profile.profile,
                                        cmdargs,
                                        utilityPath=options.utilityPath,
                                        xrePath=options.xrePath,
                                        debuggerInfo=debuggerInfo,
                                        symbolsPath=symbolsPath,
                                        timeout=timeout)

Also, as I used in the try run in bug 1356002, you should use a try message of the form

try: -b do -p android-api-15 -u autophone-crashtest-webrtc,autophone-reftest-ogg-video,autophone-reftest-webm-video -t none 

to test this in Autophone. This regression is with the reftest versions only and your try run https://treeherder.mozilla.org/#/jobs?repo=try&revision=e5d7bd509dd56375dee27485540ab389f1300ee2&filter-searchStr=autophone&group_state=expanded is testing mochitests and performance tests which are not affected by this regression.
(In reply to Shing Lyu [:shinglyu] from comment #35)
> Testing:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=e788dfb215cc08b65943d83fa0883e92e40da154

The previous one doesn't work, this one is better:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b29cfdcf39a5f00af638b1a26f9a42776d28f63b
Backout by cbook@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/48958ca4267a
Backed out changeset 6bbf9c0c5dac for causing bug 1356002
backed out from m-c in https://hg.mozilla.org/mozilla-central/rev/48958ca4267adc86ea57f3701645d5e0222d792e for causing bug 1356002
Status: RESOLVED → REOPENED
Flags: needinfo?(slyu)
Resolution: FIXED → ---
After a one-line fix, it passes on all platforms including the Android autophone tests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f59d5bd8e1c084cbc6e0446947e41c89c2c4d175

Will land this tomorrow when I'm at the office.
Flags: needinfo?(slyu)
Pushed by slyu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cff015caf58e
Continue reftest after a crash r=dbaron,jmaher
https://hg.mozilla.org/mozilla-central/rev/cff015caf58e
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Depends on: 1359288
Depends on: 1372981
Depends on: 1374456
Depends on: 1401035
You need to log in before you can comment on or make changes to this bug.