Closed Bug 1392391 Opened 7 years ago Closed 7 years ago

Parse reftest manifests before loading reftest.jsm

Categories

(Testing :: Reftest, enhancement)

enhancement
Not set
normal

Tracking

(firefox60 fixed)

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: ahal, Assigned: ahal)

References

(Depends on 1 open bug)

Details

Attachments

(6 files)

The fact that we parse the reftest manifests from reftest.jsm is causing a lot of complexity in the harness. There is a ton of global state to keep track of things like chunks, restarting after crashes, repeating, shuffling, etc. This is also making it very difficult to enable runByManifest mode in bug 1353461. I propose we do manifest parsing before we run tests. I don't want to write a new manifest parser in python, maybe we can use xpcshell or marionette to parse the manifests. This is likely going to be a large and complicated refactor. For that reason I think we should block on having some sort of selftests for the python bits of reftest, this is bug 1392390. There are also a few structured log problems in bug 1373745. I'd like to clean those up before attempting this as well.
I think the first step here is to simply move the manifest parsing into a new JSM. Even this is a bit tricky due to all of the global state that will need to be shared, it'll likely also involve creating a "globals.jsm" that can be shared between reftest.jsm and manifest.jsm. Since this in itself is a large change I'll tackle it in a dependent bug.
Depends on: 1405349
Assignee: nobody → ahalberstadt
Status: NEW → ASSIGNED
The attached patches mostly work, except on android. I was hoping to use --headless mode when initially parsing the manifest, but unfortunately that causes incorrect graphics values in the sandbox resulting in test failures. Parsing the manifest in non-headless mode takes a little longer and is more confusing to the end user, but there aren't really any good alternatives. As for Android, I'll have to debug further. I might end up just leaving it alone and continue to parse manifests in the same browser instance that the tests run in. The downside there is we don't get to clean up all the chunking/repeat/shuffle logic out of reftest.jsm.
I had an idea. These patches pass information back and forth by saving to files, but what if we used marionette and event listeners? Instead we could: A) Start the browser B) Python sends list of manifests C) JS parses manifests and sends back list of tests D) Python does chunking etc, sends a partial list of tests E) JS runs tests and shutsdown F) Python starts next instance G) Python sends next list of tests H) Go to E and repeat. This feels cleaner and would give the python harness more fine grained control (no longer need to control flow with prefs), while also solving the extra-browser-window-during-manifest-parsing problem. I'll have to spend some time looking into this to see if it's possible or not.
After briefly exploring comment 6, it'll be scope bloat for run-by-manifest. I do think we might want to consider an approach like that at some point, but we'd almost have to rewrite reftest to accomplish it.
This looks like it can land (previous try run was good, but waiting on a new one after minor changes). However I never did figure out the Android issue, so this patch leaves Android alone. I'll file a follow-up to tackle that. Also to re-iterate, this is just about accessing the test objects in python. The actual "runByManifest" algorithm will be implemented in bug 1353461.
Comment on attachment 8947619 [details] Bug 1392391 - [reftest] Pre-parse the manifests in a separate Firefox instance, https://reviewboard.mozilla.org/r/217310/#review223132 ::: layout/tools/reftest/remotereftest.py:137 (Diff revision 2) > c.read() > > rtncode = self._process.poll() > if (rtncode is None): > self._process.terminate() > - except: > + except Exception: Can you comment on why these except: clauses are changing? (Just in mozreview/bugzilla -- probably no need for a code comment.) Is there a change in behavior for KeyboardInterrupt?
Comment on attachment 8942264 [details] Bug 1392391 - [reftest] Factor out code for creating nsIUri objects to isolated function, https://reviewboard.mozilla.org/r/212548/#review223214 I like most of the changes, but would like to consider something similar to what I have outlined below. ::: layout/tools/reftest/manifest.jsm:712 (Diff revision 3) > + if (test.url2) { > + test.url2 = g.ioService.newURI(test.url2, null, manifestURL); > + secMan.checkLoadURIWithPrincipal(principal, test.url2, > + CI.nsIScriptSecurityManager.DISALLOW_SCRIPT); > + } > + } I see a lot of if (test.url2) statements in here- could we not: let files = [test.url1]; if (test.url2) files.push(test.url2); then: for (var file in files) { if (test.runHttp) { ServeFiles(...) } } else { newURI(...) } }
Attachment #8942264 - Flags: review?(jmaher) → review-
Comment on attachment 8942266 [details] Bug 1392391 - [reftest] Parse manifests from a new ReadTests method, https://reviewboard.mozilla.org/r/212552/#review223216 I would like to ensure the getCharPref is in a try/catch ::: layout/tools/reftest/reftest.jsm:356 (Diff revision 3) > - g.browser.removeEventListener("focus", StartTests, true); > + g.browser.removeEventListener("focus", ReadTests, true); > } > > - var manifests; > + g.urls = []; > + var prefs = Components.classes["@mozilla.org/preferences-service;1"]. > + getService(Components.interfaces.nsIPrefBranch); previously this was in a try/catch statement in starttests() and was indicated as "optional" ::: layout/tools/reftest/reftest.jsm:359 (Diff revision 3) > - var manifests; > + g.urls = []; > + var prefs = Components.classes["@mozilla.org/preferences-service;1"]. > + getService(Components.interfaces.nsIPrefBranch); > + > + // Parse reftest manifests > + let manifests = JSON.parse(prefs.getCharPref("reftest.manifests")); this used to be in a try/catch- is there a reason it is not anymore?
Attachment #8942266 - Flags: review?(jmaher) → review-
Comment on attachment 8947619 [details] Bug 1392391 - [reftest] Pre-parse the manifests in a separate Firefox instance, https://reviewboard.mozilla.org/r/217310/#review223290 a few things here- overall this looks like a great refactor. ::: layout/tools/reftest/output.py:151 (Diff revision 2) > > if isinstance(data, dict) and 'action' in data: > + if data['action'] == 'results': > + for k, v in data['results'].items(): > + self.results[k] += v > + else: I don't understand what we are doing with self.results ::: layout/tools/reftest/reftest.jsm:359 (Diff revision 2) > g.urls = []; > var prefs = Components.classes["@mozilla.org/preferences-service;1"]. > getService(Components.interfaces.nsIPrefBranch); > > + let manifests = prefs.getCharPref("reftest.manifests", null); > + let dumpTests = prefs.getCharPref("reftest.manifests.dumpTests", null); dumpTests is a new control preference, It only works if reftest.manifests is specified and will call DoneTests()- ideally some reference to why this is here would help a lot. ::: layout/tools/reftest/reftest.jsm:374 (Diff revision 2) > + let decoder = new TextDecoder(); > + g.urls = JSON.parse(decoder.decode(array)).map(CreateUrls); > + StartTests(); > + }); > + return; > + } else if (manifests) { there is a chance that we could do nothing here as we have if/elseif and no final else clause. Previously we would attempt to work with reftest.manifests and we would find errors there if they existed, now we would do nothing and maybe timeout? ::: layout/tools/reftest/remotereftest.py:137 (Diff revision 2) > c.read() > > rtncode = self._process.poll() > if (rtncode is None): > self._process.terminate() > - except: > + except Exception: I noticed this a couple times, is this for flake8? Maybe we should call out the exceptions we expect? ::: layout/tools/reftest/runreftest.py:276 (Diff revision 2) > > :param options: Object containing command line options > - :param manifests: Dictionary of the form {manifest_path: [filters]} > + :param tests: List of test objects to run > :param server: Server name to use for http tests > :param profile_to_clone: Path to a profile to use as the basis for the > test profile can you document prefs for: * manifests * startAfter (this wasn't added by you) * prefs ::: layout/tools/reftest/runreftest.py:849 (Diff revision 2) > + self.log.suite_start(ids, name=options.suite) > + > startAfter = None # When the previous run crashed, we skip the tests we ran before > prevStartAfter = None > for i in itertools.count(): > - try: > + status, startAfter, results = self.runApp( why did we remove the try/except here? I see this as being useful if there is an odd failure? is the idea that any failure will surface itself naturally?
Attachment #8947619 - Flags: review?(jmaher) → review-
Comment on attachment 8947620 [details] Bug 1392391 - [reftest] Perform chunking with manifestparser, https://reviewboard.mozilla.org/r/217312/#review223296 this looks great
Attachment #8947620 - Flags: review?(jmaher) → review+
Comment on attachment 8947619 [details] Bug 1392391 - [reftest] Pre-parse the manifests in a separate Firefox instance, https://reviewboard.mozilla.org/r/217310/#review223132 > Can you comment on why these except: clauses are changing? (Just in mozreview/bugzilla -- probably no need for a code comment.) > > Is there a change in behavior for KeyboardInterrupt? This is due to bug 1434430. Basically the mozreview bot is using flake8 3.5.0, whereas |mach lint| is using 3.3.0. The new version of flake8 introduced the ``no-blank-except`` rule. This issue will be fixed in the other bug, but I pre-emptively also fixed those instances here to avoid reviewbot spam. This will change the behaviour of KeyboardInterrupt. A blank ``except`` will catch it, but ``except Exception`` will not. I did this on purpose because in these instances, it looked like the fact that we were catching ``KeyboardInterrupt`` was a bug. If you see an instance where we're *supposed* to catch it, we can change it to catch ``BaseException``. Please let me know if you find anything that requires that.
Comment on attachment 8947619 [details] Bug 1392391 - [reftest] Pre-parse the manifests in a separate Firefox instance, https://reviewboard.mozilla.org/r/217310/#review223290 > I don't understand what we are doing with self.results This maps up to the line: logger._logData({'results': g.testResults}); in reftest.jsm. Previsouly, we logged a whole bunch of extra metadata along with the ``suiteEnd`` message. But since this is now being logged from python, we no longer have access to that metadata. Using this custom ``results`` log action was the method I concocted to transfer that metadata from JS to python. The python side now also needs to aggregate all the metadata from the various Firefox instances (which will be) spawned by run-by-manifest. So ``self.results`` is storing and combining the metadata until it's needed when we call ``suite_end``. > there is a chance that we could do nothing here as we have if/elseif and no final else clause. Previously we would attempt to work with reftest.manifests and we would find errors there if they existed, now we would do nothing and maybe timeout? Good catch. I checked for the case that both ``reftest.manifests`` and ``reftest.tests`` were defined, but also need to check that at least one of them is. > I noticed this a couple times, is this for flake8? Maybe we should call out the exceptions we expect? See my comment to gbrown above. Ideally we would, but I don't want to guess at what we expect, so think it's safer to just leave the behaviour as is. > why did we remove the try/except here? I see this as being useful if there is an odd failure? is the idea that any failure will surface itself naturally? It's actually not a try/except, it was a try/finally. In the finally we were calling ``self.cleanup(profileDir)``. I moved this logic into the ``runApp`` function which is lower down the stack, so we don't need to cleanup the profile here anymore.
Comment on attachment 8942264 [details] Bug 1392391 - [reftest] Factor out code for creating nsIUri objects to isolated function, https://reviewboard.mozilla.org/r/212548/#review223396
Attachment #8942264 - Flags: review?(jmaher) → review+
Comment on attachment 8942266 [details] Bug 1392391 - [reftest] Parse manifests from a new ReadTests method, https://reviewboard.mozilla.org/r/212552/#review223398
Attachment #8942266 - Flags: review?(jmaher) → review+
Comment on attachment 8947619 [details] Bug 1392391 - [reftest] Pre-parse the manifests in a separate Firefox instance, https://reviewboard.mozilla.org/r/217310/#review223400 thanks
Attachment #8947619 - Flags: review?(jmaher) → review+
Comment on attachment 8942265 [details] Bug 1392391 - [mozlog] Forward extra data from 'suite_end' action, https://reviewboard.mozilla.org/r/212550/#review223750
Attachment #8942265 - Flags: review?(james) → review+
I think we're good to land. Here's a more recent try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f3367a55dbabad7e408014e860886ab7974f8eb2 I think those Android jsreftest failures were from that expired certificate that had the trees closed most of the day. At least all the retriggers came back green.
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b9e9374af9a1 [mozlog] Forward extra data from 'suite_end' action, r=jgraham https://hg.mozilla.org/integration/autoland/rev/846d50ea8403 [reftest] Factor out code for creating nsIUri objects to isolated function, r=jmaher https://hg.mozilla.org/integration/autoland/rev/6bd3abc55ea8 [reftest] Parse manifests from a new ReadTests method, r=jmaher https://hg.mozilla.org/integration/autoland/rev/406806a088d5 [reftest] Pre-parse the manifests in a separate Firefox instance, r=jmaher https://hg.mozilla.org/integration/autoland/rev/50df56a0cebf [reftest] Perform chunking with manifestparser, r=jmaher
For the record, it looks like this has increased the frequency of bug 1416125. I believe the other bug is a race condition that happens somewhere in the manifest parsing code. Seeing as my changes here refactored the manifest parsing code, it must have changed timing enough to increase the frequency. I added some debug statements to investigate, but now it doesn't seem to be easily reproducible anymore. I'll continue investigation in the other bug.
See Also: → 1416125
Backed out 5 changesets (bug 1392391) for frequently failing jsreftests on Android. a=backout Backout link: https://hg.mozilla.org/mozilla-central/rev/e1954b02d9e39bdb7c1f17aa95ca9cad5d5c14ae Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=50df56a0cebf11a8e05bf1d49c678521fee522c6 Log: https://treeherder.mozilla.org/logviewer.html#?job_id=160637300&repo=autoland&lineNumber=1579 [task 2018-02-06T18:15:15.073Z] 18:15:15 INFO - REFTEST TEST-INFO | screentopng: exit 0 [task 2018-02-06T18:15:18.864Z] 18:15:18 INFO - org.mozilla.fennec_aurora still alive after SIGABRT: waiting... [task 2018-02-06T18:15:23.986Z] 18:15:23 WARNING - TEST-UNEXPECTED-FAIL | remoteautomation.py | application timed out after 670 seconds with no output [task 2018-02-06T18:15:23.987Z] 18:15:23 INFO - INFO | automation.py | Application ran for: 0:14:11.415088 [task 2018-02-06T18:15:23.987Z] 18:15:23 INFO - INFO | zombiecheck | Reading PID log: /tmp/tmpf3keoYpidlog
Flags: needinfo?(ahalberstadt)
Target Milestone: mozilla60 → ---
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This ended up being a heisenbug (adding debug statements reduced the frequency of this failure back down to previous levels). Here is a push with/without the debug logs: https://treeherder.mozilla.org/#/jobs?repo=try&author=ahalberstadt@mozilla.com&tochange=31e1c34238f24cc96145b348b83d4e2f2cd5b7b0&fromchange=791c16ddd68e69e97b86375ac3df732891ce7106 Since: A) This is previously existing B) I can't reproduce locally C) I can't debug on try I don't know how to go about looking for the root cause of this issue. Therefore I'm going to land this change with the above debug patch (those log lines are useful to have on their own anyway). I know this isn't great engineering practice, but I don't see any other alternatives besides putting this bug on hold indefinitely. I'll add a comment/detailed commit message explaining this as well.
Flags: needinfo?(ahalberstadt)
Comment on attachment 8949083 [details] Bug 1392391 - [reftest] Add debug statements to clarify which path in ReadTests is being used, https://reviewboard.mozilla.org/r/218484/#review224272 ::: layout/tools/reftest/reftest.jsm:393 (Diff revision 1) > } else if (manifests) { > // Parse reftest manifests > + // XXX There is a race condition in the manifest parsing code which > + // sometimes shows up on Android jsreftests (bug 1416125). It seems > + // adding/removing log statements can change its frequency. > + logger.debug("Reading " + manifests.length + " manifests"); I do wonder if accessing manifests.length (i.e. an attribute of manifests) is actually causing this to be fixed- possibly we haven't fully loaded this into memory and are hitting a race condition where we parse it, but it is not available.
Attachment #8949083 - Flags: review?(jmaher) → review+
Comment on attachment 8949083 [details] Bug 1392391 - [reftest] Add debug statements to clarify which path in ReadTests is being used, https://reviewboard.mozilla.org/r/218484/#review224272 > I do wonder if accessing manifests.length (i.e. an attribute of manifests) is actually causing this to be fixed- possibly we haven't fully loaded this into memory and are hitting a race condition where we parse it, but it is not available. No, I had another try push earlier on that didn't print the length, and that also fixes it. Plus the intermittent still exists with this patch, it's just way less frequent (like it used to be).
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/51f67d98f39c [mozlog] Forward extra data from 'suite_end' action, r=jgraham https://hg.mozilla.org/integration/autoland/rev/6530b0f9d637 [reftest] Factor out code for creating nsIUri objects to isolated function, r=jmaher https://hg.mozilla.org/integration/autoland/rev/56096de8d028 [reftest] Parse manifests from a new ReadTests method, r=jmaher https://hg.mozilla.org/integration/autoland/rev/37db56af5f70 [reftest] Pre-parse the manifests in a separate Firefox instance, r=jmaher https://hg.mozilla.org/integration/autoland/rev/2ae7d8e3c00e [reftest] Perform chunking with manifestparser, r=jmaher https://hg.mozilla.org/integration/autoland/rev/aadf794aaefe [reftest] Add debug statements to clarify which path in ReadTests is being used, r=jmaher
These changes make debugging reftests (on OSX) even more unpleasant than they were previously. With MOZ_DEBUG_CHILD_PROCESS (the only way to debug the content process with lldb as far as I know), there are now 8 processes launched, with the last one being the one that actually runs the reftests. Each launch pauses the parent, so I have to attach to all of them individually to resume them (in lots of different Terminal tabs), and keep switching back to the firefox process as it waits for focus. Even when not debugging, the extra instance launch takes quite a bit of time in a debug build, since it opens 3 windows. Is there anything we can do to simplify this? Trying to debug reftests takes forever now.
We could implicitly turn off run-by-manifest mode when --debugger is specified. The downside there is that tests might not run exactly the same as they do in CI (though this would be mitigated if you manually passed in the manifest/directory that contained the test you want to debug). I wonder how mochitest does this. It's been using run-by-manifest for years and no one seems to have complaints. I'll file a follow-up.
Depends on: 1448000
Depends on: 1513577
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: