Closed Bug 1405428 Opened 4 years ago Closed 3 years ago

Run test-verify on all supported tests to find and eliminate existing failures


(Testing :: General, defect, P3)



(firefox63 fixed)

Tracking Status
firefox63 --- fixed


(Reporter: gbrown, Assigned: jmaher)


(Depends on 1 open bug, Blocks 1 open bug)



(5 files, 3 obsolete files)

Follow-up on concern raised in bug 1405338.

I'm not sure how feasible this is, but I should have a look at it.
I would imagine we would need to run tests in smaller chunks to make this happen, if we could have a mode where this could happen fairly easily and a way to parse logs or get results, this could be a trackable solution.

a few thoughts here:
1) get the list of failing tests and annotate manifests so we can know what to expect as 'ignore TV'- that is easier to query and reduce
2) go suite by suite (like xpcshell, then devtools, etc.) and get all tests green, maybe 1 suite/quarter.
3) possibly enable a simple rule at first and make that single simple rule tier-1 while we work on greening up the rest.

I think overall we need to have a starting point of what we expect as passing and failing.
Whiteboard: [PI:January]
Priority: -- → P3
Moving this to March since it's a P3 - :gbrown, if you prefer it to be on your February list then go ahead and change the whiteboard tag thanks :)
Flags: needinfo?(gbrown)
Whiteboard: [PI:January] → [PI:February]
I don't have any specific plans for working on this in the near future.
Flags: needinfo?(gbrown)
Whiteboard: [PI:February] → [PI:March]
Whiteboard: [PI:March]
We can now mark tests that fail test-verify as skip-if = 'verify' -- that might be useful here (but should be used with caution).
We have sometimes thought of this bug as the major impediment to running TV as tier 1, but that is a misconception. Even if we identify all the current TV failures, new test failures may be introduced by product changes: Since tests are not modified, TV will not run against the broken tests...until the test is modified. Then again, we are back to the place where a TV failure does not always indicate a fault in the push.
good point, I hadn't thought through the the end of the solution like you did.  How could we measure how often this happens in our current state of tier-2?  If we could do that, it would help understand the risk or the solution to it.  Possibly having TV-backfill(<testname>) implemented in treeherder would give us the ability to find the offending cause or |skip-if = verify|.
if we can easily run the failing test on the previous push, then this would help us determine if this is test related or product related and we could find the root cause if we want to backfill far enough :)
I filed bug 1465117 to do the work regarding running test-verify on a previous revision.
Depends on: 1465117
Assignee: nobody → jmaher
Attachment #8984506 - Flags: review?(gbrown)
Comment on attachment 8984506 [details] [diff] [review]
annotate mochitest files for linux and some osx/windows

Review of attachment 8984506 [details] [diff] [review]:

I spot-checked some changes -- certainly not an exhaustive review -- and what I looked at looked fine - bravo!

One minor curiosity...Most TV failure bugs that I have seen have been TV failures on all/most platforms: If it fails on Linux, I expect to see failures on Windows/osx/Android, and on opt/debug/pgo - usually. Many of your annotations are quite specific: verify && debug && (os == linux). And many seem to be debug-only failures. Is that right? Do you have any insight into why?

And one big concern: I'm not *sure* I see the point of this bug (which I filed!). Let's briefly pause to consider whether this is a good idea. On the negative side, with these annotations we won't verify any of these tests, and probably no one will ever investigate why they fail verification. Also it makes the manifests more crowded/less readable. On the positive side, I suppose we increase confidence in TV failures and reduce TV noise: We eliminate most of today's TV failures, so when we subsequently see a TV failure, it's more unexpected and more likely to be related to recent changes. But failures still don't positively indicate a failure caused by the failing push - comment 5. I'd like to hear your thoughts on this trade-off and any other issues you see.

But neither concern need hold up check-in: If you are confident this is a good idea, let's proceed before this bit-rots!
Attachment #8984506 - Flags: review?(gbrown) → review+
high rate of debug failures: leaks/assertions are only found on debug
specific platforms: some tests are platform specific or don't run on a platform for other reasons
** I still need to fine tooth comb windows/osx, so some of these might become more generic in due time

this is a good point about clutter and the overall point.  This did give me pause to reconsider.  I do think we should move forward; these will not be too hard to remove if we change our mind.  I am thinking someone could chip away at these failures, contributors, etc.

I bet half of these would be easy to fix; Then we challenge- what if one of these tests start intermittently failing at a high rate and we run TV to see how stable it is...but it doesn't run, so we cannot test it.  Possibly we do not look for skip-if when running from a MOZHARNESS_TEST_PATHS env var.
Pushed by
skip-if = verify on mochitests which do not pass test-verify. r=gbrown
we still have jsreftest/crashtest/reftest, xpcshell, and web-platform-tests
Keywords: leave-open
xpcshell was much easier than mochitest
Attachment #8985112 - Flags: review?(gbrown)
Comment on attachment 8985112 [details] [diff] [review]
annotate xpcshell files

Review of attachment 8985112 [details] [diff] [review]:

::: toolkit/modules/tests/xpcshell/xpcshell.ini
@@ +52,5 @@
>  skip-if = toolkit == 'android'
>  [test_Services.js]
>  skip-if = toolkit == 'android'
>  [test_sqlite.js]
> +skip-if = toolkit == 'android' || (verify && !debug && os == 'win')

I'm not sure if it is important, but you have been inconsistent in using () around os == ... in complex expressions. Consider:

skip-if = toolkit == 'android' || (verify && !debug && (os == 'win'))
Attachment #8985112 - Flags: review?(gbrown) → review+
thanks for calling that out- one is my script which uses (os == ...) so we can chain multiple os clauses, and the other is fine tuning by hand when I sanity check the patch and find a couple other cases to annotate.
Pushed by
skip-if = verify on xpcshell tests which do not pass test-verify. r=gbrown
yes, these do crash!
Attachment #8985236 - Flags: review?(gbrown)
Comment on attachment 8985236 [details] [diff] [review]
annotate crashtests that fail on test-verify

Review of attachment 8985236 [details] [diff] [review]:

Almost entirely linux/debug issues here - how odd!
Attachment #8985236 - Flags: review?(gbrown) → review+
almost all the crashtest issues are filed as bug 1468949.
this time using reftest annotations (not manifestparser).  As a bonus I have data from reftest to add :)
Attachment #8985236 - Attachment is obsolete: true
Flags: needinfo?(jmaher)
Attachment #8986166 - Flags: review?(gbrown)
Attachment #8986166 - Flags: review?(gbrown) → review+
Pushed by
annotate crashtests and reftests which fail in test-verify mode. r=gbrown
after this patch is just jsreftests, then we are "done".  that will get us to 98%, and there are:
* tests which landed between running tests and landing annotations
* tests which landed/changed/failed prior to tier-1
* tests which have duplicate names
* tests which were accidentally overlooked in the generation of test names (I suspect a small volume)
Attachment #8986209 - Flags: review?(gbrown)
Comment on attachment 8986209 [details] [diff] [review]
annotate web-platform-tests that fail on test-verify

Review of attachment 8986209 [details] [diff] [review]:

"fails to repeat itself" might be misleading: Verify failures might be intermittent failures, chaos-only failures, relies on earlier test / cannot run by itself, or cannot repeat failures. (I assume you haven't investigated all of these!) s/fails to repeat itself/fails --verify mode/ if that's not too much trouble?

I am far from an expert on wpt manifests, but these look okay to me.

Please post some try links here, for the record.

::: testing/web-platform/meta/webrtc/RTCDTMFSender-ontonechange.https.html.ini
@@ -1,2 @@
>  [RTCDTMFSender-ontonechange.https.html]
> -  disabled: if debug and not e10s and (os == "win") and (version == "6.1.7601") and (processor == "x86") and (bits == 32):

Did you intend to remove the original condition?
Attachment #8986209 - Flags: review?(gbrown) → review+
Pushed by
annotate web-platform-tests which fail to pass test-verify. r=gbrown
relanding reftest/web-platform-tests annotations now that bug 1469583 is merged around
Flags: needinfo?(jmaher)
Pushed by
annotate web-platform-tests which fail to pass test-verify. r=gbrown
Backout by
Backed out 1 changesets for wpt failures. CLOSED TREE
this was backed out because it failed some regular wpt tests (not test-verify).  I had removed a block of annotations in a few files to indicate it was a CRASH all the time and I was wrong, it is not all the time (I think only on debug), so I put that back as it was originally.


try run:
Attachment #8986209 - Attachment is obsolete: true
Attachment #8987578 - Flags: review?(gbrown)
Comment on attachment 8987580 [details] [diff] [review]
fix the jsreftest paths in s/test/tests/

Review of attachment 8987580 [details] [diff] [review]:

::: taskcluster/taskgraph/util/
@@ +59,5 @@
>      except CalledProcessError:
>          return 0
> +    # TODO: move this to a static function so we only load it once
> +    # Hack a method for inserting a file to a try commit that we can run specific tests

Did you mean to include this?

@@ +101,5 @@
>                  if not gpu:
>                      test_count += 1
> +    print("JMAHER: total changed_files: %s, and test_count: %s" % (len(changed_files), test_count))

Accidental inclusion?
Attachment #8987578 - Flags: review?(gbrown) → review+
and this is the real patch, sorry for the confusion.
Attachment #8987580 - Attachment is obsolete: true
Attachment #8987580 - Flags: review?(gbrown)
Attachment #8987595 - Flags: review?(gbrown)
Attachment #8987595 - Flags: review?(gbrown) → review+
Pushed by
annotate web-platform-tests which fail to pass test-verify. r=gbrown
fix paths for detecting jsreftests. r=gbrown
once these are successfully on trunk, we can call this done
Keywords: leave-open
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Regressions: 1577197
You need to log in before you can comment on or make changes to this bug.