Closed Bug 1845221 Opened 2 years ago Closed 2 years ago

Frequent The mochitest suite: mochitest-a11y ran with return status: FAILURE (use this bug if you see "exit status 2" for a mochitest-a11y log)

Categories

(Testing :: Mochitest, defect)

defect

Tracking

(firefox-esr102 unaffected, firefox-esr115 unaffected, firefox115 unaffected, firefox116 unaffected, firefox117 fixed, firefox118 fixed)

RESOLVED FIXED
118 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox-esr115 --- unaffected
firefox115 --- unaffected
firefox116 --- unaffected
firefox117 --- fixed
firefox118 --- fixed

People

(Reporter: SerbanS, Assigned: nlapre)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell disable-recommended])

Attachments

(1 file, 2 obsolete files)

The failure line does not appear in failure summary from treeherder.

Hello Nathan! Could you please take a look at this? Range of backfills and retriggers are pointing to your latest changeset in Bug 854796.

Thank you!

Flags: needinfo?(nlapre)
Regressions: 854796

Looking at this, but I'm struggling to figure out what the actual failure is, since I can't find it in the logs. Is it a timeout? Maybe I'm just not searching correctly, but I can't find a place in the logs where there's a failure message.

Flags: needinfo?(nlapre)
Regressed by: 854796
No longer regressions: 854796

Set release status flags based on info from the regressing bug 854796

I think the failure is in accessible/tests/mochitest/actions/, but I cannot figure out what it is.

Yeah, me neither; I can't seem to find any TEST-FAIL, despite it reporting one failure in that suite. A hint seems to be that actions/test_link.html takes the longest (at around 1.5 seconds), but that doesn't seem particularly excessive to me. I also don't know what the time for that test file was before. My patch did adjust how we deal with links, and touched that file. I'm having no luck getting this to fail locally, even with --verify.

a reported failure without a TEST-FAIL is usually something outside of the harness, like a shutdown timeout, a leak, a shutdown crash, an assertion outside of a test.

I didn't see anything in the logs. This is frustrating that we cannot see the failure- maybe a bit more looking and I or someone else will see the failure.

hmm, a try push doesn't reproduce this:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=60601afff062b14865823839e6d12eedb7c0bb06&selectedTaskRun=HZ9lKzHhRRGJ4THrrSKYRw.0

possibly that is related to an artifact build, I will try a few other pushes

I needed to retrigger. I bisected this down a bit, I can get this down to 2 tests. I have 2 combinations that work:

1 (try push):

 [test_keys.xhtml]
 [test_link.html]

2 (try push):

 [test_link.html]
 [test_media.html]

but not (try push):

 [test_keys.html]
 [test_media.html]

I don't see anything in the log to indicate what is wrong, but given the root cause of this and the narrowed down tests, maybe something can be figured out.

Okay, thank you for the help. The cause is not clear to me even after investigation. I think I will try to re-write the seemingly-problematic tests in our newer accessibility testing framework so that I can remove the test file that seems to be causing problems.

Set release status flags based on info from the regressing bug 854796

Duplicate of this bug: 1847153

FWIW, there are dozens of instances of this failure that are being mis-starred to grab-bag-bug 1538599 because the log parser was seeing "exit status 2" and suggesting bug 1538599 as a result. I'm extending the title of this bug in the hopes that treeherder will pick that up and start suggesting this bug instead.

Summary: Frequent The mochitest suite: mochitest-a11y ran with return status: FAILURE → Frequent The mochitest suite: mochitest-a11y ran with return status: FAILURE (use this bug if you see "exit status 2" for a mochitest-a11y log)

One idea for a way forward here... Maybe it'd be worth activating the "full log" for mochitest-a11y-* tests (temporarily at least), to hopefully gather some more data? (That might even make the test-failure line show up in the log, which would be a clue about what's going on and why it's not showing up in the current "more concise" logs.)

In other words: it might be worth trying to effectively call SimpleTest.requestCompleteLog()[1] for every single test in this test-suite. (But probably not actually do that -- just change the default behavior for mochitest-a11y-* or something like that, temporarily.)

(I don't know precisely how to do that, but I'm pretty sure it should be possible.)

[1] https://firefox-source-docs.mozilla.org/testing/mochitest-plain/faq.html#how-can-i-get-the-full-log-output-for-my-test-in-automation-for-debugging

We have good reason to believe the problematic test is accessible/tests/mochitest/actions/test_link.html. At a minimum, we could at least try enabling the full log for that test. But that might not be so useful because I suspect this is somehow impacting other tests via side effects or similar. Otherwise, we should just be seeing a failure in one specific test.

Joel, is there a way to enable the full log for an entire directory or do we need to land a patch to enable it for every test in the actions directory?

Flags: needinfo?(jmaher)
Depends on: 1847161

(In reply to James Teh [:Jamie] from comment #22)

We have good reason to believe the problematic test is accessible/tests/mochitest/actions/test_link.html. At a minimum, we could at least try enabling the full log for that test.

I posted a patch in bug 1847161 to do that for that test and the other two that were mentioned in comment 9. That should be sufficient, based on what jmaher noted there.

Let's see what that yields.

Flags: needinfo?(jmaher)

thanks :dholbert, I need to add some harness debugging as well to determine where the error might be coming from

With the full logs for these tests: comparing one "bad" log vs. one "good" log for the same job type (from the previous treeherder cycle, so testing essentially same code/conditions), I'm not seeing any meaningful differences between the two logs.

"bad": https://treeherder.mozilla.org/logviewer?job_id=424980707&repo=autoland
"good": https://treeherder.mozilla.org/logviewer?job_id=424977939&repo=autoland

After stripping out timestamps and some background noise to make the logs more-usefully-diffable, I'm not seeing any meaningful differences when comparing the "good" vs. "bad" log side-by-side, focusing on the relevant portion which is the portion leading up to this in the bad log:

INFO - Passed:  382
INFO - Failed:  1
INFO - Todo:    1

(The "good" log instead has Failed: 0 but the Passed/Todo values are the same, 382 & 1.)

The portion of the log above that, up to the previous Passed/Failed/Todo results section, looks essentially the same to me between good & bad logs.

So: this is looking a lot like it might be a harness bug of some sort (or an extremely weird test bug that somehow produces an increment in the test failure count without logging anything at all).

Thanks Daniel. Looking narrowly at what my changes actually did that could have caused this, my best guess is that we're dealing with a timing issue of some sort having to do with accessible recreation. I was getting timeouts (with useful error messages) when running the patch through Try, though I ended up fixing those by rewriting the problematic tests as browser tests. The changes I made cause recreation of accessibles if an anchor element gains/loses an href (in the absence of a click listener), or gains/loses a click listener (in the absence of an href). If we have neither, the spec requires us to treat those accessibles as generics, so we recreate them as HyperTextAccessibles. The timing issues that I was seeing on Try before landing my initial patch appeared to be related to markup of the form <a onclick=""></a>. I fixed the issues by adding an href attribute to those anchor elements in the test so that the accessible wouldn't be recreated (the href would force it to be a link from the beginning), and rewriting the pieces of the test that expected no href as a browser test, operating under the theory that the "onclick" attribute was being processed later and forcing recreation of the accessible. The browser tests didn't have any problem with that, and Try was happy.

This is all to say: I don't know what about this confluence of factors causes these mystery failures and I certainly don't understand what the error is now, but my leading suspect is - waves hands vaguely - something to do with accessible recreation, and I think the way forward is to rewrite test_link.html as a browser test, which I'm working on now.

This revision removes test_link.html since it has been failing mysteriously in
CI. It's unclear why this is happening, and no amount of logging or local
testing has revealed the issue. This revision migrates the tests to the browser
tests, lumping it in with similar tests in browser_caching_actions.js.

Assignee: nobody → nlapre
Status: NEW → ASSIGNED

FWIW I added some logging to try to get to the bottom of who/what is incrementing the failure count:
https://treeherder.mozilla.org/jobs?repo=try&revision=d69ef28d9bcc9a42108f91f506fc53e75d70b9b2

In at least some cases, the issue seems to be from some test logging a result after SimpleTest.finish() is called:
https://treeherder.mozilla.org/logviewer?job_id=425024689&repo=try&lineNumber=3774-3813

(i.e. we're hitting this line which increments the test failure count:
https://searchfox.org/mozilla-central/rev/b4dadcc4c45a2e381a4866d28ebf3a901decccc1/testing/mochitest/tests/SimpleTest/TestRunner.js#854

TestRunner.updateUI([{ result: false }]);

)

This is probably test_link.html since this happens during that test.

Not every failure triggers my logging for this situation, but some of them (e.g. the log linked in this comment) do seem to be doing this. I know we used to (and probably still intend to?) print a log message when this happens, but apparently that's not happening right now (aside from the logging that I'm adding here).

nlapre, it's probably worth looking into how/why that test is still registering more test-results after SimpleTest.finish() is called (maybe it's listening for events and calling is(), ok() , etc. in those event-handlers; and some events come in and trigger those handlers after the test calls SimpleTest.finish()?)

(I don't know to-what-extent your posted rewrite patch would already address/not-address that issue.)

Huh, interesting! Thanks for that. My knowledge of the fine inner workings of the accessibility event queue tests is limited, but I'm surprised that it manages to keep listening for things even after processing the entire event queue. It's unexpected to me that the failure message doesn't get through, but that your logging does.

(I don't know to-what-extent your posted rewrite patch would already address/not-address that issue.)

I think it would fully address those issues, by way of deleting the test file entirely :) More seriously, I suspect that the introduction of async/await in the browser test rewrite of these tests will render these issues obsolete, since the listeners are better-encapsulated.

Pushed by nlapre@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1d8ec2085c98 Remove failing mochitest, migrate tests to browser tests,r=eeejay
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 118 Branch

The patch landed in nightly and beta is affected.
:nlapre, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox117 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(nlapre)
Flags: needinfo?(nlapre)
Depends on: 1848205

(In reply to Daniel Holbert [:dholbert] from comment #30)

In at least some cases, the issue seems to be from some test logging a result after SimpleTest.finish() is called:
https://treeherder.mozilla.org/logviewer?job_id=425024689&repo=try&lineNumber=3774-3813

It's actually a bit more subtle than this. See bug 1848205 to see what was going on. It seems that the issue was that this test was navigating and thereby getting a second callback to the unload handler that checks for results being logged after finish is called; and in the second callback, some data was missing and so the test harness's logic falls over and fails to log, despite incrementing the failure count.

This patch doesn't affect behavior; it's just splitting one compound check into
two nested ones, and adjusting the indentation.

I'm doing this because the next patch will adjust the inner logic, and
splitting it up this way lets me do that in a more targeted way.

As described in the bug, this can sometimes happen when mochitests navigate,
e.g. due to synthesizing a click on a link to the same document.

Depends on D185943

Sorry, copypaste error; I posted those^ patches on the wrong bug. They're meant for bug 1848205.

Comment on attachment 9348452 [details]
Bug 1845221 part 1: Split up two-part condition. r?jmaher

Revision D185943 was moved to bug 1848205. Setting attachment 9348452 [details] to obsolete.

Attachment #9348452 - Attachment is obsolete: true

Comment on attachment 9348453 [details]
Bug 1845221 part 2: Log an error when mochitests hit unload handler with missing data. r?jmaher

Revision D185944 was moved to bug 1848205. Setting attachment 9348453 [details] to obsolete.

Attachment #9348453 - Attachment is obsolete: true

(In reply to Intermittent Failures Robot from comment #44)

57 failures in 3341 pushes (0.017 failures/push) were associated with this bug in the last 7 days.

Note, we haven't actually had any failures reported on trunk after the fix landed in comment 36, so don't let comment 44 alarm you. :)

Ah, but we did have 29 failures on beta. We should uplift Nathan's patch to beta to avoid spurious test failures there (which may be confusing/difficult-to-star per comment 19)

Great idea, thanks Daniel. Ryan, could you uplift this patch to beta? If we don't, we'll see spurious test failures causing noise, as Daniel mentioned. It should apply cleanly.

Flags: needinfo?(ryanvm)
Flags: needinfo?(ryanvm)

Comment on attachment 9347489 [details]
Bug 1845221: Remove failing mochitest, migrate tests to browser tests,r?eeejay,jamie

Beta/Release Uplift Approval Request

  • User impact if declined: Spurious test failures which might be confusing and difficult to attribute.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Only impacts tests. Ports a test from one framework to another.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9347489 - Flags: approval-mozilla-beta?

(I think this technically doesn't need approval since it's test-only, but that is a good way to be sure it gets noticed. :) )

Sorry, I didn't realise you and Nathan had talked about this not needing to go through the normal uplift process. I wasn't aware of the approval exemption for test only patches. My bad.

Comment on attachment 9347489 [details]
Bug 1845221: Remove failing mochitest, migrate tests to browser tests,r?eeejay,jamie

Indeed, no approval needed since it's test-only, but I'll rubberstamp it anyway :-)

Attachment #9347489 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to James Teh [:Jamie] from comment #50)

Sorry, I didn't realise you and Nathan had talked about this not needing to go through the normal uplift process. I wasn't aware of the approval exemption for test only patches. My bad.

No worries, thanks anyway! (I'm not sure where/if the exception is documented -- maybe RyanVM knows? -- but my recollection of the guideline is: "if it's not affecting the bits that users receive in our shipped binaries [e.g. changes to internal documentation or tests], then it doesn't need uplift approval". This is in the spirit of release drivers wanting to mitigate risks of user-facing regressions from late-breaking changes. Clearly, changes to documentation and tests carry essentially-zero user-facing risk of introducing a regression or behavior-change.)

Thanks for the uplift, RyanVM!

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: