Closed Bug 1081529 Opened 5 years ago Closed 4 years ago

Un-hide Marionette(Mnw) tests on B2G when they meet visibility standards

Categories

(Tree Management Graveyard :: Visibility Requests, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Unassigned)

References

Details

Basically perma-fail and unowned. Both Mn and Mnw have been hidden on trunk. Aurora34/b2g34 likely to follow.
John, I'm hiding these on Try but am leaving Gaia Try alone for now. Let me know if you'd prefer me to follow suit there as well.
Flags: needinfo?(jhford)
OS: Windows 8.1 → Gonk (Firefox OS)
Summary: Un-hide Marionette tests on B2G when they meet visibility standards → Un-hide Marionette(Mnw) tests on B2G when they meet visibility standards
Splitting this into Mn/Mnw even though it might be the similar root cause
Flags: needinfo?(jhford)
FYI, I ran a Try push off recent b2g-inbound, and Mn is currently permafailing in test_click_scrolling.py (lots of bug 1078177 and also another failure as well). Mnw is currently sitting on almost exactly a 10% failure rate (98/1000 runs were orange/red).

Example Mn failure log:
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=3034902&repo=try

There appear to be some common oranges on Mnw (bug 1025284, bug 1078276, bug 1025289, bug 1029296, and bug 1020930 to name a few), but there's still a pretty long tail of one-off failures too. If someone wants to go through them, here's a link to get you started:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=f1c72c1a3203&searchQuery=webapi (you'll need to click the visibility toggle button in the upper-right to actually see anything)

Overall, I'd say things aren't as bleak as they were when this bug was first filed (10% is lower than I expected for Mnw, TBH), but timeouts are still the death of us in webapi. While there appear to be some tests that are more susceptible to failure than others, my expectation is that disabling commonly-failing tests will just move the failures to other ones instead. There still appears an underlying issue with the harness and/or emulator environment that contributes to these problems.
Component: Marionette → Visibility Requests
Product: Testing → Tree Management
Version: unspecified → ---
Hi Ryan, MNW is really important for b2g webapi. Do you think we could un-hide MNW back first? :)
I know there are still some orange happened in MNW [1]. We have already landed two bugs trying to improve it, like bug 1143596 for test_getthreads.js and bug 1143628 for test_massive_incoming_delete.js. Let's keep improving it.

[1] https://treeherder.mozilla.org/#/jobs?repo=b2g-inbound&exclusion_profile=false&filter-searchStr=MNW
Flags: needinfo?(ryanvm)
What's the current failure rate? Looks like we're still well north of 5% based on the link you gave?
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Low_intermittent_failure_rate

Sorry for not including a link to the policy in this bug previously. Anyway, that page should give you a good feel for what needs to be done for Mnw to be unhidden again.
Flags: needinfo?(ryanvm)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #5)
> What's the current failure rate? Looks like we're still well north of 5%
> based on the link you gave?
> https://wiki.mozilla.org/Sheriffing/
> Job_Visibility_Policy#Low_intermittent_failure_rate
> 
> Sorry for not including a link to the policy in this bug previously. Anyway,
> that page should give you a good feel for what needs to be done for Mnw to
> be unhidden again.

I see, thanks for this information.
Depends on: 1143596, 1143628
Depends on: 1151726
Depends on: 1152272
Duplicate of this bug: 916362
Depends on: 1155022
Depends on: 1154215, 1153709
Depends on: 1159622
Depends on: 1162422
Depends on: 1162407
Depends on: 1203425
Hi Edgar,
Are you still working on this?
Thanks!
Flags: needinfo?(echen)
(In reply to Josh Cheng [:josh] from comment #8)
> Hi Edgar,
> Are you still working on this?
> Thanks!

Yes, I am still working on this.
Quick update current status: we found a emulator adb hang issue (bug 1207039) which contributes to the random timeout problems (bug 1153709, bug 1154215). I believe marionette tests can get a big improvement with fixing the bug 1207039.
Depends on: 1213785
Depends on: 1214537
Depends on: 1223676
Depends on: 1223028
No longer depends on: 1214537
Hi Ryan, we have fixed a bunch of bugs which improves the stability of MNW a lot. The most important one is that we fix the random timeout issue (bug 1153709, bug 1154215), so if any test is not stable enough, you could just disable it which won't move the failures to other ones.

MNW is now on 5/129(~3%) failure rate (https://treeherder.mozilla.org/#/jobs?repo=try&revision=769d200ca8ca&exclusion_profile=false&group_state=expanded). Could MNW be unhidden again based on current status?

Thank you.
Flags: needinfo?(ryanvm)
Nice work! A few things:

1) Can you please make sure those remaining failures get filed so that they're starrable once Mnw is made visible again?

2) Has Mnw only been greened up on trunk/master or is the intent to get it unhidden on the release branches as well?

3) Can you please respond to a few items from the Job Visibility Page [1] checklist (with links where applicable) to verify that we're not missing anything?
* Has an active owner
* Has sufficient documentation
* Must avoid patterns known to cause non deterministic failures

Thanks!

[1] https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy
Flags: needinfo?(ryanvm) → needinfo?(echen)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #12)
> Nice work! A few things:
> 
> 1) Can you please make sure those remaining failures get filed so that
> they're starrable once Mnw is made visible again?

Done: bug 1224986, bug 1224990 and bug 1224992.

> 
> 2) Has Mnw only been greened up on trunk/master or is the intent to get it
> unhidden on the release branches as well?

Only on master. Most of the fixes are not landed in release branches.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #12) 
> 3) Can you please respond to a few items from the Job Visibility Page [1]
> checklist (with links where applicable) to verify that we're not missing
> anything?
> * Has an active owner

Ken Chang would be the active owner.

> * Has sufficient documentation
> * Must avoid patterns known to cause non deterministic failures
 
Here is the link I found from MDN: https://developer.mozilla.org/en-US/docs/Mozilla/QA/Marionette/Marionette_JavaScript_Tests
Flags: needinfo?(echen)
Tomcat/Wes, please look this information (and recent trunk results) over and decide if Mnw is ready for unhiding or not.
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
Mnw seems to have failed nine times out of the most recent 90ish runs on inbound, so it's still got about a 10% failure rate. However, most of those are all timing out in the same test: https://treeherder.mozilla.org/logviewer.html#?job_id=18854065&repo=mozilla-inbound

If you can disable test_mobile_operator_names_plmnlist.js and if that doesn't move the timeout to some other test, the failure rate would be around 2%, which would be fine for unhiding, imo.
Flags: needinfo?(wkocher) → needinfo?(echen)
Flags: needinfo?(cbook)
(In reply to Wes Kocher (:KWierso) from comment #16)
> Mnw seems to have failed nine times out of the most recent 90ish runs on
> inbound, so it's still got about a 10% failure rate. However, most of those
> are all timing out in the same test:
> https://treeherder.mozilla.org/logviewer.html#?job_id=18854065&repo=mozilla-
> inbound
> 
> If you can disable test_mobile_operator_names_plmnlist.js and if that
> doesn't move the timeout to some other test, the failure rate would be
> around 2%, which would be fine for unhiding, imo.

The failure rate of test_mobile_operator_names_plmnlist.js is higher than I thought. I filed bug 1234746 for the timeout and disabled the test first.
Flags: needinfo?(echen)
(In reply to Wes Kocher (:KWierso) from comment #16)
> Mnw seems to have failed nine times out of the most recent 90ish runs on
> inbound, so it's still got about a 10% failure rate. However, most of those
> are all timing out in the same test:
> https://treeherder.mozilla.org/logviewer.html#?job_id=18854065&repo=mozilla-
> inbound
> 
> If you can disable test_mobile_operator_names_plmnlist.js and if that
> doesn't move the timeout to some other test, the failure rate would be
> around 2%, which would be fine for unhiding, imo.

I have disabled test_mobile_operator_names_plmnlist.js, is MNW ready for unhide?
https://treeherder.allizom.org/#/jobs?repo=b2g-inbound&exclusion_profile=false&filter-searchStr=mnw&group_state=expanded&fromchange=8ad77c0ff487

Thank you.
Flags: needinfo?(wkocher)
Looks much better, thanks! I retriggered a bunch. Out of 325 runs, there were 11 failures, around 3%. Unhidden.
Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(wkocher)
Resolution: --- → FIXED
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.