Bug 960072 & also anything outstanding on https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy will need to be fixed before we unhide.
Thanks for filing this Ed. Are you guys aware of any outstanding issues that would prevent turning this back on besides the current failure rate?
Looking now - the log output will cause false positives, eg: https://tbpl.mozilla.org/php/getParsedLog.php?id=43499492&tree=Mozilla-Inbound 23:20:20 INFO - *~*~* Results *~*~* 23:20:20 INFO - passed: 271 23:20:20 INFO - failed: 6 23:20:20 INFO - todo: 43 23:20:20 INFO - 271 passing (1h) 23:20:20 INFO - 43 pending 23:20:20 INFO - 6 failing 23:20:20 INFO - 1) Vertical - Bookmark Uninstall removal of bookmark: 23:20:20 INFO - AssertionError: bookmark was removed (Similar to bug 1006511, bug 1017559 and others) I'd also like to see evidence/confirmation that: * Crashes are handled with stacks output in a TBPL compatible format * Hangs are handed with failure messages output in a TBPL compatible format * No external resources are used as part of the job (eg github during setup) and during the test run itself (eg by ensuring that MOZ_DISABLE_NONLOCAL_CONNECTIONS is defined). See these two for reference: https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#6.29_Outputs_failures_in_a_TBPL-starrable_format https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#8.29_Must_avoid_patterns_known_to_cause_non_deterministic_failures
Gareth - can you comment on output for Crashes and Hangs? I know you did some work on getting the TBPL reporting fixed up, but I'm not sure of the current state of things. Can we simulate one and check? (In reply to Ed Morley (Away 12th-20th July) from comment #3) > * No external resources are used as part of the job (eg github during setup) > and during the test run itself (eg by ensuring that > MOZ_DISABLE_NONLOCAL_CONNECTIONS is defined). This should be the case today, but we will try to verify.
(In reply to Ed Morley [:edmorley] from comment #3) > Looking now - the log output will cause false positives, eg: > https://tbpl.mozilla.org/php/getParsedLog.php?id=43499492&tree=Mozilla- > Inbound > > 23:20:20 INFO - *~*~* Results *~*~* > 23:20:20 INFO - passed: 271 > 23:20:20 INFO - failed: 6 > 23:20:20 INFO - todo: 43 > 23:20:20 INFO - 271 passing (1h) > 23:20:20 INFO - 43 pending > 23:20:20 INFO - 6 failing > 23:20:20 INFO - 1) Vertical - Bookmark Uninstall removal of bookmark: > 23:20:20 INFO - AssertionError: bookmark was removed > > (Similar to bug 1006511, bug 1017559 and others) > > I'd also like to see evidence/confirmation that: > * Crashes are handled with stacks output in a TBPL compatible format B2G crashes are still not handled gracefully iirc?
Gij tests have been relatively stable for us recently. There have been a few intermittent marionette issues which we've tracked down and fixed. The remaining one is bug 1091680 which is relatively rare. Ed - I'd like to see if we'd be good to re-enable test visibility for Gij on inbound and central trees. Anything else you'd like to see us do before re-enabling these?  https://tbpl.mozilla.org/?tree=B2g-Inbound&showall=1&jobname=b2g_ubuntu64_vm%20b2g-inbound%20opt%20test%20gaia-js-integration
Deferring to Ryan for this (I'm now working on treeherder rather than actively sheriffing, and so less impacted by decisions like these).
Flags: needinfo?(emorley) → needinfo?(ryanvm)
Kevin, can you please answer the bullet points from sections 1 and 2 in the job visibility policy? https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy 1.1 Has an active owner 1.2 Usable job logs 1.3 Low intermittent failure rate 1.4 Must avoid patterns known to cause non deterministic failures 1.5 Has sufficient documentation 2.1 Breakage is expected to be followed by tree closure or backout 2.2 Runs on mozilla-central and all trees that merge into it 2.3 Scheduled on every push 2.4 Easily run on try server Thanks!
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #8) > 1.1 Has an active owner Myself as well as other active gaia contributors will own this. I am more than happy to put my name down for this. > 1.3 Low intermittent failure rate The last issue is what we believe is a test harness issue and we are currently looking into. See here for an example of this issue: https://tbpl.mozilla.org/php/getParsedLog.php?id=51761817&tree=B2g-Inbound#error0 I think the rate is low enough to be 'sheriffable', but I would be curious as to your opinion if it's acceptable to re-enable with this problem. We have a weekly meeting where we sync on this specific issue (and other harness issues). > 1.2 Usable job logs > 1.4 Must avoid patterns known to cause non deterministic failures > 1.5 Has sufficient documentation > 2.1 Breakage is expected to be followed by tree closure or backout > 2.2 Runs on mozilla-central and all trees that merge into it > 2.3 Scheduled on every push > 2.4 Easily run on try server I believe the answer to all of these are 'yes'.
I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1093706 so we will properly report b2g crashes etc and stop any false positives
So to give you an example of the current state of things https://treeherder.mozilla.org/ui/#/jobs?repo=gaia-try&revision=e7408ee6534f is a try run this mornings master, 5 out of 93 fails, we have been sherrifing Gij ourselves and the level of stability has been fairly good although there are periods of instability (the ftp failures recently https://bugzilla.mozilla.org/show_bug.cgi?id=920153 etc) The current failures all show similiar symptoms and are being investigated in https://bugzilla.mozilla.org/show_bug.cgi?id=1093799, first so they give more helpful output, then figure out whats causing the crashes, they are at around 5% and we have 2/3 people actively working on getting them fixed. Fixing the last intermittents is the main priority right now, if it were possible I would still like us to go visible now as it will help us keep the tests more stable and help surface any more issues that the sheriffs need to keep this sheriffable. As with Kevin I will put my hand in as owner and if there is anything else that can be done to get these tests visible then will be happy to work on it.
(In reply to Dale Harvey (:daleharvey) from comment #11) > As with Kevin I will put my hand in as owner and if there is anything else > that can be done to get these tests visible then will be happy to work on it. Hey, before unhiding bug 1091680 needs to be resolved or ?
(repeating from IRC) Yeh I am working full time on getting that issue resolved, with it still open we are hovering around 5% intermittent and I (+ Kevin), are asking if it would be ok to have the tests have the tests visible now because 1. Its going to be easier to fix the last (hard) issue if we dont get bitten by upstream reds, 2. I expect there will still be teething issues when the tests do start being sheriffed so it would be nice get the process started.
TEST-UNEXPECTED-FAIL | null | Messages as share target Share via Messages Activity close button Should return to Thread panel if in Participants panel Doesn't look like an acceptable failure message to me. That "null" should be the name of the test so that they're properly suggestable in TBPL/Treeherder. Also, was comment 3 ever addressed?
Taking a look at the tokens of the test output now, should definitely have that before being visible, apologies I was confused by the format and thought ours was ok I think NON_LOCAL_CONNECTIONS is going to be a problem, we have redone a bunch of tests to make sure they dont depend on externally loaded resources, however we pull our node_modules at the least from github (we pull a b2g / xulrunner build, but I believe they count as local)
non-local connections only affects anything via gecko, not in a harness outside of that.
Ah ok, thats a relief, we should have got rid of any of that, doing a test run now and will disable / get any tests fixed that break the rule and add that flag to the test harness
So we have fixed and enabled the MOZ_DISABLE_NONLOCAL_CONNECTIONS flag in the gaia integration tests (https://bugzilla.mozilla.org/show_bug.cgi?id=1030045) > TEST-UNEXPECTED-FAIL | null | Messages as share target Share via Messages Activity close button Should return to Thread panel if in Participants panel Has been fixed And the output from crashes / hangs is in a compatible format as far as I understand from https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#6.29_Outputs_failures_in_a_TBPL-starrable_format https://treeherder.mozilla.org/ui/#/jobs?repo=gaia-try&revision=3bcabc1ca9d1 is a run with 4 failures out of 100 Gij runs, we are tracking down the last of the issues, mostly intermittent harness issues but I believe this is now stable enough to be visible (and being visible should help it be more stable, we had a tree closure and various interments introduced due to upstream changes this week) Ryan is away so needinfing Ed, Carsten, can we have these tests visible now?
Looks fine now at a quick glance - thank you :-) Unhidden everywhere apart from b2g32 and b2g30, since the jobs are failing there.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Awesome, thanks a lot I had one last question, going to send an announcement to the list that we need to keep on top of these tests and make sure they stay in a sheriffable state, is there any recommendations for what developers should do when they see intermittent failures, how to file bugs to make it easier etc? Thanks
The sheriffs will inevitably handle much of the intermittent bug filing for say b2g-inbound, but for other people looking there, or else on the gaia-try repo/their try results, I'd just say they need to be extra careful to check failures - ie: (a) If there is a bug suggestion, don't assume it's correct / that the failure must be intermittent (the patch could have caused a different failure mode in the same file) (b) If there isn't a bug suggestion, don't just assume it's an intermittent, try retriggering to see for sure. It may be perma-failing. (c) Even if the retrigger of a failure comes back green, it still could indicate a new high-frequency intermittent introduced by the patch - so common sense prevails: if the failure is in a test related to the code being changed, perform 10-20 retriggers to see failure rate, and ideally fix or file before landing. Also, if filing bugs for intermittent failures, it's important to add keyword "intermittent-failure" so the bug suggestion query finds them. Other than that, they must just include in their summary the text/test name between the pipe symbols of the failure line, or in the case of failures that aren't in that format, the full failure line (mozharness timestamp prefix can be removed). Hope that helps :-)
Assignee: nobody → emorley
Can we confirm that the relevant patches have been backported to b2g34 before undoing they're?
Thanks autocorrect. "unhiding there"
(In reply to [Away 18-Nov to 23-Nov] Ryan VanderMeulen [:RyanVM UTC-5] from comment #23) > Can we confirm that the relevant patches have been backported to b2g34 > before undoing they're? Sorry comment 20 should read b2g32 and b2g34; b2g30 wasn't in the exclusion list to start with.
You need to log in before you can comment on or make changes to this bug.