Closed Bug 689625 Opened 14 years ago Closed 13 years ago

please send an email when a mobile talos suite has not reported numbers in a 24 hour period

Categories

(Release Engineering :: General, defect, P2)

ARM
Android
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: catlee)

Details

(Whiteboard: [talos][android_tier_1])

this would be: tp4m tsvg ts tpan tzoom tdhtml tsspider we had a scenario where tp4m was not reporting results for a full week and not everybody was aware of this.
Ftr here is a URL to show the discontinuity since Sep. 19th: http://graphs-new.mozilla.org/graph.html#tests=[[85,63,20],[84,63,20],[87,63,20],[85,11,20],[87,11,20]]&sel=none&displayrange=30&datatype=delta I believe this is a WONTFIX. Developers did not care to figure out why the Android Tegra 250 mozilla-inbound talos remote-tp4m was constantly orange/red. The problem is being reported per check-in. Developers did not stop working but kept on going without figuring it out. The red/orange jobs were showing up constantly on tbpl and being ignored. It only got attention when a regression email was noticed for the N900s ("Talos Regression :( Tp4 increase 54% on Nokia n900 mobile")
Android Talos is not as stable as we'd like it to be, but having even say 50% of jobs fail is a much better situation than having 100% of jobs fail. I think marking this as WONTFIX would be a terrible mistake. I'd really like to know when we lose all performance data, even on a day where I'm not watching the tree because I'm not checking anything in.
Not a WONTFIX. Until the Android infra becomes bulletproof, and everyone assumes an Android orange is not a false positive, we can't take that attitude. We are not there yet. We all watch for dev.tree.management emails. We would have seen this earlier if Maemo talos was on m-i, but it isn't.
How would developers have noticed? Pretty sure I've starred every single tp4m failure (that someone else didn't star first) since I got back yesterday, with a bug which existed prior to September 19th. The significant thing is that it never succeeds, that rather than getting one success every three or four attempts like most Android Talos suites do it never gets one, not that it has started permanently failing with some new failure mode that was ignored. I could go back to what I did in August, retriggering every single Android failure, despite the way that got us into multi-hour backlogs, but even that doesn't really make it clear that a suite has gotten into this state, because eventually a retrigger winds up getting coalesced with a later job.
Philor, so what do you suggest? We blindly move along, unaware that there is a bug in our infrastructure causing every run to fail?
Why would you think I'm suggesting that? comment 0 said "send a mail", comment 1 said "there's no need for that, if someone looked at the failures they would have seen it" and if I hadn't been midaired I would have been comment 2 saying "I looked at every single failure, and still had no idea that this was the case."
(In reply to Phil Ringnalda (:philor) from comment #6) > Why would you think I'm suggesting that? > > comment 0 said "send a mail", comment 1 said "there's no need for that, if > someone looked at the failures they would have seen it" and if I hadn't been > midaired I would have been comment 2 saying "I looked at every single > failure, and still had no idea that this was the case." Sorry, didn't get that from what you wrote
Let me step back and elaborate why I said WONTFIX. I care about Android and mobile to become a tier 1 (I actually thought we were there already) as much as any mobile dev but we need developers (beside mobile devs) to care about this too. Before I jump into all my reasoning I want to point out that pretty much anyone can write a tool to notice that graph posts are missing (IIUC). Please feel free to correct me if I misunderstand how graph works. Devs were landing on orange/red, not caring and expecting someone else to care of it (aka philor) or at least that is the impression I get from who files infra issues and who comments on those bugs. I believe that every orange/red should be looked by pushers and talk with the buildduty. Development has not stopped in the many instances that Android goes red/orange for days (even though mobile/ateam/releng had brought it to green) and the bugs that philor files do not get attention by releng/a-team (even though many times have been code issues). This is more of a cultural/social problem than releng/ateam not picking up the bugs fast enough to determine that a bug was indeed browser/fennec based (from outside it is how it looks but please correct if I am misconceiving). We have many branches for people to hack on while mozilla-inbound could have been closed for Android bustages. Letting it being developed eventually makes code issues to reach mozilla-central and all other branches. We socially/culturally allow this to happen and it doesn't benefit Android or anyone. We say we are all mobile and want to treat Android as tier1 but non-mobile devs do not care enough (or at least it is the impression I get). I have another general solution that could allow anyone to see more clearly when we have a infra issue. If there was side ways view of tbpl we could see infra issues (the table fills from the left where c1 is the newest push). c1 | c2 | c3 | ... | cM builder1 | G | G | G | ... | G builder2 | R | R | G | ... | G ... > (load more changesets) builderN | - | G | G | ... | G This would clearly show that after c2 something changed that made the builder go red since then. Right now tbpl does not allow you to notice so clearly. If tbpl allowed you to switch from the original view to this per-builder view we could help be more deterministic about issues like this. In other words, I would love to catch up on a call (if we turn to discuss this with bug comments forever) to determine how to prevent this situation to happen again. AFAICT a tool could be written that does not necessarily have to be written by releng. I am willing to work through this issue and determine what is the right way/solution. I would be happy to see this bug happen as a safeguard but the social/cultural disconnect seems to me the major concern.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #8) > Devs were landing on orange/red, not caring and expecting someone else to > care of it (aka philor) or at least that is the impression I get from who > files infra issues and who comments on those bugs. I believe that every > orange/red should be looked by pushers and talk with the buildduty. We very intentionally created mozilla-inbound because we did not want to have our developers context-switch and turn into build and test engineers for four or five hours every single time they pushed something. Nobody who pushed to mozilla-inbound over the last week did anything at all wrong by pushing while this was happening, so please stop saying that they did. Nor did any of the volunteers who star mozilla-inbound and the volunteers who merge mozilla-inbound do anything wrong. This suite failed in a variety of ways, when it happened to get run (looking for instances today, that was things like "two of ten pushes" or "three of seven pushes"). I completely failed to see that it was still happening as of last night when I got back, because having a few Android Talos runs fail in well-known ways is not the least bit surprising. Had I been around all last week, I absolutely, without question, would not have seen that it was happening. If we have to wait on a complete ground-up rewrite of tbpl before we have a tool to notice this sort of thing, or we have to wait for some other team to take over the Talos regression spotting and emailing-about script so that we can tell who to refile this bug against, then that leaves only one way forward in the meantime. I'm back to my August program of retriggering every single Android failure, so please order more Tegras. A lot more Tegras. As long as the current order has taken to arrive, probably ordering 600 this next time would be a good idea.
(In reply to Phil Ringnalda (:philor) from comment #9) > I'm back to my August program of retriggering every single Android failure, > so please order more Tegras. > > A lot more Tegras. > > As long as the current order has taken to arrive, probably ordering 600 this > next time would be a good idea. I just brought online 36 new tegras and I have learned that the other 160 are being delivered soon. We are making the plans necessary so that when the 160 are imaged and ready the infra will be ready and we can bring them online. It may not be 600 but it's quite a bit better than the 80 we have been living with
Yeah, it's possible the 200 will prove to be enough to keep up with my retriggering - I'm watching fewer trees really closely than I was in August, and had I not been on a retriggering spree, I would have gotten a remote-tp4m in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=5456273e5ab5 on only the third try. Did we actually change something to cause a few remote-tp4ms to finish today, or was the week-long string really just pure coincidence?
(In reply to Phil Ringnalda (:philor) from comment #11) > Yeah, it's possible the 200 will prove to be enough to keep up with my > retriggering - I'm watching fewer trees really closely than I was in August, > and had I not been on a retriggering spree, I would have gotten a > remote-tp4m in > https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=5456273e5ab5 on only the > third try. > > Did we actually change something to cause a few remote-tp4ms to finish > today, or was the week-long string really just pure coincidence? Joel would be the person to ask, I know he and I have been working on small changes to how talos works (his side) and how the tegra infra handles errors (my side) and those changes have been slowly working thru the codebase
we fixed a pageloader bug (rolled out this morning) that prevented the browser from hitting the 'quit' call which ended up putting us in a error handling loop instead of a publish the results phase.
I asked Catlee whether this belonged in the talos regression script; he thinks it belongs by itself. So we need a new script that parses graphs data for time-since-last-data.
Whiteboard: [talos][android_tier_1]
(In reply to Aki Sasaki [:aki] from comment #14) > I asked Catlee whether this belonged in the talos regression script; he > thinks it belongs by itself. > So we need a new script that parses graphs data for time-since-last-data. Where does the talos regression script live in VCS so that a motivated person can start hacking on this?
Priority: -- → P3
(In reply to Chris Cooper [:coop] from comment #15) > (In reply to Aki Sasaki [:aki] from comment #14) > > I asked Catlee whether this belonged in the talos regression script; he > > thinks it belongs by itself. > > So we need a new script that parses graphs data for time-since-last-data. > > Where does the talos regression script live in VCS so that a motivated > person can start hacking on this? In here: http://hg.mozilla.org/graphs/file/default/server/analysis
In tbpl, would it help if the T's were spelled out? e.g., instead of Android opt B 1 2 3 4 5 6 7 8 b-c C J1 J2 R1 R2 T T T T T T T T T Android opt B 1 2 3 4 5 6 7 8 b-c C J1 J2 R1 R2 Tp4m Ts Tdhtml ... (or some other identifiable way). This would not be a perfect solution to the problem, but would possibly visually help people notice if/when a specific suite had issues for a length of time. This could be instead of, or in addition to, the email solution.
I could see that being very helpful for identifying patterns. I'd still like to see this email notification implemented as well though.
Identifying talos suites in tbpl is bug 685053.
Something more general would have also caught bug 693686 :'(
Assignee: nobody → bear
tossing into the triage pool - with the other android tier 1 bugs I have and the amount of how-the-F-do-I-start-this research this would take, probably one of the others on the team can do this faster than I can. if not toss it back to me
Assignee: bear → nobody
Priority: P3 → --
Whiteboard: [talos][android_tier_1] → [talos][android_tier_1][triagefollowup]
Assignee: nobody → catlee
Priority: -- → P2
Whiteboard: [talos][android_tier_1][triagefollowup] → [talos][android_tier_1]
Who wants emails for these to start with? I'm not comfortable reporting to a newsgroup until having this run for a few weeks.
you can send them to me until we get the kinks worked out
(In reply to Brad Lassey [:blassey] from comment #23) > you can send them to me until we get the kinks worked out blassey: any kinks left to iron out, or are we done here?
per blassey in mobile/QA/RelEng mtg this morning: he thinks its been fine so far and is now ok to direct to newsgroups whenever you are comfortable doing that. (no need to setup separate mailing list and find members after all).
I've been running this since december, and so far all I've got are false alarms. I'd like to fix the false alarms before publishing these alerts more widely.
No other alarms so far, false or otherwise. I've enabled reporting to dev.tree-management.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Chris, I'm reopening because Mark just told me that sunspider stopped reporting a while back and apparently no email was sent to dev.tree-management.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Brad Lassey [:blassey] from comment #28) > Chris, I'm reopening because Mark just told me that sunspider stopped > reporting a while back and apparently no email was sent to > dev.tree-management. (In reply to Mark Finkle (:mfinkle) from comment #29) > http://graphs.mozilla.org/graph.html#tests=[[26,11,23],[26,11, > 20]]&sel=none&displayrange=30&datatype=running That was caused by Bug 767224, and per that bug entirely intentional. If we need to turn it back on, please file a new bug and we'll do so. This is not a failure of this bug.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.