Closed
Bug 689625
Opened 14 years ago
Closed 13 years ago
please send an email when a mobile talos suite has not reported numbers in a 24 hour period
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jmaher, Assigned: catlee)
Details
(Whiteboard: [talos][android_tier_1])
this would be:
tp4m
tsvg
ts
tpan
tzoom
tdhtml
tsspider
we had a scenario where tp4m was not reporting results for a full week and not everybody was aware of this.
Comment 1•14 years ago
|
||
Ftr here is a URL to show the discontinuity since Sep. 19th:
http://graphs-new.mozilla.org/graph.html#tests=[[85,63,20],[84,63,20],[87,63,20],[85,11,20],[87,11,20]]&sel=none&displayrange=30&datatype=delta
I believe this is a WONTFIX. Developers did not care to figure out why the Android Tegra 250 mozilla-inbound talos remote-tp4m was constantly orange/red. The problem is being reported per check-in.
Developers did not stop working but kept on going without figuring it out.
The red/orange jobs were showing up constantly on tbpl and being ignored.
It only got attention when a regression email was noticed for the N900s ("Talos Regression :( Tp4 increase 54% on Nokia n900 mobile")
Comment 2•14 years ago
|
||
Android Talos is not as stable as we'd like it to be, but having even say 50% of jobs fail is a much better situation than having 100% of jobs fail. I think marking this as WONTFIX would be a terrible mistake. I'd really like to know when we lose all performance data, even on a day where I'm not watching the tree because I'm not checking anything in.
Comment 3•14 years ago
|
||
Not a WONTFIX. Until the Android infra becomes bulletproof, and everyone assumes an Android orange is not a false positive, we can't take that attitude.
We are not there yet. We all watch for dev.tree.management emails. We would have seen this earlier if Maemo talos was on m-i, but it isn't.
Comment 4•14 years ago
|
||
How would developers have noticed? Pretty sure I've starred every single tp4m failure (that someone else didn't star first) since I got back yesterday, with a bug which existed prior to September 19th. The significant thing is that it never succeeds, that rather than getting one success every three or four attempts like most Android Talos suites do it never gets one, not that it has started permanently failing with some new failure mode that was ignored.
I could go back to what I did in August, retriggering every single Android failure, despite the way that got us into multi-hour backlogs, but even that doesn't really make it clear that a suite has gotten into this state, because eventually a retrigger winds up getting coalesced with a later job.
Comment 5•14 years ago
|
||
Philor, so what do you suggest? We blindly move along, unaware that there is a bug in our infrastructure causing every run to fail?
Comment 6•14 years ago
|
||
Why would you think I'm suggesting that?
comment 0 said "send a mail", comment 1 said "there's no need for that, if someone looked at the failures they would have seen it" and if I hadn't been midaired I would have been comment 2 saying "I looked at every single failure, and still had no idea that this was the case."
Comment 7•14 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #6)
> Why would you think I'm suggesting that?
>
> comment 0 said "send a mail", comment 1 said "there's no need for that, if
> someone looked at the failures they would have seen it" and if I hadn't been
> midaired I would have been comment 2 saying "I looked at every single
> failure, and still had no idea that this was the case."
Sorry, didn't get that from what you wrote
Comment 8•14 years ago
|
||
Let me step back and elaborate why I said WONTFIX. I care about Android and mobile to become a tier 1 (I actually thought we were there already) as much as any mobile dev but we need developers (beside mobile devs) to care about this too.
Before I jump into all my reasoning I want to point out that pretty much anyone can write a tool to notice that graph posts are missing (IIUC). Please feel free to correct me if I misunderstand how graph works.
Devs were landing on orange/red, not caring and expecting someone else to care of it (aka philor) or at least that is the impression I get from who files infra issues and who comments on those bugs. I believe that every orange/red should be looked by pushers and talk with the buildduty.
Development has not stopped in the many instances that Android goes red/orange for days (even though mobile/ateam/releng had brought it to green) and the bugs that philor files do not get attention by releng/a-team (even though many times have been code issues). This is more of a cultural/social problem than releng/ateam not picking up the bugs fast enough to determine that a bug was indeed browser/fennec based (from outside it is how it looks but please correct if I am misconceiving).
We have many branches for people to hack on while mozilla-inbound could have been closed for Android bustages. Letting it being developed eventually makes code issues to reach mozilla-central and all other branches.
We socially/culturally allow this to happen and it doesn't benefit Android or anyone. We say we are all mobile and want to treat Android as tier1 but non-mobile devs do not care enough (or at least it is the impression I get).
I have another general solution that could allow anyone to see more clearly when we have a infra issue.
If there was side ways view of tbpl we could see infra issues (the table fills from the left where c1 is the newest push).
c1 | c2 | c3 | ... | cM
builder1 | G | G | G | ... | G
builder2 | R | R | G | ... | G
... > (load more changesets)
builderN | - | G | G | ... | G
This would clearly show that after c2 something changed that made the builder go red since then. Right now tbpl does not allow you to notice so clearly.
If tbpl allowed you to switch from the original view to this per-builder view we could help be more deterministic about issues like this.
In other words, I would love to catch up on a call (if we turn to discuss this with bug comments forever) to determine how to prevent this situation to happen again. AFAICT a tool could be written that does not necessarily have to be written by releng.
I am willing to work through this issue and determine what is the right way/solution. I would be happy to see this bug happen as a safeguard but the social/cultural disconnect seems to me the major concern.
Comment 9•14 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #8)
> Devs were landing on orange/red, not caring and expecting someone else to
> care of it (aka philor) or at least that is the impression I get from who
> files infra issues and who comments on those bugs. I believe that every
> orange/red should be looked by pushers and talk with the buildduty.
We very intentionally created mozilla-inbound because we did not want to have our developers context-switch and turn into build and test engineers for four or five hours every single time they pushed something. Nobody who pushed to mozilla-inbound over the last week did anything at all wrong by pushing while this was happening, so please stop saying that they did.
Nor did any of the volunteers who star mozilla-inbound and the volunteers who merge mozilla-inbound do anything wrong. This suite failed in a variety of ways, when it happened to get run (looking for instances today, that was things like "two of ten pushes" or "three of seven pushes"). I completely failed to see that it was still happening as of last night when I got back, because having a few Android Talos runs fail in well-known ways is not the least bit surprising. Had I been around all last week, I absolutely, without question, would not have seen that it was happening.
If we have to wait on a complete ground-up rewrite of tbpl before we have a tool to notice this sort of thing, or we have to wait for some other team to take over the Talos regression spotting and emailing-about script so that we can tell who to refile this bug against, then that leaves only one way forward in the meantime.
I'm back to my August program of retriggering every single Android failure, so please order more Tegras.
A lot more Tegras.
As long as the current order has taken to arrive, probably ordering 600 this next time would be a good idea.
Comment 10•14 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #9)
> I'm back to my August program of retriggering every single Android failure,
> so please order more Tegras.
>
> A lot more Tegras.
>
> As long as the current order has taken to arrive, probably ordering 600 this
> next time would be a good idea.
I just brought online 36 new tegras and I have learned that the other 160 are being delivered soon. We are making the plans necessary so that when the 160 are imaged and ready the infra will be ready and we can bring them online.
It may not be 600 but it's quite a bit better than the 80 we have been living with
Comment 11•14 years ago
|
||
Yeah, it's possible the 200 will prove to be enough to keep up with my retriggering - I'm watching fewer trees really closely than I was in August, and had I not been on a retriggering spree, I would have gotten a remote-tp4m in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=5456273e5ab5 on only the third try.
Did we actually change something to cause a few remote-tp4ms to finish today, or was the week-long string really just pure coincidence?
Comment 12•14 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #11)
> Yeah, it's possible the 200 will prove to be enough to keep up with my
> retriggering - I'm watching fewer trees really closely than I was in August,
> and had I not been on a retriggering spree, I would have gotten a
> remote-tp4m in
> https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=5456273e5ab5 on only the
> third try.
>
> Did we actually change something to cause a few remote-tp4ms to finish
> today, or was the week-long string really just pure coincidence?
Joel would be the person to ask, I know he and I have been working on small changes to how talos works (his side) and how the tegra infra handles errors (my side) and those changes have been slowly working thru the codebase
Reporter | ||
Comment 13•14 years ago
|
||
we fixed a pageloader bug (rolled out this morning) that prevented the browser from hitting the 'quit' call which ended up putting us in a error handling loop instead of a publish the results phase.
Comment 14•14 years ago
|
||
I asked Catlee whether this belonged in the talos regression script; he thinks it belongs by itself.
So we need a new script that parses graphs data for time-since-last-data.
Updated•14 years ago
|
Whiteboard: [talos][android_tier_1]
Comment 15•14 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #14)
> I asked Catlee whether this belonged in the talos regression script; he
> thinks it belongs by itself.
> So we need a new script that parses graphs data for time-since-last-data.
Where does the talos regression script live in VCS so that a motivated person can start hacking on this?
Priority: -- → P3
Comment 16•14 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #15)
> (In reply to Aki Sasaki [:aki] from comment #14)
> > I asked Catlee whether this belonged in the talos regression script; he
> > thinks it belongs by itself.
> > So we need a new script that parses graphs data for time-since-last-data.
>
> Where does the talos regression script live in VCS so that a motivated
> person can start hacking on this?
In here:
http://hg.mozilla.org/graphs/file/default/server/analysis
Comment 17•14 years ago
|
||
In tbpl, would it help if the T's were spelled out?
e.g., instead of
Android opt B 1 2 3 4 5 6 7 8 b-c C J1 J2 R1 R2 T T T T T T T T T
Android opt B 1 2 3 4 5 6 7 8 b-c C J1 J2 R1 R2 Tp4m Ts Tdhtml ...
(or some other identifiable way).
This would not be a perfect solution to the problem, but would possibly visually help people notice if/when a specific suite had issues for a length of time.
This could be instead of, or in addition to, the email solution.
Comment 18•14 years ago
|
||
I could see that being very helpful for identifying patterns. I'd still like to see this email notification implemented as well though.
Comment 19•14 years ago
|
||
Identifying talos suites in tbpl is bug 685053.
Comment 20•14 years ago
|
||
Something more general would have also caught bug 693686 :'(
Updated•14 years ago
|
Assignee: nobody → bear
Comment 21•14 years ago
|
||
tossing into the triage pool - with the other android tier 1 bugs I have and the amount of how-the-F-do-I-start-this research this would take, probably one of the others on the team can do this faster than I can.
if not toss it back to me
Assignee: bear → nobody
Priority: P3 → --
Whiteboard: [talos][android_tier_1] → [talos][android_tier_1][triagefollowup]
Assignee | ||
Updated•14 years ago
|
Assignee: nobody → catlee
Priority: -- → P2
Updated•14 years ago
|
Whiteboard: [talos][android_tier_1][triagefollowup] → [talos][android_tier_1]
Assignee | ||
Comment 22•14 years ago
|
||
Who wants emails for these to start with? I'm not comfortable reporting to a newsgroup until having this run for a few weeks.
Comment 23•14 years ago
|
||
you can send them to me until we get the kinks worked out
Comment 24•14 years ago
|
||
(In reply to Brad Lassey [:blassey] from comment #23)
> you can send them to me until we get the kinks worked out
blassey: any kinks left to iron out, or are we done here?
Comment 25•14 years ago
|
||
per blassey in mobile/QA/RelEng mtg this morning: he thinks its been fine so far and is now ok to direct to newsgroups whenever you are comfortable doing that. (no need to setup separate mailing list and find members after all).
Assignee | ||
Comment 26•14 years ago
|
||
I've been running this since december, and so far all I've got are false alarms. I'd like to fix the false alarms before publishing these alerts more widely.
Assignee | ||
Comment 27•13 years ago
|
||
No other alarms so far, false or otherwise.
I've enabled reporting to dev.tree-management.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 28•13 years ago
|
||
Chris, I'm reopening because Mark just told me that sunspider stopped reporting a while back and apparently no email was sent to dev.tree-management.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 29•13 years ago
|
||
Comment 30•13 years ago
|
||
(In reply to Brad Lassey [:blassey] from comment #28)
> Chris, I'm reopening because Mark just told me that sunspider stopped
> reporting a while back and apparently no email was sent to
> dev.tree-management.
(In reply to Mark Finkle (:mfinkle) from comment #29)
> http://graphs.mozilla.org/graph.html#tests=[[26,11,23],[26,11,
> 20]]&sel=none&displayrange=30&datatype=running
That was caused by Bug 767224, and per that bug entirely intentional.
If we need to turn it back on, please file a new bug and we'll do so. This is not a failure of this bug.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•