Closed Bug 877779 Opened 7 years ago Closed 7 years ago
Talos Regression tp4m
_nochrome 14% on Android 2 .2, May 29
I cannot find a dev-tree-management alert for this regression, but :mfinkle pointed it out. Before, mozilla-inbound, Android 2.2: 577.7 Δ -3 (-0.6%) acc549b97117 May 29, 2013 03:06 After: 660.7 Δ 83 (14.4%) 0d52ae944c00 May 29, 2013 05:25 Regression range appears to be http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=acc549b97117&tochange=0d52ae944c00 There may be a very subtle bump at the same time on Android 4.0...or I may be imagining that.
Patrick - Bug 790388 shows up in the regression range. Any input one way or the other?
tp-star doesn't execute any of the code changed by 790388...
let's figure out which of these patches caused this with try pushes
Assignee: nobody → gbrown
tracking-fennec: ? → 24+
Try run for acc549b97117: https://tbpl.mozilla.org/?tree=Try&rev=59de3f76728d 702.70 612.55 659.55 620.00 635.05 811.05 653.10 542.05 686.95 650.80 689.05 642.20 642.95 635.30 620.80 658.20 634.20 661.10 743.55 660.20 618.80 674.05 => average of 657.01
Try run for 0d52ae944c00 at https://tbpl.mozilla.org/?tree=Try&rev=dd55e3eb870b: 670.95 631.85 687.05 626.7 662.4 710.75 683.6 735.4 628.55 652.05 648.05 609.45 621.05 638.6 628.7 741.95 618.6 659.55 651.45 658.75 731.95 => average of 661.56
As seen in Comments 4 and 5, all my try pushes give me the regressed tp4m results -- none give the non-regressed results. I went back to mozilla-inbound and retriggered tp4m from several pushes before the regression range: see https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Android&rev=4187c565aec5. The first Android 2.2 tp4m result here, from May 28, was 581.00. Re-triggers of this job, run today, produced 655.45, 690.45, and 645.60. This reminds me of bug 864637.
This regression is visible on mozilla-inbound, mozilla-central, mozilla-aurora, and mozilla-beta, all on May 29. Re-triggers of jobs on try and mozilla-inbound from before the regression give the regressed results. The same thing happened - small regression in Android 2.2 tp4m for no known reason - in bug 855044 on March 26. The problem magically un-regressed on March 27. The same thing happened in bug 864637 on April 18. The problem magically un-regressed on May 9.
needinfo to XioNoX (since I didn't realize he was in paris so my IRC ping wasn't useful) <Callek> XioNoX: so #ateam is chasing a talos regression affecting mobile only, that crosses branches, and presents even for retriggered tests prior to the inflection point. On the 29'th. The likely factors (that we thought of so far) are ruled out.... XioNoX: (bug for regression is 877779) -- I noticed https://bugzilla.mozilla.org/show_bug.cgi?id=877126 which was work you did relating to network on the 29'th, ganglia graphs for bm-remote* seem to indicate a rather large network drop on that time (e.g. http://ganglia3.build.mtv1.mozilla.com/ganglia/?r=month&cs=&ce=&c=RelEngMTV1&h=bm-remote-talos-webhost-01.build.mtv1.mozilla.com&tab=m&vn=&mc=2&z=small&metric_group=ALLGROUPS -- notice network load) The tegras hit this host from same DC (mtv1) while the pandas which also saw a (smaller) blip [smaller could just be from newer hardware] access it from a scl1->mtv blip. My underlying question is this experience *anything* that could be fallout from Bug 877126?
0:12 <@XioNoX> Callek: yeah, wanted to ask you for more details actually, the circuit flaps you mentioned only made routers to use their redundant link, nothing else 20:12 <@XioNoX> if you could give me source/dest IPs I could check if something is wrong Feel free to also ping other people from Netops that are closer to your timezone if needed.
Maybe hwine knows about network changes that happened on May 29th.
:jmaher noted that the tp4m logs for the regressed cases have *very* large times for the page m.yahoo.co.jp/www.yahoo.co.jp/index.html -- typically close to 1000 seconds. There is no sign of this in the logs for non-regressed runs. :jmaher also reportedly found some external accesses in this page, but eliminating those did not eliminate the 1000 second page load time.
Please see comment 10.
(In reply to Armen Zambrano G. [:armenzg] (back in July 7th) from comment #10) > Maybe hwine knows about network changes that happened on May 29th. I don't know -- that would be a question for netops. Ravi?
Flags: needinfo?(hwine) → needinfo?(ravi)
actually there is no need to look for network changes, we found the problem was due to a network access request (to the live internet) which wasn't resolving anymore.
Logs show reasonable (non-timeout) values for m.yahoo.co.jp, and tp4m has returned to pre-regression levels, beginning June 19 -- the most recent talos.zip update. Can this be resolved, or is there more to do?
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.