Closed
Bug 716256
Opened 14 years ago
Closed 14 years ago
Frequent talos failures on multiple trees starting ~7am 2012-01-07 MVT
Categories
(Testing :: Talos, defect)
Testing
Talos
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 716326
People
(Reporter: emorley, Unassigned)
Details
Talos tp has been failing much more frequently since ~7am 2012-01-07 MVT.
edmorley: jmaher: any idea what could have broken things at 7am MVT today?
jmaher: edmorley: no idea; I suspect there was either a hiccup on the network
jmaher: or maybe some issues with disks on certain slaves
edmorley: jmaher: multiple machines, trees and for several hours
jmaher: edmorley: hmm
edmorley: https://tbpl.mozilla.org/?jobname=talos%20tp and https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=talos%20tp (plus obviously press down arrow x a few)
jmaher: odd that we are getting crashes and timeouts
jmaher: edmorley: so this is the first instance I see on inbound: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=talos%20tp&rev=757a48403154
jmaher: and that is on osx 64 bit
jmaher: edmorley: but the frequency of that is pretty high; wonder if the responsiveness stuff is causing it to crash; that has been live for almost 2 months, but we had troubles initially with osx crashes
jmaher: edmorley: it all sounds like a slave issue
jmaher: or a drive that serves the files got corrupted
Examples:
* https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=talos%20tp&rev=879883efec3c
* https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=talos%20tp&rev=1a4ef8ec3f5a
* https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=talos%20tp&rev=757a48403154
* https://tbpl.mozilla.org/?jobname=talos%20tp&rev=ff517d1a0c4a
* https://tbpl.mozilla.org/?jobname=talos%20tp&rev=5a446202be5f
| Reporter | ||
Comment 1•14 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=8397254&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8402015&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8402005&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8401913&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8401429&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8401417&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8402027&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8401840&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=8400830&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=8401912&tree=Firefox
| Reporter | ||
Comment 2•14 years ago
|
||
Not just tp it would seem, also dromaeo and talos chrome_mac:
https://tbpl.mozilla.org/php/getParsedLog.php?id=8399968&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=8400988&tree=Mozilla-Inbound
Summary: Frequent talos tp failures on multiple trees starting ~7am 2012-01-07 MVT → Frequent talos failures on multiple trees starting ~7am 2012-01-07 MVT
| Reporter | ||
Comment 3•14 years ago
|
||
| Reporter | ||
Comment 4•14 years ago
|
||
| Reporter | ||
Comment 5•14 years ago
|
||
| Reporter | ||
Comment 6•14 years ago
|
||
Comment 7•14 years ago
|
||
I only see two things in there: bug 716326, something broke on talos-r4-snow-007 and it chewed up 30 jobs, and a big cluster of bug 664371, where we hang after the log line "loaded http://localhost/page_load_test/tp5/goo.ne.jp/goo.ne.jp/index.html (next: http://localhost/page_load_test/tp5/alipay.com/www.alipay.com/index.html)" which from what alice told me long ago means that we have actually finished loading the goo.ne.jp page, and we are loading the aliplay.com page.
Far and away the most likely two things are that something snuck through in the process of turning it into a local-only page, and it hits the network for something which intermittently fails in a nasty way, or, like the classic zombocom episode, it intermittently gets busted DNS and a DNS lookup from something like prefetching gets stuck and hangs the browser. Either way, it's going to be someone with access to the pageset who determines what's actually happening, but either way it should have considerably higher priority than it has ever gotten, since it's entirely reasonable to expect that a perf test which intermittently hits clusters of hangs also intermittently hit clusters of slows, and if I landed something that regressed tp, I'd just sit and argue and delay for as long as possible, waiting to see whether my "regression" just disappeared when either the buildfarm DNS server or alipay's DNS or the buildfarm network or alipay's servers got fixed. That sort of utter lack of faith in a test is... not ideal.
Are there any that I missed noticing, which are both not talos-r4-snow-007 and are not a tp5 run which hung after the "(next: http://localhost/page_load_test/tp5/alipay.com/www.alipay.com/index.html)" line?
| Reporter | ||
Comment 8•14 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #7)
> Are there any that I missed noticing, which are both not talos-r4-snow-007
> and are not a tp5 run which hung after the "(next:
> http://localhost/page_load_test/tp5/alipay.com/www.alipay.com/index.html)"
> line?
I don't believe so, your analysis looks correct. Think this was just a case of more-caffeine required on my part, with accidentally including the alipay logs, which then confused my results in comment 0. Thanks for making sense of it :-)
Going to dupe to bug 716326.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
| Reporter | ||
Updated•14 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•