Closed Bug 711725 Opened 9 years ago Closed 6 years ago

Tegras and Pandas disconnect with "remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion." or similar at any given step

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

ARM
Android

Tracking

(firefox16 affected, firefox17 affected, firefox18 affected, firefox19 affected)

RESOLVED WORKSFORME
Tracking Status
firefox16 --- affected
firefox17 --- affected
firefox18 --- affected
firefox19 --- affected

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [purple][android_tier_1])

Starting whichever night it was last week that we both got the buildfarm network fixed up, and added a bunch more foopies, we've been getting clumps of Android tests that successfully complete the run, but then disconnect ("Connection to the other side was lost in a non-clean fashion.") during the reboot device step, like

https://tbpl.mozilla.org/php/getParsedLog.php?id=8000862&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8000842&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8000855&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8000865&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8000850&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8000861&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8000894&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=8000834&tree=Mozilla-Inbound

Time of day (3:12 - 3:14 for those)? Everything attached to one foopy that's running at a particular time? Dunno.
https://tbpl.mozilla.org/php/getParsedLog.php?id=8037092&tree=Mozilla-Aurora

Could just be that the 3:12 clumps are easier to spot as a clump, since there's probably less running on fewer trees then.
Or it could be that we've always had some disconnects during rebooting, which might be what I used to star with bits from stories explaining death to children ("Sometimes, honey, when a Tegra gets really really tired, it needs to rest for a very long time..."), and now that the clumps are making us notice this flavor of failure, those are sticking out more.
related to a comment in bug 713047 - dustin thinks it may be the reboot step timing out or being slower than the buildslave thinks it should be.. anywho - read that bug monday and figure it out bear
Assignee: nobody → bear
https://tbpl.mozilla.org/php/getParsedLog.php?id=8188464&tree=Mozilla-Aurora

Should be interesting, having old tegras moved over, to see whether they start exploding in new ways in new places in the run.
https://tbpl.mozilla.org/php/getParsedLog.php?id=8202304&tree=Firefox
Summary: Intermittent clumps of Tegras disconnecting during the reboot step after successful runs → Intermittent clumps of Tegras attached to bm-19 and bm-20 disconnecting like a honey badger
Per discussions with philor in #build, while the original burst of these may have been due to bring up issues with the new buildmasters, and move of tegras. However, these tegras continue to display a new failure signature.

Several pieces moving at the same time here, need some data analysis to see if there is any correlation:
 - bm19 & bm20 originally brought up with tegras connected to new foopies on 2011-12-20 (bug 704597) (new foopies are faster hardware running os 10.7 (Lion))
 - all older tegra/foopies moved to bm19 & bm20 on 2011-12-31 (bug 713170) (old bm was always bordering on swapping, new still have plenty of headroom)

Example of "new signature":
07:53 < philor> hwine: https://tbpl.mozilla.org/php/getParsedLog.php?id=8441984&tree=Mozilla-Inbound is traditional purple, it was just sitting there and 
                then there's a "process killed by signal 15"
From that log:
========= Started Cleanup Device failed (results: 2, elapsed: 19 secs) ==========
python /builds/sut_tools/cleanup.py 10.250.50.87
 in dir /builds/tegra-177/test/. (timeout 1200 secs)
 watching logfiles {}
 argv: ['python', '/builds/sut_tools/cleanup.py', '10.250.50.87']
 environment:
  PATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PWD=/builds/tegra-177/test
  SUT_IP=10.250.50.87
  SUT_NAME=tegra-177
  __CF_USER_TEXT_ENCODING=0x1F5:0:0
 closing stdin
 using PTY: False
process killed by signal 15
program finished with exit code -1
elapsedTime=19.277863
======== Finished Cleanup Device failed (results: 2, elapsed: 19 secs) ========

========= Started  (results: not started, elapsed: not started) ==========
======== Finished  (results: not started, elapsed: not started) ========

========= Started  (results: not started, elapsed: not started) ==========
======== Finished  (results: not started, elapsed: not started) ========

========= Started  (results: not started, elapsed: not started) ==========
======== Finished  (results: not started, elapsed: not started) ========

========= Started  (results: not started, elapsed: not started) ==========
======== Finished  (results: not started, elapsed: not started) ========

========= Started  (results: not started, elapsed: not started) ==========
======== Finished  (results: not started, elapsed: not started) ========

========= Started Reboot Device interrupted (results: 4, elapsed: 2 secs) ==========

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
======== Finished Reboot Device interrupted (results: 4, elapsed: 2 secs) ========
Summary: Intermittent clumps of Tegras attached to bm-19 and bm-20 disconnecting like a honey badger → Tegras attached to bm-19 and bm-20 exhibiting a never-seen-before failure signature
Sorry, I misled you - that "traditional purple" is the well-known failure, that's bug 660480 (hmm, or is it? I don't remember whether they've always come with a disconnect in the reboot step, because I didn't have any need to look below the "process killed by signal 15").

The non-traditional, new purple is https://tbpl.mozilla.org/php/getParsedLog.php?id=8442422&tree=Mozilla-Inbound, where the whole run went perfectly up until it's running `python /builds/sut_tools/reboot.py 10.250.49.70`, the reboot step, and somewhere in the middle of that, like that one where it got caught while dumping a logcat line, there's a sudden

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

There are three possibilities for comment 14 (at least):

* that ain't it, because of things like https://tbpl.mozilla.org/php/getParsedLog.php?id=8212921&tree=Mozilla-Aurora where the disconnect was in the clobber build tools step
* there are actually two (or more) new failure modes, one that's due to timing in the reboot step and one or more others that explain sudden disconnects in other steps
* disconnecting in random steps isn't really new, I just blew it off before, and then tried to lump it in with this
Or https://tbpl.mozilla.org/php/getParsedLog.php?id=8457729&tree=Mozilla-Beta where the disconnect is in the tiresome `rm -rfv build` step, from which I would forcibly disconnect if I was a foopy.
I don't really pay attention to individual failures, because there are just too many of them, but it could well be that rather than what master or what foopy, it's a particular set of tegras (well, which can also mean a particular foopy) - feels like as I glance at the number, it's always over 250.
And given the timing of them this morning, one of the things that I call this, the lost connection during the reboot step, may be associated with reconfigs - there was a large batch right around the 11:40 reconfig.
removing from my queue so this can be triaged
Assignee: bear → nobody
Priority: -- → P3
https://tbpl.mozilla.org/php/getParsedLog.php?id=9897894&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=9897922&tree=Mozilla-Inbound

("Tegras sometimes disconnect prematurely during the reboot step, singly or in small or large groups, and sometimes philor notices it because they didn't also have another failure" would probably be a more accurate summary.)
Unless I'm mistaken, *all* production tegras are attached to bm-19 or bm-20 atm.
Shortening the summary.
Summary: Tegras attached to bm-19 and bm-20 exhibiting a never-seen-before failure signature → Tegras exhibiting a never-seen-before failure signature
https://tbpl.mozilla.org/php/getParsedLog.php?id=10128958&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129019&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10128963&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10128668&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129006&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10128386&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10128176&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129131&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129081&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129811&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129012&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10128960&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129015&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10128959&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129795&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129079&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129102&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129132&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129083&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10128999&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10128476&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129046&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10129016&tree=Mozilla-Inbound
In general, we loose connection.

Hopefully, distributing them among three masters might help with this (bug 734393).
Depends on: 734393
Summary: Tegras exhibiting a never-seen-before failure signature → Tegras disconnect with "Connection to the other side was lost in a non-clean fashion." or similar at any given step
There's actually quite a few things in here, and I think one of them, the one I notice the most, is actually "the reboot step takes way too long, and there is something that releng does which causes a large number of tegras to disconnect, and since the reboot step takes way too long, a large percentage of the disconnecting tegras are doing so in the reboot step."
https://tbpl.mozilla.org/php/getParsedLog.php?id=10244506&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10246035&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10244877&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10245057&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10243784&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10244866&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10243778&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10244796&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10243786&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10243768&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10244869&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10244513&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10243704&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10244482&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10244547&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10245078&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10243852&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10243830&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10243968&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10243915&tree=Mozilla-Inbound