Closed Bug 681855 Opened 13 years ago Closed 11 years ago

Frequent Tegra "Cleanup Device exception" or "Configure Device exception" from "Remote Device Error: devRoot from devicemanager [None] is not correct"

Categories

(Release Engineering :: General, defect, P3)

ARM
Android
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [android_tier_1_mozharness][marked as RETRY])

Attachments

(2 files)

It's all Greek to me, so no idea whether this is releng or ateam. http://tbpl.allizom.org/php/getParsedLog.php?id=6113410 Android Tegra 250 mozilla-inbound opt test jsreftest-2 on 2011-08-24 19:57:22 PDT ========= Started Cleanup Device exception (results: 4, elapsed: 30 secs) ========== python /builds/sut_tools/cleanup.py 10.250.49.79 in dir /builds/tegra-091/test/. (timeout 1200 secs) ... Warning proxy.flg found during cleanup Connecting to: 10.250.49.79 reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket devroot None /builds/tegra-091/test/../error.flg Remote Device Error: devRoot from devicemanager [None] is not correct program finished with exit code 1 elapsedTime=30.138345 ======== Finished Cleanup Device exception (results: 4, elapsed: 30 secs) ========
Nope, still can't be done. Between this and bug 660480, if I try to star each individual Android failure, by the time I finish two pushes, the retriggers on the first have started to fail plus there's another new push also starting to fail. Turn this one and that into automatic retries, and maybe I could look at the remaining failures as individuals instead of a mass of "infrastructure; retriggered" (and for bonus fun, I might actually recognize regressions from pushes before I've blindly retriggered the job three or four times).
Priority: -- → P3
Whiteboard: [orange] → [orange][android_tier_1]
I read thru a bunch of the logs and matched them up with clientproxy log items and I think I have found that when devicemanager loses contact with sutagent after the install has happened, the step fails *but* proxy.flg is still active. When buildbot then reboots the build clientproxy doesn't behave properly.
Attachment #555864 - Flags: review?(aki)
Attachment #555864 - Flags: review?(aki) → review+
Assignee: nobody → bear
https://tbpl.mozilla.org/php/getParsedLog.php?id=6393015&tree=Firefox ========= Started Configure Device exception (results: 4, elapsed: 24 secs) ========== python /builds/sut_tools/config.py 10.250.49.36 jsreftest in dir /builds/tegra-049/test/. (timeout 1200 secs) watching logfiles {} argv: ['python', '/builds/sut_tools/config.py', '10.250.49.36', 'jsreftest'] environment: PATH=/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin PWD=/builds/tegra-049/test SUT_IP=10.250.49.36 SUT_NAME=tegra-049 __CF_USER_TEXT_ENCODING=0x1F6:0:0 closing stdin using PTY: False connecting to: 10.250.49.36 reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket devroot None /builds/tegra-049/test/../error.flg Remote Device Error: devRoot from devicemanager [None] is not correct - exiting process killed by signal 15 program finished with exit code -1 elapsedTime=24.209561 ======== Finished Configure Device exception (results: 4, elapsed: 24 secs) ========
Summary: Frequent Tegra "Cleanup Device exception" from "Remote Device Error: devRoot from devicemanager [None] is not correct" → Frequent Tegra "Cleanup Device exception" or "Configure Device exception" from "Remote Device Error: devRoot from devicemanager [None] is not correct"
https://tbpl.mozilla.org/php/getParsedLog.php?id=6412335&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6412687&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6412692&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6413344&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6413349&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6413346&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6410641&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410649&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410517&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6405855&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410513&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6405553&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410505&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410521&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410518&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6406607&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6413842&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6413965&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6406414&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6406419&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6406410&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6406411&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6414209&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6414100&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6413980&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6413970&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6413969&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6414104&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6414102&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6413834&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6413831&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6413722&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6413707&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6413704&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6413718&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6414809&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6414800&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6414804&tree=Mozilla-Inbound
Blocks: 692662
Attached patch RETRYSplinter Review
We already do manually, 98% of the failures, so the only difference should be that I don't rage while I have to star them, and bhearsum doesn't have to think of new blasphemies or anatomically-unlikely things to say between "that" and "philor" every time he gets bugspam from this bug.
Attachment #569276 - Flags: review?(bear)
https://tbpl.mozilla.org/php/getParsedLog.php?id=7014718&tree=Mozilla-Inbound And believe me, I'm not any where *near* that flexible.
Attachment #569276 - Flags: review?(bear) → review+
This made it to production today.
Depends on: 690311
Whiteboard: [orange][android_tier_1] → [orange][android_tier_1_mozharness]
Ambitiously resolving this as WORKSFORME, given no reported occurrences in 3 months.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
https://tbpl.mozilla.org/php/getParsedLog.php?id=8375698&tree=Mozilla-Inbound It could be FIXED, if it actually turns out that the best way to deal with whatever state causes this is "spend 6 minutes getting ready, find this is hosed, RETRY," but it isn't gone, just wallpapered.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(In reply to Phil Ringnalda (:philor) from comment #382) > It could be FIXED, if it actually turns out that the best way to deal with > whatever state causes this is "spend 6 minutes getting ready, find this is > hosed, RETRY," but it isn't gone, just wallpapered. My apologies, philor. I didn't realize that we had essentially paved over this problem rather than fixing it. I'll comment in bug 690311, which seems to be the great white hope.
removing from my queue so it can be triaged
Assignee: bear → nobody
At risk of getting restarred - we sort of forgot in bug 660480 that adding a log_eval_func on the cleanup buildstep would mean that the global one where this was setting retry wouldn't be used anymore, so these'll be red again for a while.
We put off (and then put off and then put off ...) a bunch of work on avoiding these failures because mozharness was supposed to save us. I'm currently pushing to do more of the initial cleanup work on the foopys prior to even marking the tegra as ready to accept jobs. Not sure if there's a bug on file for this yet. If there isn't, I'll file one. This should hopefully avoid some of these cleanup exceptions and also make the actual test cycle on the tegras a little more compact, at the expense of slightly longer reboot cycles. If the tegras that *do* start running jobs are more likely to succeed, that's probably a win though.
Whiteboard: [orange][android_tier_1_mozharness] → [orange][android_tier_1_mozharness][marked as RETRY]
Whiteboard: [orange][android_tier_1_mozharness][marked as RETRY] → [android_tier_1_mozharness][marked as RETRY]
Product: mozilla.org → Release Engineering
Status: REOPENED → RESOLVED
Closed: 13 years ago11 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: