Something has broken in tegra recovery

RESOLVED FIXED

Status

Infrastructure & Operations
CIDuty
--
major
RESOLVED FIXED
6 years ago
3 months ago

People

(Reporter: philor, Unassigned)

Tracking

Details

(Reporter)

Description

6 years ago
Last week, we had three tegra recovery bugs: bug 806950, bug 807663 and bug 807965.

Every single tegra involved in those now has its tracking bug reopened because it is broken, doing horribly, failing more than 50% of the time and doing so in suspicious ways like timing out in reftests and mochitests and failing to even initialize the browser in talos.

Particularly telling are tegra-093 and tegra-182, because from looking at buildapi/recent, there's no evidence I can see that they had a problem at all, but then they got recovered, and have turned into broken tegras.

The one before those was bug 802655, which did three tegras, two of which I persuaded Callek were bad hardware and should be scrapped, the third of which I regretted having not included in that tar-brushing a day later.

The one before that was bug 792692, which only did two of the tegras in the 300s, which are all awful and thus hard to tell about, but one of the two, tegra-336, seems to have been restarted on November 1st and to be running okay.

So, what could have changed between 2012-09-21, the last time we did a successful reimage, and now?
(Reporter)

Comment 1

6 years ago
tegra-064 got reimaged in bug 807963 rather than a tegra-recovery bug, but it's broken just the same.
Blocks: 778833

Updated

6 years ago
Blocks: 808468

Updated

6 years ago
Depends on: 808474

Comment 2

6 years ago
>So, what could have changed between 2012-09-21, the last time we did a successful reimage, and now?

>tegra-064 got reimaged in bug 807963 rather than a tegra-recovery bug, but it's broken just the same.

:philor, is there a way to check if the latest image was used in tegra-064? I know there are several images on the imaging netbook and want to confirm the latest image has been used since 9-21.
(Reporter)

Comment 3

6 years ago
s/philor/Callek/, since I'm a volunteer who looks at logs after test jobs finish, not a releng employee with access to anything.
Flags: needinfo?(bugspam.Callek)
(Reporter)

Comment 4

6 years ago
tegra-057 got a reimage in bug 807962, and is also busted.
Blocks: 778920

Comment 5

6 years ago
Hi,

I just reimaged tegra-057 and tegra-064 with what is supposed to be the correct image.  Is there a way you can run tests on them to confirm they're working as normal?

Thanks,
Van
(In reply to Van Le [:van] from comment #5)
> I just reimaged tegra-057 and tegra-064 with what is supposed to be the
> correct image.  Is there a way you can run tests on them to confirm they're
> working as normal?

Apparently we crossed streams, and I never took down 057 first -- but no big worry there, it will pickup a new job soon. I've just started up 064 as well.

*leaving* their problem tracking bugs open for now
Flags: needinfo?(bugspam.Callek)
(Reporter)

Comment 7

6 years ago
The fact that it's difficult to say whether or not 057 and 064 got another "bad image" doesn't bode well for that email thread about verifying that tegras are in good shape before putting them back in service.

064 still has a busted sd card, whether because it got one bad one replaced by another, or it has a busted slot, or something less imaginable - every other test run was failing by not being able to write to the card. The ones that did run... there were only four, only one failed, but in a suspicious way.

057 is probably busted in the bad-image way - it's done 15 green runs and 12 non-green, which is a bit higher than the average success rate for broken ones, but well below the average for unbroken ones.

But if the eventual post-image verification process takes two days to be sure a tegra is healthy, that's not going to be very handy.
(Reporter)

Comment 8

6 years ago
057 is busted in the bad-image way, but it's ugly that it takes this long to be sure.

Updated

6 years ago
Blocks: 813012

Updated

6 years ago
No longer blocks: 813012
From what I can tell, the system is working as-intended, and nothing unexpected changed. This bug is closeable.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED

Updated

6 years ago
No longer blocks: 808468
Depends on: 808468
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering

Updated

3 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.