Closed Bug 1429595 Opened 2 years ago Closed Last year

Recent slowdown in buildbot-based Linux tests on ESR52

Categories

(Release Engineering :: General, enhancement)

enhancement
Not set

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: RyanVM, Unassigned)

References

Details

(Whiteboard: [stockwell infra])

Given what a low churn branch ESR52 is and the fact that these instances are still m1.medium AFAIK, I have a bad feeling this is tied to whatever Spectre/Meltdown mitigations are landing on AWS' end :(

The link below shows the issue pretty clearly - many test suites becoming more orange prone for causes related to execution time, and even green suites showing much longer runtimes than they used to. The Linux32 debug runs are the most obvious, but other non-tc prefixed jobs also show similar regressions.

https://treeherder.mozilla.org/#/jobs?repo=mozilla-esr52&filter-searchStr=linux%20debug%20test&group_state=expanded&fromchange=5b7d93f245ee9bd2d77d857b7828210452e280e8

I don't know what options we have here, realistically. My recollection is that we're stuck on m1.medium for performance reasons or something? Given that this mainly affects Linux32 tests, can we maybe consider turning them off? We have to support ESR52 until August at this point, so I don't think that leaving them in this failing state and trying to run out the clock is a viable choice.
I read somewhere that the spectre/meltdown mitigation patches have a bigger performance impact on paravirtual instance types than hvm types. Can't remember where, but I'm pretty sure it was authored by an AWS engineer so the information was credible.

I looked in the AWS web console and we're using paravirtual instance types for at least the gecko-t-linux-medium worker type. Sure enough, that's the worker type being used for the linux32 debug tasks that I clicked on :/

The good news is that it appears we no longer use the gecko-t-linux-medium worker type in mozilla-central. So our use of paravirtual may die off when stop running CI for ESR52.
This issue is that that point is over 7 months away :(
I'd consider moving those tasks to a non-paravirtual instance and then mass disabling any tests that fail. We've moved on to different instance types in central. I'm optimistic >95% of the tests "just work" with a different instance type. I just don't know how many other changes were made to support the different instance type. Hopefully not many. You should be able to replace "gecko-t-linux-medium" and push to Try to get a feel for things.
Linking to bug that switched us off m1.medium.
Depends on: 1411334
And a few more.
Depends on: 1281241, 1361476
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
I've been trying to retrigger these until they get green (or, at least as common, until we get another push and I move on to retriggering there), but now I'm done. If treeherder hadn't removed its UI for hiding jobs, they would be hidden. Just shut them off, since nobody's going to fix them.
Component: Infrastructure: AWS → General
Product: Infrastructure & Operations → Release Engineering
QA Contact: cshields
(In reply to Phil Ringnalda (:philor) from comment #15)
> I've been trying to retrigger these until they get green (or, at least as
> common, until we get another push and I move on to retriggering there), but
> now I'm done. If treeherder hadn't removed its UI for hiding jobs, they
> would be hidden. Just shut them off, since nobody's going to fix them.

Yes, I think it's important to be explicit here. We are functionally running out the clock here rather than investing engineering effort to fix tests that will go away in 6 months.
(In reply to Phil Ringnalda (:philor) from comment #15)
> I've been trying to retrigger these until they get green (or, at least as
> common, until we get another push and I move on to retriggering there), but
> now I'm done. If treeherder hadn't removed its UI for hiding jobs, they
> would be hidden. Just shut them off, since nobody's going to fix them.

@buildduty - perhaps this is something you could look into too
buildbot & ESR52 are EOL
Status: NEW → RESOLVED
Closed: Last year
QA Contact: catlee
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.