Closed Bug 1408389 Opened 2 years ago Closed 2 years ago

when trying to run tests on m3.large (instead of m1.medium) I get many blue jobs in treeherder

Categories

(Taskcluster :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla58

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(1 file)

https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=f8545b82c78b04af34da0e6d895b48153584a3cc&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&filter-resultStatus=retry&selectedJob=136795928

I don't know why we get blue jobs, there is no log or other meta data- in order to switch away we need to have mostly green jobs.  I assume this is a system level error at Amazon the machine gets yanked.  I find it odd that it occurs on specific job types, which tells me it is related to the tests being run- although no logs lead me to confusion.
Are these the same tests that we couldn't get to run on anything but m1.mediums before?  If I recall, those were failing (orange) not ?? (blue).  I think the rough consensus was that they were concurrency-related tests and failed on a multi-CPU instance type (which just about everything but m1.medium is).

If this is the same, let's find and link to that bug for context.

Either way, we should be able to dig up some logging for those instances.
these are the same tests we tried to run on m3.large in the past and identified as too flaky or perma failing.  There were 5 exceptions of test jobs that were using legacy and 3 of them are ok to move, but examining the last 2 test suites, this is where I get a lot of the blue jobs.

Many of the other failures I am doing a quick pass on to hunt down failures that I see in the logs.  If there are other explanations for the blue jobs, that would be good to know as well.  I found bug 1281241 (which this blocks) as a reference for previous work done to get off the m1.mediums.
I'll pull the logs for those instances (in a bit..)
Flags: needinfo?(dustin)
Looking at one of the machines things start crashing (including the worker) because the machine is out of memory:

Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: Uncaught Exception! Attempting to report to Sentry and crash.
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: Error: spawn ENOMEM
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker:     at exports._errnoException (util.js:1026:11)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker:     at ChildProcess.spawn (internal/child_process.js:313:11)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker:     at exports.spawn (child_process.js:380:9)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker:     at Object.exports.execFile (child_process.js:143:15)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker:     at exports.exec (child_process.js:103:18)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker:     at Object.check (/home/ubuntu/docker_worker/node_modules/diskspace/diskspace.js:56:3)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker:     at exports.default (/home/ubuntu/docker_worker/src/lib/stats/host_metrics.js:43:13)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker:     at ontimeout (timers.js:365:14)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker:     at tryOnTimeout (timers.js:237:5)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker:     at Timer.listOnTimeout (timers.js:207:5)
this is great info!  I need to look at the passing ones and see what the memory usage is.
Flags: needinfo?(dustin)
we fixed a damp test, now we need to run damp not on legacy.  Doing the default instance type (m3.large), we run out of memory!  for 7.5GB of memory, that isn't good- but thanks to the data in this bug, I moved to xlarge and it works great:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d4f6786669723bccabf73c864cf3e9342792d9c6
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8920556 - Flags: review?(gbrown)
Comment on attachment 8920556 [details] [diff] [review]
run damp/asan tests on xlarge instead of legacy

Review of attachment 8920556 [details] [diff] [review]:
-----------------------------------------------------------------

I suggest clarifying the comment, maybe, "runs out of memory on default/m3.medium"
Attachment #8920556 - Flags: review?(gbrown) → review+
s/m3.medium/m3.large/
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #6)
> Created attachment 8920556 [details] [diff] [review]
> run damp/asan tests on xlarge instead of legacy
> 
> we fixed a damp test, now we need to run damp not on legacy.  Doing the
> default instance type (m3.large), we run out of memory!  for 7.5GB of
> memory, that isn't good- but thanks to the data in this bug, I moved to
> xlarge and it works great:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=d4f6786669723bccabf73c864cf3e9342792d9c6

Interesting that we run out of memory on m3.large but not m1.medium.  m1.medium has half the memory of a m3.large.
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/11d443e7b098
run devtools on asan and xlarge. r=gbrown
m1.medium is single core, m3.large is multi-core, I suspect we are chewing up much more memory per process/thread than we would on m1.medium.
https://hg.mozilla.org/mozilla-central/rev/11d443e7b098
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
You need to log in before you can comment on or make changes to this bug.