Many retries on Linux SM tsan builds that sometimes end up in exceptions - no log

RESOLVED FIXED in Firefox 66

Status

enhancement
RESOLVED FIXED
7 months ago
7 months ago

People

(Reporter: CosminS, Assigned: sfink)

Tracking

unspecified
mozilla66

Firefox Tracking Flags

(firefox66 fixed)

Details

Attachments

(3 attachments)

Given that it's just one occurrence, I don't think this is anything to worry about atm.

Flags: needinfo?(nfroyd)

It actually happens regularly if you look further down on treeherder. Exceptions and no logs suggests out of memory. Presumably, bug 1516575 would help get logs. Those jobs run on gecko-3-b-linux, maybe the workers that they got didn't have enough memory. At least, the jobs that do end up green have run on instances that do have a lot of memory (m4.4xlarge, m5d.4xlarge), but sadly, there's no record of their resource usage like other builds.

So, the worker type is apparently configured with [cm].4xlarge instances. The m ones have 64GB of memory, and the c* ones have 32GB. I also confirmed that all the recent failing ones (where it's still possible to figure out what instance types were) were on c* instances.

Switching to gecko-3-b-linux-xlarge would give instances with at least 72GB memory. But this begs the question. Do we expect those jobs requiring more than 32GB?

Flags: needinfo?(sphink)

https://clang.llvm.org/docs/ThreadSanitizer.html says tsan should have a 5x-10x memory overhead. At 10x, running on 32GB should be similar to running with 3GB, which I would expect to be more than enough.

Perhaps it's due to running tests with lots of concurrent processes? I think it would be worth trying to drop that down and see what happens.

Flags: needinfo?(sphink)
Pushed by sfink@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e44a152c9f9f
Run tests with 2 concurrent processes instead of 8, r=glandium

Switching to -j2 appears to have slowed it down so much that it times out (without OOMing the whole box).

I did pushes with -j2 and -j5 while running under /usr/bin/time -v to see the memory usage. That showed that (1) the job is using a ton of memory, and (2) the -j setting has a large effect on running time but no effect on the maximum RSS size as measured by time -v.

I will have more results later, but when running on an xlarge instance (to avoid OOMing during the testing), I have

-j2 takes 135minutes, peak RSS is 15.9GB
-j5 takes 66minutes, peak RSS is 15.9GB

The test times vary quite a bit so this shouldn't be taken as absolute truth, but I have a bunch of other pushes from before I got the time -v working that show a large speedup from increasing -j values.

I think what this means is that time -v is showing a per-process value, so really it's just that the biggest JS test process that runs uses 15.9GB. Presumably, -j2 would then be burning up to 15.9x2 GB, while -j5 would go up to 15.9x5 GB -- though it depends on the distribution of sizes across different test processes, so I guess that doesn't mean all that much. I need something logging total system memory usage over time.

At least it validates that memory usage during tests can be massive. so:

(In reply to Mike Hommey [:glandium] from comment #4)

So, the worker type is apparently configured with [cm].4xlarge instances. The m ones have 64GB of memory, and the c* ones have 32GB. I also confirmed that all the recent failing ones (where it's still possible to figure out what instance types were) were on c* instances.

Switching to gecko-3-b-linux-xlarge would give instances with at least 72GB memory. But this begs the question. Do we expect those jobs requiring more than 32GB?

Apparently, yes. :-(

This is a kind of ridiculous memory usage. I'm guessing that the usage is very uneven across tests, and there's probably a very small set of tests that require tons of memory. If I can identify those, then I can skip them for the tsan run.

I did a little poking around and it looks like on Linux this is where jittests actually wait for processes to complete:
https://searchfox.org/mozilla-central/rev/c21d6620d384dfb13ede6054015da05a6353b899/js/src/tests/lib/tasks_unix.py#167

You could swap out os.waitpid there for os.wait4 to get resource usage per-process:
https://docs.python.org/2/library/os.html#os.wait4

pid, status, rusage = os.wait4(0, os.WNOHANG)
# rusage.ru_maxrss is peak RSS per https://docs.python.org/2/library/resource.html#resource.getrusage

I just prefixed each JS invocation with

/usr/bin/time -v -o $MOZ_UPLOAD_DIR/test_times.txt

instead. I'm doing a test run now with the jit-tests using the most memory removed. I'll upload the file giving memory sizes. I closed the terminal where I generated it, but it was something like

perl -ne 'print "$1 " if m!^\s*Command being timed: ".*?([^/\s]*)"!; print "$1\n" if /Maximum resident.*?(\d+)/' /tmp/test_times.txt | sort -k2 -n
Flags: needinfo?(sphink)

(Oh, and sure enough, the largest one was 15.9GB.)

Ooh, it's done! And in only 18 minutes, which is dramatically faster.

https://treeherder.mozilla.org/#/jobs?repo=try&author=sfink%40mozilla.com&fromchange=e38502952110ab3931767d17c61006bc6788d2ca&selectedJob=222090684

This was on an xlarge, so it probably won't be quite that good.

Attachment #9036470 - Attachment description: Bug 1519263 - Run tests with 2 concurrent processes instead of 8, r?glandium → Bug 1519263 - Skip tsan tests that consume too much memory
I'll leave this instrumentation patch attached here for future reference. It does not need to land.
Assignee: nobody → sphink
Status: NEW → ASSIGNED
Pushed by sfink@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7133682529a0
Skip tsan tests that consume too much memory r=jonco
Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
You need to log in before you can comment on or make changes to this bug.