Many retries on Linux SM tsan builds that sometimes end up in exceptions - no log
Categories
(Firefox Build System :: General, enhancement)
Tracking
(firefox66 fixed)
| Tracking | Status | |
|---|---|---|
| firefox66 | --- | fixed |
People
(Reporter: CosminS, Assigned: sfink)
Details
Attachments
(3 files)
No logs from what I could find.
| Comment hidden (Intermittent Failures Robot) |
Comment 2•7 years ago
|
||
Given that it's just one occurrence, I don't think this is anything to worry about atm.
Comment 3•7 years ago
|
||
It actually happens regularly if you look further down on treeherder. Exceptions and no logs suggests out of memory. Presumably, bug 1516575 would help get logs. Those jobs run on gecko-3-b-linux, maybe the workers that they got didn't have enough memory. At least, the jobs that do end up green have run on instances that do have a lot of memory (m4.4xlarge, m5d.4xlarge), but sadly, there's no record of their resource usage like other builds.
Comment 4•7 years ago
|
||
So, the worker type is apparently configured with [cm].4xlarge instances. The m ones have 64GB of memory, and the c* ones have 32GB. I also confirmed that all the recent failing ones (where it's still possible to figure out what instance types were) were on c* instances.
Switching to gecko-3-b-linux-xlarge would give instances with at least 72GB memory. But this begs the question. Do we expect those jobs requiring more than 32GB?
| Assignee | ||
Comment 5•7 years ago
|
||
https://clang.llvm.org/docs/ThreadSanitizer.html says tsan should have a 5x-10x memory overhead. At 10x, running on 32GB should be similar to running with 3GB, which I would expect to be more than enough.
Perhaps it's due to running tests with lots of concurrent processes? I think it would be worth trying to drop that down and see what happens.
| Assignee | ||
Comment 6•7 years ago
|
||
Comment 8•7 years ago
|
||
Backed out for spidermonkey bustages.
Backout link: https://hg.mozilla.org/integration/autoland/rev/89bf8ea5967c52f2e9f1bcc174dfdd81d0062143
Push link: https://hg.mozilla.org/integration/autoland/rev/e44a152c9f9f2859751dd43dc6ac057f643b70d0
Log link: https://treeherder.mozilla.org/logviewer.html#?job_id=221864839&repo=autoland
| Assignee | ||
Comment 9•7 years ago
|
||
Switching to -j2 appears to have slowed it down so much that it times out (without OOMing the whole box).
I did pushes with -j2 and -j5 while running under /usr/bin/time -v to see the memory usage. That showed that (1) the job is using a ton of memory, and (2) the -j setting has a large effect on running time but no effect on the maximum RSS size as measured by time -v.
I will have more results later, but when running on an xlarge instance (to avoid OOMing during the testing), I have
-j2 takes 135minutes, peak RSS is 15.9GB
-j5 takes 66minutes, peak RSS is 15.9GB
The test times vary quite a bit so this shouldn't be taken as absolute truth, but I have a bunch of other pushes from before I got the time -v working that show a large speedup from increasing -j values.
I think what this means is that time -v is showing a per-process value, so really it's just that the biggest JS test process that runs uses 15.9GB. Presumably, -j2 would then be burning up to 15.9x2 GB, while -j5 would go up to 15.9x5 GB -- though it depends on the distribution of sizes across different test processes, so I guess that doesn't mean all that much. I need something logging total system memory usage over time.
At least it validates that memory usage during tests can be massive. so:
(In reply to Mike Hommey [:glandium] from comment #4)
So, the worker type is apparently configured with [cm].4xlarge instances. The m ones have 64GB of memory, and the c* ones have 32GB. I also confirmed that all the recent failing ones (where it's still possible to figure out what instance types were) were on c* instances.
Switching to gecko-3-b-linux-xlarge would give instances with at least 72GB memory. But this begs the question. Do we expect those jobs requiring more than 32GB?
Apparently, yes. :-(
This is a kind of ridiculous memory usage. I'm guessing that the usage is very uneven across tests, and there's probably a very small set of tests that require tons of memory. If I can identify those, then I can skip them for the tsan run.
Comment 10•7 years ago
|
||
I did a little poking around and it looks like on Linux this is where jittests actually wait for processes to complete:
https://searchfox.org/mozilla-central/rev/c21d6620d384dfb13ede6054015da05a6353b899/js/src/tests/lib/tasks_unix.py#167
You could swap out os.waitpid there for os.wait4 to get resource usage per-process:
https://docs.python.org/2/library/os.html#os.wait4
pid, status, rusage = os.wait4(0, os.WNOHANG)
# rusage.ru_maxrss is peak RSS per https://docs.python.org/2/library/resource.html#resource.getrusage
| Assignee | ||
Comment 11•7 years ago
|
||
I just prefixed each JS invocation with
/usr/bin/time -v -o $MOZ_UPLOAD_DIR/test_times.txt
instead. I'm doing a test run now with the jit-tests using the most memory removed. I'll upload the file giving memory sizes. I closed the terminal where I generated it, but it was something like
perl -ne 'print "$1 " if m!^\s*Command being timed: ".*?([^/\s]*)"!; print "$1\n" if /Maximum resident.*?(\d+)/' /tmp/test_times.txt | sort -k2 -n
| Assignee | ||
Comment 12•7 years ago
|
||
| Assignee | ||
Comment 13•7 years ago
|
||
(Oh, and sure enough, the largest one was 15.9GB.)
Ooh, it's done! And in only 18 minutes, which is dramatically faster.
This was on an xlarge, so it probably won't be quite that good.
Updated•7 years ago
|
| Assignee | ||
Comment 14•7 years ago
|
||
| Assignee | ||
Updated•7 years ago
|
Comment 15•7 years ago
|
||
Comment 16•7 years ago
|
||
| bugherder | ||
| Comment hidden (Intermittent Failures Robot) |
Description
•