Strings.huge_capacity started failing recently when run from a linux64 test machine

RESOLVED WONTFIX

Status

Testing
General
RESOLVED WONTFIX
3 years ago
a year ago

People

(Reporter: Chris Manchester (mostly offline July 16-20), Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

Full log: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cmanchester@mozilla.com-9cfdd7a56b31/try-linux64-debug/try_ubuntu64_vm-debug_test-gtest-bm52-tests1-linux64-build88.txt.gz

15:50:18  WARNING -  TEST-UNEXPECTED-FAIL | Strings.huge_capacity | Value of: l.SetCapacity(nsString::size_type(-1)/8 + 1, fallible)
15:50:18     INFO -    Actual: false
15:50:18     INFO -  Expected: true @ /builds/slave/try-l64-d-00000000000000000000/build/src/xpcom/tests/gtest/TestStrings.cpp:902
15:50:18  WARNING -  TEST-UNEXPECTED-FAIL | Strings.huge_capacity | Value of: l.SetCapacity(nsString::size_type(-1)/8 + 2, fallible)
15:50:18     INFO -    Actual: false
15:50:18     INFO -  Expected: true @ /builds/slave/try-l64-d-00000000000000000000/build/src/xpcom/tests/gtest/TestStrings.cpp:903
15:50:21  WARNING -  TEST-UNEXPECTED-FAIL | Strings.huge_capacity | Value of: m.SetCapacity(nsString::size_type(-1)/8 + 1001, fallible)
15:50:21     INFO -    Actual: false
15:50:21     INFO -  Expected: true @ /builds/slave/try-l64-d-00000000000000000000/build/src/xpcom/tests/gtest/TestStrings.cpp:907
15:50:32  WARNING -  TEST-UNEXPECTED-FAIL | Strings.huge_capacity | test completed (time: 31637ms)

I pushed this to try a few weeks ago and everything looked ok. I can't find a recent change to the test, I'm not sure what's going on here.
The opt version of the test passes.
Depends on: 1204153
This is caused by jemalloc 4 (try runs before[1] and after[2] bug 762449 landed). Moving from a builder to a tester in this case means going from 7 to 3.5 GiB of memory, we're running out of memory on the smaller instance with jemalloc 4.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=8ad70f11daf6&exclusion_profile=false
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=bdbda81d6226&exclusion_profile=false
Blocks: 762449
Short of a better idea here I think we need to disable the parts of this test that are failing.
Created attachment 8661359 [details]
MozReview Request: Bug 1204148 - Disable several assertions in TestStrings.cpp that are failing with jemalloc 4 on test machines.

Bug 1204148 - Disable several assertions in TestStrings.cpp that are failing with jemalloc 4 on test machines.
Attachment #8661359 - Flags: review?(nfroyd)
Comment on attachment 8661359 [details]
MozReview Request: Bug 1204148 - Disable several assertions in TestStrings.cpp that are failing with jemalloc 4 on test machines.

https://reviewboard.mozilla.org/r/19347/#review17303

I want to talk with mike to understand what's going on here before this lands, if it lands at all.

::: xpcom/tests/gtest/TestStrings.cpp:903
(Diff revision 1)
>      EXPECT_TRUE(l.SetCapacity(nsString::size_type(-1)/8, fallible));
>      EXPECT_TRUE(l.SetCapacity(nsString::size_type(-1)/8 + 1, fallible));
>      EXPECT_TRUE(l.SetCapacity(nsString::size_type(-1)/8 + 2, fallible));

I guess jemalloc4 is hanging on to large allocations, or something, in hopes that they might get reused soon?  The allocations just prior to this function would occupy virtually the entire address space if they were kept around, so maybe that's the problem...

Is there a tuning knob for this?

I don't feel completely comfortable landing this without understanding more of what jemalloc is doing behind the scenes.
Attachment #8661359 - Flags: review?(nfroyd)
Mike, questions to you from comment 5.  Is jemalloc4 doing funny things with large allocations that we need to tune somehow?
Flags: needinfo?(mh+mozilla)
jemalloc4 is not holding onto that memory. It unmaps it as soon as free() is called for it. Now, one thing that the test ends up doing is to alloc a 1GB buffer and realloc it to 2GB. I'm not sure if mozjemalloc was able to do the realloc by extending the already mapped region, but jemalloc4 clearly fails to do that (and it's not unexpected, since address space is mostly given out by the kernel with decreasing addresses, so it's very likely that mapped memory is already followed by some other mappings such that extending is not possible). That means realloc does a 1GB copy. This may have some influence here.

Note those tests were added back in bug 615147.
Flags: needinfo?(mh+mozilla)
Another thing that factors in is that debug builds have poisoning of allocations enabled. But mozjemalloc had that too.
So, I guess the difference that makes it fail with jemalloc4 and not mozjemalloc is this, when running the Strings.huge_capacity test only:

with jemalloc4:
	Maximum resident set size (kbytes): 3484340

with mozjemalloc:
	Maximum resident set size (kbytes): 3223164
And the 256M difference comes from how huge allocations are treated in jemalloc4. With mozjemalloc, allocating 1GiB + 1 byte mapped 1025 MiB. With jemalloc4, it's 1280 MiB. Which are, in both cases, poisoned on debug builds, so there's an additional 255MiB being poisoned with jemalloc4. We're also kind of lucky that this works on the same machines with mozjemalloc. If some more processes were running on the machine, it would likely fail the same.

Now the question is what's the least worst way to work around this without breaking what the test is about (which the currently attached patch would)
Another option would be to request that GTests run on larger instances (there appears to be some precedent for this).

Partially disabling this test doesn't seem particularly risky so long as it's temporary, as it's only disabling on linux debug builds.
Note that jemalloc 4 is going to be temporarily disabled on mozilla-central for a few days until the aurora merge to help regression tracking since jemalloc 4 is not set to ride the trains. Please do not consider a success running this test on the test slaves as a green light to enable the test job broadly.
I'm going to see how hard it is to get these on larger instances (bug 1205785), that seems like the safe way forward.
We got instances with more memory for these.
No longer blocks: 992983
Per bug 1363992, jemalloc 4 related bugs are now irrelevant.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.