Cleanup and refactoring of GC system memory allocation functions

RESOLVED FIXED in Firefox 66

Status

()

enhancement
P3
normal
RESOLVED FIXED
9 months ago
20 days ago

People

(Reporter: ehoogeveen, Assigned: ehoogeveen)

Tracking

Trunk
mozilla66
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr60 wontfix, firefox64 wontfix, firefox65 wontfix, firefox66 fixed)

Details

Attachments

(3 attachments, 7 obsolete attachments)

45.71 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
28.87 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
25.85 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
Background/motivation:

In the days of yore when the GC's system memory allocation functions were first imported from jemalloc, it was decided to split them by platform as they didn't share a lot of code. That changed when I added the chunk alignment algorithm and the last ditch allocator, which have a lot of things in common across all platforms.

When thinking about solutions to bug 1441473 I envisaged adding a fairly large amount of new infrastructure, and doing that separately for each platform would have created a big mess, so I decided to try fusing the different implementations together again. To do that I decided to clean up the existing implementations first and started finding lots of minor inconsistencies. After much tinkering and scope creep I pretty much rewrote gc/Memory.cpp, and here we are.

Implementation notes:

Many of the changes I made are pretty minor, but I tried to be deliberate and consistent.
1) Perhaps most contentiously, I removed explicit support for Windows Mobile / UWP and Solaris. These added additional OS-specific branches which I wanted to avoid.
1a) We're pretty limited on what we can do using Windows Mobile / UWP and as far as I know we don't really support them anyway. Windows Mobile / UWP support _aligned_malloc, but if they ever start using 48-bit pointers there isn't much we can do about it.
1b) As for Solaris, I don't think we use anything too Linux or OSX-specific so it might just work, or work with minor modifications. I don't think we need to treat it as a separate platform.
2) I tried to encapsulate OS-specific implementation details as much as possible. The constraints on VirtualAlloc/VirtualFree and mmap/munmap are ultimately too different for a complete fusion, but we can at least give them a common higher level API.
3) The JS engine works with 47-bit pointers and this is something I wanted to enforce. Right now I don't know of any cases where Windows hands out addresses above the 47-bit range, but that might change. In particular AllocateMappedContent did not go through MapMemory or MapMemoryAt on Windows, which could become a problem in the future.
4) I was pretty liberal in my usage of release assertions. These functions should all be out-of-line code so code size isn't really an issue, and they all invoke various system calls which should dwarf the time it takes to perform simple checks. Crashing early may not help much, but you never know what strange behavior we might trigger once memory corruption is in play.
5) I tweaked a lot of comments and made them consistently use // inside function bodies and /**/ outside.
6) I switched a lot of variables over from p/addr/address to region and size to length. I'm not attached to these names but we had an inconsistent mix before (admittedly |p| and |size| were the most common so perhaps I should have stuck with those).
7) AllocateMappedContent's alignment parameter wasn't very useful as it was limited by the system page size in practice. I changed this by making both the Windows and Linux/OSX versions use MapAlignedPages, then explicitly creating the mapping in the aligned region. This makes AllocateMappedContent more of a first class citizen. Doing this also uncovered bug 1502562.
8) I switched our use of madvise to use MADV_FREE on OSX, consistent with jemalloc. We used to redefine MADV_DONTNEED to MADV_FREE on OSX, but this was lost without a comment all the way back in bug 720439. MADV_FREE is also supported on recent Linux kernels (4.5+); according to [1] it's faster when there isn't too much memory pressure but slows down as memory pressure goes up. This might be worth investigating further.
9) I defined an explicit address growth direction for each non-Windows platform, with a comment explaining their origin. I considered just hardcoding a growth direction for each platform and using tryserver to confirm, but that felt like a step too far. I also wanted to apply the same logic to Windows, but the additional concern of racing with other threads made this far too messy.

[1] https://lwn.net/Articles/591214/
See the notes in comment #0 for details. I apologize for the large patch - I tried to split it up, but I couldn't think of a very good way to do so and then I kept making more changes and it got too entangled. The diff is surprisingly readable if only because it's so much of a rewrite, but looking at the patched file in full might be more helpful.
Attachment #9020614 - Flags: review?(sphink)
Posted file Patched gc/Memory.cpp (obsolete) —
The patched gc/Memory.cpp in case it's easier to review than the diff.
Great, thanks for doing this.
Comment on attachment 9020614 [details] [diff] [review]
Clean up and refactor GC system memory allocation functions.

Review of attachment 9020614 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Memory.cpp
@@ +42,5 @@
> + * would make its application failure prone and complex. Tests indicate that
> + * VirtualAlloc always hands out regions of memory in increasing order.
> + */
> +#if defined(XP_DARWIN) || !defined(XP_WIN) && (defined(__ia64__) || defined(__aarch64__) ||\
> +    (defined(__sparc__) && defined(__arch64__) && (defined(__NetBSD__) || defined(__linux__))))

Please parenthesize for precedence -- I'm not sure what associativity to expect of

    #if A || !B && C

@@ +77,5 @@
> +};
> +#endif
> +
> +/*
> + * Data from OOM crashes shows there may be up to 24 chunksized but unusable

*chunk-sized

@@ +121,3 @@
>      }
>  }
> +template <Commit commit, PageAccess prot>

Why template parameters instead of just passing them to the function? Especially with prot, which is passed through. This is a static function, so I would expect the difference to be entirely optimized away.

Then again, I guess this better matches the caller, so I'm ok if you want to leave it as-is. It probably won't make a code size difference either way.

@@ +192,5 @@
> +    const uintptr_t end   = UINT64_C(0x0000800000000000);
> +    const uintptr_t step  = ChunkSize;
> +    uintptr_t desired;
> +    void* region = nullptr;
> +    for (desired = start; !region && desired + length <= end; desired += step) {

Huh. I guess if the first allocation fails, we could start just past the last attempted allocation (wrapping around). But that's more complexity, so I guess we should wait until we need it?

@@ +308,3 @@
>   * MapAlignedPagesSlow. In this case, try harder to find an alignable chunk
>   * by temporarily holding onto the unaligned parts of each chunk until the
>   * allocator gives us a chunk that either is, or can be aligned.

Is there a bug number you could put in the comment here to provide more information?

@@ +320,3 @@
>      }
>      for (; attempt < MaxLastDitchAttempts; ++attempt) {
> +        GetNewChunk(&region, tempMaps + attempt, length, alignment);

I kinda prefer &tempMaps[attempt], but maybe that's just me.

@@ +346,5 @@
>   * unaligned chunk, then reallocate the unaligned part to block off the
>   * old address and force the allocator to give us a new one.
>   */
>  static void
> +GetNewChunk(void** aRegion, void** aRetainedRegion, size_t length, size_t alignment)

There must be a better name for this function; I found it very confusing whenever it was called in the above code. RealignChunk? RemapAligned? AdjustAlignment? SwapForAlignedRegion?

@@ +356,3 @@
>      do {
> +        size_t offset = OffsetFromAligned(region, alignment);
> +        if (offset == 0) {

Maybe comment that this also catches the OOM case? (eg the following code sets both retainedRegion and region to nullptr).

@@ +373,5 @@
> +/*
> + * mmap calls don't have to be matched with calls to munmap, so we can unmap
> + * just the pages we don't need. However, as we don't know a priori if addresses
> + * are handed out in increasing or decreasing order, we have to try both
> + * directions (depending on the environment, one will always fail).

Er, wait. I'm not following. Why do we have to do all this craziness in the first place? Is this just to keep allocation requests as small as possible, so they keep working near OOM? Because it seems a lot simpler to get an aligned mapping by mapping (length + alignment - 1) bytes and then unmapping the unused portions.

Except that iiuc, mmap will always page-align anyway, so you'd want to do that in page-sized values instead of the -1.

@@ +382,5 @@
> +    void* regionStart = *aRegion;
> +    MOZ_ASSERT(regionStart && OffsetFromAligned(regionStart, alignment) != 0);
> +
> +    bool addressesGrowUpward = growthDirection > 0;
> +    bool directionUncertain = -8 < growthDirection && growthDirection <= 8;

Would this be simpler as

    int directions[] = [ growthDirection > 0, growthDirection <= 0 ];
    if (abs(growthDirection) >= 8) {
        directions[1] = -1;
    }
    for (size_t i = 0; i < 2; ++i) {
        if (directions[i] == -1) {
            break;
        } else if (directions[i] == int(true)) {
            ...
        } else {
            ...
        }
    }

or similar?

@@ +449,5 @@
>          return true;
>      }
> +#if defined(XP_WIN)
> +    return VirtualAlloc(region, length, MEM_RESET, DWORD(PageAccess::ReadWrite)) == region;
> +#elif defined(XP_DARWIN)

I guess making this happen on appropriate linux versions is a separate bug.
Ok, I read through the new version. It's my first time seeing much of this code, and it's all scary stuff, but it more or less makes sense to me except for the question I asked. Basically r=me once you explain why we do the growth direction stuff with mmap.
Flags: needinfo?(emanuel.hoogeveen)
Thanks for the quick review and the comments! To clarify, the only functional change here is in AllocateMappedContent now using MapAlignedPages to ease the alignment restrictions. All the other things - the 3-tier aligned allocation algorithm and the growth direction stuff - I implemented years ago in bug 1005849.

Back then we were having a lot of issues on 32-bit systems with OOMs where inspection showed that there were still several (often dozens of) alignable 1MiB chunks available. We were missing these because mmap/VirtualAlloc was giving us a 1MiB chunk that *couldn't* be aligned, and our fallback of allocating a |length + alignment - pagesize| region was failing because there legitimately weren't any of those left.

So I came up with last ditch allocation pass that allocates just the requested amount of memory, tries to align it, then holds onto it to prevent mmap/VirtualAlloc from handing it out again. We do this in a loop for up to 32 regions (32 was a guess on my part but turned out to be reasonable) before giving up and failing the allocation. Obviously this is slow but it prevents us from crashing before all the address space has truly been used up.

This worked pretty well - the GC-related signatures for those OOMs pretty much disappeared as they all shifted over to jemalloc. I tried for a while to integrate the same logic into jemalloc, which did lead to some simplifications that I integrated on the GC side in bug 1118481, but unfortunately the jemalloc side never landed. I might try again after this bug lands since our fork is in a better place now (and undoing the per-platform split should make it much more similar to how jemalloc operates).

So to summarize the algorithm:

MapAlignedPages:
  1) Request a region of the desired size. If it's aligned we're done.
  2) Try to align the region. If we can, we're done; otherwise unmap it.
MapAlignedPagesSlow:
  3) Request a region of size |length + alignment - pagesize|.
     If this succeeds, we can align the desired region inside it and we're done.
MapAlignedPagesLastDitch:
  4) Request a region of the desired size and try to align it. If we can align
     the region we're done. Otherwise, hold onto it so we can't get it again.
  5) Repeat step 4 (up to 32 times) until we get a region that we can align.
  6) Unmap all the regions we held onto and return what we got.

(In reply to Steve Fink [:sfink] [:s:] from comment #4)
> > +#if defined(XP_DARWIN) || !defined(XP_WIN) && (defined(__ia64__) || defined(__aarch64__) ||\
> > +    (defined(__sparc__) && defined(__arch64__) && (defined(__NetBSD__) || defined(__linux__))))
> 
> Please parenthesize for precedence -- I'm not sure what associativity to
> expect of
> 
>     #if A || !B && C

Will do - I guess I didn't want to add even more parentheses to that mess.

> > + * Data from OOM crashes shows there may be up to 24 chunksized but unusable
> 
> *chunk-sized

Done.

> > +template <Commit commit, PageAccess prot>
> 
> Why template parameters instead of just passing them to the function?
> Especially with prot, which is passed through. This is a static function, so
> I would expect the difference to be entirely optimized away.
> 
> Then again, I guess this better matches the caller, so I'm ok if you want to
> leave it as-is. It probably won't make a code size difference either way.

This is mostly an artifact of trying to abstract away the OS-specific differences - for a while I was trying to thread the file mapping stuff through to MapInternal, and having two separate sources of default arguments was useful.

I could argue that using template parameters makes it obvious at a glance that the passed values are known at compile time, but I don't think that's a very strong advantage. I'll leave it as-is if you're okay with it though.
 
> > +    for (desired = start; !region && desired + length <= end; desired += step) {
> 
> Huh. I guess if the first allocation fails, we could start just past the
> last attempted allocation (wrapping around). But that's more complexity, so
> I guess we should wait until we need it?

This whole loop is awful and I hope to get rid of it soon. But if we just change the step size to the length (rounded up to the alignment), we could jump right over an available region that happened to start just one alignment-increment away.

> >   * allocator gives us a chunk that either is, or can be aligned.
> 
> Is there a bug number you could put in the comment here to provide more
> information?

See above; I could link to bug 1005849 but I prefer to make things obvious from the comments where possible. Do you think there's more context I should add? Maybe an overview at the top of the file would help.

> > +        GetNewChunk(&region, tempMaps + attempt, length, alignment);
> 
> I kinda prefer &tempMaps[attempt], but maybe that's just me.

I always get a bit antsy about the operator precedence there - I always have to convince myself it means &(tempMaps[attempt]) and not (&tempMaps)[attempt].

> >  static void
> > +GetNewChunk(void** aRegion, void** aRetainedRegion, size_t length, size_t alignment)
> 
> There must be a better name for this function; I found it very confusing
> whenever it was called in the above code. RealignChunk? RemapAligned?
> AdjustAlignment? SwapForAlignedRegion?

Yeah, that name is bad. The problem is that the function is sort of multi-functional - it tries to align the chunk it's given, but if that fails it returns a new (possibly aligned) chunk and retains the old one. It's nice because we can use it both in the first pass (MapAlignedPages) and in the third pass (MapAlignedPagesLastDitch), with the same API across all operating systems - but it's hard to describe its function concisely. I'll think about this a bit more.

> > +        size_t offset = OffsetFromAligned(region, alignment);
> > +        if (offset == 0) {
> 
> Maybe comment that this also catches the OOM case? (eg the following code
> sets both retainedRegion and region to nullptr).

Will do.

> > + * mmap calls don't have to be matched with calls to munmap, so we can unmap
> > + * just the pages we don't need. However, as we don't know a priori if addresses
> > + * are handed out in increasing or decreasing order, we have to try both
> > + * directions (depending on the environment, one will always fail).
> 
> Er, wait. I'm not following. Why do we have to do all this craziness in the
> first place? Is this just to keep allocation requests as small as possible,
> so they keep working near OOM? Because it seems a lot simpler to get an
> aligned mapping by mapping (length + alignment - 1) bytes and then unmapping
> the unused portions.

Right, see above: We already do the |length + alignment - pagesize| thing in MapAlignedPagesSlow, this logic was added in case there aren't any of those regions left. We also use it in MapAlignedPages because we can and it often works (IIRC there was around a 50% success rate for aligning the first chunk we get).

It might also stave off running out of |length + alignment - pagesize| regions a little longer, and since both jemalloc and the GC allocate 1MiB-aligned chunks we hope that unaligned chunks are pretty rare. If we manage to use an unaligned chunk that the OS gives us, the next 10 chunks might all be aligned (e.g. imagine a pattern of 1MiB aligned holes in the address space with one 1.5MiB free region before it that happens to be unaligned).

> > +    bool addressesGrowUpward = growthDirection > 0;
> > +    bool directionUncertain = -8 < growthDirection && growthDirection <= 8;
> 
> Would this be simpler as
> 
>     int directions[] = [ growthDirection > 0, growthDirection <= 0 ];
>     if (abs(growthDirection) >= 8) {
>         directions[1] = -1;
>     }
>     for (size_t i = 0; i < 2; ++i) {
>         if (directions[i] == -1) {
>             break;
>         } else if (directions[i] == int(true)) {
>             ...
>         } else {
>             ...
>         }
>     }
> 
> or similar?

Hmm, I prefer my version but I'm biased. I think your suggestion is a bit more efficient but mine is slightly more intuitive? I can play with this a bit more. Really I wanted to combine both cases into 1 block but I couldn't figure out a way to do it intuitively. One way that might work is to use negative numbers for the negative growth direction, but you'd have to negate them again before using them and you'd run into 32-bit overflow...

> > +    return VirtualAlloc(region, length, MEM_RESET, DWORD(PageAccess::ReadWrite)) == region;
> > +#elif defined(XP_DARWIN)
> 
> I guess making this happen on appropriate linux versions is a separate bug.

Yeah, it would be easy to do but it would be nice to get some performance comparisons and see whether it affects memory measurements. I just changed this back to match jemalloc since the MADV_FREE was removed without comment.
Flags: needinfo?(emanuel.hoogeveen)
Priority: -- → P3
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #6)
> > > +    for (desired = start; !region && desired + length <= end; desired += step) {
> > 
> > Huh. I guess if the first allocation fails, we could start just past the
> > last attempted allocation (wrapping around). But that's more complexity, so
> > I guess we should wait until we need it?
> 
> This whole loop is awful and I hope to get rid of it soon. But if we just
> change the step size to the length (rounded up to the alignment), we could
> jump right over an available region that happened to start just one
> alignment-increment away.

That's not what I meant. I meant to try to allocate first as a special case, and if it works we're done. Otherwise, do exactly what you're doing, but instead of starting at 'start' and stepping forward by 'step' until you get to 'end', pick up where you left off the last time the special initial allocation failed and wrap back around before giving up completely. Essentially, make 'desired' persist across calls. (Leave 'step' the same.)

But it doesn't matter, if you're rewriting this to do something more intelligent.

> > >   * allocator gives us a chunk that either is, or can be aligned.
> > 
> > Is there a bug number you could put in the comment here to provide more
> > information?
> 
> See above; I could link to bug 1005849 but I prefer to make things obvious
> from the comments where possible. Do you think there's more context I should
> add? Maybe an overview at the top of the file would help.

Sorry, I shouldn't have said "to provide more information". I meant to provide anecdotal evidence that this situation occurs enough to be worth implementing all this stuff. Skimming that bug, it seems like there's no good single place to point to for that evidence. So unless you happen to know of something, it's fine to leave it out.

> > > +    bool addressesGrowUpward = growthDirection > 0;
> > > +    bool directionUncertain = -8 < growthDirection && growthDirection <= 8;
> > 
> > Would this be simpler as
> > 
> >     int directions[] = [ growthDirection > 0, growthDirection <= 0 ];
> >     if (abs(growthDirection) >= 8) {
> >         directions[1] = -1;
> >     }
> >     for (size_t i = 0; i < 2; ++i) {
> >         if (directions[i] == -1) {
> >             break;
> >         } else if (directions[i] == int(true)) {
> >             ...
> >         } else {
> >             ...
> >         }
> >     }
> > 
> > or similar?
> 
> Hmm, I prefer my version but I'm biased. I think your suggestion is a bit
> more efficient but mine is slightly more intuitive? I can play with this a
> bit more. Really I wanted to combine both cases into 1 block but I couldn't
> figure out a way to do it intuitively. One way that might work is to use
> negative numbers for the negative growth direction, but you'd have to negate
> them again before using them and you'd run into 32-bit overflow...

Nah, that's too much cleverness. I wasn't going for efficiency, only clarity. Your version is:

    bool addressesGrowUpward = growthDirection > 0;
    bool directionUncertain = -8 < growthDirection && growthDirection <= 8;
    for (i = 0; i < 2; ++i) {
      if (addressesGrowUpward) {
         ...
      } else {
         ...
      }
      if (!directionUncertain) {
         break;
      }
      addressesGrowUpward = !addressesGrowUpward;
    }

which is a lot of bookkeeping scattered around, so it took me a while to trace through. Rereading mine, it doesn't quite achieve what I wanted. I suppose there's always:

    const char* attempts;
    if (growthDirection < -8) {
        attempts = "<";
    } else if (growthDirection > 8) {
        attempts = ">";
    } else if (growthDirection < 0) {
        attempts = "<>";
    } else {
        attempts = "><";
    }
    for (i = 0; i < strlen(attempts); ++i) {
        if (attempts[i] == '>') { 
            ....
        } else {
            ....
        }
    }

but that's kinda weird and verbose. Maybe

    int8_t attempts[2];
    attempts[0] = growthDirection < 0 ? -1 : 1;
    attempts[1] = abs(growthDirection) >= 8 ? 0 : -attempts[0];
    for (i = 0; i < 2 && attempts[i]; ++i) { ... }

Meh. What you have is fine.
Attachment #9020614 - Flags: review?(sphink) → review+
Blocks: 1502562
No longer depends on: 1502562
Bah, sorry about the further delay here. Let's try to get this in before the soft freeze. Carrying forward r+.

I decided to stick with the original formulation for the growth direction stuff for now; I think I can see where you're coming from, but I like that in my version the choice of what to do is more explicit. I guess it's in the eye of the beholder.

I changed "GetNewChunk" to "TryToAlignChunk" - the latter doesn't fully describe everything the function tries to do, but at least it's less meaningless than the former.
Attachment #9020614 - Attachment is obsolete: true
Attachment #9029229 - Flags: review+
https://treeherder.mozilla.org/#/jobs?repo=try&revision=49d580761e9002b088b32181241b3a5cd21403bf

A fair amount of orange, but it all looks unrelated. One interesting GC crash in Linux x64 asan bc8 that doesn't seem to have an associated bug, but I don't think this patch could have caused it.
Keywords: checkin-needed
Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2b3af7022014
Clean up and refactor GC system memory allocation functions. r=sfink
Keywords: checkin-needed
Backed out changeset for raptor Android timeout errors.

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&group_state=expanded&searchStr=android%2C8.0%2Cpixel2%2Caarch64%2Copt%2Craptor%2Cperformance%2Ctests%2Con%2Cfirefox%2Cwith%2Ce10s%2Ctest-android-hw-p2-8-0-android-aarch64%2Fopt-raptor-speedometer-geckoview-e10s%2Crap-e10s%28sp%29&fromchange=2b3af7022014b0d4a10a2c8812da9d2735f75a9b&tochange=098c543f74c68ee61b4de82dae21870b1eb8fa12&selectedJob=215226260

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=215226260&repo=mozilla-inbound&lineNumber=989

Backout link: https://hg.mozilla.org/integration/mozilla-inbound/rev/098c543f74c68ee61b4de82dae21870b1eb8fa12

01:04:59     INFO -  adb launch_application: am start -W -n org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity -a android.intent.action.Main --ez use_multiprocess True --es args "-profile /sdcard/raptor-profile --es env0 LOG_VERBOSE=1 --es env1 R_LOG_LEVEL=6" -d about:blank
01:09:59     INFO -  raptor-main Exception launching org.mozilla.geckoview_example
01:09:59    ERROR -  Traceback (most recent call last):
01:09:59     INFO -    File "/builds/worker/workspace/build/tests/raptor/raptor/raptor.py", line 562, in <module>
01:09:59     INFO -      main()
01:09:59     INFO -    File "/builds/worker/workspace/build/tests/raptor/raptor/raptor.py", line 532, in main
01:09:59     INFO -      raptor.run_test(next_test, timeout=int(next_test['page_timeout']))
01:09:59     INFO -    File "/builds/worker/workspace/build/tests/raptor/raptor/raptor.py", line 264, in run_test
01:09:59     INFO -      fail_if_running=False)
01:09:59     INFO -    File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb_android.py", line 580, in launch_activity
01:09:59     INFO -      timeout=timeout)
01:09:59     INFO -    File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb_android.py", line 481, in launch_application
01:09:59     INFO -      self.shell_output(cmd, timeout=timeout)
01:09:59     INFO -    File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1378, in shell_output
01:09:59     INFO -      raise ADBTimeoutError("%s" % adb_process)
01:09:59     INFO -  mozdevice.adb.ADBTimeoutError: args: adb -s HT85S1A02570 wait-for-device shell am start -W -n org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity -a android.intent.action.Main --ez use_multiprocess True --es args "-profile /sdcard/raptor-profile --es env0 LOG_VERBOSE=1 --es env1 R_LOG_LEVEL=6" -d about:blank; echo rc=$?, exitcode: None, stdout: Starting: Intent { act=android.intent.action.Main dat=about:blank cmp=org.mozilla.geckoview_example/.GeckoViewActivity (has extras) }
01:09:59    ERROR - Return code: 1
01:09:59  WARNING - setting return code to 1
01:09:59     INFO - Killing logcat pid 475.
01:09:59 CRITICAL - PERFHERDER_DATA was seen 0 times, expected 1.
01:09:59     INFO - copying raptor results to upload dir:
01:09:59     INFO - /builds/worker/workspace/build/blobber_upload_dir/perfherder-data.json
01:09:59     INFO - copying raptor results from /builds/worker/workspace/build/raptor.json to /builds/worker/workspace/build/blobber_upload_dir/perfherder-data.json
01:09:59 CRITICAL - Error copying results /builds/worker/workspace/build/raptor.json to upload dir /builds/worker/workspace/build/blobber_upload_dir/perfherder-data.json
01:09:59     INFO - [Errno 2] No such file or directory: u'/builds/worker/workspace/build/raptor.json'
01:09:59     INFO - Running post-action listener: _package_coverage_data
01:09:59     INFO - Running post-action listener: _resource_record_post_action
01:09:59     INFO - Running post-action listener: process_java_coverage_data
01:09:59     INFO - Running post-action listener: stop_device
01:09:59     INFO - Killing logcat pid 475.
01:09:59     INFO - [mozharness: 2018-12-03 01:09:59.576788Z] Finished run-tests step (success)
Flags: needinfo?(emanuel.hoogeveen)
Oof, what a thing to get backed out for. Just knowing there's a timeout is not very useful. I wonder if there's a way to coax more details from this job..
Flags: needinfo?(emanuel.hoogeveen)
This changes allocation on 64-bit platforms to try randomly chosen aligned regions instead, removing the allocation loop. WIP because the one test that we run in production on ARM64 still fails (I don't know why).
Attachment #9029229 - Attachment description: v2 - Clean up and refactor GC system memory allocation functions. → Part 1 v2: Clean up and refactor GC system memory allocation functions.
Hi Bob, I need some help figuring out what's going wrong here. With both patches applied, everything is green on try [1] except for test-android-hw-p2-8-0-android-aarch64/opt-raptor-speedometer-geckoview-e10s [2]. It says it's timing out, but part 2 should have made allocation much, much faster on ARM64, so my best guess is that the framework is failing somehow. The jit-tests in the -u all push [1] also don't seem to indicate any problems (they're orange, but that's expected - work to fix them is ongoing).

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=bf7789bdf8e0dc636ec48777a9a7dd88c2268c5c
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0df7caea042cbd269265e119404bf653a170fa3
Flags: needinfo?(bob)
Logcat shows the gecko process for the child tab crashed via a SEGV

12-07 13:54:08.056  1180  1571 I ActivityManager: Start proc 5529:org.mozilla.geckoview_example:tab/u0a120 for service org.mozilla.geckoview_example/org.mozilla.gecko.process.GeckoServiceChildProcess$tab

...

12-07 13:55:48.217   729   729 I Zygote  : Process 5529 exited due to signal (11)
12-07 13:55:48.217  5509  5545 I Gecko   : [Parent 5509, Gecko_IOThread] WARNING: pipe error (102): Connection reset by peer: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 349
12-07 13:55:48.218  5509  5525 I Gecko   : 
12-07 13:55:48.218  5509  5525 I Gecko   : ###!!! [Parent][MessageChannel] Error: (msgtype=0x1E0086,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv
12-07 13:55:48.218  5509  5525 I Gecko   : 
12-07 13:55:48.219  5509  5546 I GeckoProcessManager: Binder died for tab
12-07 13:55:48.219  1180  1571 I ActivityManager: Process org.mozilla.geckoview_example:tab (pid 5529) has died: vis  TOP 
12-07 13:55:48.220  5509  5525 D GeckoViewProgress: observe: topic=oop-frameloader-crashed
12-07 13:55:48.220  5509  5525 D GeckoViewContent: observe: oop-frameloader-crashed

I retriggered it several times to see how reproducible it is. snorp: Thoughts?
Flags: needinfo?(bob) → needinfo?(snorp)
A segmentation fault? Interesting, maybe I shouldn't have dismissed the jit-test orange so easily. I'll have a closer look, thanks for the assistance!
Yeah it's crashing, but why aren't we seeing the dump? Do raptor tests not fetch it?
Flags: needinfo?(snorp) → needinfo?(rwood)
FWIW, the jit-test logs only give the JS stack, which isn't very helpful on its own.
Blocks: 1441473
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #17)
> Yeah it's crashing, but why aren't we seeing the dump? Do raptor tests not
> fetch it?

We do for Firefox desktop but not for geckoview [1]. We use MozRunner for desktop which supports checking for crashes/dumps. For geckoview we use MozDevice's ADBDevice to control the app, and I see no support there for checking for crashes/getting dumps [2].

How can we check for crashes and grab dumps for geckoview? :gbrown, is there a way we can do that now (have you implemented that somewhere)? If so I'd like to implement that in Raptor too, thanks.

[1] https://searchfox.org/mozilla-central/rev/dc3585e58b427c3a8a7c01712fe54ebe118a3ce2/testing/raptor/raptor/raptor.py#352

[2] https://searchfox.org/mozilla-central/source/testing/mozbase/mozdevice/mozdevice/adb_android.py
Flags: needinfo?(rwood) → needinfo?(gbrown)
See Also: → 1512983
Thanks :snorp, filed Bug 1512983 to have Raptor grab crash dumps on geckoview, will follow up there.
Flags: needinfo?(gbrown)
Comment on attachment 9029229 [details] [diff] [review]
Part 1 v2: Clean up and refactor GC system memory allocation functions.

Review of attachment 9029229 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Memory.cpp
@@ +58,5 @@
> +                           (defined(__NetBSD__) || defined(__linux__)))))
> +static mozilla::Atomic<int, mozilla::Relaxed,
> +                       mozilla::recordreplay::Behavior::DontPreserve>
> +    growthDirection(1);
> +#elif defined(XP_LINUX)

Why rename XP_UNIX -> XP_LINUX only to break Tier3 like BSDs/Solaris?

In file included from /tmp/objdir/js/src/gc/Unified_cpp_js_src_gc1.cpp:2:
js/src/gc/Memory.cpp:478:30: error: use of undeclared identifier 'growthDirection'
  bool addressesGrowUpward = growthDirection > 0;
                             ^
js/src/gc/Memory.cpp:479:34: error: use of undeclared identifier 'growthDirection'
  bool directionUncertain = -8 < growthDirection && growthDirection <= 8;
                                 ^
js/src/gc/Memory.cpp:479:53: error: use of undeclared identifier 'growthDirection'
  bool directionUncertain = -8 < growthDirection && growthDirection <= 8;
                                                    ^
js/src/gc/Memory.cpp:491:13: error: use of undeclared identifier 'growthDirection'
          ++growthDirection;
            ^
js/src/gc/Memory.cpp:503:13: error: use of undeclared identifier 'growthDirection'
          --growthDirection;
            ^
5 errors generated.
As an update, I spent some time figuring out how to do ARM64 builds. I think I got it working, but unfortunately I get an INSTALL_FAILED_NO_MATCHING_ABIS error trying to install it on my phone, so I think I'm out of luck as far as local testing is concerned.

> Why rename XP_UNIX -> XP_LINUX only to break Tier3 like BSDs/Solaris?

Sorry, that was not my intent. I was merely using it to distinguish from OSX (XP_DARWIN), but it should be fine as long as I always check for OSX first. I'll fix this.
s/XP_LINUX/XP_UNIX/, carrying forward r+.
Attachment #9020615 - Attachment is obsolete: true
Attachment #9029229 - Attachment is obsolete: true
Attachment #9031721 - Flags: review+
Updated for the change to part 1.
Attachment #9030052 - Attachment is obsolete: true
With the progress in bug 1512983, I've started to investigate using printf debugging on Try and I think I'm getting close to understanding the problem.

Apparently AArch64 supports two different translation granules (4kiB and 64kiB), which set the page size and the size of the translation tables. In 4kiB mode, pages are 4kiB (12 bits) and each translation table can store 9 bits of the offset. In 64kiB mode, pages are 64kiB (16 bits) and each translation table can store 13 bits of the offset. If all (levels of) translation tables are used, both modes support up to 48-bit addresses.

AArch64 on Linux appears to use 3 (levels of) translation tables in 4kiB mode for 12+9+9+9 = 39-bit addresses, and 2 (levels of) translation tables in 64kiB mode for 16+13+13 = 42-bit addresses.

I don't know how to check for this at runtime, though simply assuming 39-bit addresses if the page size is 4kiB and assuming 42-bit addresses if the page size is 64kiB should work (#if defined(__aarch64__)). I'd love to do something more clever and future proof though.

Sources:
https://www.arm.com/files/downloads/ARMv8_Architecture.pdf
https://events.static.linuxfound.org/images/stories/pdf/lcna_co2012_marinas.pdf
I would be really interested to know what Windows for ARM64 does here. Do they use 4kiB pages and 39-bit addresses like Linux? Do they use 4kiB pages but all 4 translation tables for 48-bit addresses? I think [1] is meant to be the usual source of truth but it doesn't mention ARM at all.

[1] https://docs.microsoft.com/en-us/windows/desktop/Memory/memory-limits-for-windows-releases
Ah, in theory we should be able to get this information from GetSystemInfo() on Windows, specifically the lpMinimumApplicationAddress and lpMaximumApplicationAddress fields [1].

[1] https://docs.microsoft.com/en-us/windows/desktop/api/sysinfoapi/ns-sysinfoapi-_system_info
On my x64 system lpMinimumApplicationAddress == 0x10000 (2^16, matching the allocation granularity) and lpMaximumApplicationAddress == 0x7ffffffeffff (2^47 - 2^16 - 1), so these constants should be pretty easy to use.
It seems Arm64 also supports 16k pages (presumably storing 11 bits per translation table), increasing the number of possibilities even further [1]. But how do I check what the user mode address space limit actually *is* from a running process? There doesn't seem to be any standard way. Hell I checked programs like Wine and they just hardcode a 47-bit limit when they emulate Windows' GetSystemInfo().

I don't want to do the most conservative thing and just assume 39-bit addresses whenever Arm64 is involved, but I might need help from a Linux expert here.

[1] https://github.com/torvalds/linux/blob/master/arch/arm64/Kconfig
Well, I'm not super proud of this but it's the best I could come up with given what I was able to find online (which is to say, not a whole lot). Since there seems to be no standard API we can query to discover the address range at runtime, I decided to just do a binary search for it. It seems to work well and overhead at startup is a one-time ~45 microseconds on Android AArch64, which I think is acceptable.

The rest of this patch is unchanged from the WIP. The idea is simple: Since we have a gigantic amount of address space to work with, why not just randomly choose where to allocate? The odds of finding an empty spot should be very good, even with relatively large (e.g. 4GiB) allocations and even if many regions are reserved.

Since we're in charge of the address range ourselves, this solves the 48-bit address issue, and in addition it gives us the strongest possible address space randomization. It's also ultimately a lot simpler than what we do on 32-bit platforms.

There are a few complications. As mentioned, the actual address range depends on the architecture and kernel configuration, which causes problems if we just assume that every 47-bit number is valid. If we're generating 47-bit numbers, the probability of getting one in AArch64 Linux's default 39-bit range is not very high, explaining the problems on Android.

In addition I decided to still attempt to align chunks if the allocator gives us something in a valid range. I wasn't sure whether to bother, as the odds of overlapping with another allocation should be small, but given that the logic already existed for 32-bit I figured I might as well.

I extended TryToAlignChunk() a little to make it return a boolean indicating success or failure, made it deallocate the leftover region on success or OOM, and added a template parameter so that we don't uselessly get a new chunk on platforms other than Windows. I hope this didn't make the function even more confusing (the changes are fairly small).
Attachment #9031722 - Attachment is obsolete: true
Attachment #9033137 - Flags: review?(sphink)
Given the 39-bit address range on Linux AArch64 (well, it can be as low as 36 bits with 16kiB pages but let's hope not a lot of people are using that), I wonder if we should take care to not allocate regions in all of it. By allocating randomly we could accidentally allocate a single page in every 4GiB region, for instance. Perhaps we should reserve regions with the highest bit set for wasm (not sure whether wasm itself uses MapAlignedPages, I'd have to check).

Bug 1517412 made me think of this, though the failures there are happening without these patches.
As noted in bug 1517412, we use 6GB reservations for wasm on 64-bit systems, so the situation is even grimmer than it looks.  On ARM64 we could however move this down toward 4GB because immediates in instructions are smaller on ARM64 than on x64, this is bug 1442544.

With a 36-bit effective address space we probably cannot use the mmap trick for bounds checks at all, as there's too great a risk of running out of memory in reasonable content and/or without full process separation.  It'll be very attractive to detect this situation at run-time, so that we can choose a per-process strategy for how to handle wasm.  (/me wonders about whether compiled wasm can be transmitted cross-process with, say, some kind of message broadcast, and how that will interact with per-process policies.)
Comment on attachment 9033137 [details] [diff] [review]
Part 2 v2.0: Allocate at randomly chosen aligned addresses on 64-bit platforms.

Review of attachment 9033137 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay. I didn't sift through this all that thoroughly, but it looks about right modulo the concerns you and lth are discussing later in this bug (which I would not have thought of.)

::: js/src/gc/Memory.cpp
@@ +200,5 @@
> +
> +#if defined(JS_64BIT)
> +
> +/* Returns the base 2 logarithm of the given value, rounded down. */
> +static inline uint32_t LogBase2(uint64_t v) {

I think this is mozilla::FloorLog2 from mfbt's MathAlgorithms.h.

@@ +225,5 @@
> +  do {
> +    mozilla::Maybe<uint64_t> result;
> +    do {
> +      result = mozilla::RandomUint64();
> +    } while (!result);

Oh, wow. Hopefully consuming entropy here won't get us into trouble anywhere.

@@ +239,5 @@
> + * kernel configuration. For example, AArch64 on Linux uses addresses with
> + * 39 significant bits by default, but can be configured to use addresses with
> + * 48 significant bits by enabling a 4th translation table. Unfortunately,
> + * there appears to be no standard way to query the limit at runtime
> + * (with the notable exception of GetSystemInfo() on Windows).

I would probably say "(Windows exposes this via GetSystemInfo())", since you're talking about Linux here so Windows isn't really an exception.

@@ +408,5 @@
> +
> +  // Try to allocate in random aligned locations.
> +  void* region = nullptr;
> +  for (size_t i = 1; i <= 1000; ++i) {
> +    if (i % 10 != 0) {

i % 16 maybe? (Or `if (i & 0xf)`.) It seems weird to throw a division in here, even though I'm sure it's swamped by other operations.

@@ +762,5 @@
>                                   MAP_PRIVATE | MAP_FIXED, fd, alignedOffset));
>    MOZ_RELEASE_ASSERT(map != MAP_FAILED);
>  #endif
>  
> +#if defined(DEBUG)

Looks like this can go back to #ifdef, unless there's some other reason for this change?
Attachment #9033137 - Flags: review?(sphink) → review+
Thanks for the review! No worries about the lag and happy new year :) I'll fix the comments and make a part 3 to address the concern in comment 31.

We should definitely fall back to the old method of allocating on systems we've already measured as using less-than-48-bit addresses, and I think it's a good idea to reserve say half of the address space for very large allocations (we could reserve more, but the more we reserve the weaker our address space randomization gets). Unfortunately this scattershot approach to allocating will probably make the situation in bug 1517412 somewhat worse, but I'll try to limit the damage.

(In reply to Steve Fink [:sfink] [:s:] from comment #33)
> > +  do {
> > +    mozilla::Maybe<uint64_t> result;
> > +    do {
> > +      result = mozilla::RandomUint64();
> > +    } while (!result);
> 
> Oh, wow. Hopefully consuming entropy here won't get us into trouble anywhere.

I hope not! We could do something like occasionally seeding a XorShift128PlusRNG, but that requires adding a Mutex and the addresses would be far more predictable. Thankfully allocating chunks isn't something you can do *that* frequently without also running out of physical memory, so it's probably okay.

> Looks like this can go back to #ifdef, unless there's some other reason for
> this change?

Oh, yeah a lot of them can probably be #ifdefs. I did it mostly for symmetry with the // !defined(x) comments on large conditional blocks, but I think I only left a couple of those since most blocks are small.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #34)
> We should definitely fall back to the old method of allocating on systems
> we've already measured as using less-than-48-bit addresses,

Er, to clarify, I mean falling back on OOM (since OOMing trying to reserve a 6GiB region on a platform with 39-bit addresses is not inconceivable).

Data point: Fedora Core 28 Server (!) configured for the SoftIron Overdrive 1000:

CONFIG_ARM64_PAGE_SHIFT=12
CONFIG_ARM64_CONT_SHIFT=4
CONFIG_ARM64_4K_PAGES=y
# CONFIG_ARM64_16K_PAGES is not set
# CONFIG_ARM64_64K_PAGES is not set
# CONFIG_ARM64_VA_BITS_39 is not set
CONFIG_ARM64_VA_BITS_48=y
CONFIG_ARM64_VA_BITS=48
CONFIG_ARM64_PA_BITS_48=y
CONFIG_ARM64_PA_BITS=48

Since this is Server it is not surprising that it is configured with 48-bit virtual addresses, or at least that's what it looks like to me. (Fedora has no aarch64 support for Workstation from what I can tell.)

Carrying forward r+. This addresses all the comments except the one about #ifdef; I left that one for part 3.

Attachment #9033137 - Attachment is obsolete: true
Attachment #9036263 - Flags: review+

I was torn on whether to make this its own part, but I think there's enough new here that you should at least have a look.

This patch disables the new allocator on platforms that don't have a very large address range. On these platforms the new allocator's scattershot approach causes problems when mixing very large allocations/reservations (like WASM's) with smaller allocations, and we're better off letting the OS tell us where to allocate.

On platforms where we do have a large enough address range to use the new allocator, I split the address range in half: the lower half is reserved for smaller allocations and the upper half is reserved for huge (>= 1GiB) allocations. This should keep regular chunk allocations from stepping on WASM's toes and vice versa.

Finally, if there's no chance of getting an invalid address we now fall back to overallocating (MapAlignedPagesSlow) before we give up in MapAlignedPagesRandom, and if the allocation is a huge reservation we return nullptr instead of crashing.

I also adjusted the tests and exposed some new helper functions. Since we now enable the older allocator again on some 64-bit platforms (but without the bad allocation loop) we can test it again, and WASM might be interested in knowing the system's address range.

Attachment #9036264 - Flags: review?(sphink)
Comment on attachment 9036264 [details] [diff] [review]
Part 3: Tune new allocator to play nice with WASM.

Review of attachment 9036264 [details] [diff] [review]:
-----------------------------------------------------------------

I wonder if there's a good place to expose this. getBuildConfiguration() jumps to mind, except that it is definitely *not* the build configuration. Oh well, maybe wait until someone has a use for it. (It'd be good to have it in profiler metadata...)
Attachment #9036264 - Flags: review?(sphink) → review+

(In reply to Steve Fink [:sfink] [:s:] from comment #39)

I wonder if there's a good place to expose this. getBuildConfiguration()

Sorry, s/this/some of these discovered values like address bits/.

GetOSConfiguration()? GetKernelConfiguration()? Well, right now it's still only a few values. That reminds me though, maybe it would be good to expose the allocation granularity (it doesn't hurt to pass an alignment smaller than the allocation granularity to MapAlignedPages(), but I could imagine something like a vector implementation wanting to know what the smallest sensible heap allocation is - assuming malloc isn't good enough).

Oh, and thanks for the review! Carrying forward r+, just added a big comment above MapAlignedPagesRandom() to explain the algorithm.

Build only try run to confirm I didn't break anything while splitting off part 3: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f7d8cab8c8847a2bc5ec11660a0933ef448f1ad

(I did a full run earlier and it was green modulo some m-c bustage)

Attachment #9036264 - Attachment is obsolete: true
Attachment #9036426 - Flags: review+

Looks good. Fingers crossed!

Keywords: checkin-needed

Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/32aab5bf983a
Part 1: Clean up and refactor GC system memory allocation functions. r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/ac29aabfda36
Part 2: Allocate at randomly chosen aligned addresses on 64-bit platforms. r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/e5d3da4bdf58
Part 3: Tune new allocator to play nice with WASM. r=sfink

Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 6 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66

65=wontfix because I assume we do not want to uplift GC changes to 65 Beta when there are only two weeks remaining in the release cycle.

Depends on: 1520343

(In reply to Chris Peterson [:cpeterson] from comment #46)

65=wontfix because I assume we do not want to uplift GC changes to 65 Beta when there are only two weeks remaining in the release cycle.

Yes, this is a pretty big change that I wouldn't land during the soft freeze, much less uplift.

Depends on: 1520496
Depends on: 1521713
Depends on: 1522294
Depends on: 1520783
Depends on: 1526300
Regressions: 1562437
You need to log in before you can comment on or make changes to this bug.