Closed Bug 1035396 Opened 6 years ago Closed 6 years ago

USS reporting takes too long

Categories

(Core :: Gecko Profiler, defect)

All
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla34

People

(Reporter: janx, Assigned: janx)

References

Details

(Keywords: perf)

Attachments

(4 files, 5 obsolete files)

The current Linux implementation of Unique Set Size reporting uses `fgets` to parse `/system/self/smaps`, which causes a `read` system call for every character of that file. This is too CPU intensive and takes too long.

Optimizing this function by `fread`ing the virtual `smaps` file into a buffer should reduce the number of system calls, and lower the CPU load on Firefox OS devices and/or allow us to poll USS with a higher frequency.
Attached patch strstr (obsolete) — Splinter Review
Attachment #8451997 - Flags: review?(nicolas.b.pierron)
Attached patch 4-letter matching (obsolete) — Splinter Review
Attachment #8451998 - Flags: review?(nicolas.b.pierron)
Attached image perf comparison graphs
One performance graph for each USS reporting method:

1. `fgets` is the current way to do it
2. `strstr` uses strstr
3. `nbp` is the 4-letter matching trick suggested by Nicolas

Each graph shows the number of milliseconds it takes to compute USS, per Firefox OS app process:
1. Callscreen is actually System (Callscreen is an in-process app).
2. Homescreen is the foreground app.
3. Keyboard is an app running in the background.

The `fgets` method takes between 30 and 100ms (more for background apps because they are niced and system calls are expensive).

The `strstr` method takes between 15 and 50ms (only 7ms for background apps).

The `nbp` method (4-letter matching) takes between 15 and 40ms (only 7ms for background apps). This seems to be the fastest and steadiest approach, just a little faster than `strstr` and showing lower "nice-spikes" in background apps (25ms vs 75ms).
Screen dims just before :35 and is is locked after :55.
(Note: horizontal scale changes because that run was 3 min long, where strstr run was only 1.5 min).

Screen dims before :35 and locks at :40. We see less spikes than for strstr, and apps (esp. Homescreen) are able to do more runs at 7ms, which seems to be the minimum.
Attachment #8452026 - Attachment description: strstr perf longer run → strstr perf for 1.5 minute
Attachment #8452033 - Attachment description: 4-letter match very long run → 4-letter perf for 3 minutes
Blocks: 989263
Comment on attachment 8451997 [details] [diff] [review]
strstr

Review of attachment 8451997 [details] [diff] [review]:
-----------------------------------------------------------------

These comments apply to both patches.

::: xpcom/base/nsMemoryReporterManager.cpp
@@ +45,5 @@
>  
>  #if defined(XP_LINUX)
>  
> +#include <string.h>
> +#include <stdlib.h>

In C++ we are suppose to use cstring and cstdlib instead of string. and stdlib.h.

@@ +47,5 @@
>  
> +#include <string.h>
> +#include <stdlib.h>
> +
> +#define PAGE_SIZE 4096

I would prefer if this is a static const within the function body.
No need to name it pageSize, just name it readSize.

@@ +102,5 @@
> +      if (ptr >= end) {
> +        break;
> +      }
> +      amount += atoi(ptr + sizeof("Private_xxxxx:"));
> +      ptr++;

potential improvement:
  if we read Private_Clean & Private_Dirty at this point, then we can skip the large majority of the non-matching part (~432 chars).  This might also have some nasty side-effect on the cache prefecting.  Note, that you might have to also test if ptr is still below end, such as we can read one line, and increase the size of "len" such as it contains both Private_Clean and Private_Dirty in the worst case scenario.
Attachment #8451997 - Flags: review?(nicolas.b.pierron)
Attachment #8451997 - Flags: review?(n.nethercote)
Attachment #8451997 - Flags: review+
Comment on attachment 8451998 [details] [diff] [review]
4-letter matching

Review of attachment 8451998 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/nsMemoryReporterManager.cpp
@@ +89,5 @@
> +
> +  static const uint32_t priv = *(uint32_t*)"Priv";
> +  static const uint32_t riva = *(uint32_t*)"riva";
> +  static const uint32_t ivat = *(uint32_t*)"ivat";
> +  static const uint32_t vate = *(uint32_t*)"vate";

can you try with constants instead of using a read with a cast:
  'P' << 24 | 'r' << 16 | 'i' << 8 | 'v';
  'r' << 24 | 'i' << 16 | 'v' << 8 | 'a';
  'i' << 24 | 'v' << 16 | 'a' << 8 | 't';
  'v' << 24 | 'a' << 16 | 't' << 8 | 'e';

This might help to let the compiler inline these values in the conditions.
Have you verified in the assembly that these constants are inline in the conditions and not loaded at every loop cycle.

@@ +110,5 @@
> +        offset = sizeof("rivate_xxxxx:");
> +      } else if (word == ivat) {
> +        offset = sizeof("ivate_xxxxx:");
> +      } else if (word == vate) {
> +        offset = sizeof("vate_xxxxx:");

Considering that we are mostly looking for performances on ARM it might be good to test as follow, in order to make only one load of a constant and ensure that other constants are quickly computed from the first one (which would be only one instruction instead of 2 for loading a 32 bit constant).

  uint32_t pr = 'P' << 8 | 'r';
  uint32_t priv = pr << 16 | ('i' << 8 | 'v');
  if (word == priv) {
    ...
    goto addAmount:
  }

  uint32_t riva = priv << 8 | 'a';
  if (word == riva) {
    ...
    goto addAmount:
  }

  uint32_t ivat = riva << 8 | 't';
  if (word == ivat) {
    ...
    goto addAmount:
  }

  uint32_t vate = ivat << 8 | 'e';
  if (word == vate) {
    ...
    goto addAmount:
  }

  continue;

addAmount:
  ...

Note that this is not needed if these constant can still live in registers, and even less needed with the skipping strategy described in my previous comment.
Attachment #8451998 - Flags: review?(nicolas.b.pierron)
Attachment #8451998 - Flags: review?(n.nethercote)
Attachment #8451998 - Flags: review+
Thanks a lot for reviewing the patches, even though most of the ideas are from you :)

I addressed your two nits (see below). For the other performance suggestions, I'll have to make my graph tool calculate averages to precisely measure their impact. Let's wait for Nicholas' opinion on these changes first.

(In reply to Nicolas B. Pierron [:nbp] from comment #6)
> > +#include <string.h>
> > +#include <stdlib.h>
> 
> In C++ we are supposed to use cstring and cstdlib instead of string.h and
> stdlib.h.

Fixed.

> > +#define PAGE_SIZE 4096
> 
> I would prefer if this is a static const within the function body.
> No need to name it pageSize, just name it readSize.

I actually went for a `static const uint32_t page = sysconf(_SC_PAGESIZE)` (thanks to :padenot for the tip) https://github.com/jankeromnes/gecko-dev/blob/ussnbp/xpcom/base/nsMemoryReporterManager.cpp#L92
Comment on attachment 8451998 [details] [diff] [review]
4-letter matching

Review of attachment 8451998 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/nsMemoryReporterManager.cpp
@@ +89,5 @@
> +
> +  static const uint32_t priv = *(uint32_t*)"Priv";
> +  static const uint32_t riva = *(uint32_t*)"riva";
> +  static const uint32_t ivat = *(uint32_t*)"ivat";
> +  static const uint32_t vate = *(uint32_t*)"vate";

The approach with the shifts would need to be ifdef'ed for big-endian vs. little-endian systems.

I suspect that the pointer casts in the current patch will break on strict-alignment systems.  A union of char[4] and uint32_t would work.

But I just checked whether the load is optimized out, and: for both the pointer cast and the union, GCC 4.9.0 on x86_64 doesn't, but Clang 3.2 does.  GCC does seem willing to hoist the loads out of the loop, at least if there are enough spare registers.
> In C++ we are suppose to use cstring and cstdlib instead of string. and
> stdlib.h.

http://stackoverflow.com/questions/8380805/difference-between-string-h-and-cstring suggests otherwise. 
Also, the mozilla codebase includes string.h 209 times in .cpp files and includes cstring only 27 times.
Comment on attachment 8451997 [details] [diff] [review]
strstr

Review of attachment 8451997 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/nsMemoryReporterManager.cpp
@@ +85,5 @@
>      return NS_ERROR_UNEXPECTED;
>    }
>  
>    int64_t amount = 0;
> +  static const uint32_t len = 32;

It took me quite some time to realize what |len| is for. AIUI, it's to account for the possibility of a Private_{Dirty,Clean} line being spread across two fread() calls -- you're preserving |len| chars from each previous fread(). Please add a comment about this.

Please also add a comment explaining the syntax of the lines being looked for -- with the old code it was obvious, but with the new code it is not.

@@ +90,5 @@
> +  char buffer[len + PAGE_SIZE + 1];
> +
> +  for (int i = 0; i < len; i++) {
> +    buffer[i] = ' ';
> +  }

Use memset() instead.

@@ +101,5 @@
> +    while (ptr = strstr(ptr, "Private")) {
> +      if (ptr >= end) {
> +        break;
> +      }
> +      amount += atoi(ptr + sizeof("Private_xxxxx:"));

Space after the colon?
Attachment #8451997 - Flags: review?(n.nethercote) → review+
Comment on attachment 8451998 [details] [diff] [review]
4-letter matching

Review of attachment 8451998 [details] [diff] [review]:
-----------------------------------------------------------------

I don't like this version. Too complicated for a marginal improvement.
Attachment #8451998 - Flags: review?(n.nethercote) → review-
Thanks Nicholas, Nicolas and Jed for your input!

(In reply to Nicholas Nethercote [:njn] from comment #12)
> I don't like this version. Too complicated for a marginal improvement.

Agreed, the solution is very cool but adds a bit of code complexity, so let's go with strstr.

(In reply to Nicholas Nethercote [:njn] from comment #11)
> > +  static const uint32_t len = 32;
> 
> It took me quite some time to realize what |len| is for. AIUI, it's to
> account for the possibility of a Private_{Dirty,Clean} line being spread
> across two fread() calls -- you're preserving |len| chars from each previous
> fread(). Please add a comment about this.

You're right, that code wasn't very explicit. I renamed the variable to `carryOver` and added a comment to explain its purpose.

> Please also add a comment explaining the syntax of the lines being looked
> for -- with the old code it was obvious, but with the new code it is not.

Done.

> > +  for (int i = 0; i < len; i++) {
> > +    buffer[i] = ' ';
> > +  }
> 
> Use memset() instead.

Done.

> > +      amount += atoi(ptr + sizeof("Private_xxxxx:"));
> 
> Space after the colon?

Not needed.

(In reply to Nicolas B. Pierron [:nbp] from comment #6)
> potential improvement:
>   if we read Private_Clean & Private_Dirty at this point, then we can skip
> the large majority of the non-matching part (~432 chars).  This might also
> have some nasty side-effect on the cache prefecting.  Note, that you might
> have to also test if ptr is still below end, such as we can read one line,
> and increase the size of "len" such as it contains both Private_Clean and
> Private_Dirty in the worst case scenario.

This sounds interesting, but adds a lot of code and only saves 2 or 3ms, so I decided to stay with the original code.
Attachment #8451997 - Attachment is obsolete: true
Comment on attachment 8454108 [details] [diff] [review]
Make Linux USS reporting 2x faster.

Carrying over Nic{,h}olas' r+.
Attachment #8454108 - Flags: review+
Comment on attachment 8454108 [details] [diff] [review]
Make Linux USS reporting 2x faster.

Review of attachment 8454108 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks!

::: xpcom/base/nsMemoryReporterManager.cpp
@@ +84,5 @@
>  
> +  // We carry over the end of the buffer to make sure we don't accidentally
> +  // break on an interesting line.
> +  static const uint32_t carryOver = 32;
> +  static const uint32_t pageSize = sysconf(_SC_PAGESIZE);

I'd just use 4096 and rename it |readSize|. There's no fundamental connection with the page size here.
Adressed nit, carried over.

(In reply to Nicholas Nethercote [:njn] from comment #16)
> I'd just use 4096 and rename it |readSize|. There's no fundamental
> connection with the page size here.

I was assuming that reading a file page by page was somehow faster, but I guess I was wrong. Renamed and changed to 4096.
Attachment #8454108 - Attachment is obsolete: true
Attachment #8454399 - Flags: review+
Unrelated orange, let's go ahead with "Make Linux USS reporting 2x faster".
Status: NEW → ASSIGNED
Keywords: checkin-needed
Backed out for asan failures:
https://tbpl.mozilla.org/php/getParsedLog.php?id=43610880&tree=Mozilla-Inbound
07:20:17     INFO -  1572 INFO TEST-START | /tests/image/test/mochitest/test_bug601470.html
07:20:18     INFO -  =================================================================
07:20:18     INFO -  ==1781==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fff8ecc7e85 at pc 0x460ee3 bp 0x7fff8ecc6dd0 sp 0x7fff8ecc6da8
07:20:18     INFO -  READ of size 1 at 0x7fff8ecc7e85 thread T0
07:20:18     INFO -      #0 0x460ee2 in __interceptor_atoi /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_interceptors.cc:595
07:20:19     INFO -      #1 0x7f201f1dd97e in GetProcSelfSmapsPrivate /builds/slave/m-in-l64-asan-0000000000000000/build/xpcom/base/nsMemoryReporterManager.cpp:102
07:20:19     INFO -      #2 0x7f201f1dd97e in ResidentUniqueDistinguishedAmount(long*) /builds/slave/m-in-l64-asan-0000000000000000/build/xpcom/base/nsMemoryReporterManager.cpp:142
07:20:19     INFO -      #3 0x7f201f1e712f in ResidentUniqueReporter::CollectReports(nsIMemoryReporterCallback*, nsISupports*, bool) /builds/slave/m-in-l64-asan-0000000000000000/build/xpcom/base/nsMemoryReporterManager.cpp:156
07:20:19     INFO -      #4 0x7f201f1db9bc in nsMemoryReporterManager::GetReportsForThisProcessExtended(nsIMemoryReporterCallback*, nsISupports*, bool, _IO_FILE*) /builds/slave/m-in-l64-asan-0000000000000000/build/xpcom/base/nsMemoryReporterManager.cpp:1214
07:20:19     INFO -      #5 0x7f201f2e2371 in NS_InvokeByIndex /builds/slave/m-in-l64-asan-0000000000000000/build/xpcom/reflect/xptcall/md/unix/xptcinvoke_x86_64_unix.cpp:162
07:20:19     INFO -      #6 0x7f202279c207 in Invoke /builds/slave/m-in-l64-asan-0000000000000000/build/js/xpconnect/src/XPCWrappedNative.cpp:2395
07:20:19     INFO -      #7 0x7f202279c207 in CallMethodHelper /builds/slave/m-in-l64-asan-0000000000000000/build/js/xpconnect/src/XPCWrappedNative.cpp:1736
07:20:19     INFO -      #8 0x7f202279c207 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) /builds/slave/m-in-l64-asan-0000000000000000/build/js/xpconnect/src/XPCWrappedNative.cpp:1703
07:20:19     INFO -      #9 0x7f20227a20ce in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) /builds/slave/m-in-l64-asan-0000000000000000/build/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1272
07:20:19     INFO -      #10 0x7f20279a64c3 in CallJSNative /builds/slave/m-in-l64-asan-0000000000000000/build/js/src/jscntxtinlines.h:230
...
...
07:20:19     INFO -  Address 0x7fff8ecc7e85 is located in stack of thread T0 at offset 4165 in frame
07:20:19     INFO -      #0 0x7f201f1dd78f in ResidentUniqueDistinguishedAmount(long*) /builds/slave/m-in-l64-asan-0000000000000000/build/xpcom/base/nsMemoryReporterManager.cpp:141
07:20:19     INFO -    This frame has 1 object(s):
07:20:19     INFO -      [32, 4161) 'buffer.i' <== Memory access at offset 4165 overflows this variable
07:20:19     INFO -  HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext
07:20:19     INFO -        (longjmp and C++ exceptions *are* supported)
07:20:19     INFO -  SUMMARY: AddressSanitizer: stack-buffer-overflow /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_interceptors.cc:595 __interceptor_atoi
07:20:19     INFO -  Shadow bytes around the buggy address:
07:20:19     INFO -    0x100071d90f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
07:20:19     INFO -    0x100071d90f90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
07:20:19     INFO -    0x100071d90fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
07:20:19     INFO -    0x100071d90fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
07:20:19     INFO -    0x100071d90fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
07:20:19     INFO -  =>0x100071d90fd0:[01]f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3
07:20:19     INFO -    0x100071d90fe0: f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3
07:20:19     INFO -    0x100071d90ff0: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
07:20:19     INFO -    0x100071d91000: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f2 f2 f2
07:20:19     INFO -    0x100071d91010: 00 00 f2 f2 00 00 f3 f3 00 00 00 00 00 00 00 00
07:20:19     INFO -    0x100071d91020: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
07:20:19     INFO -  Shadow byte legend (one shadow byte represents 8 application bytes):
07:20:19     INFO -    Addressable:           00
07:20:19     INFO -    Partially addressable: 01 02 03 04 05 06 07
07:20:19     INFO -    Heap left redzone:       fa
07:20:19     INFO -    Heap right redzone:      fb
07:20:19     INFO -    Freed heap region:       fd
07:20:19     INFO -    Stack left redzone:      f1
07:20:19     INFO -    Stack mid redzone:       f2
07:20:19     INFO -    Stack right redzone:     f3
07:20:19     INFO -    Stack partial redzone:   f4
07:20:19     INFO -    Stack after return:      f5
07:20:19     INFO -    Stack use after scope:   f8
07:20:19     INFO -    Global redzone:          f9
07:20:19     INFO -    Global init order:       f6
07:20:19     INFO -    Poisoned by user:        f7
07:20:19     INFO -    Contiguous container OOB:fc
07:20:19     INFO -    ASan internal:           fe
07:20:19     INFO -  ==1781==ABORTING
07:20:20     INFO -  TEST-INFO | Main app process: killed by SIGHUP
07:20:20  WARNING -  TEST-UNEXPECTED-FAIL | /tests/image/test/mochitest/test_bug601470.html | application terminated with exit code 1


remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/9396c2c95b00
Comment on attachment 8454399 [details] [diff] [review]
Make Linux USS reporting 2x faster.

Review of attachment 8454399 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/nsMemoryReporterManager.cpp
@@ +99,5 @@
> +    // We are looking for lines like "Private_{Clean,Dirty}: 4 kB".
> +    while (ptr = strstr(ptr, "Private")) {
> +      ptr += sizeof("Private_Xxxxx:");
> +      amount += atoi(ptr);
> +      if (ptr >= end) {

This check should be the first thing to do after the condition of the while.
Doing it here is clearly the cause of the ASAN failure.
Attachment #8454399 - Attachment is obsolete: true
Comment on attachment 8458549 [details] [diff] [review]
Make Linux USS reporting 2x faster.

Nicolas, please have a look again.

Try: https://tbpl.mozilla.org/?tree=Try&rev=8c83d5cf8703
Attachment #8458549 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8458549 [details] [diff] [review]
Make Linux USS reporting 2x faster.

Review of attachment 8458549 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/nsMemoryReporterManager.cpp
@@ +83,5 @@
>    }
>  
> +  // We carry over the end of the buffer to make sure we don't accidentally
> +  // break on an interesting line.
> +  static const uint32_t carryOver = 32;

comment-nit:
  We carry over the end of the buffer to the beginning to make sure we only interpret complete lines.

@@ +84,5 @@
>  
> +  // We carry over the end of the buffer to make sure we don't accidentally
> +  // break on an interesting line.
> +  static const uint32_t carryOver = 32;
> +  static const uint32_t pageSize = sysconf(_SC_PAGESIZE);

nit: see comment 16 & comment 17

@@ +90,3 @@
>    int64_t amount = 0;
> +  char buffer[carryOver + pageSize + 1];
> +  memset(buffer, ' ', carryOver);

comment-nit:
  Fill the beginning of the buffer with spaces, as a sentinel for the first iteration.

@@ +106,2 @@
>      }
> +    if (bytes < pageSize) {

nit:
  // We do not expect any match within the end of the buffer.
  MOZ_ASSERT(!ptr); // or MOZ_ASSERT(!strstr(end, "Private"))

@@ +106,5 @@
>      }
> +    if (bytes < pageSize) {
> +      break;
> +    }
> +    memcpy(buffer, end, carryOver);

comment-nit:
  Carry the end of the buffer to the beginning.
Attachment #8458549 - Flags: review?(nicolas.b.pierron) → review+
Nits, unnecessary re-try: https://tbpl.mozilla.org/?tree=Try&rev=3a8255064ae9
Attachment #8451998 - Attachment is obsolete: true
Attachment #8458549 - Attachment is obsolete: true
Comment on attachment 8459487 [details] [diff] [review]
Make Linux USS reporting 2x faster.

(carried over)
Attachment #8459487 - Flags: review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/0a2f3cd3c814
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
You need to log in before you can comment on or make changes to this bug.