[MTBF][Memory Report] Memory report pulling causing processes to exit

RESOLVED FIXED

Status

()

Toolkit
about:memory
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: ypwalter, Unassigned)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [MemShrink:P1])

(Reporter)

Description

3 years ago
When doing memory report pulling on B2G, processes exit a lot. Is there a way to ease this pressure? 

Got 0/10 files.
Warning: Child 4544 exited during memory reporting
10:49:00 
Warning: Child 6920 exited during memory reporting
10:49:00 
Warning: Child 4927 exited during memory reporting
10:49:00 
Warning: Child 4470 exited during memory reporting
10:49:00 
Warning: Child 4809 exited during memory reporting
10:49:01 
Got 0/5 files.
Warning: Child 7286 exited during memory reporting
10:49:03 
Got 0/4 files.
Got 1/1 files.
(Reporter)

Updated

3 years ago
Blocks: 990888
It would be good to confirm that the processes are exiting because of the memory pressure of memory reporting, rather than for some other reason.

There is definitely room for improvement here — PMemoryReportRequest sends the child's entire memory report as a single array rather than streaming it.  This is a known issue and probably not too difficult to fix; it just hadn't been reported as being enough of a problem in practice to give it high priority.

Beyond that, if there's more per-child-process overhead that's not as easy to deal with, it might make sense to consider collecting child process reports serially (or at least limiting the concurrency to the physical CPU count) to keep the peak overhead down.
Whiteboard: [MemShrink]
In particular, it would be nice to verify that the children aren't crashing, like in bug 1125490.

Updated

3 years ago
Depends on: 1149085
No longer depends on: 1149085
Depends on: 1149085

Updated

3 years ago
Whiteboard: [MemShrink] → [MEMSHRINK:P2]

Updated

3 years ago
Whiteboard: [MEMSHRINK:P2] → [MemShrink:P1]
After discussing in #memshrink we bumped this to P1, we should make our best effort to not increase memory consumption while performing memory testing (regardless of whether it causes OOMs). I've definitely seen OOMs on low-end devices in the past that were pretty clearly triggered by generating a memory report.

There are several potential ways we can address this as noted by Jed in comment 1, as well as a few more:
- Stream the memory reports from child to parent
- Do not perform memory reports in parallel
- Go back to writing the individual reports to a file, merge them (or not) in the parent process
- Reduce the overall size of memory reports
  - The reports currently are rather verbose with thousands of copies of descriptions, perhaps add an option to omit those on b2g
  - If we serialize to a file instead we might want to look at making the json format more terse / switch to a binary format
Depends on: 1151597
Filed bug 1151597 for getting rid of the big array in PMemoryReportRequest.
(In reply to Eric Rahm [:erahm] from comment #3)
> I've definitely seen OOMs on low-end devices in the past that
> were pretty clearly triggered by generating a memory report.

I assume those devices were using swap in the form of zRAM.  Testing on my Flame suggests that that's responsible for a lot of this: reporting a process's memory causes a certain amount of swap-in, and reporting every process's memory at once (when free/evictable memory is already scarce) causes a few to be OOM-killed — and the resulting state has about the same amount of free and cached memory as before, but a lot more free swap.  If I disable zRAM (and adjust the amount of RAM correspondingly), it's very hard to reproduce this even without any other changes.

Which brings us to:

> - Do not perform memory reports in parallel

I haven't implemented this yet, but I suspect it will be a much bigger win than bug 1151597 (which I have implemented — empirically it doesn't seem to help much, compared to switching from zRAM to real RAM).

>   - The reports currently are rather verbose with thousands of copies of
> descriptions, perhaps add an option to omit those on b2g
>   - If we serialize to a file instead we might want to look at making the
> json format more terse / switch to a binary format

I'm not so sure this is significant, at least for B2G — even with 10 child processes, the uncompressed JSON is only 3-4 MiB.

Comment 6

3 years ago
After 22hrs running on mtbf, we found the messages below, and it will cause the mtbf exited with exception.

Got 0/9 files.
Warning: Child 19600 exited during memory reporting
Warning: Child 19505 exited during memory reporting
Warning: Child 19690 exited during memory reporting
Warning: Child 17508 exited during memory reporting
Warning: Child 19903 exited during memory reporting
Got 0/4 files.
Warning: Child 18404 exited during memory reporting
Warning: Child 20227 exited during memory reporting
Warning: Child 2132 exited during memory reporting
Warning: Child 1295 exited during memory reporting
Got 0/0 files.Build was aborted
Archiving artifacts
Command adb shell 'echo -n "gc log" > "/data/local/debug_info_trigger"; echo -n "|$?"' failed with error code 143
Terminated
Pulled files into /var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op@7/label/moztwlab-01/output_164/about-memory43.
Failed to retrieve memory reports
Pulling GC/CC logs...
Crash in get-about-memory
Depends on: 1154053
More observations:

1. MinimizeMemoryUsage actually increases memory usage, substantially, when called on a mostly swapped-out process — enough that trying to minimize the parent process can kill several children (i.e., serialization won't help here).  But it's not used by default.

2. Verbose GC/CC logs, which are the default for get_about_memory.py, seem to have the same concurrency problem as memory reports, and at a roughly similar magnitude.  Abbreviated GC/CC logs seem to not be a problem.

Updated

3 years ago
OS: Linux → Gonk (Firefox OS)
Hardware: x86_64 → ARM
All blocking bugs have been fixed.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.