Closed Bug 1779138 Opened 3 years ago Closed 3 years ago

Large 23 second CPU spikes in nsMemoryReporterManager::ResidentUnique(unsigned int)

Categories

(Core :: XPCOM, defect, P1)

defect

Tracking

()

RESOLVED FIXED
106 Branch
Performance Impact ?
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- unaffected
firefox103 --- unaffected
firefox104 --- unaffected
firefox105 + fixed
firefox106 --- fixed

People

(Reporter: jrmuizel, Assigned: pbone)

References

(Regression)

Details

(Keywords: regression, regressionwindow-wanted)

Attachments

(2 files, 1 obsolete file)

I'm getting these about 50 seconds apart.

Here's a stack:

19.94 s  100.0%	0 s	 	mozilla::detail::RunnableFunction<mozilla::MemoryTelemetry::GatherTotalMemory()::$_3>::Run()
19.94 s  100.0%	0 s	 	 nsMemoryReporterManager::ResidentUnique(unsigned int)
19.94 s   99.9%	11.00 ms	 	  ResidentUniqueDistinguishedAmount(long long*, unsigned int) (.llvm.6731204511831033690)
19.91 s   99.8%	15.00 ms	 	   mach_vm_region
19.85 s   99.5%	17.00 ms	 	    mach_msg
19.84 s   99.4%	19.84 s	 	     mach_msg_trap

Inside the kernel the stack looks like:

18.75 s   64.0%	6.00 ms	 	             mach_vm_region
18.70 s   63.8%	15.00 ms	 	              mach_msg
18.69 s   63.8%	57.00 ms	 	               mach_msg_trap
18.56 s   63.3%	0 s	 	                mach_call_munger64
18.50 s   63.2%	8.00 ms	 	                 mach_msg_overwrite_trap
18.22 s   62.2%	6.00 ms	 	                  ipc_kmsg_send
18.15 s   62.0%	7.00 ms	 	                   ipc_kobject_server
18.06 s   61.7%	3.00 ms	 	                    0xffffff8000382360
18.01 s   61.5%	5.00 ms	 	                     vm_map_region
17.78 s   60.7%	3.31 s	 	                      vm_map_region_walk
6.88 s   23.5%	6.88 s	 	                       IORWLockUnlock
6.68 s   22.8%	6.68 s	 	                       IORWLockWrite

I still see this even after restarting Firefox.

:kmag, this one's a blast from the past: MemoryTelemetry is causing some grief.

Flags: needinfo?(kmaglione+bmo)

It's interesting. I don't remember us having problems with the macOS memory telemetry. Maybe we should add telemetry for how long the reporters take so we catch regressions. I thought we used to have them. There may be machine-specific issues.

We should maybe disable the reporters on machines where they take too long, but there's no knowing how that would skew the data they collect

Flags: needinfo?(kmaglione+bmo)

Unfortunately, our team does not have the domain knowledge to fix this problem and does own this data, along with its collection/reporting code. Going as back as bug 829439, the relevant component seems to be Core::XPCOM.

In general, our team is responsible for the data collection tools (e.g. Legacy Telemetry, Glean SDKs) but not for the code using our tools to collect their data.

Component: Telemetry → XPCOM
Product: Toolkit → Core

Paul, is this related to stuff you've worked on recently?

Flags: needinfo?(pbone)
See Also: → 1781266

In addition to being slow and using significant power, it seems to slow down everything happening on other threads (especially the main thread). Example profiles: https://share.firefox.dev/3zrjwfp https://share.firefox.dev/3oMyY0Y
See how the profiler is sampling slowly until the CPU activity on the StreamTrans thread finishes, and how the main thread is getting very little CPU time. I'm guessing whatever we are doing with the memory also slows down memory accesses for other threads of the process.

Performance Impact: --- → ?

I'm guessing enumerating memory mappings is more expensive on the M1 kernel, for some reason. We should maybe disable this probe on macOS on ARM until we figure it out.

I was seeing this on x86-64

It seems like we should probably get some telemetry on how long GatherTotalMemory takes

The slowness of ResidentUniqueDistinguishedAmount had also been pointed out in bug 1665318 for about:processes, and I'm pretty sure that was before we had M1s. I fixed it there by switching to a call to task_info(selectedTask, TASK_VM_INFO, (task_info_t)&task_vm_info, &count); See discussion in https://phabricator.services.mozilla.com/D125729#inline-692250, which mentions this API "was introduced in OS X 10.11 (and our minimum supported version is 10.12, so it's fine!)".

(In reply to Florian Quèze [:florian] from comment #12)

The slowness of ResidentUniqueDistinguishedAmount had also been pointed out in bug 1665318 for about:processes, and I'm pretty sure that was before we had M1s. I fixed it there by switching to a call to task_info(selectedTask, TASK_VM_INFO, (task_info_t)&task_vm_info, &count); See discussion in https://phabricator.services.mozilla.com/D125729#inline-692250, which mentions this API "was introduced in OS X 10.11 (and our minimum supported version is 10.12, so it's fine!)".

I'd probably be OK with that change if you want to write a patch. I don't have the time right now, and I don't have a Mac build machine to test on at the moment, so it would require long round-trip try builds to test.

(In reply to Kris Maglione [:kmag] from comment #13)

(In reply to Florian Quèze [:florian] from comment #12)

The slowness of ResidentUniqueDistinguishedAmount had also been pointed out in bug 1665318 for about:processes, and I'm pretty sure that was before we had M1s. I fixed it there by switching to a call to task_info(selectedTask, TASK_VM_INFO, (task_info_t)&task_vm_info, &count); See discussion in https://phabricator.services.mozilla.com/D125729#inline-692250, which mentions this API "was introduced in OS X 10.11 (and our minimum supported version is 10.12, so it's fine!)".

I'd probably be OK with that change if you want to write a patch.

I wrote the patch that I had in mind, that replaces the current implementation with what I use in about:processes. When looking in about:memory, the resident-unique value is now higher than resident for the parent process, which seems incorrect, eg:
363.95 MB ── resident
374.95 MB ── resident-peak
403.02 MB ── resident-unique
I wonder if the OS could be attributing the shared memory to the parent process. The resident and resident-unique values look reasonable for child processes.

I think this is better for Telemetry (as collecting a value that matches what users see on the Activity Monitor makes sense) but for about:memory I'm hesitant. How much do we care about this value being really the resident-unique amount of memory?

Assignee: nobody → florian
Status: NEW → ASSIGNED

I'm still trying to understand this. The documentation is bad to nonexistent.

I don't like the "resident-unique" value in about:memory being higher than the "resident" value. It will be confusing to most people reading the output. We could have a separate "Fast" version of the function for telemetry like we have for the "resident" number.

The "phys_footprint" value does appear to include some kernel memory, like the page table, from what I can tell. But that probably isn't 40MB. It also accounts for compressed memory, but I don't know what that accounting means exactly because, again, the documentation doesn't seem to exist.

And I can't figure out whether or how it accounts for shared memory...

I landed Bug 1743781 recently so this is almost certainly a regression from that, where the new Kernel APIs i'm using there are just slower and hold a lock that creates contention for other threads also.

I am worried that Florian's patch counts shared memory, but as kmag has said the documentation is very much lacking.

I like kmag's proposal of using phys_footprint as our "fast" method and using that in telemetry while using the existing slow path for about:memory. But I'd also like to suggest that we add both to about:memory. Counting memory is tricky depends on what you're trying to find out. Rather than labeling entries 'resident-unique' and resident-unique-fast I'd like to name them after the APIs we gather them from, so that when you're reading a report you can remember "oh yes, that's reported by the X API". So I suggest: resident-unique (using the method in ResidentUniqueDistinguishedAmount which I think is as "authorative" as it gets) and resident-phys-footprint for the phys_footprint field returned by task_info), and what's used by telemetry.

Flags: needinfo?(pbone)
Regressed by: 1743781

Set release status flags based on info from the regressing bug 1743781

[Tracking Requested - why for this release]: Potentially serious responsiveness regression. Workaround would be "disable telemetry" (I think?) which is not something we want to recommend.

Severity: -- → S2
Priority: -- → P1

Hi :jrmuizel, can you reproduce this reliabily, Can you verify that it is caused by Bug 1743781. I'd like to suggest backing that out but don't want to waste time if it's not the culprit. Thanks.

Flags: needinfo?(jmuizelaar)

(In reply to Paul Bone [:pbone] [PTO until 8/8/22] from comment #20)

Can you verify that it is caused by bug 1743781.

We can verify from the profiles in comment 8, as they contain line information (double click on the "ResidentUniqueDistinguishedAmount" line in the calltree or flame graph to have the inline source view panel open at the bottom of the profiler front-end with line information).

In https://share.firefox.dev/3zrjwfp, on the StreamTrans #5779 thread, there are there 1723 samples in ResidentUniqueDistinguishedAmount calling mach_vm_region. 5 of them are on the existing call, 1718 (99,7%) are on the new call in the SM_SHARED case added by bug 1743781. On the StreamTrans #5795 thread, there are 1605 samples in mach_vm_region, 50 in the old call, 1556 in the new one (about 97%).

In https://share.firefox.dev/3oMyY0Y there are 144 samples in mach_vm_region, all of them in the new call.

I think that's enough to confidently say 1743781 is the culprit.

(In reply to Paul Bone [:pbone] [PTO until 8/8/22] from comment #17)

I landed Bug 1743781 recently so this is almost certainly a regression from that, where the new Kernel APIs i'm using there are just slower and hold a lock that creates contention for other threads also.

I can see why that would make it slower, but I'm surprised that it would cause that much of a CPU/power usage spike unless we're talking about spinlocks. But I think maybe just the extra system call overhead for every SM_SHARED segment would be enough to explain it.

But I'd also like to suggest that we add both to about:memory.

I was actually thinking that too, mainly because we see a lot of confusion where people look at processes in about:memory and none of the numbers shown match the size shown in the system process manager. We should probably try to do something similar on Windows, though I don't know the appropriate API in that case. On Linux, there isn't really a standard file manager, so there isn't an obvious number to choose. Maybe whatever the Gnome process manager shows.

That doesn't have to happen in this bug, though.

(In reply to Florian Quèze [:florian] from comment #21)

(In reply to Paul Bone [:pbone] [PTO until 8/8/22] from comment #20)

Can you verify that it is caused by bug 1743781.

We can verify from the profiles in comment 8, as they contain line information (double click on the "ResidentUniqueDistinguishedAmount" line in the calltree or flame graph to have the inline source view panel open at the bottom of the profiler front-end with line information).

In https://share.firefox.dev/3zrjwfp, on the StreamTrans #5779 thread, there are there 1723 samples in ResidentUniqueDistinguishedAmount calling mach_vm_region. 5 of them are on the existing call, 1718 (99,7%) are on the new call in the SM_SHARED case added by bug 1743781. On the StreamTrans #5795 thread, there are 1605 samples in mach_vm_region, 50 in the old call, 1556 in the new one (about 97%).

In https://share.firefox.dev/3oMyY0Y there are 144 samples in mach_vm_region, all of them in the new call.

I think that's enough to confidently say 1743781 is the culprit.

Oof, Yeah pretty obvious, thanks.

Flags: needinfo?(jmuizelaar)

Hi Dianna,

Can you please backout Bug 1743781 in Beta only to remove this regression from Beta (FF 104)? We'll try to get a real fix for Nightly (FF 105).

[Tracking Requested - why for this release]: Let's track this in 105 too as we are for 104. It's a big enough problem regardless of what version it's in.

Flags: needinfo?(dsmith)

(In reply to Paul Bone [:pbone] [PTO until 8/8/22] from comment #24)

Hi Dianna,

Can you please backout Bug 1743781 in Beta only to remove this regression from Beta (FF 104)? We'll try to get a real fix for Nightly (FF 105).

[Tracking Requested - why for this release]: Let's track this in 105 too as we are for 104. It's a big enough problem regardless of what version it's in.

Ok, it looks like I have to backout bug 1777220 as well since it was a regression of Bug 1743781.

Flags: needinfo?(dsmith)

Removing tracking as this no longer affects 104 after backout of Bug 1743781 and bug 1777220 from beta

I might still work on this a bit tomorrow, but I don't think I will have time to finish the fix, get reviews and land it within a day, and I'll be on PTO the following two weeks so Paul and/or Kris, feel free to take over and push the WIP patch I had through the finish line.

(In reply to Florian Quèze [:florian] from comment #27)

I might still work on this a bit tomorrow, but I don't think I will have time to finish the fix, get reviews and land it within a day, and I'll be on PTO the following two weeks so Paul and/or Kris, feel free to take over and push the WIP patch I had through the finish line.

No worries. I'd be happy to take it, I'll look next week.

QA Whiteboard: [qa-regression-triage]
Assignee: florian → pbone

Set release status flags based on info from the regressing bug 1743781

This patch disables calculating resident unique memory for telemetry, but
retains it for about:memory. Telemetry has other memory measurements including a different fast-path.

See Also: → 1786852
See Also: → 1786860
See Also: → 1786864
Pushed by pbone@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6cde786bcd09 Don't query resident unique memory on MacOS r=mccr8
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 106 Branch

I've been adding telemetry to prevent this from regressing in the future, Bug 1786864, and noticed that these calls are usually fast. However the browser session I was testing is only a minute or two old and perhaps as the process continues to run these calls slow down. Or something else about the effected peoples' Macs slows it down. Certainly the effect is real, it's just interesting that I didn't see it myself so it requires "something else" to show up.

MEMORY_COLLECTION_TIME
51 samples, average = 41.8, sum = 2,133

14 | 0 0%
19 |## 3 6%
26 |### 4 8%
36 |######################### 40 78%
49 |## 3 6%
67 |# 1 2%
92 | 0 0%

Comment on attachment 9288254 [details]
Bug 1779138 - use TASK_VM_INFO for the resident unique memory value on Mac, r=kmag.

Revision D153569 was moved to bug 1786860. Setting attachment 9288254 [details] to obsolete.

Attachment #9288254 - Attachment is obsolete: true

Comment on attachment 9291317 [details]
Bug 1779138 - Don't query resident unique memory on MacOS r=mccr8

Beta/Release Uplift Approval Request

  • User impact if declined: A performance regression affecting MacOS users would be released.
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch disables gathering MEMORY_UNIQUE telemetry on MacOS. There's enough other memory telemetry that we can do without this.
  • String changes made/needed: None
  • Is Android affected?: No
Attachment #9291317 - Flags: approval-mozilla-beta?

Comment on attachment 9291317 [details]
Bug 1779138 - Don't query resident unique memory on MacOS r=mccr8

Approved for 105.0b5.

Attachment #9291317 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Paul Bone [:pbone] from comment #34)

I've been adding telemetry to prevent this from regressing in the future, Bug 1786864, and noticed that these calls are usually fast. However the browser session I was testing is only a minute or two old and perhaps as the process continues to run these calls slow down. Or something else about the effected peoples' Macs slows it down.

I was under the impression that the time it took to get the resident unique memory scaled with the amount of memory Firefox used. So on a recent Firefox process it was always fast, and got worse after multiple days of use.

The profiles in comment 28 (https://share.firefox.dev/3zrjwfp) had 2 different code paths triggering nsMemoryReporterManager::ResidentUnique. One in mozilla::MemoryTelemetry::GatherReports() (I think this is fixed by the patch) and one in mozilla::MemoryTelemetry::GatherTotalMemory() (I think it's https://searchfox.org/mozilla-central/rev/380fc5571b039fd453b45bbb64ed13146fe9b066/xpcom/base/MemoryTelemetry.cpp#375 and I'm not sure if it is fixed). Paul, could you please verify if we need to do something more for the second code path?

Flags: needinfo?(pbone)

(In reply to Florian Quèze [:florian] from comment #40)

(In reply to Paul Bone [:pbone] from comment #34)

I've been adding telemetry to prevent this from regressing in the future, Bug 1786864, and noticed that these calls are usually fast. However the browser session I was testing is only a minute or two old and perhaps as the process continues to run these calls slow down. Or something else about the effected peoples' Macs slows it down.

I was under the impression that the time it took to get the resident unique memory scaled with the amount of memory Firefox used. So on a recent Firefox process it was always fast, and got worse after multiple days of use.

The profiles in comment 28 (https://share.firefox.dev/3zrjwfp) had 2 different code paths triggering nsMemoryReporterManager::ResidentUnique. One in mozilla::MemoryTelemetry::GatherReports() (I think this is fixed by the patch) and one in mozilla::MemoryTelemetry::GatherTotalMemory() (I think it's https://searchfox.org/mozilla-central/rev/380fc5571b039fd453b45bbb64ed13146fe9b066/xpcom/base/MemoryTelemetry.cpp#375 and I'm not sure if it is fixed). Paul, could you please verify if we need to do something more for the second code path?

Oh, you're right. I'm going to disable this on the 2nd code path, which also means disabling MEMORY_TOTAL (or making it inaccurate). I'll then follow up with a patch for Nightly-only and regular "riding the trains" that uses the task info API you suggested to provide MEMORY_TOTAL.

Re-opening this bug to add another patch.

Status: RESOLVED → REOPENED
Flags: needinfo?(pbone)
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Target Milestone: 106 Branch → ---

In order to remove uses of ResidentUnique on MacOS this patch disables
GatherTotalMemory in its entirety (rather than provide inaccurate telemetry).
I will follow up with a patch that uses task_info phys_footprint to re-enable
it.

Pushed by pbone@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/49f976dc75d5 Disable GatherTotalMemory on MacOS r=florian

Backed out for causing Xpcshell failures on test_TelemetrySession.js. CLOSED TREE
Backout link
Push with failures
Link to failure log
Failure line :
TEST-UNEXPECTED-FAIL | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | xpcshell return code: 0
TEST-UNEXPECTED-FAIL | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_saveLoadPing - [test_saveLoadPing : 423] false == true

Flags: needinfo?(pbone)
Status: ASSIGNED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Pushed by pbone@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ff5568c32338 Disable GatherTotalMemory on MacOS r=florian
Status: ASSIGNED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 106 Branch

Please nominate the new patch for Beta approval when you're comfortable doing so.

Comment on attachment 9292394 [details]
Bug 1779138 - Disable GatherTotalMemory on MacOS r=florian

Beta/Release Uplift Approval Request

  • User impact if declined: As with the other patch, a performance regression is in beta and this patch fixes it.
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch disables the MEMORY_TOTAL telemetry for MacOS. Which is used but a follow-up will restore it for nightly then allow it to ride the trains.
  • String changes made/needed: None
  • Is Android affected?: No
Flags: needinfo?(pbone)
Attachment #9292394 - Flags: approval-mozilla-beta?

I'd normally like to wait a couple of days before requesting Beta uplift but I'll be on PTO next week. You're welcome to use your judgement and put this in a later beta version.

Comment on attachment 9292394 [details]
Bug 1779138 - Disable GatherTotalMemory on MacOS r=florian

Approved for 105.0b7.

Attachment #9292394 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Florian Quèze [:florian] from comment #40)

I was under the impression that the time it took to get the resident unique memory scaled with the amount of memory Firefox used. So on a recent Firefox process it was always fast, and got worse after multiple days of use.

It's more that it scales with the number of segments that are mapped. If you have a few huge allocations (e.g., for WebAssembly) it's just as fast as a few small allocations. But, in essence, yes.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: