Closed Bug 1567018 Opened 5 years ago Closed 2 years ago

[DoS/Hang] Firefox still hangs on wake up from OS X Power Nap (even with AppNap disabled via 1551990)

Categories

(Core :: SQLite and Embedded Database Bindings, defect, P1)

Unspecified
macOS
defect

Tracking

()

RESOLVED WORKSFORME
Performance Impact low
Tracking Status
firefox70 --- affected

People

(Reporter: haik, Unassigned)

References

Details

(Keywords: hang, perf:responsiveness)

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1551990 +++

Bug 1551990 was speculatively fixed by disabling AppNap. However, some users still report Firefox hanging on macOS after the computer is awakened from sleep.

See bug 1551990 comment 32 and later for more details.

Blocks: 1551990
No longer depends on: 1551990
See Also: → 1567427

I'm still investigating what's happening here. Assuming the problem is still that the parent process is throttled via App Nap while child processes continue to run and trigger network requests, I'm going to experiment with using an "activity"[1] to indicate to the OS that the parent process shouldn't be throttled.

  1. https://developer.apple.com/library/archive/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/PrioritizeWorkAtTheAppLevel.html#//apple_ref/doc/uid/TP40013929-CH36-SW1
Assignee: nobody → haftandilian

(In reply to Markus Stange [:mstange] from comment #32)

I'm still encountering this bug every now and then, on Nightly. I'm on 10.12. Here's a profile with lots of network requests from the content process which were waiting for the parent process for 11.1 hours: https://perfht.ml/2NTLZWK

Reposting Markus' profile from bug 1551990 that was recorded after 1551990 was fixed.

I think this problem may be due to network request handling during the suspended/sleep state.

There was an assumption that, during Power Nap, network requests were being throttled because the parent process only was being put into App Nap. With the fix for bug 1551990, the parent process should not be App Nap'd by the OS. From Activity Monitor, we can see that the parent process is not going into the App Nap state.

@mayhemer, from the networking side, does the profile on comment 3 indicate networking issues? I don't think we do any special handling of network connections when we enter the sleep state. I have seen some comments indicating that sleep mode can cause sockets to hang and I was wondering if we need to do any special handling in Firefox. Chromium has ERR_NETWORK_IO_SUSPENDED and seems to close connections on entering sleep and when connections are started while in sleep mode.

Flags: needinfo?(honzab.moz)

Can't elaborate here as I don't have an OSX machine to check this on. I'll try to find someone from the Necko team to look at this.

Flags: needinfo?(honzab.moz)

Kershaw promised to look at this.

(In reply to Haik Aftandilian [:haik] from comment #4)

I think this problem may be due to network request handling during the suspended/sleep state.

There was an assumption that, during Power Nap, network requests were being throttled because the parent process only was being put into App Nap. With the fix for bug 1551990, the parent process should not be App Nap'd by the OS. From Activity Monitor, we can see that the parent process is not going into the App Nap state.

@mayhemer, from the networking side, does the profile on comment 3 indicate networking issues? I don't think we do any special handling of network connections when we enter the sleep state. I have seen some comments indicating that sleep mode can cause sockets to hang and I was wondering if we need to do any special handling in Firefox. Chromium has ERR_NETWORK_IO_SUSPENDED and seems to close connections on entering sleep and when connections are started while in sleep mode.

It seems that the profile shows there are networking issues. I've seen some http requests were failed, but I can't tell the reason from the profile.
Is there a chance to get the http log when this happens again?

Flags: needinfo?(haftandilian)

I don't think this will be useful because the high CPU usage problem didn't occur after waking the computer up. I haven't been able to reproduce the hang. Here's a log file collected while sleeping, but the timestamps don't seem to make sense. I'll keep trying.

Put computer to sleep around midnight on August 14th.
Accidentally woke up computer around 1:30 AM August 15th.
Woke computer up 7:15 AM August.
High CPU usage problem did not occur.
Log file is 277M after decompression:
https://www.dropbox.com/sh/8oaxcq9uqsuctj3/AADRBmNxPJSfyAIHabAkmFR_a?dl=0

@brymaster or @mstange, if you're able to reproduce this, could you try collecting a networking log (see comment 7)? We want to start the log before putting the computer to sleep for a long time.

Just as a reminder, to reproduce this problem we need "Enable Power Nap" to be turned on in system preferences and the computer should be put to sleep while plugged into power. Power Nap wakeups are more frequent when the system is plugged in and this might be Power Nap related.

Flags: needinfo?(mstange)
Flags: needinfo?(haftandilian)
Flags: needinfo?(brymaster)

(In reply to Haik Aftandilian [:haik] from comment #8)

I don't think this will be useful because the high CPU usage problem didn't occur after waking the computer up. I haven't been able to reproduce the hang. Here's a log file collected while sleeping, but the timestamps don't seem to make sense. I'll keep trying.

Put computer to sleep around midnight on August 14th.
Accidentally woke up computer around 1:30 AM August 15th.
Woke computer up 7:15 AM August.
High CPU usage problem did not occur.
Log file is 277M after decompression:
https://www.dropbox.com/sh/8oaxcq9uqsuctj3/AADRBmNxPJSfyAIHabAkmFR_a?dl=0

It's really hard to tell what happened from the log. I can only tell there were lots of http requests, most of them were successful and some were failed.
It seems that networking works fine while the computer is asleep.

Can I get the log that is captured after waking up the computer? I'd like to see the reason why network requests are failure.
Thanks.

Flags: needinfo?(haftandilian)

This is still happening. Firefox 69.0, macOS 10.14.6. Enabled plugins are h264ify, Tab Center Redux, uBlock Origin, View Image.

Steps:

  • Went to about:networking > Logging > Start Logging.
  • I closed the lid of my MacBook Pro on Sunday 11pm while plugged in and at 100%.
  • Opened the lid on Monday 8am and noticed Activity Monitor showing the main Firefox process using 100%+ CPU.
  • I then stopped the network logging from overnight, saved the log and then started a new network log -- while still at 100%+ CPU.

Overnight network log (14.7 MB uncompressed).

Morning network log (47.4 MB uncompressed).

Yet another profile capture of questionable value

Flags: needinfo?(brymaster)

(In reply to brymaster from comment #10)

This is still happening. Firefox 69.0, macOS 10.14.6. Enabled plugins are h264ify, Tab Center Redux, uBlock Origin, View Image.

Steps:

  • Went to about:networking > Logging > Start Logging.
  • I closed the lid of my MacBook Pro on Sunday 11pm while plugged in and at 100%.
  • Opened the lid on Monday 8am and noticed Activity Monitor showing the main Firefox process using 100%+ CPU.
  • I then stopped the network logging from overnight, saved the log and then started a new network log -- while still at 100%+ CPU.

Overnight network log (14.7 MB uncompressed).

We are missing child process logs and profile.

I don't see anything significantly perf-related in the net log, more or less quiet over the night.

Morning network log (47.4 MB uncompressed).

As well here

Yet another profile capture of questionable value

I don't see much here either. Nothing from child processes.

Sampling of the process eating the CPU via the Activity Monitor as a complementary info would have been good.

(In reply to Honza Bambas (:mayhemer) from comment #11)

Sampling of the process eating the CPU via the Activity Monitor as a complementary info would have been good.

I've provided plenty of those in the previous Bug #1551990 -- if I cannot capture useful diagnostics with your own tools then I really don't know what else to tell you.

(In reply to brymaster from comment #12)

(In reply to Honza Bambas (:mayhemer) from comment #11)

Sampling of the process eating the CPU via the Activity Monitor as a complementary info would have been good.

I've provided plenty of those in the previous Bug #1551990 -- if I cannot capture useful diagnostics with your own tools then I really don't know what else to tell you.

Thanks a lot for providing this data!

Question is if this is exactly the same problem or if we shifted it to something else with fixes in that bug - i.e. we have another thing to fix.

The net logs say it's not a networking problem. There is no flood of networking requests as before, so this is something else.

It's weird the profiler didn't capture anything from child processes. Maybe tweak the settings, but Markus would provide a better advise on that. Note that not all threads are monitored by the profiler by default.

If there is one process and presumably one thread in it eating CPU, the sample would really be usefull, narrowing down where to look or what this is.

Thanks.

Happening again today. Did not immediately start using high CPU when awakened from sleep. Now the main Firefox process is stuck at 100% CPU and I cannot add/remove/edit bookmarks. Closing Firefox will result in a crash after a long period of time.

Activity Monitor screenshot: https://i.imgur.com/nLa1mbS.png

Process sample #1: https://pastebin.com/Q7ccB8Vx

Process sample #2: https://pastebin.com/P2VTThhB

Process sample #3: https://pastebin.com/JDydTCAA

A Time Profiler trace from Xcode's Instruments https://pastebin.com/QH0gyHKg

Network log: https://privatebin.net/?ff24fd08f0f1988f#8ZTs1f47puHDiPGvVrTM8AN2umhiZdg4gKgSJLCH9QsP

More profile captures: https://perfht.ml/2O5CeTW, https://perfht.ml/2O5Abzb, https://perfht.ml/2O5LpE0, https://perfht.ml/2O73III, https://perfht.ml/2O6to8l

Thanks to brymaster for help with debugging this. Removing my needinfo. I haven't posted new profiles or traces because I haven't been able to reproduce this issue.

Flags: needinfo?(haftandilian)

Thank for all the information, brymaster.
However, I can't find any networking issue from the logs and profiles.

Vicky, could you find someone to take a look at those profiles? Thanks,

Flags: needinfo?(vchin)

(In reply to brymaster from comment #16)

Didn't have to close the lid this time. Firefox just got into this state somehow.

Samples:
https://pastebin.com/47gEZhND

https://pastebin.com/KsBEi1Ua

in both samples, only thing that is visible is some IO activity on mozStorage #2 thread, but nothing really significant, I don't thing that would be the cause of a 100% load of the process 15455.

what exact symptoms did you see and how exactly did you choose this (and only this) process for sampling? was the Firefox UI unresponsive? was it the same as in comment 10?

can you sample also content processes? it would be good to choose a process that actually causes a CPU load for sampling, the parent process you have chosen really was doing nothing.

(In reply to Honza Bambas (:mayhemer) from comment #18)

in both samples, only thing that is visible is some IO activity on mozStorage #2 thread, but nothing really significant, I don't thing that would be the cause of a 100% load of the process 15455.

what exact symptoms did you see and how exactly did you choose this (and only this) process for sampling? was the Firefox UI unresponsive? was it the same as in comment 10?

can you sample also content processes? it would be good to choose a process that actually causes a CPU load for sampling, the parent process you have chosen really was doing nothing.

In EVERY instance, I chose the main Firefox process because its using 100%+ CPU. Based on Activity Monitor, the Content Processes were doing nothing. Symptoms are main process high CPU, excess heat on the bottom of the machine and high RPM with the fans. The Firefox UI is responsive EXCEPT for the bookmarks UI. Attempting to add a new bookmark while Firefox is in this state will fail (i.e. the bookmark will not add to the list and no star appears for it in the address bar). Closing Firefox in this high CPU state results in a crash that looks like: https://pastebin.com/E6yfTxvC

I'm finding it hard to believe and frustrating that in all three tickets now with all samples/logs I've provided while the main process is using 100%+ CPU, no useful information can be gathered. If Xcode time profiles, process samples, gecko profile captures and network logs are not enough information then SOMEONE needs to explain what tools I should be using to help debug this.

Yes, next time I will also sample the Content Processes.

Is the crash visible in about:crashes in Firefox? if so, can you provide the id/link to it?

See Also: → 1580117

Bug 1580117 is also a parent process high CPU utilization macOS bug. I can reproduce 1580117 quite easily, but the steps don't involve macOS sleep(). I think we should treat these as separate bugs, but wanted to note the similarity.

Flags: needinfo?(vchin)
Whiteboard: [qf]
Whiteboard: [qf] → [qf:p1:responsiveness]

Firefox 72.0.2
macOS 10.14.6

Plugged my machine in to charge last night, minimized my two Firefox windows (1 normal, 1 "private browsing"), closed the lid, when I woke up this morning and opened the lid Firefox was once again stuck using over 100% CPU. This happens so often that I don't bother reporting it. When in this state and I close Firefox, it also results in the same crash I've reported previously. I recently created a new Firefox profile to help rule out any issues there; problem persisting.

Process sample:
https://hastebin.com/raw/ohucorakoh

Screenshot:
https://i.imgur.com/LfFFgwr.png

Assuming I'm experiencing the same bug, here is a workaround that work for me:

In the OSX Users & Groups settings select "Login Options" and then tick the check-box "Show fast user switching...". This adds an item to the OSX menu bar to switch to the login window.

Now, whenever my Firefox gets into this dreaded locked up state, I choose "Login Window..." from that menu and log back in right again. Afterwards Firefox is back live and kicking ;-)

:gsvelto pointed out this might be related to the idle timer, specifically the idle-daily timer Waking up after a long sleep might trigger many idle timer events to run at the same time and cause high parent process CPU overhead. Needs more investigation.

See Also: → 1591214

I believe my issue was being stuck in a SQLITE_BUSY loop. Reading through issues https://bugzilla.mozilla.org/show_bug.cgi?id=1326309 and https://bugzilla.mozilla.org/show_bug.cgi?id=1583927 led me to this conclusion after I created a new Firefox profile, copied some profile data (bookmarks via importing a JSON backup, favicons.sqlite, key4.db, logins.json), and yet still experienced the problem. During high CPU, I ran the "Verify Integrity" check for Places Database in about:support and noticed:

Unable to check favicons.sqlite integrity: Error: Could not open connection ...

So I closed the browser and deleted favicons.sqlite and now I have NOT had this problem since. Hopefully this helps others including Mozilla devs. There should be some logging in about:performance, about:support and the Firefox Profiler addon so we can tell when we're actually stuck in a SQLITE_BUSY loop and possibly more info as to why.

(In reply to brymaster from comment #26)

During high CPU, I ran the "Verify Integrity" check for Places Database in about:support and noticed:

Unable to check favicons.sqlite integrity: Error: Could not open connection ...

So I closed the browser and deleted favicons.sqlite and now I have NOT had this problem since. Hopefully this helps others including Mozilla devs. There should be some logging in about:performance, about:support and the Firefox Profiler addon so we can tell when we're actually stuck in a SQLITE_BUSY loop and possibly more info as to why.

Thanks for the update! In the two macOS Activity Monitor samples you provided in comment 16, both had a MozStorage thread with some sqlite activity (as pointed out by Honza on comment 18). Those don't appear in the Firefox profiler profiles because MozStorage threads are not included by default. Adding "MozStorage" to the list of threads sampled in the profiler settings should be enough to include them. At the time, we didn't think they were the problem.

Both the samples show the busy MozStorage thread in PR_Sleep() about 20% of the time. And on bug 1583927 comment 9, with the profile provided (https://perfht.ml/2OpTbYX which includes MozStorage threads) the user reported having 100% busy CPU and in that case the busy MozStorage thread is in PR_Sleep() ~40% of the time. Therefore it seems probable that the MozStorage threads from your Activity Monitor samples are causing the high CPU load.

So I think what you have found is consistent with the Activity Monitor samples you provided. If you could reproduce this issue again and collect a profile that includes MozStorage threads (by adding MozStorage to the list of threads in the profiler settings), we could be more confident. Bug 1583927 might be a dupe of this bug.

@Doug, does any of the information here shed more light about what might be going on with the storage thread and sqlite? The shutdown crash report https://crash-stats.mozilla.org/report/index/996a7f7c-df42-469d-8684-5a9260190925 (from comment 21) shows an sqlite stack. Could it be that the thread is doing idle-timer tasks built up while the machine was suspended? On bug 1583927, I don't see any mention of suspend/resume unlike what's reported here.

Flags: needinfo?(dothayer)
See Also: → 1583927
See Also: → 1422855
Assignee: haftandilian → nobody
Component: Widget: Cocoa → Storage
Product: Core → Toolkit
See Also: → 1643491
See Also: → 1643793
Depends on: 1672791

I've filed a more specific bug as bug 1672791 with some analysis.

Flags: needinfo?(mstange.moz)

Has anyone seen this recently? I have not, but I've been using a different machine.

Downgrading priority until we get more reports.

Whiteboard: [qf:p1:responsiveness] → [qf:p3:responsiveness]

I haven't seen this in a long time although it used to happen to me regularly before.

Performance Impact: --- → P3
Whiteboard: [qf:p3:responsiveness]

In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.

Severity: critical → --

Tentatively closing this bug since there haven't been any more reports recently. Please reopen if you are still able to reproduce.

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(dothayer)
Resolution: --- → WORKSFORME
Product: Toolkit → Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: