Closed Bug 1481304 Opened 6 years ago Closed 6 years ago

All Mac content processes "Not Responding" in Activity Monitor (but seem to work fine)

Categories

(Core :: DOM: Content Processes, defect)

Unspecified
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox61 --- unaffected
firefox62 --- unaffected
firefox63 + fixed

People

(Reporter: dveditz, Assigned: Alex_Gaynor)

References

Details

(Keywords: regression)

Attachments

(1 file)

On my system all content processes are listed as "Not Responding" in Activity Monitor on Mac, although they appear to work fine (I'm entering this bug in one). I first noticed this morning in a build from Friday or Saturday, and it persisted after upgrading and restarting twice (first the pending update and then to today's build).

On IRC Mossop confirmed he saw it too.
[Tracking Requested - why for this release]:

Alex, I bisected this regression to the following pushlog, which points to your fix for native events bug 1426100:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=47aef65bc40dae041cc2c8635fc2594d6ef8aaa4&tochange=edfca01b39ab8bde74f1de73e04fa52acc6a9382
Blocks: 1426100
Component: General → DOM: Content Processes
Flags: needinfo?(agaynor)
Keywords: regression
Product: Firefox → Core
Target Milestone: --- → mozilla63
This seems to have implications for shutting down content processes, too. Every time I start Firefox, I get a message about unsent crash reports that matches the number of content processes that existed in my previous session.
They'll be reopened, so there's no security benefit, but this causes Activity
Monitor to not report the processes as 'not responding'.
Assignee: nobody → agaynor
Flags: needinfo?(agaynor)
Comment on attachment 8998188 [details]
Bug 1481304 - shutdown connections to the windowserver in content processes; r?haik

Haik Aftandilian [:haik] has approved the revision.

https://phabricator.services.mozilla.com/D2855
Attachment #8998188 - Flags: review+
Keywords: checkin-needed
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/8171af14faf2
shutdown connections to the windowserver in content processes; r=haik
Keywords: checkin-needed
Backed out changeset 8171af14faf2 (Bug 1481304) for failure at devtools/client/debugger/new/test/mochitest/browser_dbg-toggling-tools.js

Backout:
https://hg.mozilla.org/integration/autoland/rev/aa72fd1d53db7e6ef515cf847a55471d58b6b5f8

Push link:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=8171af14faf2015a0a7b7d1c8346a0b5d65d6829&filter-searchStr=dt4&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-classifiedState=unclassified

Failure log:
https://treeherder.mozilla.org/logviewer.html#?job_id=192858889&repo=autoland&lineNumber=8497

13:43:28     INFO - TEST-START | devtools/client/debugger/new/test/mochitest/browser_dbg_rr_breakpoints-01.js
13:43:28     INFO - GECKO(1881) | MIDDLEMAN 1902 /var/folders/sj/5tdfp2md3435zt681kh36bjh00000x/T/Recording557062
13:43:28     INFO - GECKO(1881) | RECORDING 1903 /var/folders/sj/5tdfp2md3435zt681kh36bjh00000x/T/Recording557062
13:43:28     INFO - GECKO(1881) | [Child 1902, Main Thread] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22
13:43:28     INFO - GECKO(1881) | [Child 1902, Main Thread] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22
13:43:28     INFO - GECKO(1881) | [Child 1902, Main Thread] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22
13:43:29     INFO - GECKO(1881) | [Child 1902, Main Thread] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22
13:43:29     INFO - GECKO(1881) | [Child 1902, Main Thread] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22
13:43:29     INFO - GECKO(1881) | [Child 1902, Unnamed thread 116d33410] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22
13:43:29     INFO - GECKO(1881) | JavaScript error: resource://gre/modules/WebProgressChild.jsm, line 106: NS_ERROR_UNEXPECTED:
13:43:29     INFO - GECKO(1881) | JavaScript error: resource://gre/modules/WebProgressChild.jsm, line 106: NS_ERROR_UNEXPECTED:
13:43:29     INFO - GECKO(1881) | JavaScript error: resource://gre/modules/WebProgressChild.jsm, line 106: NS_ERROR_UNEXPECTED:
13:43:29     INFO - GECKO(1881) | JavaScript error: resource://gre/modules/WebProgressChild.jsm, line 106: NS_ERROR_UNEXPECTED:
13:43:29     INFO - GECKO(1881) | [Child 1902, Unnamed thread 116d33410] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22
13:43:29     INFO - GECKO(1881) | [Child 1902, Unnamed thread 116d33410] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22
13:43:29     INFO - GECKO(1881) | REPLAYING 1904 /var/folders/sj/5tdfp2md3435zt681kh36bjh00000x/T/Recording557062
13:43:29     INFO - GECKO(1881) | REPLAYING 1905 /var/folders/sj/5tdfp2md3435zt681kh36bjh00000x/T/Recording557062
13:43:35     INFO - GECKO(1881) | REPLAYING 1906 /var/folders/sj/5tdfp2md3435zt681kh36bjh00000x/T/Recording557062
13:43:40     INFO - GECKO(1881) | REPLAYING 1907 /var/folders/sj/5tdfp2md3435zt681kh36bjh00000x/T/Recording557062
13:43:45     INFO - GECKO(1881) | REPLAYING 1908 /var/folders/sj/5tdfp2md3435zt681kh36bjh00000x/T/Recording557062
13:43:50     INFO - GECKO(1881) | REPLAYING 1909 /var/folders/sj/5tdfp2md3435zt681kh36bjh00000x/T/Recording557062
13:43:55     INFO - GECKO(1881) | REPLAYING 1910 /var/folders/sj/5tdfp2md3435zt681kh36bjh00000x/T/Recording557062
14:00:35     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1533760302/build/venv/bin/python', '-u', '/Users/cltbld/tasks/task_1533760302/build/tests/mochitest/runtests.py', '--total-chunks', '8', '--this-chunk', '4', '--appname=/Users/cltbld/tasks/task_1533760302/build/application/Firefox Nightly.app/Contents/MacOS/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/ErPDaitWQqOOL7OndsZU3Q/artifacts/public/build/target.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--quiet', '--log-raw=/Users/cltbld/tasks/task_1533760302/build/blobber_upload_dir/mochitest-devtools-chrome-chunked_raw.log', '--log-errorsummary=/Users/cltbld/tasks/task_1533760302/build/blobber_upload_dir/mochitest-devtools-chrome-chunked_errorsummary.log', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--sandbox-read-whitelist=/Users/cltbld/tasks/task_1533760302/build', '--log-raw=-', '--flavor=browser', '--subsuite=devtools', '--chunk-by-runtime']
14:00:35    ERROR - timed out after 1000 seconds of no output
14:00:35    ERROR - Return code: -15
14:00:35    ERROR - No suite end message was emitted by this harness.
14:00:35     INFO - TinderboxPrint: mochitest-mochitest-devtools-chrome-chunked<br/>81/0/0
14:00:35    ERROR - # TBPL FAILURE #
14:00:35  WARNING - setting return code to 2
14:00:35    ERROR - The mochitest suite: mochitest-devtools-chrome-chunked ran with return status: FAILURE
14:00:35     INFO - Running post-action listener: _package_coverage_data
14:00:35     INFO - Running post-action listener: _resource_record_post_action
14:00:35     INFO - Running post-action listener: process_java_coverage_data
14:00:35     INFO - [mozharness: 2018-08-08 21:00:35.634330Z] Finished run-tests step (success)
14:00:35     INFO - Running post-run listener: _resource_record_post_run
Flags: needinfo?(agaynor)
Taking a look.

FYI you pasted the failure log for browser_dbg_rr_breakpoints-01.js, when you meant to post browser_dbg-toggling-tools.js. (This threw me for a loop for a few minutes)
Flags: needinfo?(agaynor)
Brian, browser_dbg_rr_breakpoints-01.js seems to be consistently failing under the macOS opt builds, but not debug, and it doesn't reproduce locally either. I can't imagine why this change would break this test -- if this change broke anything, I'd expect it to be "spawning all content processes, all the time". Is there anything that I ought to know in trying to track this down?

The fact that it fails under opt, but not debug, makes me wonder if there isn't a race condition in the test that this patch triggers with some tiny timing difference.
Flags: needinfo?(bhackett1024)
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #8)
> Brian, browser_dbg_rr_breakpoints-01.js seems to be consistently failing
> under the macOS opt builds, but not debug, and it doesn't reproduce locally
> either. I can't imagine why this change would break this test -- if this
> change broke anything, I'd expect it to be "spawning all content processes,
> all the time". Is there anything that I ought to know in trying to track
> this down?
> 
> The fact that it fails under opt, but not debug, makes me wonder if there
> isn't a race condition in the test that this patch triggers with some tiny
> timing difference.

This test only runs on opt builds (there are some issues on debug builds at the moment), which I think explains why you're not seeing failures on debug builds.

I tried running the test locally on an opt build with the patch applied, and do see a failure.  The problem is that the patch is adding a call to CGSShutdownServerConnections() that runs when recording or replaying a tab.  The record/replay system redirects calls to system APIs like this so that we can avoid interacting with the system and get consistent behavior while replaying.  We don't have a redirection for CGSShutdownServerConnections; I'll add one soon, but until then it might be best to workaround this problem by wrapping the call to CGSShutdownServerConnections in an 'if (!recordreplay::IsRecordingOrReplaying())' test.
Flags: needinfo?(bhackett1024)
Ahhhh! Thank you, will take a shot at that.

For the future, is there a good way for me to have discovered that myself? This is pretty subtly and it seems like you recognized it right away :-)
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #10)
> Ahhhh! Thank you, will take a shot at that.
> 
> For the future, is there a good way for me to have discovered that myself?
> This is pretty subtly and it seems like you recognized it right away :-)

After bug 1481009 lands, failures like this should produce stack dumps in treeherder logs, which will show that we crashed inside the call to CGSShutdownServerConnections.  I'll also experiment with printing some explanatory text whenever we crash while replaying and see if that shows up in the logs as well.
Sounds great, that would have been a huge help. Thanks!
(In reply to Brian Hackett (:bhackett) from comment #9)
> I'll add one soon, but until then it might be best to workaround this
> problem by wrapping the call to CGSShutdownServerConnections in an 'if
> (!recordreplay::IsRecordingOrReplaying())' test.

Brian, is there an exisitng explanation of when/how we need to use recordreplay::IsRecordingOrReplaying()?
Flags: needinfo?(bhackett1024)
Depends on: 1482668
(In reply to Haik Aftandilian [:haik] from comment #13)
> (In reply to Brian Hackett (:bhackett) from comment #9)
> > I'll add one soon, but until then it might be best to workaround this
> > problem by wrapping the call to CGSShutdownServerConnections in an 'if
> > (!recordreplay::IsRecordingOrReplaying())' test.
> 
> Brian, is there an exisitng explanation of when/how we need to use
> recordreplay::IsRecordingOrReplaying()?

Hi, I just added an appendix 'Impacts on Gecko Development' to the web replay MDN page (https://developer.mozilla.org/en-US/docs/Mozilla/Projects/WebReplay) which describes how to address the ways in which Web Replay can be broken by other Gecko changes.

I also just filed bug 1482668 to add a redirection for CGSShutdownServerConnections.  When that is fixed there won't be a need for a workaround here, and the patch should be fine as is.
Flags: needinfo?(bhackett1024)
Keywords: checkin-needed
Pushed by btara@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/114f05d1f4a4
shutdown connections to the windowserver in content processes; r=haik
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/114f05d1f4a4
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
I'm still seeing this on the latest Nightly (64.0a1 (2018-10-13) (64-bit)). After I start the browser, everything seems fine. However, after a few minutes one after another the content processes become "Not responding" according to Activity Monitor. The browser seems to work fine though.

Activity monitor: https://i.imgur.com/tXEIdaF.png
A sample of one the "Not responding" processes: https://gist.github.com/jiripospisil/c8312c73f4173064ada2c48d4ffdf68a
about:support: https://gist.github.com/jiripospisil/324015fab7333abd2a81b82742a990a9
(In reply to jiri.pospisil from comment #17)
> I'm still seeing this on the latest Nightly (64.0a1 (2018-10-13) (64-bit)).
> After I start the browser, everything seems fine. However, after a few
> minutes one after another the content processes become "Not responding"
> according to Activity Monitor. The browser seems to work fine though.
> 
> Activity monitor: https://i.imgur.com/tXEIdaF.png
> A sample of one the "Not responding" processes:
> https://gist.github.com/jiripospisil/c8312c73f4173064ada2c48d4ffdf68a
> about:support:
> https://gist.github.com/jiripospisil/324015fab7333abd2a81b82742a990a9

Thanks for the report. Would you mind filing a new bug for this? This appears to be a regression introduced by my fix for bug 1431441 with the same symptoms as this bug.
(In reply to Haik Aftandilian [:haik] from comment #18)
> (In reply to jiri.pospisil from comment #17)
> > I'm still seeing this on the latest Nightly (64.0a1 (2018-10-13) (64-bit)).
> > After I start the browser, everything seems fine. However, after a few
> > minutes one after another the content processes become "Not responding"
> > according to Activity Monitor. The browser seems to work fine though.
> > 
> > Activity monitor: https://i.imgur.com/tXEIdaF.png
> > A sample of one the "Not responding" processes:
> > https://gist.github.com/jiripospisil/c8312c73f4173064ada2c48d4ffdf68a
> > about:support:
> > https://gist.github.com/jiripospisil/324015fab7333abd2a81b82742a990a9
> 
> Thanks for the report. Would you mind filing a new bug for this? This
> appears to be a regression introduced by my fix for bug 1431441 with the
> same symptoms as this bug.

Sure, here you go https://bugzilla.mozilla.org/show_bug.cgi?id=1498845
Thank you for the report. This should be resolved by the backout of bug 1431441:

https://hg.mozilla.org/integration/autoland/rev/556f31a5e2230d5d06c23b5316d2231a74713251

And then I'll close this bug. The fix should be available in Nightly within the next 24 hours.
See Also: → 1501121
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: