Closed Bug 1647096 Opened 4 years ago Closed 4 years ago

GeckoProfiler.Markers gtest crashes after changing tests to use socket process

Categories

(Core :: Networking, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla80
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- wontfix
firefox77 --- unaffected
firefox78 --- wontfix
firefox79 --- fixed
firefox80 --- fixed

People

(Reporter: brennan.brisad, Assigned: mozbugz)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged])

Attachments

(1 obsolete file)

The GeckoProfiler.Markers gtest have started crashing on my Linux machine. I git-bisected the issue and found that https://hg.mozilla.org/mozilla-central/rev/e251c2af1161147109a3874f33e356b51bdb45d1 (of Bug 1602832) is the commit which broke it.

If I run the test,

$ ./mach gtest GeckoProfiler.Markers
<snip>
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from GeckoProfiler
[ RUN      ] GeckoProfiler.Markers

it stays there with no continued progress.

This is what it looks if I do a process listing:

$ ps -ef f
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
<snip>
michael  18578  6027  2 20:47 pts/1    S+     0:00  |   |   \_ python2.7 ./mach gtest GeckoProfiler.Markers
michael  19244 18578 13 20:47 pts/1    Sl+    0:02  |   |       \_ /home/michael/gecko/objdir-frontend/dist/bin/firefox -uni
michael  19246 19244  0 20:47 pts/1    Z+     0:00  |   |           \_ [firefox] <defunct>
<snip>

I don't know if it is related to this issue, but I thought it might be worth mentioning that a process listing on my computer also shows that my Nightly browser which I have been running for many days have gathered quite a few defunct processes:

michael  17643 17640  0 May31 ?        Sl   201:12  \_ /home/michael/.local/nightly/firefox-bin
michael  17668 17643  0 May31 ?        Z      1:40      \_ [GPU Process] <defunct>
michael  17708 17643  0 May31 ?        Z      0:00      \_ [Socket Process] <defunct>
michael  17725 17643  0 May31 ?        Z      0:28      \_ [Privileged Cont] <defunct>
michael  17778 17643  0 May31 ?        Z      4:02      \_ [Web Content] <defunct>
michael  17783 17643  0 May31 ?        Z      2:36      \_ [Web Content] <defunct>
michael  17787 17643  0 May31 ?        Z      2:37      \_ [Web Content] <defunct>
michael  17856 17643  0 May31 ?        Z      1:40      \_ [WebExtensions] <defunct>
michael  18083 17643  0 May31 ?        Z      0:00      \_ [RDD Process] <defunct>
michael  24474 17643  0 May31 ?        Z      0:22      \_ [Web Content] <defunct>
michael   9214 17643  0 Jun01 ?        Z      0:01      \_ [Web Content] <defunct>
michael   9249 17643  0 Jun01 ?        Z      0:00      \_ [Web Content] <defunct>
michael   9291 17643  0 Jun01 ?        Z      0:00      \_ [pingsender] <defunct>
michael   9295 17643  0 Jun01 ?        Z      0:00      \_ [firefox-bin] <defunct>
michael   9320 17643  0 Jun01 ?        Z      8:07      \_ [GPU Process] <defunct>
michael   9359 17643  0 Jun01 ?        Z      0:00      \_ [Socket Process] <defunct>
michael   9431 17643  0 Jun01 ?        Z      5:12      \_ [Web Content] <defunct>
michael   9469 17643  0 Jun01 ?        Z      8:10      \_ [Web Content] <defunct>
michael   9472 17643  0 Jun01 ?        Z      4:43      \_ [Web Content] <defunct>
michael   9483 17643  0 Jun01 ?        Z      6:17      \_ [Web Content] <defunct>
michael   9534 17643  0 Jun01 ?        Z      1:22      \_ [Privileged Cont] <defunct>
michael   9602 17643  0 Jun01 ?        Z      4:17      \_ [WebExtensions] <defunct>
michael   9737 17643  0 Jun01 ?        Z      0:00      \_ [RDD Process] <defunct>
michael   9971 17643  0 Jun01 ?        Z      1:27      \_ [Web Content] <defunct>
michael  12119 17643  0 Jun01 ?        Z      1:54      \_ [Web Content] <defunct>
michael  15988 17643  0 Jun01 ?        Z      1:21      \_ [Web Content] <defunct>
michael  25947 17643  0 Jun03 ?        Z      0:10      \_ [Web Content] <defunct>
michael  27935 17643  0 Jun04 ?        Z      0:00      \_ [pingsender] <defunct>
michael  27939 17643  0 Jun04 ?        Z      0:00      \_ [firefox-bin] <defunct>
michael  27964 17643  0 Jun04 ?        Z      0:17      \_ [GPU Process] <defunct>
michael  28001 17643  0 Jun04 ?        Z      0:00      \_ [Socket Process] <defunct>
michael  28067 17643  0 Jun04 ?        Z      0:06      \_ [Web Content] <defunct>
michael  28105 17643  0 Jun04 ?        Z      0:51      \_ [Web Content] <defunct>
michael  28108 17643  0 Jun04 ?        Z      0:36      \_ [Web Content] <defunct>
michael  28120 17643  0 Jun04 ?        Z      0:54      \_ [Web Content] <defunct>
michael  28158 17643  0 Jun04 ?        Z      0:03      \_ [Privileged Cont] <defunct>
michael  28239 17643  0 Jun04 ?        Z      0:17      \_ [WebExtensions] <defunct>
michael  28285 17643  0 Jun04 ?        Z      0:00      \_ [Web Content] <defunct>
michael  28406 17643  0 Jun04 ?        Z      0:00      \_ [RDD Process] <defunct>
michael  29108 17643  0 Jun04 ?        Z      0:00      \_ [pingsender] <defunct>
michael  29113 17643  0 Jun04 ?        Z      0:00      \_ [firefox-bin] <defunct>
michael  29138 17643  0 Jun04 ?        Sl    82:33      \_ /home/michael/.local/nightly/firefox-bin -contentproc -parentBuildID 20200603214922 -prefsLen 1 -prefMapSize 226332 -appdir /home/michael/.local/nightly
michael  29177 17643  0 Jun04 ?        Sl     0:03      \_ /home/michael/.local/nightly/firefox-bin -contentproc -parentBuildID 20200603214922 -prefsLen 90 -prefMapSize 226332 -appdir /home/michael/.local/nightl
michael  29271 17643  0 Jun04 ?        Sl   136:15      \_ /home/michael/.local/nightly/firefox-bin -contentproc -childID 2 -isForBrowser -prefsLen 695 -prefMapSize 226332 -parentBuildID 20200603214922 -appdir /
michael  29273 17643  0 Jun04 ?        Sl    51:13      \_ /home/michael/.local/nightly/firefox-bin -contentproc -childID 3 -isForBrowser -prefsLen 695 -prefMapSize 226332 -parentBuildID 20200603214922 -appdir /
michael  29286 17643  0 Jun04 ?        Sl    58:22      \_ /home/michael/.local/nightly/firefox-bin -contentproc -childID 4 -isForBrowser -prefsLen 695 -prefMapSize 226332 -parentBuildID 20200603214922 -appdir /
michael  29332 17643  0 Jun04 ?        Sl     9:58      \_ /home/michael/.local/nightly/firefox-bin -contentproc -childID 5 -isForBrowser -prefsLen 789 -prefMapSize 226332 -parentBuildID 20200603214922 -appdir /
michael  29401 17643  0 Jun04 ?        Sl    32:40      \_ /home/michael/.local/nightly/firefox-bin -contentproc -childID 6 -isForBrowser -prefsLen 6915 -prefMapSize 226332 -parentBuildID 20200603214922 -appdir 
michael  29459 17643  0 Jun04 ?        Sl    54:45      \_ /home/michael/.local/nightly/firefox-bin -contentproc -childID 7 -isForBrowser -prefsLen 7805 -prefMapSize 226332 -parentBuildID 20200603214922 -appdir 
michael  29632 17643  0 Jun04 ?        Sl     0:13      \_ /home/michael/.local/nightly/firefox-bin -contentproc -parentBuildID 20200603214922 -prefsLen 7939 -prefMapSize 226332 -appdir /home/michael/.local/nigh
michael  29787 17643  0 Jun04 ?        Sl    17:43      \_ /home/michael/.local/nightly/firefox-bin -contentproc -childID 8 -isForBrowser -prefsLen 8361 -prefMapSize 226332 -parentBuildID 20200603214922 -appdir 
michael  30188 17643  0 Jun04 ?        Sl    35:27      \_ /home/michael/.local/nightly/firefox-bin -contentproc -childID 9 -isForBrowser -prefsLen 8361 -prefMapSize 226332 -parentBuildID 20200603214922 -appdir 
michael   1473 17643  0 Jun05 ?        Sl    34:08      \_ /home/michael/.local/nightly/firefox-bin -contentproc -childID 12 -isForBrowser -prefsLen 10360 -prefMapSize 226332 -parentBuildID 20200603214922 -appdi
michael   2171 17643  0 Jun05 ?        Sl    15:27      \_ /home/michael/.local/nightly/firefox-bin -contentproc -childID 13 -isForBrowser -prefsLen 10561 -prefMapSize 226332 -parentBuildID 20200603214922 -appdi

Kershaw, please have a look at this bug.

Severity: -- → S3
Flags: needinfo?(kershaw)
Priority: -- → P2
Whiteboard: [necko-triaged]

Set release status flags based on info from the regressing bug 1602832

I think this is a dead lock problem.
See the log below, http handler is created in locked_profiler_stream_json_for_this_process, where gPSMutex is already locked and profiler_register_thread() is called since we tried to launch socket process in http handler. In profiler_register_thread, we tried to lock gPSMutex again and cause this dead lock.

[(null) 38769: Main Thread]: I/prof [38769] profiler_start
[(null) 38769: Main Thread]: I/prof [38769] locked_profiler_start
[(null) 38769: Main Thread]: I/prof [38769] - capacity  = 8388608
[(null) 38769: Main Thread]: I/prof [38769] - duration  = -1.00
[(null) 38769: Main Thread]: I/prof [38769] - interval = 1.00
[(null) 38769: Main Thread]: I/prof [38769] - browsing context ID = 0
[(null) 38769: Main Thread]: I/prof [38769] - feature  = stackwalk
[(null) 38769: Main Thread]: I/prof [38769] - threads  = GeckoMain
[(null) 38769: Unnamed thread 0x110140bd0]: D/prof [38769] profiler_register_thread(AsyncLogger)
[(null) 38769: Main Thread]: D/prof [38769] profiler_add_sampled_counter(malloc)
[(null) 38769: Main Thread]: I/prof [38769] profiler_stream_json_for_this_process
[(null) 38769: Main Thread]: I/prof [38769] locked_profiler_stream_json_for_this_process
[(null) 38769: Main Thread]: D/nsHttp nsHttpAuthCache::nsHttpAuthCache 0x1101b4858
[(null) 38769: Main Thread]: D/nsHttp nsHttpAuthCache::nsHttpAuthCache 0x1101b4888
[(null) 38769: Main Thread]: D/nsHttp Creating nsHttpHandler [this=0x1101b4800].
[(null) 38769: Main Thread]: D/nsHttp nsHttpHandler::SetFastOpenOSSupport version 19.5.0
[(null) 38769: Main Thread]: D/nsHttp nsHttpHandler::SetFastOpenOSSupport  supported.
[(null) 38769: Main Thread]: D/nsHttp nsHttpHandler::Init
[(null) 38769: IPC Launch #1]: D/prof [38769] profiler_register_thread(IPC Launch #1)
Assignee: nobody → kershaw
Flags: needinfo?(kershaw)

nsHttpHandler::Init launches socket process, which causes deadlock in trying profiler_register_thread if it's already locked.

I found this deadlock locally (on Windows, so it's not a Linux/intel-only bug), and I've just worked on a fix in bug 1649056 before I saw this bug here! 😅
My solution is pretty much the same (do things before locking the profiler mutex), but covers other potentially-dangerous things and a bit more, so I'd prefer we keep mine.

I'll make this bug here dependent on bug 1649056, and you can verify it's fixed afterwards...

Depends on: 1649056
OS: Linux → All
Hardware: x86_64 → All
Attachment #9159846 - Attachment is obsolete: true

Fixed by bug 1649056. Please reopen if you still have problems with this test.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

Is bug 1649056 something we should consider uplifting?

Assignee: kershaw → gsquelart
Flags: needinfo?(gsquelart)
Target Milestone: --- → mozilla80

(In reply to Ryan VanderMeulen [:RyanVM] from comment #7)

Is bug 1649056 something we should consider uplifting?

Thank you for the suggestion. Requested in bug 1649056 comment 8.

Flags: needinfo?(gsquelart)

The patch landed in nightly and beta is affected.
:gerald, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsquelart)

Landed in beta, see bug 1649056 comment 10.

Flags: needinfo?(gsquelart)
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: