Closed Bug 1681842 Opened 10 months ago Closed 8 months ago

Parent process crashes on arm32 aren't detected, zombie app stays behind

Categories

(Toolkit :: Crash Reporting, defect)

ARM
Android
defect

Tracking

()

RESOLVED FIXED
87 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox83 --- unaffected
firefox84 + wontfix
firefox85 + wontfix
firefox86 + fixed
firefox87 --- fixed

People

(Reporter: mstange, Assigned: gsvelto)

References

(Regression)

Details

(Keywords: regression)

Attachments

(3 files, 1 obsolete file)

Steps to reproduce:

  1. In GeckoView example, navigate to about:crashparent.

Expected results:
Either the entire GVE app should crash, or a warning should be shown that allows restarting of Gecko.

Actual results:
The GVE app stays alive but is in a zombie state. No interaction with websites is possible any more. The app UI can still be interacted with.

I'm currently looking for a regression range.

Keywords: regression
13:48.59 INFO: Last good revision: 7cb103daaa8d249916f8e7c95a380c20434a9d41
13:48.59 INFO: First bad revision: ba8ce25fe969a8997591a33e060bed86bbcb9c1f
13:48.59 INFO: Pushlog:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=7cb103daaa8d249916f8e7c95a380c20434a9d41&tochange=ba8ce25fe969a8997591a33e060bed86bbcb9c1f

Flags: needinfo?(gsvelto)
Regressed by: 1649132

[Tracking Requested - why for this release]: Some crashes are handled incorrectly on Android and cause the app to go into a zombie state. Users have to restart the app manually. No crash report is generated.

Has Regression Range: --- → yes
Has STR: --- → yes

While testing this I noticed something odd: even without the patch applied if I browse to about:crashparent the GeckoView example app doesn't seem to die, or maybe it's restarted automatically? It closes when the main process crashes, but if I look in the app tray it's still there, and clicking on it it seems to start again and I can interact with it. This needs a deeper investigation.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #4)

This needs a deeper investigation.

Is that something you can take on, or the geckoview team?

Flags: needinfo?(gsvelto)
Flags: needinfo?(amoya)

It would be very helpful if the GeckoView team could give this a second spin. As per comment 4 I couldn't really repro with the original STR. The app seems to close but if I look at the app tray it's still there and it appears to restart when I click on it. Also I get the same behavior w/ and w/o the regressing patch applied.

Flags: needinfo?(gsvelto)
Flags: needinfo?(snorp)

Hi David, can you help redirect this to someone on the GV team to take a look?

Flags: needinfo?(snorp) → needinfo?(dbolter)

Agi can you help (someone) move this along? (See comment 4)

Flags: needinfo?(dbolter) → needinfo?(agi)

Gabriele, is the concern about development builds? or is this a problem for users? If the concern is for user builds, could you try if this is reproducible in Fenix? GVE isn't really released anywhere, and the behavior is different in Fenix (I don't see the behavior in Fenix Nightly FWIW)

This likely happens because we don't have a crash reporter by default in GVE (Bug 1666372) so the crash gets logged to the console and then the default behavior on Android is to restart the crashed process.

Having the app on the tray is also expected, apps never really "die" from the user perspective, and having it on the tray is desirable so that the user can go back to what they were doing after a crash.

Also, I just tested on GVE with the tip of m-c and I don't see the behavior in the OP, the app disappears without any message (expected, we don't install a crash reporter for GVE) and then re-opening it works as inteded (i.e., the app restarts).

Markus, do you still see the hang?

In any case, we should track this only if we can reproduce on Fenix, GVE is just a debugging tool and shouldn't block anything.

Flags: needinfo?(mstange.moz)
Flags: needinfo?(gsvelto)
Flags: needinfo?(agi)
Flags: needinfo?(amoya)

(In reply to Agi Sferro | :agi | ni? for questions | ⏰ PST | he/him from comment #9)

If the concern is for user builds, could you try if this is reproducible in Fenix?

This is reproducible in Fenix for me, on arm32. That's where I found it - I only checked GVE to track down whether it was a GeckoView problem or a Fenix problem.

Also, I just tested on GVE with the tip of m-c and I don't see the behavior in the OP, the app disappears without any message (expected, we don't install a crash reporter for GVE) and then re-opening it works as inteded (i.e., the app restarts).

Were you testing on arm32 or arm64?

Markus, do you still see the hang?

Let me check.

Yes, the hang still happens on today's Fenix Nightly, on a Moto G5.

Flags: needinfo?(mstange.moz)
Flags: needinfo?(agi)

In response to Comment 6:

While testing this I noticed something odd: even without the patch applied if I browse to about:crashparent the GeckoView example app doesn't seem to die, or maybe it's restarted automatically? It closes when the main process crashes, but if I look in the app tray it's still there, and clicking on it it seems to start again and I can interact with it. This needs a deeper investigation.

This is normal, the app tray is not representative of what apps are running, is more a "history" of what apps the user ran, swiping the app away doesn't necessarily kill the app, either, for example.

(reply to Markus Stange [:mstange] from comment #10)

Were you testing on arm32 or arm64?

Ah the arm32 is indeed key to reproduce! Thanks for the clarification Markus.

Gabriele, I believe that's why you couldn't reproduce, you need to run an arm32 build (ac_add_options --target=arm in mozconfig), likely on an actual device, to reproduce this.

Sending to Crash Reporting since this only reproduce on one CPU arch (so unlikely to be a GV problem) and the regressing commit was landed there.

Feel free to drop on matrix#geckoview to this discuss this further or for help on how to run/build GeckoView.

Component: General → Crash Reporting
Flags: needinfo?(agi)
Product: GeckoView → Toolkit
OS: All → Android
Hardware: Unspecified → ARM
Summary: Parent process crashes aren't detected, zombie app stays behind → Parent process crashes on arm32 aren't detected, zombie app stays behind

(In reply to Agi Sferro | :agi | ni? for questions | ⏰ PST | he/him from comment #13)

(reply to Markus Stange [:mstange] from comment #10)

Were you testing on arm32 or arm64?

Ah the arm32 is indeed key to reproduce! Thanks for the clarification Markus.

Gabriele, I believe that's why you couldn't reproduce, you need to run an arm32 build (ac_add_options --target=arm in mozconfig), likely on an actual device, to reproduce this.

Thanks to both, this is really nasty news. We already have a 32-bit ARM-specific issue in crash reporting, bug 1666383 and it appears to be caused by a compiler bug (or feature?). I am afraid the issue here might be similar given it's specificity. Sadly this isn't even the first time it happens, we already had to roll some workarounds in the exception handler specifically for Android a while back :-/

Flags: needinfo?(gsvelto)

I just tried w/ GeckoView on an oldish ARM phone (Android 7.1.2) and I could make it crash just fine, here's the crash report:

https://crash-stats.mozilla.org/report/index/85ff5c25-b864-48b4-afc2-033180210104

The report suffers from bug 1666383 but the application died just as I expected it to. I'll try on a more modern phone and see what happens there.

FWIW I could reproduce on my S7 (which is aarch64) with Android 8.0.0.

No luck on my side. I tried Fenix nightly on my Fairphone 3 too but the crash reporting flow works fine. I wonder what could be the difference? Could you try repeating the STR and attaching the output of:

adb logcat -s "Gecko,GeckoCrashReporter,GeckoSession,google-breakpad,mozac/CrashReporter:*"

This is what I get:

01-05 14:36:39.903 30765 30800 I Gecko   : -*- nsDNSServiceDiscovery.js : nsDNSServiceDiscovery
01-05 14:36:40.886 30765 30800 I GeckoSession: zerdatime 101387632 - chrome startup finished
01-05 14:36:40.931 30765 30765 I GeckoSession: handleMessage GeckoView:PageStart uri=
01-05 14:36:40.939 30765 30765 I GeckoSession: handleMessage GeckoView:PageStop uri=null
01-05 14:36:41.053 30765 30765 I GeckoSession: handleMessage GeckoView:PageStart uri=
01-05 14:36:42.127 30765 30765 W GeckoSession: No history entries found.
01-05 14:36:42.331 30765 30765 I GeckoSession: handleMessage GeckoView:PageStop uri=null
01-05 14:36:44.044 30765 30800 I GeckoSession: zerdatime 101390791 - chrome startup finished
01-05 14:36:44.162 30765 30765 I GeckoSession: handleMessage GeckoView:PageStart uri=
01-05 14:36:44.162 30765 30765 I GeckoSession: handleMessage GeckoView:PageStop uri=null
01-05 14:38:15.537 30765 30800 W google-breakpad: ExceptionHandler::GenerateDump cloned child 
01-05 14:38:15.537 30765 30800 W google-breakpad: 31194
01-05 14:38:15.537 30765 30800 W google-breakpad: 
01-05 14:38:15.537 30765 30800 W google-breakpad: ExceptionHandler::SendContinueSignalToChild sent continue signal to child
01-05 14:38:15.538 31194 30800 W google-breakpad: ExceptionHandler::WaitForContinueSignal waiting for continue signal...
01-05 14:38:15.543 30765 30765 I GeckoSession: handleMessage GeckoView:PageStart uri=

There's something definitely wrong there. It seems that the crashed process is not woken up when minidump generation is done. I'll try to repro again on my side.

Gabriele would you mind assigning this bug to yourself at least for now if this is something you're looking into?

Flags: needinfo?(gsvelto)

Sure

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Flags: needinfo?(gsvelto)

I think we're seeing one of these cases in the wild: https://github.com/mozilla-mobile/fenix/issues/17005 (note the reporter cannot install arm64 APKs so the OS is running arm 32bit)

Alright, I can finally repro reliably on my device. Looking into this today.

I've inspected the crash reporting flow and there's something odd going on: the CrashManager can't find the minidump file even though it is present. However the error it throws is handled and the rest of the code executes as expected. On the other hand I just noticed this error in the log I had missed until now:

01-20 11:04:05.941 22097 22116 E GeckoConsole: [JavaScript Error: "TypeError: can't access property "sendAsyncMessage", this._manager.messageManager is null" {file: "chrome://geckoview/content/geckoview.js" line: 497}]
01-20 11:04:05.941 22097 22116 E GeckoConsole: _updateContentModuleState@chrome://geckoview/content/geckoview.js:497:5
01-20 11:04:05.941 22097 22116 E GeckoConsole: set enabled@chrome://geckoview/content/geckoview.js:477:10
01-20 11:04:05.941 22097 22116 E GeckoConsole: init/</<@chrome://geckoview/content/geckoview.js:107:9
01-20 11:04:05.941 22097 22116 E GeckoConsole: forEach@chrome://geckoview/content/geckoview.js:138:19
01-20 11:04:05.941 22097 22116 E GeckoConsole: init/<@chrome://geckoview/content/geckoview.js:106:12

I'm not familiar with this code but it doesn't look good and might explain why the app is unresponsive afterwards.

The more I look into this the less this seems like a problem with the crash reporting code. The error in the crash manager is rather straightforward: the code is trying to invoke the minidump-analyzer program to extract the stack trace from the crash and add it to the crash ping, but the program isn't there. Which makes sense because we never packaged it for Android. Back in the days of Fennec we added its functionality to mozglue and used it from there, see the patch in bug 1307153. It seems to me that this doesn't happen in Fenix. I poked into the android-components repo and I can't see the crash handler trying to extract the stack trace so this is a regression compared to Fenix and I'll file a separate bug for it.

As for the hang here it seems to me that what is happening is that after a crash the unload event is received by the GeckoViewModule corresponding to the current window which in turn tries to disable all its "modules" (I'm not sure what they are exactly). When disabled one of these modules is thus trying to send a GeckoView:UpdateModuleState message, but the message manager is gone and from what I can tell that corresponds with the browser object of the current window.

In other words after the crash the browser object for the current window is gone but we're still trying to use its message manager to relay messages.

My guess is that this is caused by some kind of race. I don't understand how bug 1649132 might trigger it however, because the changes in that bug only affect main process' crashes, not content process ones. However I did land some changes in that timeframe which fixed races when dealing with content process crashes, namely bug 1666383 and bug 1624467, and those did affect the order of events happening between the main process and a crashed content one.

Agi, I'm not super-familiar with GeckoView's front-end structure, is the info above enough to figure out what's going on?

Flags: needinfo?(agi)

Hey Gabriele! I think the messageManager error is a red herring. I see the bug even without that line.

Also I don't understand how you would get anything running from geckoview.js, that file runs in the main process, so after calling about:crashparent it shouldn't be running at all. Do you have a full log? I can look at it.

One way to reliably see that error message is to trigger a content process crash about:crashcontent, in that case the window is closed (I'm assuming by Gecko) and we try to update the state in the content process and we fail to do so, which is sort-of expected, although we should check if the messageManager is alive so we don't throw like that.

But this case is for parent process crashes.

Flags: needinfo?(agi) → needinfo?(gsvelto)

To be clear here, the crash is in the parent process, so the app should close, stuff like this

As for the hang here it seems to me that what is happening is that after a crash the unload event is received by the GeckoViewModule corresponding to the current window which in turn tries to disable all its "modules" (I'm not sure what they are exactly). When disabled one of these modules is thus trying to send a GeckoView:UpdateModuleState message, but the message manager is gone and from what I can tell that corresponds with the browser object of the current window.

In other words after the crash the browser object for the current window is gone but we're still trying to use its message manager to relay messages.

is just not possible, the parent process is crashing so we can't run anything in it.

Duh, somehow I went back and read the STR in comment 0 this morning and in my mind it read about:crashcontent. I'll re-test tomorrow.

That being said the behavior for content process crashes is also odd but not catastrophic. The page just stays there, unresponsive, until you point the URL at something else.

T(In reply to Gabriele Svelto [:gsvelto] from comment #28)

That being said the behavior for content process crashes is also odd but not catastrophic. The page just stays there, unresponsive, until you point the URL at something else.

That's just because GVE doesn't implement onCrash properly, if you try in Fenix it should show you a "sad tab" that asks you if you want to restore the tab or close it.

Testing again and no luck. about:crashparent works just fine on all of my phones both in GeckoView and Fenix. So we'll debug this the printf() way. I'll attach a patch with a lot of logging to figure out what's going on and ask to repro for me.

Agi can you reproduce the issue on your phone with this patch applied then attach the output of logcat? The output should look like this:

01-21 10:49:52.533 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() called, here we go
01-21 10:49:52.533 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() opening /proc/meminfo
01-21 10:49:52.533 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() reading /proc/meminfo
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() parsing /proc/meminfo
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing MemTotal with value 866865152
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing MemFree with value 13053952
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing SwapFree with value 267350016
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() parsed /proc/meminfo
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing AvailablePageFile with value 0
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing TotalPageFile with value 1286291456
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() is done
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() called, here we go
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() opening /proc/meminfo
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() reading /proc/meminfo
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() parsing /proc/meminfo
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing MemTotal with value 866865152
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing MemFree with value 12632064
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing SwapFree with value 267354112
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() parsed /proc/meminfo
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing AvailablePageFile with value 0
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing TotalPageFile with value 1286291456
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() is done

Could you also attach the contents of /proc/meminfo? I hadn't thought about that before but we're parsing that file, so if the format is somehow different on certain phones maybe the parsing logic gets confused and fails or ends up in a loop.

Flags: needinfo?(gsvelto) → needinfo?(agi)

Man now I can't reproduce with a local build anymore :/ (I can reproduce with a nightly from taskcluster though, not sure why) I'll poke things around.

Flags: needinfo?(agi)

Oh, it just occurred to me that, when I reproduced it, I was using an artifact build, maybe I can't reproduce anymore because local builds don't have the crash reporter, let's see if ac_add_options --enable-crashreporter helps.

(In reply to Agi Sferro | :agi | ni? for questions | ⏰ PST | he/him from comment #33)

Oh, it just occurred to me that, when I reproduced it, I was using an artifact build, maybe I can't reproduce anymore because local builds don't have the crash reporter, let's see if ac_add_options --enable-crashreporter helps.

Nope, now I actually get the crash reporter too :/

I'll push this to try, let's see if I can reproduce with that build.

Can't reproduce with a try build either. I wonder why. Maybe something related to MOZILLA_OFFICIAL?

Agi, Markus, if you happen to reproduce this by chance can you just get me the contents of /proc/meminfo from your phones? I suspect the parsing logic might be the culprit here.

$ cat /proc/meminfo
MemTotal:        3811596 kB
MemFree:          171480 kB
MemAvailable:    1991060 kB
Buffers:          158360 kB
Cached:          1797588 kB
SwapCached:        37136 kB
Active:          1592936 kB
Inactive:        1042672 kB
Active(anon):     342808 kB
Inactive(anon):   339108 kB
Active(file):    1250128 kB
Inactive(file):   703564 kB
Unevictable:         128 kB
Mlocked:             128 kB
RbinTotal:        335872 kB
RbinAlloced:           0 kB
RbinPool:              0 kB
RbinFree:              0 kB
SwapTotal:       2097148 kB
SwapFree:        1669316 kB
Dirty:               144 kB
Writeback:             0 kB
AnonPages:        673924 kB
Mapped:           498148 kB
Shmem:              2256 kB
Slab:             226888 kB
SReclaimable:      89152 kB
SUnreclaim:       137736 kB
KernelStack:       39600 kB
PageTables:        61008 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4002944 kB
Committed_AS:   117788076 kB
VmallocTotal:   258998208 kB
VmallocUsed:      608376 kB
VmallocChunk:   258296804 kB
Flags: needinfo?(agi) → needinfo?(gsvelto)

I just remembered I wanted to try a MOZILLA_OFICIAL build, I'm doing it now.

Yahoo some progress! Building a shippable artifact from try I can finally reproduce, so at least we got something: https://treeherder.mozilla.org/jobs?repo=try&revision=8450e16d1ff3166274d37e2735c843e58ee5d5d2

Gabriele, feel free to submit a try build with whatever debug info you need an ni me and I can give you the output.

Meanwhile, with your patch, this is what I get:

02-02 13:49:48.913 16493 16493 I GeckoViewActivity: Starting to load page at about:crashparent
02-02 13:49:48.913 16493 16493 I GeckoViewActivity: zerdatime 1130678239 - page load start
02-02 13:49:48.922 16493 16493 W IInputConnectionWrapper: getCursorCapsMode on inactive InputConnection
02-02 13:49:50.056 16493 16493 I Choreographer: Skipped 67 frames!  The application may be doing too much work on its main thread.
02-02 13:49:50.056 16493 16493 W IInputConnectionWrapper: getTextAfterCursor on inactive InputConnection
02-02 13:49:50.065  2628  2628 I SKBD    : SamsungKeypad [IMI] onStartInput - caller pid : 16493, caller uid : 10613
02-02 13:49:50.097 16493 16510 D Gecko   : *****  WriteAnnotationsForMainProcessCrash()

I have tested the parsing code more extensively and it's not part of the problem because it works correctly with the data in comment 39. Since there seems to be a difference between different devices one thing that comes to mind is SELinux. I wonder if Samsung devices have a different setup and maybe that's blocking our attempt at reading /proc/meminfo? I'll change the error reporting slightly and see if we can do better.

Flags: needinfo?(gsvelto)

Agi I slightly changed the logic that reads /proc/meminfo so that it does not assume that the file can be read entirely in only one syscall. Can you try it out? I've launched a try run here.

Also, if the issue still happens could you provide me the full contents of logcat? The best would be to run adb logcat -c before reproducing and then the full contents of adb logcat.

Flags: needinfo?(agi)
Attached file log.txt

I can still reproduce. This is the full logcat from uninstalling geckoview_example, to reproducing the problem.

Flags: needinfo?(agi)
Flags: needinfo?(gsvelto)

I don't know if it helps, but I attached a debugger when the app hangs and this is the stacktrace for the Gecko thread when it's stuck

  thread #13, name = 'Gecko'
    frame #0: 0xe6ffe178 libc.so`syscall + 28
    frame #1: 0xe702e618 libc.so`__pthread_mutex_lock_with_timeout(pthread_mutex_internal_t*, bool, timespec const*) + 172
    frame #2: 0xc1139d74 libxul.so`google_breakpad::ExceptionHandler::SignalHandler(sig=<unavailable>, info=<unavailable>, uc=<unavailable>) at exception_handler.cc:360
    frame #3: 0xc1141478 libxul.so`nsProfileLock::FatalSignalHandler(signo=<unavailable>, info=<unavailable>, context=<unavailable>) at nsProfileLock.cpp:173
    frame #4: 0xafe363ec app_process32
    frame #5: 0xe6ffde1c libc.so
    frame #6: 0xc113615c libxul.so`CrashReporter::AnnotateOOMAllocationSize(size=<unavailable>) at nsExceptionHandler.cpp:0

Oh so further messing with the debugger, it appears that we get a SIGSEGV here:

sys_open(char const*, int, int) 0x00000000c137a15c
CrashReporter::AnnotateMemoryStatus(CrashReporter::AnnotationWriter&) 0x00000000c137cbd4
CrashReporter::WriteSynthesizedAnnotations(CrashReporter::AnnotationWriter&) 0x00000000c137cbbe
CrashReporter::WriteAnnotationsForMainProcessCrash(CrashReporter::PlatformWriter&, mozilla::phc::AddrInfo const*, long) 0x00000000c137a71c
CrashReporter::WriteCrashEventFile(long, char const*, mozilla::phc::AddrInfo const*, google_breakpad::MinidumpDescriptor const&) 0x00000000c137a374
CrashReporter::MinidumpCallback(google_breakpad::MinidumpDescriptor const&, void*, mozilla::phc::AddrInfo const*, bool) 0x00000000c137a2ce
google_breakpad::ExceptionHandler::GenerateDump(google_breakpad::ExceptionHandler::CrashContext*, mozilla::phc::AddrInfo const*) 0x00000000c137eae6
google_breakpad::ExceptionHandler::HandleSignal(int, siginfo*, void*) 0x00000000c137dfa0
google_breakpad::ExceptionHandler::SignalHandler(int, siginfo*, void*) 0x00000000c137de3e
nsProfileLock::FatalSignalHandler(int, siginfo*, void*) 0x00000000c1385478
WasmTrapHandler(int, siginfo*, void*) 0x00000000c1696626
<unknown> 0x00000000afe363ec
<unknown> 0x00000000e6ffde1c
CrashChannel::OpenContentStream(bool, nsIInputStream**, nsIChannel**) 0x00000000c125b0b6
nsBaseChannel::BeginPumpingData() 0x00000000c2180666
nsBaseChannel::AsyncOpen(nsIStreamListener*) 0x00000000c2180606
mozilla::net::DocumentLoadListener::Open(nsDocShellLoadState*, mozilla::net::LoadInfo*, unsigned int, unsigned int, mozilla::Maybe<unsigned long long> const&, mozilla::TimeStamp const&, nsDOMNavigationTiming*, mozilla::Maybe<mozilla::dom::ClientInfo>&&, bool, int, nsresult*) 0x00000000c1f161a2
mozilla::net::DocumentLoadListener::OpenDocument(nsDocShellLoadState*, unsigned int, mozilla::Maybe<unsigned long long> const&, mozilla::TimeStamp const&, nsDOMNavigationTiming*, mozilla::Maybe<mozilla::dom::ClientInfo>&&, mozilla::Maybe<bool>, mozilla::Maybe<bool>, int, nsresult*) 0x00000000c21de65c
mozilla::net::DocumentChannelParent::Init(mozilla::dom::CanonicalBrowsingContext*, mozilla::net::DocumentChannelCreationArgs const&) 0x00000000c1f16f20
mozilla::net::NeckoParent::RecvPDocumentChannelConstructor(mozilla::net::PDocumentChannelParent*, mozilla::dom::MaybeDiscarded<mozilla::dom::BrowsingContext> const&, mozilla::net::DocumentChannelCreationArgs const&) 0x00000000c1f16e48
mozilla::net::PNeckoParent::OnMessageReceived(IPC::Message const&) 0x00000000c22166f6
mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&) 0x00000000c1f22abc
mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy*, IPC::Message const&) 0x00000000c1f1ea38
mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) 0x00000000c1f1e53c
mozilla::ipc::MessageChannel::MessageTask::Run() 0x00000000c1f1e8f2
mozilla::RunnableTask::Run() 0x00000000c1eefbe4
mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) 0x00000000c1eef22c
mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) 0x00000000c1eeec6a
mozilla::TaskController::ProcessPendingMTTask(bool) 0x00000000c1eeed1e
mozilla::TaskController::InitializeInternal()::$_3::operator()() const 0x00000000c1ef030c
mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::$_3>::Run() 0x00000000c1ef0300
nsThread::ProcessNextEvent(bool, bool*) 0x00000000c1ef2d76
NS_ProcessNextEvent(nsIThread*, bool) 0x00000000c1ef38dc
mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 0x00000000c1f1ee58
MessageLoop::RunInternal() 0x00000000c21e7c6c
MessageLoop::RunHandler() 0x00000000c21e7c64
MessageLoop::Run() 0x00000000c21e7c64
nsBaseAppShell::Run() 0x00000000c0c6ff44
nsAppStartup::Run() 0x00000000c1349166
XREMain::XRE_mainRun() 0x00000000c23f60a0
XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) 0x00000000c23f5d0e
XRE_main(int, char**, mozilla::BootstrapConfig const&) 0x00000000c138b43a
::GeckoStart(JNIEnv *, char **, int, const mozilla::StaticXREAppData &) 0x00000000c138c490
::Java_org_mozilla_gecko_mozglue_GeckoLoader_nativeRun(JNIEnv *, jclass, jobjectArray, int, int, int, int, int) 0x00000000c4f12936
<unknown> 0x00000000c5c703c2

so maybe the issue is that we're crashing while handling a crash causing a deadlock?

Indeed! We don't support recursive crash handling so we're deadlocking. I wonder however how could that happen? The only call to open() we're making is here and there's no way that can segfault.

Unless... we're allocating the buffer object on the stack before the call, and we're in the alternative signal stack because this is a main process crash so we're handling it from within the crashed process. I remember that the alt sig stack is 16KiB so maybe we've blown past it and it's causing a segfault when pushing stuff on the stack later? If my hypothesis is right cutting down BUFFER_SYZE_BYTES to say 4KiB should solve the problem. If that doesn't help then I'm afraid the only thing we can do is disable this particular code path on Android until we get full out-of-process crash reporting.

Flags: needinfo?(gsvelto)

I remember that the alt sig stack is 16KiB so maybe we've blown past it and it's causing a segfault when pushing stuff on the stack later? If my hypothesis is right cutting down BUFFER_SYZE_BYTES to say 4KiB should solve the problem

Sent a try build with this change here https://treeherder.mozilla.org/#/jobs?repo=try&revision=38b17cc856fc9b17abdf8c90819ab281e798741b

(In reply to Agi Sferro | :agi | ni? for questions | ⏰ PST | he/him from comment #49)

I remember that the alt sig stack is 16KiB so maybe we've blown past it and it's causing a segfault when pushing stuff on the stack later? If my hypothesis is right cutting down BUFFER_SYZE_BYTES to say 4KiB should solve the problem

Sent a try build with this change here https://treeherder.mozilla.org/#/jobs?repo=try&revision=38b17cc856fc9b17abdf8c90819ab281e798741b

This seems to work! :) Gabriele do you want to submit a patch for that?

Flags: needinfo?(gsvelto)

Yes, patch coming. Thinking back this was an oversight during review and yet another reason to go with full OOP crash generation ASAP. Doing stuff in an exception handler is hard and error-prone.

Flags: needinfo?(gsvelto)

When the main process crashes /proc/meminfo is being read from within an
exception handler with an alternate signal stack of only 16KiB. The buffer we
used was 10KB which might cause overflows on certain devices. When it happened
a new exception would be triggered causing a deadlock in the exception handler
which does not support recursive exception handling.

Attachment #9198425 - Attachment is obsolete: true
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6c32d769ff9a
Reduce the size of the bufer used for parsing /proc/meminfo and handle error conditions more gracefully r=KrisWright
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → 87 Branch

The patch landed in nightly and beta is affected.
:gsvelto, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)

Comment on attachment 9202818 [details]
Bug 1681842 - Reduce the size of the bufer used for parsing /proc/meminfo and handle error conditions more gracefully r=KrisWright

Beta/Release Uplift Approval Request

  • User impact if declined: The browser may remain in an unresponsive state after a main process crash, preventing the user from restarting it
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch reduces the size of a buffer which was causing a stack overflow
  • String changes made/needed: none
Flags: needinfo?(gsvelto)
Attachment #9202818 - Flags: approval-mozilla-beta?
Attachment #9202818 - Flags: approval-mozilla-beta? → approval-mozilla-release?

Comment on attachment 9202818 [details]
Bug 1681842 - Reduce the size of the bufer used for parsing /proc/meminfo and handle error conditions more gracefully r=KrisWright

Taking for Fenix 86, Thanks

Attachment #9202818 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.