1681842 - Parent process crashes on arm32 aren't detected, zombie app stays behind

Markus Stange [:mstange]

Reporter

Description

•

5 years ago

•

Edited

Steps to reproduce:

In GeckoView example, navigate to about:crashparent.

Expected results:
Either the entire GVE app should crash, or a warning should be shown that allows restarting of Gecko.

Actual results:
The GVE app stays alive but is in a zombie state. No interaction with websites is possible any more. The app UI can still be interacted with.

Markus Stange [:mstange]

Reporter

Comment 1

•

5 years ago

I'm currently looking for a regression range.

Keywords: regression

Markus Stange [:mstange]

Reporter

Comment 2

•

5 years ago

•

Edited

13:48.59 INFO: Last good revision: 7cb103daaa8d249916f8e7c95a380c20434a9d41
13:48.59 INFO: First bad revision: ba8ce25fe969a8997591a33e060bed86bbcb9c1f
13:48.59 INFO: Pushlog:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=7cb103daaa8d249916f8e7c95a380c20434a9d41&tochange=ba8ce25fe969a8997591a33e060bed86bbcb9c1f

Flags: needinfo?(gsvelto)

Regressed by: 1649132

Markus Stange [:mstange]

Reporter

Comment 3

•

5 years ago

•

Edited

[Tracking Requested - why for this release]: Some crashes are handled incorrectly on Android and cause the app to go into a zombie state. Users have to restart the app manually. No crash report is generated.

Has Regression Range: --- → yes

Has STR: --- → yes

status-firefox83: --- → unaffected

status-firefox84: --- → affected

status-firefox85: --- → affected

status-firefox-esr78: --- → unaffected

tracking-firefox84: --- → ?

tracking-firefox85: --- → ?

Gabriele Svelto [:gsvelto]

Assignee

Comment 4

•

5 years ago

While testing this I noticed something odd: even without the patch applied if I browse to about:crashparent the GeckoView example app doesn't seem to die, or maybe it's restarted automatically? It closes when the main process crashes, but if I look in the app tray it's still there, and clicking on it it seems to start again and I can interact with it. This needs a deeper investigation.

Flags: needinfo?(gsvelto)

Julien Cristau [:jcristau]

Updated

•

5 years ago

tracking-firefox85: ? → +

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox86: --- → affected

tracking-firefox84: ? → +

tracking-firefox86: --- → +

Julien Cristau [:jcristau]

Comment 5

•

5 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #4)

This needs a deeper investigation.

Is that something you can take on, or the geckoview team?

Flags: needinfo?(gsvelto)

Flags: needinfo?(amoya)

Gabriele Svelto [:gsvelto]

Assignee

Comment 6

•

5 years ago

It would be very helpful if the GeckoView team could give this a second spin. As per comment 4 I couldn't really repro with the original STR. The app seems to close but if I look at the app tray it's still there and it appears to restart when I click on it. Also I get the same behavior w/ and w/o the regressing patch applied.

Flags: needinfo?(gsvelto)

Julien Cristau [:jcristau]

Updated

•

5 years ago

Flags: needinfo?(snorp)

Ryan VanderMeulen [:RyanVM]

Comment 7

•

5 years ago

Hi David, can you help redirect this to someone on the GV team to take a look?

Flags: needinfo?(snorp) → needinfo?(dbolter)

David Bolter [:davidb] (NeedInfo me for attention)

Comment 8

•

5 years ago

Agi can you help (someone) move this along? (See comment 4)

Flags: needinfo?(dbolter) → needinfo?(agi)

[ex-Mozilla] Agi Sferro | :agi

Comment 9

•

5 years ago

Gabriele, is the concern about development builds? or is this a problem for users? If the concern is for user builds, could you try if this is reproducible in Fenix? GVE isn't really released anywhere, and the behavior is different in Fenix (I don't see the behavior in Fenix Nightly FWIW)

This likely happens because we don't have a crash reporter by default in GVE (Bug 1666372) so the crash gets logged to the console and then the default behavior on Android is to restart the crashed process.

Having the app on the tray is also expected, apps never really "die" from the user perspective, and having it on the tray is desirable so that the user can go back to what they were doing after a crash.

Also, I just tested on GVE with the tip of m-c and I don't see the behavior in the OP, the app disappears without any message (expected, we don't install a crash reporter for GVE) and then re-opening it works as inteded (i.e., the app restarts).

Markus, do you still see the hang?

In any case, we should track this only if we can reproduce on Fenix, GVE is just a debugging tool and shouldn't block anything.

Flags: needinfo?(mstange.moz)

Flags: needinfo?(gsvelto)

Flags: needinfo?(agi)

[ex-Mozilla] Agi Sferro | :agi

Updated

•

5 years ago

Flags: needinfo?(amoya)

Markus Stange [:mstange]

Reporter

Comment 10

•

5 years ago

(In reply to Agi Sferro | :agi | ni? for questions | ⏰ PST | he/him from comment #9)

If the concern is for user builds, could you try if this is reproducible in Fenix?

This is reproducible in Fenix for me, on arm32. That's where I found it - I only checked GVE to track down whether it was a GeckoView problem or a Fenix problem.

Also, I just tested on GVE with the tip of m-c and I don't see the behavior in the OP, the app disappears without any message (expected, we don't install a crash reporter for GVE) and then re-opening it works as inteded (i.e., the app restarts).

Were you testing on arm32 or arm64?

Markus, do you still see the hang?

Let me check.

Markus Stange [:mstange]

Reporter

Comment 11

•

5 years ago

Yes, the hang still happens on today's Fenix Nightly, on a Moto G5.

Flags: needinfo?(mstange.moz)

Markus Stange [:mstange]

Reporter

Updated

•

5 years ago

Flags: needinfo?(agi)

[ex-Mozilla] Agi Sferro | :agi

Comment 12

•

5 years ago

In response to Comment 6:

While testing this I noticed something odd: even without the patch applied if I browse to about:crashparent the GeckoView example app doesn't seem to die, or maybe it's restarted automatically? It closes when the main process crashes, but if I look in the app tray it's still there, and clicking on it it seems to start again and I can interact with it. This needs a deeper investigation.

This is normal, the app tray is not representative of what apps are running, is more a "history" of what apps the user ran, swiping the app away doesn't necessarily kill the app, either, for example.

[ex-Mozilla] Agi Sferro | :agi

Comment 13

•

5 years ago

•

Edited

(reply to Markus Stange [:mstange] from comment #10)

Were you testing on arm32 or arm64?

Ah the arm32 is indeed key to reproduce! Thanks for the clarification Markus.

Gabriele, I believe that's why you couldn't reproduce, you need to run an arm32 build (ac_add_options --target=arm in mozconfig), likely on an actual device, to reproduce this.

Sending to Crash Reporting since this only reproduce on one CPU arch (so unlikely to be a GV problem) and the regressing commit was landed there.

Feel free to drop on matrix#geckoview to this discuss this further or for help on how to run/build GeckoView.

Component: General → Crash Reporting

Flags: needinfo?(agi)

Product: GeckoView → Toolkit

Markus Stange [:mstange]

Reporter

Updated

•

5 years ago

OS: All → Android

Hardware: Unspecified → ARM

Summary: Parent process crashes aren't detected, zombie app stays behind → Parent process crashes on arm32 aren't detected, zombie app stays behind

Gabriele Svelto [:gsvelto]

Assignee

Comment 14

•

5 years ago

(In reply to Agi Sferro | :agi | ni? for questions | ⏰ PST | he/him from comment #13)

(reply to Markus Stange [:mstange] from comment #10)

Were you testing on arm32 or arm64?

Ah the arm32 is indeed key to reproduce! Thanks for the clarification Markus.

Gabriele, I believe that's why you couldn't reproduce, you need to run an arm32 build (ac_add_options --target=arm in mozconfig), likely on an actual device, to reproduce this.

Thanks to both, this is really nasty news. We already have a 32-bit ARM-specific issue in crash reporting, bug 1666383 and it appears to be caused by a compiler bug (or feature?). I am afraid the issue here might be similar given it's specificity. Sadly this isn't even the first time it happens, we already had to roll some workarounds in the exception handler specifically for Android a while back :-/

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 15

•

5 years ago

I just tried w/ GeckoView on an oldish ARM phone (Android 7.1.2) and I could make it crash just fine, here's the crash report:

https://crash-stats.mozilla.org/report/index/85ff5c25-b864-48b4-afc2-033180210104

The report suffers from bug 1666383 but the application died just as I expected it to. I'll try on a more modern phone and see what happens there.

[ex-Mozilla] Agi Sferro | :agi

Comment 16

•

5 years ago

FWIW I could reproduce on my S7 (which is aarch64) with Android 8.0.0.

Gabriele Svelto [:gsvelto]

Assignee

Comment 17

•

5 years ago

No luck on my side. I tried Fenix nightly on my Fairphone 3 too but the crash reporting flow works fine. I wonder what could be the difference? Could you try repeating the STR and attaching the output of:

adb logcat -s "Gecko,GeckoCrashReporter,GeckoSession,google-breakpad,mozac/CrashReporter:*"

[ex-Mozilla] Agi Sferro | :agi

Comment 18

•

5 years ago

This is what I get:

01-05 14:36:39.903 30765 30800 I Gecko   : -*- nsDNSServiceDiscovery.js : nsDNSServiceDiscovery
01-05 14:36:40.886 30765 30800 I GeckoSession: zerdatime 101387632 - chrome startup finished
01-05 14:36:40.931 30765 30765 I GeckoSession: handleMessage GeckoView:PageStart uri=
01-05 14:36:40.939 30765 30765 I GeckoSession: handleMessage GeckoView:PageStop uri=null
01-05 14:36:41.053 30765 30765 I GeckoSession: handleMessage GeckoView:PageStart uri=
01-05 14:36:42.127 30765 30765 W GeckoSession: No history entries found.
01-05 14:36:42.331 30765 30765 I GeckoSession: handleMessage GeckoView:PageStop uri=null
01-05 14:36:44.044 30765 30800 I GeckoSession: zerdatime 101390791 - chrome startup finished
01-05 14:36:44.162 30765 30765 I GeckoSession: handleMessage GeckoView:PageStart uri=
01-05 14:36:44.162 30765 30765 I GeckoSession: handleMessage GeckoView:PageStop uri=null
01-05 14:38:15.537 30765 30800 W google-breakpad: ExceptionHandler::GenerateDump cloned child 
01-05 14:38:15.537 30765 30800 W google-breakpad: 31194
01-05 14:38:15.537 30765 30800 W google-breakpad: 
01-05 14:38:15.537 30765 30800 W google-breakpad: ExceptionHandler::SendContinueSignalToChild sent continue signal to child
01-05 14:38:15.538 31194 30800 W google-breakpad: ExceptionHandler::WaitForContinueSignal waiting for continue signal...
01-05 14:38:15.543 30765 30765 I GeckoSession: handleMessage GeckoView:PageStart uri=

Gabriele Svelto [:gsvelto]

Assignee

Comment 19

•

5 years ago

There's something definitely wrong there. It seems that the crashed process is not woken up when minidump generation is done. I'll try to repro again on my side.

Julien Cristau [:jcristau]

Updated

•

5 years ago

status-firefox84: affected → wontfix

Julien Cristau [:jcristau]

Comment 20

•

5 years ago

Gabriele would you mind assigning this bug to yourself at least for now if this is something you're looking into?

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 21

•

5 years ago

Sure

Assignee: nobody → gsvelto

Status: NEW → ASSIGNED

Flags: needinfo?(gsvelto)

[ex-Mozilla] Agi Sferro | :agi

Comment 22

•

5 years ago

I think we're seeing one of these cases in the wild: https://github.com/mozilla-mobile/fenix/issues/17005 (note the reporter cannot install arm64 APKs so the OS is running arm 32bit)

Gabriele Svelto [:gsvelto]

Assignee

Comment 23

•

5 years ago

Alright, I can finally repro reliably on my device. Looking into this today.

Gabriele Svelto [:gsvelto]

Assignee

Comment 24

•

5 years ago

I've inspected the crash reporting flow and there's something odd going on: the CrashManager can't find the minidump file even though it is present. However the error it throws is handled and the rest of the code executes as expected. On the other hand I just noticed this error in the log I had missed until now:

01-20 11:04:05.941 22097 22116 E GeckoConsole: [JavaScript Error: "TypeError: can't access property "sendAsyncMessage", this._manager.messageManager is null" {file: "chrome://geckoview/content/geckoview.js" line: 497}]
01-20 11:04:05.941 22097 22116 E GeckoConsole: _updateContentModuleState@chrome://geckoview/content/geckoview.js:497:5
01-20 11:04:05.941 22097 22116 E GeckoConsole: set enabled@chrome://geckoview/content/geckoview.js:477:10
01-20 11:04:05.941 22097 22116 E GeckoConsole: init/</<@chrome://geckoview/content/geckoview.js:107:9
01-20 11:04:05.941 22097 22116 E GeckoConsole: forEach@chrome://geckoview/content/geckoview.js:138:19
01-20 11:04:05.941 22097 22116 E GeckoConsole: init/<@chrome://geckoview/content/geckoview.js:106:12

I'm not familiar with this code but it doesn't look good and might explain why the app is unresponsive afterwards.

Gabriele Svelto [:gsvelto]

Assignee

Comment 25

•

5 years ago

The more I look into this the less this seems like a problem with the crash reporting code. The error in the crash manager is rather straightforward: the code is trying to invoke the minidump-analyzer program to extract the stack trace from the crash and add it to the crash ping, but the program isn't there. Which makes sense because we never packaged it for Android. Back in the days of Fennec we added its functionality to mozglue and used it from there, see the patch in bug 1307153. It seems to me that this doesn't happen in Fenix. I poked into the android-components repo and I can't see the crash handler trying to extract the stack trace so this is a regression compared to Fenix and I'll file a separate bug for it.

As for the hang here it seems to me that what is happening is that after a crash the unload event is received by the GeckoViewModule corresponding to the current window which in turn tries to disable all its "modules" (I'm not sure what they are exactly). When disabled one of these modules is thus trying to send a GeckoView:UpdateModuleState message, but the message manager is gone and from what I can tell that corresponds with the browser object of the current window.

In other words after the crash the browser object for the current window is gone but we're still trying to use its message manager to relay messages.

My guess is that this is caused by some kind of race. I don't understand how bug 1649132 might trigger it however, because the changes in that bug only affect main process' crashes, not content process ones. However I did land some changes in that timeframe which fixed races when dealing with content process crashes, namely bug 1666383 and bug 1624467, and those did affect the order of events happening between the main process and a crashed content one.

Agi, I'm not super-familiar with GeckoView's front-end structure, is the info above enough to figure out what's going on?

Flags: needinfo?(agi)

[ex-Mozilla] Agi Sferro | :agi

Comment 26

•

5 years ago

Hey Gabriele! I think the messageManager error is a red herring. I see the bug even without that line.

Also I don't understand how you would get anything running from geckoview.js, that file runs in the main process, so after calling about:crashparent it shouldn't be running at all. Do you have a full log? I can look at it.

One way to reliably see that error message is to trigger a content process crash about:crashcontent, in that case the window is closed (I'm assuming by Gecko) and we try to update the state in the content process and we fail to do so, which is sort-of expected, although we should check if the messageManager is alive so we don't throw like that.

But this case is for parent process crashes.

Flags: needinfo?(agi) → needinfo?(gsvelto)

[ex-Mozilla] Agi Sferro | :agi

Comment 27

•

5 years ago

To be clear here, the crash is in the parent process, so the app should close, stuff like this

As for the hang here it seems to me that what is happening is that after a crash the unload event is received by the GeckoViewModule corresponding to the current window which in turn tries to disable all its "modules" (I'm not sure what they are exactly). When disabled one of these modules is thus trying to send a GeckoView:UpdateModuleState message, but the message manager is gone and from what I can tell that corresponds with the browser object of the current window.

In other words after the crash the browser object for the current window is gone but we're still trying to use its message manager to relay messages.

is just not possible, the parent process is crashing so we can't run anything in it.

Gabriele Svelto [:gsvelto]

Assignee

Comment 28

•

5 years ago

Duh, somehow I went back and read the STR in comment 0 this morning and in my mind it read about:crashcontent. I'll re-test tomorrow.

That being said the behavior for content process crashes is also odd but not catastrophic. The page just stays there, unresponsive, until you point the URL at something else.

[ex-Mozilla] Agi Sferro | :agi

Comment 29

•

5 years ago

T(In reply to Gabriele Svelto [:gsvelto] from comment #28)

That being said the behavior for content process crashes is also odd but not catastrophic. The page just stays there, unresponsive, until you point the URL at something else.

That's just because GVE doesn't implement onCrash properly, if you try in Fenix it should show you a "sad tab" that asks you if you want to restore the tab or close it.

Gabriele Svelto [:gsvelto]

Assignee

Comment 30

•

5 years ago

Testing again and no luck. about:crashparent works just fine on all of my phones both in GeckoView and Fenix. So we'll debug this the printf() way. I'll attach a patch with a lot of logging to figure out what's going on and ask to repro for me.

Gabriele Svelto [:gsvelto]

Assignee

Comment 31

•

5 years ago

Attached file [PATCH] Android /proc/meminfo diagnostic (obsolete) — Details

Agi can you reproduce the issue on your phone with this patch applied then attach the output of logcat? The output should look like this:

01-21 10:49:52.533 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() called, here we go
01-21 10:49:52.533 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() opening /proc/meminfo
01-21 10:49:52.533 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() reading /proc/meminfo
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() parsing /proc/meminfo
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing MemTotal with value 866865152
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing MemFree with value 13053952
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing SwapFree with value 267350016
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() parsed /proc/meminfo
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing AvailablePageFile with value 0
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing TotalPageFile with value 1286291456
01-21 10:49:52.534 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() is done
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() called, here we go
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() opening /proc/meminfo
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() reading /proc/meminfo
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() parsing /proc/meminfo
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing MemTotal with value 866865152
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing MemFree with value 12632064
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing SwapFree with value 267354112
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() parsed /proc/meminfo
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing AvailablePageFile with value 0
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() writing TotalPageFile with value 1286291456
01-21 10:49:52.540 21395 21414 D Gecko   : ***** AnnotateMemoryStatus() is done

Could you also attach the contents of /proc/meminfo? I hadn't thought about that before but we're parsing that file, so if the format is somehow different on certain phones maybe the parsing logic gets confused and fails or ends up in a loop.

Flags: needinfo?(gsvelto) → needinfo?(agi)

[ex-Mozilla] Agi Sferro | :agi

Comment 32

•

5 years ago

Man now I can't reproduce with a local build anymore :/ (I can reproduce with a nightly from taskcluster though, not sure why) I'll poke things around.

Flags: needinfo?(agi)

[ex-Mozilla] Agi Sferro | :agi

Comment 33

•

5 years ago

•

Edited

Oh, it just occurred to me that, when I reproduced it, I was using an artifact build, maybe I can't reproduce anymore because local builds don't have the crash reporter, let's see if ac_add_options --enable-crashreporter helps.

[ex-Mozilla] Agi Sferro | :agi

Comment 34

•

5 years ago

(In reply to Agi Sferro | :agi | ni? for questions | ⏰ PST | he/him from comment #33)

Oh, it just occurred to me that, when I reproduced it, I was using an artifact build, maybe I can't reproduce anymore because local builds don't have the crash reporter, let's see if ac_add_options --enable-crashreporter helps.

Nope, now I actually get the crash reporter too :/

[ex-Mozilla] Agi Sferro | :agi

Comment 35

•

5 years ago

I'll push this to try, let's see if I can reproduce with that build.

[ex-Mozilla] Agi Sferro | :agi

Comment 36

•

5 years ago

https://treeherder.mozilla.org/jobs?repo=try&revision=db6feff6f9e24a88badcb0b4a8b1d231bc0115fb

[ex-Mozilla] Agi Sferro | :agi

Comment 37

•

5 years ago

Can't reproduce with a try build either. I wonder why. Maybe something related to MOZILLA_OFFICIAL?

Gabriele Svelto [:gsvelto]

Assignee

Comment 38

•

5 years ago

Agi, Markus, if you happen to reproduce this by chance can you just get me the contents of /proc/meminfo from your phones? I suspect the parsing logic might be the culprit here.

[ex-Mozilla] Agi Sferro | :agi

Updated

•

5 years ago

Flags: needinfo?(agi)

[ex-Mozilla] Agi Sferro | :agi

Comment 39

•

5 years ago

$ cat /proc/meminfo
MemTotal:        3811596 kB
MemFree:          171480 kB
MemAvailable:    1991060 kB
Buffers:          158360 kB
Cached:          1797588 kB
SwapCached:        37136 kB
Active:          1592936 kB
Inactive:        1042672 kB
Active(anon):     342808 kB
Inactive(anon):   339108 kB
Active(file):    1250128 kB
Inactive(file):   703564 kB
Unevictable:         128 kB
Mlocked:             128 kB
RbinTotal:        335872 kB
RbinAlloced:           0 kB
RbinPool:              0 kB
RbinFree:              0 kB
SwapTotal:       2097148 kB
SwapFree:        1669316 kB
Dirty:               144 kB
Writeback:             0 kB
AnonPages:        673924 kB
Mapped:           498148 kB
Shmem:              2256 kB
Slab:             226888 kB
SReclaimable:      89152 kB
SUnreclaim:       137736 kB
KernelStack:       39600 kB
PageTables:        61008 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4002944 kB
Committed_AS:   117788076 kB
VmallocTotal:   258998208 kB
VmallocUsed:      608376 kB
VmallocChunk:   258296804 kB

Flags: needinfo?(agi) → needinfo?(gsvelto)

[ex-Mozilla] Agi Sferro | :agi

Comment 40

•

5 years ago

Attached file meminfo for Samsung S7 — Details

[ex-Mozilla] Agi Sferro | :agi

Comment 41

•

5 years ago

I just remembered I wanted to try a MOZILLA_OFICIAL build, I'm doing it now.

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

5 years ago

status-firefox87: --- → affected

[ex-Mozilla] Agi Sferro | :agi

Comment 42

•

5 years ago

Yahoo some progress! Building a shippable artifact from try I can finally reproduce, so at least we got something: https://treeherder.mozilla.org/jobs?repo=try&revision=8450e16d1ff3166274d37e2735c843e58ee5d5d2

Gabriele, feel free to submit a try build with whatever debug info you need an ni me and I can give you the output.

Meanwhile, with your patch, this is what I get:

02-02 13:49:48.913 16493 16493 I GeckoViewActivity: Starting to load page at about:crashparent
02-02 13:49:48.913 16493 16493 I GeckoViewActivity: zerdatime 1130678239 - page load start
02-02 13:49:48.922 16493 16493 W IInputConnectionWrapper: getCursorCapsMode on inactive InputConnection
02-02 13:49:50.056 16493 16493 I Choreographer: Skipped 67 frames!  The application may be doing too much work on its main thread.
02-02 13:49:50.056 16493 16493 W IInputConnectionWrapper: getTextAfterCursor on inactive InputConnection
02-02 13:49:50.065  2628  2628 I SKBD    : SamsungKeypad [IMI] onStartInput - caller pid : 16493, caller uid : 10613
02-02 13:49:50.097 16493 16510 D Gecko   : *****  WriteAnnotationsForMainProcessCrash()

Gabriele Svelto [:gsvelto]

Assignee

Comment 43

•

5 years ago

I have tested the parsing code more extensively and it's not part of the problem because it works correctly with the data in comment 39. Since there seems to be a difference between different devices one thing that comes to mind is SELinux. I wonder if Samsung devices have a different setup and maybe that's blocking our attempt at reading /proc/meminfo? I'll change the error reporting slightly and see if we can do better.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 44

•

5 years ago

Agi I slightly changed the logic that reads /proc/meminfo so that it does not assume that the file can be read entirely in only one syscall. Can you try it out? I've launched a try run here.

Also, if the issue still happens could you provide me the full contents of logcat? The best would be to run adb logcat -c before reproducing and then the full contents of adb logcat.

Flags: needinfo?(agi)

[ex-Mozilla] Agi Sferro | :agi

Comment 45

•

5 years ago

Attached file log.txt — Details

I can still reproduce. This is the full logcat from uninstalling geckoview_example, to reproducing the problem.

Flags: needinfo?(agi)

[ex-Mozilla] Agi Sferro | :agi

Updated

•

5 years ago

Flags: needinfo?(gsvelto)

[ex-Mozilla] Agi Sferro | :agi

Comment 46

•

5 years ago

•

Edited

I don't know if it helps, but I attached a debugger when the app hangs and this is the stacktrace for the Gecko thread when it's stuck

  thread #13, name = 'Gecko'
    frame #0: 0xe6ffe178 libc.so`syscall + 28
    frame #1: 0xe702e618 libc.so`__pthread_mutex_lock_with_timeout(pthread_mutex_internal_t*, bool, timespec const*) + 172
    frame #2: 0xc1139d74 libxul.so`google_breakpad::ExceptionHandler::SignalHandler(sig=<unavailable>, info=<unavailable>, uc=<unavailable>) at exception_handler.cc:360
    frame #3: 0xc1141478 libxul.so`nsProfileLock::FatalSignalHandler(signo=<unavailable>, info=<unavailable>, context=<unavailable>) at nsProfileLock.cpp:173
    frame #4: 0xafe363ec app_process32
    frame #5: 0xe6ffde1c libc.so
    frame #6: 0xc113615c libxul.so`CrashReporter::AnnotateOOMAllocationSize(size=<unavailable>) at nsExceptionHandler.cpp:0

[ex-Mozilla] Agi Sferro | :agi

Comment 47

•

5 years ago

Oh so further messing with the debugger, it appears that we get a SIGSEGV here:

sys_open(char const*, int, int) 0x00000000c137a15c
CrashReporter::AnnotateMemoryStatus(CrashReporter::AnnotationWriter&) 0x00000000c137cbd4
CrashReporter::WriteSynthesizedAnnotations(CrashReporter::AnnotationWriter&) 0x00000000c137cbbe
CrashReporter::WriteAnnotationsForMainProcessCrash(CrashReporter::PlatformWriter&, mozilla::phc::AddrInfo const*, long) 0x00000000c137a71c
CrashReporter::WriteCrashEventFile(long, char const*, mozilla::phc::AddrInfo const*, google_breakpad::MinidumpDescriptor const&) 0x00000000c137a374
CrashReporter::MinidumpCallback(google_breakpad::MinidumpDescriptor const&, void*, mozilla::phc::AddrInfo const*, bool) 0x00000000c137a2ce
google_breakpad::ExceptionHandler::GenerateDump(google_breakpad::ExceptionHandler::CrashContext*, mozilla::phc::AddrInfo const*) 0x00000000c137eae6
google_breakpad::ExceptionHandler::HandleSignal(int, siginfo*, void*) 0x00000000c137dfa0
google_breakpad::ExceptionHandler::SignalHandler(int, siginfo*, void*) 0x00000000c137de3e
nsProfileLock::FatalSignalHandler(int, siginfo*, void*) 0x00000000c1385478
WasmTrapHandler(int, siginfo*, void*) 0x00000000c1696626
<unknown> 0x00000000afe363ec
<unknown> 0x00000000e6ffde1c
CrashChannel::OpenContentStream(bool, nsIInputStream**, nsIChannel**) 0x00000000c125b0b6
nsBaseChannel::BeginPumpingData() 0x00000000c2180666
nsBaseChannel::AsyncOpen(nsIStreamListener*) 0x00000000c2180606
mozilla::net::DocumentLoadListener::Open(nsDocShellLoadState*, mozilla::net::LoadInfo*, unsigned int, unsigned int, mozilla::Maybe<unsigned long long> const&, mozilla::TimeStamp const&, nsDOMNavigationTiming*, mozilla::Maybe<mozilla::dom::ClientInfo>&&, bool, int, nsresult*) 0x00000000c1f161a2
mozilla::net::DocumentLoadListener::OpenDocument(nsDocShellLoadState*, unsigned int, mozilla::Maybe<unsigned long long> const&, mozilla::TimeStamp const&, nsDOMNavigationTiming*, mozilla::Maybe<mozilla::dom::ClientInfo>&&, mozilla::Maybe<bool>, mozilla::Maybe<bool>, int, nsresult*) 0x00000000c21de65c
mozilla::net::DocumentChannelParent::Init(mozilla::dom::CanonicalBrowsingContext*, mozilla::net::DocumentChannelCreationArgs const&) 0x00000000c1f16f20
mozilla::net::NeckoParent::RecvPDocumentChannelConstructor(mozilla::net::PDocumentChannelParent*, mozilla::dom::MaybeDiscarded<mozilla::dom::BrowsingContext> const&, mozilla::net::DocumentChannelCreationArgs const&) 0x00000000c1f16e48
mozilla::net::PNeckoParent::OnMessageReceived(IPC::Message const&) 0x00000000c22166f6
mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&) 0x00000000c1f22abc
mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy*, IPC::Message const&) 0x00000000c1f1ea38
mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) 0x00000000c1f1e53c
mozilla::ipc::MessageChannel::MessageTask::Run() 0x00000000c1f1e8f2
mozilla::RunnableTask::Run() 0x00000000c1eefbe4
mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) 0x00000000c1eef22c
mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) 0x00000000c1eeec6a
mozilla::TaskController::ProcessPendingMTTask(bool) 0x00000000c1eeed1e
mozilla::TaskController::InitializeInternal()::$_3::operator()() const 0x00000000c1ef030c
mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::$_3>::Run() 0x00000000c1ef0300
nsThread::ProcessNextEvent(bool, bool*) 0x00000000c1ef2d76
NS_ProcessNextEvent(nsIThread*, bool) 0x00000000c1ef38dc
mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 0x00000000c1f1ee58
MessageLoop::RunInternal() 0x00000000c21e7c6c
MessageLoop::RunHandler() 0x00000000c21e7c64
MessageLoop::Run() 0x00000000c21e7c64
nsBaseAppShell::Run() 0x00000000c0c6ff44
nsAppStartup::Run() 0x00000000c1349166
XREMain::XRE_mainRun() 0x00000000c23f60a0
XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) 0x00000000c23f5d0e
XRE_main(int, char**, mozilla::BootstrapConfig const&) 0x00000000c138b43a
::GeckoStart(JNIEnv *, char **, int, const mozilla::StaticXREAppData &) 0x00000000c138c490
::Java_org_mozilla_gecko_mozglue_GeckoLoader_nativeRun(JNIEnv *, jclass, jobjectArray, int, int, int, int, int) 0x00000000c4f12936
<unknown> 0x00000000c5c703c2

so maybe the issue is that we're crashing while handling a crash causing a deadlock?

Gabriele Svelto [:gsvelto]

Assignee

Comment 48

•

5 years ago

Indeed! We don't support recursive crash handling so we're deadlocking. I wonder however how could that happen? The only call to open() we're making is here and there's no way that can segfault.

Unless... we're allocating the buffer object on the stack before the call, and we're in the alternative signal stack because this is a main process crash so we're handling it from within the crashed process. I remember that the alt sig stack is 16KiB so maybe we've blown past it and it's causing a segfault when pushing stuff on the stack later? If my hypothesis is right cutting down BUFFER_SYZE_BYTES to say 4KiB should solve the problem. If that doesn't help then I'm afraid the only thing we can do is disable this particular code path on Android until we get full out-of-process crash reporting.

Flags: needinfo?(gsvelto)

[ex-Mozilla] Agi Sferro | :agi

Comment 49

•

5 years ago

I remember that the alt sig stack is 16KiB so maybe we've blown past it and it's causing a segfault when pushing stuff on the stack later? If my hypothesis is right cutting down BUFFER_SYZE_BYTES to say 4KiB should solve the problem

Sent a try build with this change here https://treeherder.mozilla.org/#/jobs?repo=try&revision=38b17cc856fc9b17abdf8c90819ab281e798741b

[ex-Mozilla] Agi Sferro | :agi

Comment 50

•

5 years ago

(In reply to Agi Sferro | :agi | ni? for questions | ⏰ PST | he/him from comment #49)

I remember that the alt sig stack is 16KiB so maybe we've blown past it and it's causing a segfault when pushing stuff on the stack later? If my hypothesis is right cutting down BUFFER_SYZE_BYTES to say 4KiB should solve the problem

Sent a try build with this change here https://treeherder.mozilla.org/#/jobs?repo=try&revision=38b17cc856fc9b17abdf8c90819ab281e798741b

This seems to work! :) Gabriele do you want to submit a patch for that?

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 51

•

5 years ago

Yes, patch coming. Thinking back this was an oversight during review and yet another reason to go with full OOP crash generation ASAP. Doing stuff in an exception handler is hard and error-prone.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 52

•

5 years ago

Attached file Bug 1681842 - Reduce the size of the bufer used for parsing /proc/meminfo and handle error conditions more gracefully r=KrisWright — Details

When the main process crashes /proc/meminfo is being read from within an
exception handler with an alternate signal stack of only 16KiB. The buffer we
used was 10KB which might cause overflows on certain devices. When it happened
a new exception would be triggered causing a deadlock in the exception handler
which does not support recursive exception handling.

Gabriele Svelto [:gsvelto]

Assignee

Updated

•

5 years ago

Attachment #9198425 - Attachment is obsolete: true

Julien Cristau [:jcristau]

Updated

•

5 years ago

status-firefox85: affected → wontfix

Pulsebot

Comment 53

•

5 years ago

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6c32d769ff9a Reduce the size of the bufer used for parsing /proc/meminfo and handle error conditions more gracefully r=KrisWright

Cosmin Sabou [:CosminS]

Comment 54

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/6c32d769ff9a

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox87: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 87 Branch

BugBot [:suhaib / :marco/ :calixte]

Comment 55

•

5 years ago

The patch landed in nightly and beta is affected.
:gsvelto, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 56

•

5 years ago

Comment on attachment 9202818 [details]
Bug 1681842 - Reduce the size of the bufer used for parsing /proc/meminfo and handle error conditions more gracefully r=KrisWright

Beta/Release Uplift Approval Request

User impact if declined: The browser may remain in an unresponsive state after a main process crash, preventing the user from restarting it
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): This patch reduces the size of a buffer which was causing a stack overflow
String changes made/needed: none

Flags: needinfo?(gsvelto)

Attachment #9202818 - Flags: approval-mozilla-beta?

Julien Cristau [:jcristau]

Updated

•

5 years ago

Attachment #9202818 - Flags: approval-mozilla-beta? → approval-mozilla-release?

Pascal Chevrel:pascalc

Comment 57

•

5 years ago

Comment on attachment 9202818 [details]
Bug 1681842 - Reduce the size of the bufer used for parsing /proc/meminfo and handle error conditions more gracefully r=KrisWright

Taking for Fenix 86, Thanks

Attachment #9202818 - Flags: approval-mozilla-release? → approval-mozilla-release+

Ryan VanderMeulen [:RyanVM]

Comment 58

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/a76ee22722f8

status-firefox86: affected → fixed

[PATCH] Android /proc/meminfo diagnostic 5 years ago Gabriele Svelto [:gsvelto] 6.68 KB, text/plain		Details
meminfo for Samsung S7 5 years ago [ex-Mozilla] Agi Sferro \| :agi 1.04 KB, application/octet-stream		Details
log.txt 5 years ago [ex-Mozilla] Agi Sferro \| :agi 750.44 KB, text/plain		Details
Bug 1681842 - Reduce the size of the bufer used for parsing /proc/meminfo and handle error conditions more gracefully r=KrisWright 5 years ago Gabriele Svelto [:gsvelto] 48 bytes, text/x-phabricator-request	pascalc : approval-mozilla-release+	Details \| Review