Closed Bug 1450105 Opened 6 years ago Closed 5 years ago

[meta-ish] Unexplained crashes in release: 59.*

Categories

(Firefox for Android Graveyard :: General, defect, P3)

All
Android
defect

Tracking

(firefox59+ wontfix, firefox60- wontfix)

RESOLVED FIXED
Tracking Status
firefox59 + wontfix
firefox60 - wontfix

People

(Reporter: mcomella, Assigned: sdaswani)

References

Details

There appear to be a series of unexplainable top crashers in 59.* (release). It started with bug 1444549: a crash occurred despite the files not getting changed in months and the code appearing to work. Then I saw bug 1449695, which appears to be in a similar situation of reading resources and crashing when they aren't in the expected format. There's also bug 1449177 and bug 1449683, which are other impossible crashes but for a ClassCastException.

We haven't been changing the code much so I think something else might be the problem, hence this bug.
Here's what we know:
- bug 1449177, bug 1449683, and bug 1449695 only crash in 59.0.2.
- bug 1444549 only crashed in 59.0 and 59.0.1 (but we fixed it for 59.0.2 by removing the unused check).
- All these crashes deal with XML in some way (inflating layouts or accessing resources)
- They all affect API 24+ (Android N+)
- Affects a wide range of phones
- Most crashes occur at < 1m of uptime (e.g. the first time these views/resources are inflated?)

Here's how we're speculating:
- nalexander noticed one of the stacktraces [1] shows, "Xposed" which is an app that modifies the Android system in order for users to make customizations. This was only noticeable in *one* crash but perhaps it's related.
- Snorp has mentioned there has been APK corruption that has been hitting at random [2] I don't have anymore details.
- I wonder if someone has forked our code base, changed the code, but has been using our crash reporter?
- In 58, we landed the build-with-gradle-only changes. In 59, we've made additional changes, including moving to the Android gradle plugin v3.0. However, nothing stands out to Nick as a problem that'd cause this (here's our IRC logs [3]).

NI Susheel: FYI that there are crashes that shouldn't be crashing on release.

[1]: https://crash-stats.mozilla.com/report/index/0673a189-5a2b-43dc-b9c7-4402c0180328
[2]: https://bugzilla.mozilla.org/show_bug.cgi?id=1444549#c10
[3]: https://mozilla.logbot.info/mobile/20180329#c14538693
Flags: needinfo?(sdaswani)
More what we know:
- bug 1449695 uses an unconventional XML format, `<item type="dimen"` rather than `<dimen` and so does bug 1449683, which uses, `<view class="org.mozilla...BrowserSearch$HomeSearchListView`, probably because you can't declare inner classes using the standard notation.
oneplus devices seem to be over-represented in those particular crash signatures - they account for 30% of these reports while overall this manufacturer is just showing up in ~2% of all fennec crashes
I looked into a bunch of these.  It looks like we're getting garbage back from the system from `context.obtainStyledResources(...)` in a bunch of places.  I can't explain that, although there have been issues with that function and unsorted attrs in the past (https://stackoverflow.com/questions/19034597/get-multiple-style-attributes-with-obtainstyledattributes) that shouldn't be impacting us, but could be.
(In reply to [:philipp] from comment #3)
> oneplus devices seem to be over-represented in those particular crash
> signatures - they account for 30% of these reports while overall this
> manufacturer is just showing up in ~2% of all fennec crashes

A gentle counter -- I also see a lot of Pixel and Pixel 2 devices, which are pretty mainstream.  (Right?)

Two stabs in the dark:

1) maybe we need to be on the aapt2 path in order to work on newer devices
2) maybe we need to finally target a newer SDK in order to work on newer devices
Ritu, I am NI'ing you to put this on your radar. I don't know if this is a 'platform' or 'front end' issue, but I wanted to get your opinion on the urgency of this. Also feel free to NI someone else if you are not the right person (and sorry!).
Flags: needinfo?(sdaswani) → needinfo?(rkothari)
I can track this for 59 - for now, looks like the right people are investigating.
Thanks for the NI Susheel. FYI to our Fennec stability triagers. Is this noticeable in 60 as well or just a 59 trend?
Flags: needinfo?(rkothari)
Flags: needinfo?(mozillamarcia.knous)
Flags: needinfo?(madperson)
Flags: needinfo?(kbrosnan)
(In reply to Ritu Kothari (:ritu) from comment #8)
> Thanks for the NI Susheel. FYI to our Fennec stability triagers. Is this
> noticeable in 60 as well or just a 59 trend?

As far as I can see it only affects 59.
Flags: needinfo?(mozillamarcia.knous)
Someone needs to be responsible for this unless we deliberately decide it's low priority: assigning Susheel. Please find an active developer to solve this or determine it doesn't need one. Thanks!
Assignee: nobody → sdaswani
I'm waiting to hear from the Fennec stability trigger folks Mike.
the signature from bug 1449569 is also present in 60.0b and 61.0a1.

in case bug 1444549 belongs to the same crash pattern, it wasn't observable during the 59 beta cycle at all and just started appearing once we pushed the build to the release audience (perhaps we just done't have enough affected devices on the beta channel)
Flags: needinfo?(madperson)
I'm going to see if the Softvision devs have any luck reproducing this or have any ideas about causes.

Vlad, can one of you three take a look into this ASAP and let us know if you find any issues?
Flags: needinfo?(vlad.baicu)
We've noticed a similarity among these bugs which points to wrong references from R.java.

bug 1449177 - pageIconLayout findViewById finds a FrameLayout and crashes with ClassCastException, where the FrameLayout retrieved is the upper view in hierarchy - https://dxr.mozilla.org/mozilla-central/source/mobile/android/app/src/main/res/layout/activity_stream_webpage_item_row.xml#15

bug 1449683 - mList ClassCastException, should be HomeListView but finds ViewStub, where ViewStub is the previous View in the xml - https://dxr.mozilla.org/mozilla-release/source/mobile/android/app/src/photon/res/layout/browser_search.xml#12

I suspect the same happens for the other 2 bugs whenever they are attempting to access resources.

It's as if the references are offset due to some reason. This points to some aapt resource mapping issue. We're looking further into this, any ideas as to what might be causing this ? 

Also, aapt2 is disabled in gradle.properties for Fennec, maybe we could try enabling it and run some tests with it ?
Flags: needinfo?(vlad.baicu) → needinfo?(michael.l.comella)
(In reply to Vlad Baicu from comment #14)
> We've noticed a similarity among these bugs which points to wrong references
> from R.java.
> 
> bug 1449177 - pageIconLayout findViewById finds a FrameLayout and crashes
> with ClassCastException, where the FrameLayout retrieved is the upper view
> in hierarchy -
> https://dxr.mozilla.org/mozilla-central/source/mobile/android/app/src/main/
> res/layout/activity_stream_webpage_item_row.xml#15
> 
> bug 1449683 - mList ClassCastException, should be HomeListView but finds
> ViewStub, where ViewStub is the previous View in the xml -
> https://dxr.mozilla.org/mozilla-release/source/mobile/android/app/src/photon/
> res/layout/browser_search.xml#12

Ooh, thanks -- this is really interesting.

> I suspect the same happens for the other 2 bugs whenever they are attempting
> to access resources.
> 
> It's as if the references are offset due to some reason. This points to some
> aapt resource mapping issue. We're looking further into this, any ideas as
> to what might be causing this ? 

I dug really deep into the tickets that specifically had to do with `obtainStyledAttributes`, and conclude that the Android system call is returning garbage.  If there is an offset issue, for any reason, that could explain it.  One of those tickets has a fascinating special value that might let us form a theory about the offset -- it's https://bugzilla.mozilla.org/show_bug.cgi?id=1444549.

> Also, aapt2 is disabled in gradle.properties for Fennec, maybe we could try
> enabling it and run some tests with it ?

I think we should do this, yes, although it will take a very long time to get any changes like this up to release.  I'll file and look into fixing the tests that are broken.

For the record, when I've been testing these things, I've been using apktool to disassemble shipping APKs and then inverting the map back to the relevant source code in mozilla-release by hand, so resource corruption should be visible.
Nick, Vlad - just want to see if there are next steps here and who is taking them?
Flags: needinfo?(vlad.baicu)
Flags: needinfo?(nalexander)
In my opinion it still not clear what exactly is causing this issue. Following along the lines of the discussion, besides further investigation I can propose to: 

(In reply to Nick Alexander :nalexander from comment #15)
> I dug really deep into the tickets that specifically had to do with
> `obtainStyledAttributes`, and conclude that the Android system call is
> returning garbage.  If there is an offset issue, for any reason, that could
> explain it.

- according to these findings maybe we can file a bug to the android issue tracker and have the google guys take a look at this issue as well ? Even if it's not on their end maybe they encountered something similar and can share some knowledge in order to speed up the process 

> I think we should do this, yes, although it will take a very long time to
> get any changes like this up to release.  I'll file and look into fixing the
> tests that are broken.

- upgrade to aapt2 - I have managed to enable it in gradle and no build exception gets thrown, but according to the docs we should check for the behavior changes https://developer.android.com/studio/build/gradle-plugin-3-0-0-migration.html#aapt2 and run tests
Flags: needinfo?(vlad.baicu)
(In reply to Vlad Baicu from comment #17)
> In my opinion it still not clear what exactly is causing this issue.
> Following along the lines of the discussion, besides further investigation I
> can propose to: 
> 
> (In reply to Nick Alexander :nalexander from comment #15)
> > I dug really deep into the tickets that specifically had to do with
> > `obtainStyledAttributes`, and conclude that the Android system call is
> > returning garbage.  If there is an offset issue, for any reason, that could
> > explain it.
> 
> - according to these findings maybe we can file a bug to the android issue
> tracker and have the google guys take a look at this issue as well ? Even if
> it's not on their end maybe they encountered something similar and can share
> some knowledge in order to speed up the process 
> 
> > I think we should do this, yes, although it will take a very long time to
> > get any changes like this up to release.  I'll file and look into fixing the
> > tests that are broken.
> 
> - upgrade to aapt2 - I have managed to enable it in gradle and no build
> exception gets thrown, but according to the docs we should check for the
> behavior changes
> https://developer.android.com/studio/build/gradle-plugin-3-0-0-migration.
> html#aapt2 and run tests

I think filing a bug in the Google tracker might be a good idea. Is someone going to file one?
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #18)
> (In reply to Vlad Baicu from comment #17)
> > In my opinion it still not clear what exactly is causing this issue.
> > Following along the lines of the discussion, besides further investigation I
> > can propose to: 
> > 
> > (In reply to Nick Alexander :nalexander from comment #15)
> > > I dug really deep into the tickets that specifically had to do with
> > > `obtainStyledAttributes`, and conclude that the Android system call is
> > > returning garbage.  If there is an offset issue, for any reason, that could
> > > explain it.
> > 
> > - according to these findings maybe we can file a bug to the android issue
> > tracker and have the google guys take a look at this issue as well ? Even if
> > it's not on their end maybe they encountered something similar and can share
> > some knowledge in order to speed up the process 
> > 
> > > I think we should do this, yes, although it will take a very long time to
> > > get any changes like this up to release.  I'll file and look into fixing the
> > > tests that are broken.
> > 
> > - upgrade to aapt2 - I have managed to enable it in gradle and no build
> > exception gets thrown, but according to the docs we should check for the
> > behavior changes
> > https://developer.android.com/studio/build/gradle-plugin-3-0-0-migration.
> > html#aapt2 and run tests
> 
> I think filing a bug in the Google tracker might be a good idea. Is someone
> going to file one?

Vlad, can you please file one and post it here?
Flags: needinfo?(vlad.baicu)
> It's as if the references are offset due to some reason. This points to some aapt resource mapping issue.
> We're looking further into this, any ideas as to what might be causing this ? 

Unfortunately, I do not know.

> Also, aapt2 is disabled in gradle.properties for Fennec, maybe we could try enabling it and run some tests with it ?

nalexander addressed this.
Flags: needinfo?(michael.l.comella)
(In reply to Michael Comella (:mcomella) [needinfo or I won't see it] from comment #20)
> > It's as if the references are offset due to some reason. This points to some aapt resource mapping issue.
> > We're looking further into this, any ideas as to what might be causing this ? 
> 
> Unfortunately, I do not know.

I have spent all morning trying to divine what offsets or shifts we could be seeing.  My pet theory is that we're seeing an upgrade issue, where some Android cache is stale, and we're getting resources values or indices from an earlier APK version.  However, trying to do some numerology on the 58.0.2 and 59.0 APKs doesn't suggest anything.

For those following along at home, I've been really trying to dig into https://bugzilla.mozilla.org/show_bug.cgi?id=1444549, because that special value is just so juicy.  However, it makes no sense; the special value @2131427964 means that we're trying to parse a TYPE_REFERENCE resource as an integer.  That value is 0x7f0b027c, which corresponds to (for 58.0.2) in R$style.smali

.field public static final Widget_TopSitesThumbnailView:I = 0x7f0b027c

and (for 59.0) in R$style.smali

.field public static final Widget_TwoLinePageRow:I = 0x7f0b027c

I just can't understand how we could be getting a TypedArray back with a valid resource ID in slot 0 and a _reference_ to a style value in slot 1.  Now, TYPE_REFERENCE is itself 1, so if we had some offset-by-1 bug was mangling the internal resource data on these devices, that could explain something; but if that was true, we'd see massive corruption all over the place, not isolated incidents like this.

> > Also, aapt2 is disabled in gradle.properties for Fennec, maybe we could try enabling it and run some tests with it ?
> 
> nalexander addressed this.

Well, I can try to address this.
Flags: needinfo?(nalexander)
I have filed the bug on Google's issue tracker

https://issuetracker.google.com/issues/77608213
Flags: needinfo?(vlad.baicu)
I have a theory that we are seeing some sort of Android upgrade bug,
where the code and the resources are mismatched across upgrades,
leading to the errors that we see.  I have witnessed invalidation
issues of this sort in the past on older Android versions; in
particular, launchers and settings Apps haven't handled resource
changes gracefully when a package is upgraded while the management App
is running.  I have a few pieces of evidence supporting this theory:

1) the APKs appear to be well-formed.  I have been working more or
   less entirely from decompiled APKs, using both apktool and
   androguard.  These tools extract resources and code from APKs in
   similar ways, but are two entirely different codebases.  They agree
   on what is in the APKs and reveal no mismatches or errors.

2) the crashes "move around" based on the differences between version
   N and version N+1, in ways that I think are explained by this
   theory.  That is, a new version (that impacts resources) addresses
   one crash but introduces another, and I think the data explains
   these when we see them.  In particular, the crashes appear to be
   both off-by-one _and_ off-by-two errors, and the off-by-two
   situation is explained by the same mechanism as the off-by-one
   situation.

3) By manually adjusting certain resource values in the way that the
   theory posits is happening, I can replicate the details of
   particular crashes locally.  In particular, the theory predicts
   exactly what special value we should see in the circumstance that
   we see a special value, and the theory agrees with the special
   value that we are seeing in the wild.

There are four relevant versions of Firefox for Android:
- 58.0.2
- 59.0
- 59.0.1
- 59.0.2

The resource sets change significantly between 58.0.2 and 59.0; they
_do not change_ between 59.0 and 59.0.1 (the hash of resources.arsc in
the APK files is identical); and they change very slightly between
59.0.1 and 59.0.2 (with the landing of
https://bugzilla.mozilla.org/show_bug.cgi?id=1444549).

There are four crashes that seem relevant:
- https://bugzilla.mozilla.org/show_bug.cgi?id=1444549

java.lang.NumberFormatException: For input string: "@2131427964"
	at java.lang.Integer.parseInt(Integer.java:608)
	at com.android.internal.util.XmlUtils.convertValueToInt(XmlUtils.java:133)
	at android.content.res.TypedArray.getInt(TypedArray.java:375)
	at org.mozilla.gecko.widget.IconTabWidget.<init>(IconTabWidget.java:33)

This is the crash that is most interesting.  It's the "original"
issue, in that it bridges 58.0.2 and 59.0, and it has a fascinating
"special value", namely "@2131427964".  That means that the `getInt`
call is trying to convert a TYPE_REFERENCE resource into a integer;
for reference 2131427964 in hex is 0x7f0b027c.

The offending code is at
https://hg.mozilla.org/releases/mozilla-release/file/FIREFOX_59_0_RELEASE/mobile/android/base/java/org/mozilla/gecko/widget/IconTabWidget.java#l33
and the relevant lines are

        TypedArray a = context.obtainStyledAttributes(attrs, R.styleable.IconTabWidget);
        mButtonLayoutId = a.getResourceId(R.styleable.IconTabWidget_android_layout, 0);
        mIsIcon = (a.getInt(R.styleable.IconTabWidget_display, 0x00) == 0x00);

What we see is that the TypedArray is coming back from Android as some
array of the form [?, (TYPE_REFERENCE, 0x7f0b027c)].  More on that shortly.

There are also three new crashes in 59.0.2 (i.e., that will be visible
at least when upgrading from 59.0/59.0.1 to 59.0.2):

- https://bugzilla.mozilla.org/show_bug.cgi?id=1449695

java.lang.UnsupportedOperationException: Can't convert value at index 2 to dimension: type=0x3
	at android.content.res.TypedArray.getDimensionPixelSize(TypedArray.java:730)
	at org.mozilla.gecko.home.TabMenuStripLayout.<init>(TabMenuStripLayout.java:65)
	at org.mozilla.gecko.home.TabMenuStrip.<init>(TabMenuStrip.java:59)

This crash has a "somewhat interesting" special value, or at least a
restriction on the range of values that we can see.

The offending code is at
https://hg.mozilla.org/releases/mozilla-release/file/FIREFOX_59_0_2_RELEASE/mobile/android/base/java/org/mozilla/gecko/home/TabMenuStripLayout.java#l65
and the relevant lines are

        final TypedArray a = context.obtainStyledAttributes(attrs, R.styleable.TabMenuStrip);
        final int stripResId = a.getResourceId(R.styleable.TabMenuStrip_strip, -1);

        titlebarFill = a.getBoolean(R.styleable.TabMenuStrip_titlebarFill, false);
        tabContentStart = a.getDimensionPixelSize(R.styleable.TabMenuStrip_tabsMarginLeft, 0);

Again, we see a TypedArray coming back from Android as some array of
the form [?, ?, (TYPE_STRING, ?), ...] (0x3 is TYPE_STRING).

- https://bugzilla.mozilla.org/show_bug.cgi?id=1449683

java.lang.ClassCastException: android.view.ViewStub cannot be cast to org.mozilla.gecko.home.HomeListView
	at org.mozilla.gecko.home.BrowserSearch.onCreateView(BrowserSearch.java:303)

This crash also has a mild restriction on the resource mismatch that
we might see.

The offending code is at
https://hg.mozilla.org/releases/mozilla-release/file/FIREFOX_59_0_2_RELEASE/mobile/android/base/java/org/mozilla/gecko/home/BrowserSearch.java#l303
and the relevant lines are

        mView = (LinearLayout) inflater.inflate(R.layout.browser_search, container, false);
        mList = (HomeListView) mView.findViewById(R.id.home_list_view);

What we witness is that the `findViewById(R.id.home_list_view)`
produces a `ViewStub`.

- https://bugzilla.mozilla.org/show_bug.cgi?id=1449177

java.lang.ClassCastException: android.widget.FrameLayout cannot be cast to org.mozilla.gecko.activitystream.homepanel.stream.StreamOverridablePageIconLayout
	at org.mozilla.gecko.activitystream.homepanel.stream.WebpageItemRow.<init>(WebpageItemRow.java:50)

The offending code is at
https://hg.mozilla.org/releases/mozilla-release/file/FIREFOX_59_0_2_RELEASE/mobile/android/base/java/org/mozilla/gecko/activitystream/homepanel/stream/WebpageItemRow.java#l50
and the relevant line is

        pageIconLayout = (StreamOverridablePageIconLayout) itemView.findViewById(R.id.page_icon);

Again, what we witness is that the `findViewById(R.id.page_icon)
produces a `FrameLayout`.

Now, some details from the APK decompilations:

R.styleable.IconTabWidget
[[file:~/Mozilla/apks/fennec-58.0.2.en-US.android-arm/smali/org/mozilla/gecko/R$styleable.smali::.line%209980][58.0.2]]
    .line 9980
    :array_19
    .array-data 4
        0x10100f2
        0x7f010100
    .end array-data
[[file:~/Mozilla/apks/fennec-59.0.multi.android-arm/smali/org/mozilla/gecko/R$styleable.smali::.line%209893][59.0]]
    .line 9893
    :array_18
    .array-data 4
        0x10100f2
        0x7f0100fd
    .end array-data

R.styleable.IconTabWidget_display
[[file:~/Mozilla/apks/fennec-58.0.2.en-US.android-arm/smali/org/mozilla/gecko/R$styleable.smali::.field%20public%20static%20final%20IconTabWidget_display:I%20=%200x1][58.0.2]]
.field public static final IconTabWidget_display:I = 0x1

[[file:~/Mozilla/apks/fennec-59.0.multi.android-arm/smali/org/mozilla/gecko/R$styleable.smali::.field%20public%20static%20final%20IconTabWidget_display:I%20=%200x1][59.0]]
.field public static final IconTabWidget_display:I = 0x1

Suppose that the 58.0.2 resources were somehow being used instead of
the 59.0 resources.  Then

        TypedArray a = context.obtainStyledAttributes(attrs, R.styleable.IconTabWidget);

would fetch `int[] {0x10100f2, 0x7f0100fd}` from the 58.0.2
resources.  The 0x10100f2 is Android-internal and constant across
Firefox versions, but what does 0x7f0100fd end up in the 58.0.2
resources?

[[file:~/Mozilla/apks/fennec-58.0.2.multi.android-arm/smali/org/mozilla/gecko/R$attr.smali::.field%20public%20static%20final%20topSitesThumbnailViewStyle:I%20=%200x7f0100fd][58.0.2]]
.field public static final topSitesThumbnailViewStyle:I = 0x7f0100fd

[[file:~/Mozilla/apks/fennec-58.0.2.multi.android-arm/res/values/styles.xml::<item%20name="topSitesThumbnailViewStyle">@style/Widget.TopSitesThumbnailView</item>][58.0.2]]
        <item name="topSitesThumbnailViewStyle">@style/Widget.TopSitesThumbnailView</item>

[[file:~/Mozilla/apks/fennec-58.0.2.multi.android-arm/res/values/public.xml::<public%20type="style"%20name="Widget.TopSitesThumbnailView"%20id="0x7f0b027c"%20/>][58.0.2]]
    <public type="style" name="Widget.TopSitesThumbnailView" id="0x7f0b027c" />

And what do we see?  A reference to a 0x7f0b027c, which is _exactly_
our special value!  Somehow, some way we're seeing an off-by-two error
that yields exactly the error that we witnessed in the wild when
upgrading 58.0.2 (and possibly 58.0.1 and 58.0, I haven't checked) to 59.0.

Let's look at the next one,
https://bugzilla.mozilla.org/show_bug.cgi?id=1449695.

R.styleable.TabMenuStrip
[[file:~/Mozilla/apks/fennec-59.0.1.multi.android-arm/smali/org/mozilla/gecko/R$styleable.smali::.line%2012150][59.0/59.0.1]]
    .line 12150
    :array_30
    .array-data 4
        0x7f010151
        0x7f010152
        0x7f010153
        0x7f010154
        0x7f010155
        0x7f010156
    .end array-data

[[file:~/Mozilla/apks/fennec-59.0.2.multi.android-arm/smali/org/mozilla/gecko/R$styleable.smali::.line%2012120][59.0.2]]
    .line 12120
    :array_2f
    .array-data 4
        0x7f010150
        0x7f010151
        0x7f010152
        0x7f010153
        0x7f010154
        0x7f010155
    .end array-data

Observe that R.styleable.TabMenuStrip_* are constant across all
versions, and in particular that
[[file:~/Mozilla/apks/fennec-59.0.2.multi.android-arm/smali/org/mozilla/gecko/R$styleable.smali::.field%20public%20static%20final%20TabMenuStrip_tabsMarginLeft:I%20=%200x2][59.0.2]]
.field public static final TabMenuStrip_tabsMarginLeft:I = 0x2

Now, what if we follow 0x7f010152 (the second index from 59.0.2) in
the resources from 59.0/59.0.1?

[[file:~/Mozilla/apks/fennec-59.0.1.multi.android-arm/smali/org/mozilla/gecko/R$attr.smali::.field%20public%20static%20final%20stripColor:I%20=%200x7f010152][59.0.1]]
.field public static final stripColor:I = 0x7f010152

[[file:~/Mozilla/apks/fennec-59.0.1.multi.android-arm/res/values/attrs.xml::<attr%20name="stripColor"%20format="color"%20/>][59.0.1]]
    <attr name="stripColor" format="color" />

and in particular, we find that

<org.mozilla.gecko.home.TabMenuStrip android:layout_gravity="top" android:background="@color/about_page_header_grey" android:layout_width="fill_parent" android:layout_height="@dimen/tabs_strip_height" gecko:strip="@drawable/home_tab_menu_strip" gecko:stripColor="@color/tab_menu_strip_color" gecko:tabsMarginLeft="@dimen/tab_strip_content_start" gecko:activeTextColor="@color/placeholder_grey" gecko:inactiveTextColor="@color/tab_text_color" />

has `gecko:stripColor="@color/tab_menu_strip_color"`.  I think that is
actually represented internally as a TYPE_STRING resource!  XXX say
more about reproducing.

Let's look at the next one,
https://bugzilla.mozilla.org/show_bug.cgi?id=1449683.

java.lang.ClassCastException: android.view.ViewStub cannot be cast to org.mozilla.gecko.home.HomeListView
	at org.mozilla.gecko.home.BrowserSearch.onCreateView(BrowserSearch.java:303)

R.id.home_list_view
[[file:~/Mozilla/apks/fennec-59.0.1.multi.android-arm/smali/org/mozilla/gecko/R$id.smali::.field%20public%20static%20final%20home_list_view:I%20=%200x7f0f00b1][59.0/59.0.1]]
.field public static final home_list_view:I = 0x7f0f00b1

[[file:~/Mozilla/apks/fennec-59.0.2.multi.android-arm/smali/org/mozilla/gecko/R$id.smali::.field%20public%20static%20final%20home_list_view:I%20=%200x7f0f00b0][59.0.2]]
.field public static final home_list_view:I = 0x7f0f00b0

Suppose that we looked for 0x7f0f00b0 (59.0.2's ID) in 59.0.1's
resource set; we would find

[[file:~/Mozilla/apks/fennec-59.0.1.multi.android-arm/smali/org/mozilla/gecko/R$id.smali::.field%20public%20static%20final%20suggestions_opt_in_prompt:I%20=%200x7f0f00b0][59.0/59.0.1]]
.field public static final suggestions_opt_in_prompt:I = 0x7f0f00b0

What do we find?

[[file:~/Mozilla/apks/fennec-59.0.1.multi.android-arm/res/layout/browser_search.xml::<ViewStub%20android:id="@id/suggestions_opt_in_prompt"%20android:layout="@layout/home_suggestion_prompt"%20android:layout_width="fill_parent"%20android:layout_height="wrap_content"%20/>][59.0/59.0.1]]
    <ViewStub android:id="@id/suggestions_opt_in_prompt" android:layout="@layout/home_suggestion_prompt" android:layout_width="fill_parent" android:layout_height="wrap_content" />

There's our ViewStub.

Finally, https://bugzilla.mozilla.org/show_bug.cgi?id=1449177.

java.lang.ClassCastException: android.widget.FrameLayout cannot be cast to org.mozilla.gecko.activitystream.homepanel.stream.StreamOverridablePageIconLayout
	at org.mozilla.gecko.activitystream.homepanel.stream.WebpageItemRow.<init>(WebpageItemRow.java:50)

        pageIconLayout = (StreamOverridablePageIconLayout) itemView.findViewById(R.id.page_icon);

[[file:~/Mozilla/apks/fennec-59.0.1.multi.android-arm/smali/org/mozilla/gecko/R$id.smali::.field%20public%20static%20final%20page_icon:I%20=%200x7f0f0099][59.0/59.0.1]]
.field public static final page_icon:I = 0x7f0f0099

[[file:~/Mozilla/apks/fennec-59.0.2.multi.android-arm/smali/org/mozilla/gecko/R$id.smali::.field%20public%20static%20final%20page_icon:I%20=%200x7f0f0098][59.0.2]]
.field public static final page_icon:I = 0x7f0f0098

Do the same thing, looking for 0x7f0f0098 in 59.0.1's resource set; we find

[[file:~/Mozilla/apks/fennec-59.0.1.multi.android-arm/smali/org/mozilla/gecko/R$id.smali::.field%20public%20static%20final%20icon_wrapper:I%20=%200x7f0f0098][59.0/59.0.1]]
.field public static final icon_wrapper:I = 0x7f0f0098

and

[[file:~/Mozilla/apks/fennec-59.0.1.multi.android-arm/res/layout/activity_stream_webpage_item_row.xml::<FrameLayout%20android:id="@id/icon_wrapper"%20android:layout_width="wrap_content"%20android:layout_height="wrap_content">][59.0/59.0.1]]
    <FrameLayout android:id="@id/icon_wrapper" android:layout_width="wrap_content" android:layout_height="wrap_content">
        <org.mozilla.gecko.activitystream.homepanel.stream.StreamOverridablePageIconLayout android:id="@id/page_icon" android:layout_width="@dimen/favicon_bg" android:layout_height="@dimen/favicon_bg" android:layout_marginLeft="@dimen/activity_stream_base_margin" android:layout_marginTop="@dimen/activity_stream_base_margin" android:layout_marginBottom="@dimen/activity_stream_base_margin" />
    </FrameLayout>

There's our FrameLayout!

I have _absolutely no idea how to proceed_ in the face of this theory.  mcomella, Vlad: what do you think?
Flags: needinfo?(vlad.baicu)
Flags: needinfo?(michael.l.comella)
In the previous comment, I wrote it using the linking feature in Emacs org-mode and haven't tried to make it link properly for others (i.e., as a real webpage).  In addition, I see I accidentally referenced some en-US packages rather than multi packages; the changes don't alter the analysis or conclusions.
Nick thanks for the deep analysis - I still have to digest it, but is there any chance that if the user grabs an update from GP after the crashing starts, then the crashing will stop? And can you reason how often this crash happens for affected users? Is it every time they try to launch the app? What happens if they delete and re-install the app?
Flags: needinfo?(nalexander)
(In reply to :sdaswani from comment #25)
> Nick thanks for the deep analysis - I still have to digest it, but is there
> any chance that if the user grabs an update from GP after the crashing
> starts, then the crashing will stop?

My expectation is that no, a further update will only shift the crashing around.  We might want to push a very special 59.0.3 to test this theory in a slightly more controlled environment.

 And can you reason how often this crash
> happens for affected users?

This I cannot say, and I think we need to work with release management to understand the ctual numbers.  The total number of crashes seems tiny -- perhaps 1500 for one of the tickets in a 5% roll out.  If my understanding is correct, this might be very rare in the wild, meaning that the set of affected users is small.

> Is it every time they try to launch the app?

An affected user should see this crash every time they launch the app.  If my theory is correct, it's possible that rebooting their device (and some other, difficult to divine set of actions) could clear whatever caches need to be invalidated and then things return to normal.  We have no reason to think that is the case, other than that with similar situations (e.g., issues with settings Apps, cache invalidation failures in the Account manager) that is exactly what happens.

> What happens if they delete and re-install the app?

My expectation is that a clean re-install will completely address the problem.  If it doesn't, my theory is 100% wrong!
Flags: needinfo?(nalexander)
Thanks Nick!
(In reply to :sdaswani from comment #27)
> Thanks Nick!

We need to understand the population this is impacting.  That means we need somebody from release engineering to explain what the status of the gradual rollout is, and somebody who is skilled aggregating crash reports to quantify how prevalent this is.  That is, I need to know what stages the rollout proceeded through, what the size of the populations were at those stages, how frequently the crashes occurred during each stage, and how concentrated the crashes were per affected device.  sdaswani, can you help with finding these people and this information?:

In the interim, I will determine what, if any resource changes occurred between 58.0 through to 58.0.2 -- my theory are there were none, since I know of no crashes fitting the pattern that would indicate such changes occurred.

I will try to understand if we witness this type of crash in the Beta population.  My working theory is that these crashes are very rare, and the Beta population is just too small.  But I might be incorrect.

Finally, I will try to think of a minimal test of the theory that we might be able to deploy as a 59.0.3.
Flags: needinfo?(sdaswani)
(In reply to Nick Alexander :nalexander from comment #28)
> (In reply to :sdaswani from comment #27)
> > Thanks Nick!
> 
> We need to understand the population this is impacting.  That means we need
> somebody from release engineering to explain what the status of the gradual
> rollout is, and somebody who is skilled aggregating crash reports to
> quantify how prevalent this is.  That is, I need to know what stages the
> rollout proceeded through, what the size of the populations were at those
> stages, how frequently the crashes occurred during each stage, and how
> concentrated the crashes were per affected device.  sdaswani, can you help
> with finding these people and this information?:

I think Sylvestre is a good first guess to answer these questions.
Flags: needinfo?(sdaswani) → needinfo?(sledru)
(In reply to Nick Alexander :nalexander from comment #28)
> (In reply to :sdaswani from comment #27)
> > Thanks Nick!
> 
> We need to understand the population this is impacting.  That means we need
> somebody from release engineering to explain what the status of the gradual
> rollout is, and somebody who is skilled aggregating crash reports to
> quantify how prevalent this is.  That is, I need to know what stages the
> rollout proceeded through, what the size of the populations were at those
> stages, how frequently the crashes occurred during each stage, and how
> concentrated the crashes were per affected device.  sdaswani, can you help
> with finding these people and this information?:
> 
> In the interim, I will determine what, if any resource changes occurred
> between 58.0 through to 58.0.2 -- my theory are there were none, since I
> know of no crashes fitting the pattern that would indicate such changes
> occurred.

My expectation is correct, there are no resources changes between 58.0 and 58.0.2:

⋊> ~/M/apks sha1sum fennec-58*/*arsc                                                                                                                                                                                              11:54:11
8e955776f08881ff3ba8cf347c4d6ca583f74726  fennec-58.0.1.multi.android-arm/resources.arsc
8e955776f08881ff3ba8cf347c4d6ca583f74726  fennec-58.0.2.multi.android-arm/resources.arsc
8e955776f08881ff3ba8cf347c4d6ca583f74726  fennec-58.0.multi.android-arm/resources.arsc

> I will try to understand if we witness this type of crash in the Beta
> population.  My working theory is that these crashes are very rare, and the
> Beta population is just too small.  But I might be incorrect.

I tried to dig into the beta population, and I can make no headway.  I see nothing that looks like the crashes we're seeing in 59.0{.1,.2}.  I don't understand why 59.0b* are not canaries in this coalmine, and must conclude that this is infrequent and we only see it in the release population.

> Finally, I will try to think of a minimal test of the theory that we might
> be able to deploy as a 59.0.3.
Android 8 for the Samsung devices did not publicly exist until March 15th or so. Samsung and carriers are still in the process of rolling out this update. We released 59.0 (release version) March 13th.
Flags: needinfo?(kbrosnan)
Yeah it has the look of an android update issue, last I checked mid last week this was specifically visible for Samsung devices.  I don't think it's necessarily an issue between Firefox 58 and 59, or 59.x and later versions. That doesn't mean it isn't also a Firefox issue, just that that's where we are seeing the big spike because that's the most common version of Firefox out in the wild. Glad we have a bug on file with Google.
Oh -- I meant to comment on bug 1450793 (the really high volume spike that is more visible on the play store).  Do we know if that is related, yet?
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #33)
> Oh -- I meant to comment on bug 1450793 (the really high volume spike that
> is more visible on the play store).  Do we know if that is related, yet?

From what I can see from the bugs I filed and that are noted in Comment 1, this issue is not related to the Samsung S8 - the device list in some of the crashes doesn't even include any Samsung devices (one example is Bug 1449695)
Calixte can help with that.
Flags: needinfo?(sledru) → needinfo?(cdenizet)
I did reach out to a contact from Google, but he indicated that the person he spoke with said there is likely little help they can offer unless we can get a good set of STR for these crashes.
Marcia, has anyone had any luck with STRs in any of these related tickets?
Flags: needinfo?(mozillamarcia.knous)
(In reply to :sdaswani from comment #37)
> Marcia, has anyone had any luck with STRs in any of these related tickets?

I don't think we have explicitly asked SV to try to reproduce. We can definitely comb the crash reports for comments and devices that most commonly suffer from the crash and have them look into it.
Flags: needinfo?(mozillamarcia.knous)
Bogdan, can someone from Softvision QA spend a few hours trying to reproduce this crash? Do you need devices first?
Flags: needinfo?(bogdan.surd)
(In reply to :sdaswani from comment #39)
> Bogdan, can someone from Softvision QA spend a few hours trying to reproduce
> this crash? Do you need devices first?

My theory suggests that you would see this by installing 58.0* via Google Play, and then getting upgraded to 59.0* (again via Google Play).  It's not clear if the Google Play Store is involved or not.  It's not clear to me that side-loading via ADB would trigger the same resource/code mismatch/invalidation issues that we're seeing in the wild.

So: I would love to see testing on this, and gently suggest exercising those upgrade paths.
I had a discussion with Calixte, who did all the work that I summarize here.

1) 59.0.2 has rolled out to 100% of our population.  (I did not know this!  I thought these tickets were blocking full roll-out.)

2) We have never seen more than 500 crashes on a single day for any of these tickets.  Given the size of our release population this is a truly insignificant number of crashes.

3) It's possible that we're not seeing the complete picture for these crashes, because we're seeing truly significant crash volumes for other tickets (https://bugzilla.mozilla.org/show_bug.cgi?id=1450793) but only in Google Play, not in Socorro.

4) We have no reason to connect these tickets and the truly significant crashes we are seeing in 59.0.2.

Given the tiny volume of these crashes and the unrelated significant crashes, we're not going to try to push this much further.

Liz, is that acceptable to you?

Aside: in fact, the amount of time I spent investigating these crashes was not worth the investment.  Next time I will try to quantify the crash volume before spending a minute thinking about the causes.
Flags: needinfo?(lhenry)
Nick, that's fine with me, I tracked this for 59 to keep an eye on the issue in case it became a real problem.  This issue was not blocking release rollout. I know that can be frustrating when we (relman) interrupt your regular work with some release-related emergency. Sorry about that. And thanks for the investigation. You are right that we should keep the user impact firmly in mind before spending a lot of energy on an issue that is already out there on release. 

The thing I'm actually worried about here is bug 1450793.
Flags: needinfo?(lhenry)
> Aside: in fact, the amount of time I spent investigating these crashes was not worth the investment.
> Next time I will try to quantify the crash volume before spending a minute thinking about the causes.

Thanks for the thorough investigation though, Nick! This sounds like a sound plan for the future that I should also follow. :)

Vlad, Google posted in your issue: https://issuetracker.google.com/issues/77608213 Can you follow-up with them?
Flags: needinfo?(michael.l.comella)
(In reply to :sdaswani from comment #39)
> Bogdan, can someone from Softvision QA spend a few hours trying to reproduce
> this crash? Do you need devices first?

We would need the specified device in order to have any chance at reproducing this specific crash since US and EU models of the same device have different hardware in them. When I tested bug 1450793 seeing as my Samsung model was different being a EU version of the device I could not reproduce the mentioned crash in any of the builds.

I am willing to investigate this further if needed but I don't know how we would go about installing the build from Google Play as mentioned in Comment 40, as far as I know there is no easy solution to this. From what I know using ADB to upgrade the build is basically the same process as GP does it.

Also taking Comment 41 is it still worth trying to investigate this?
Flags: needinfo?(bogdan.surd) → needinfo?(sdaswani)
Flags: needinfo?(cdenizet)
Given what we know regarding volume, I don't think it is worth trying to reproduce these crashes. As Liz said we should focus our efforts on Bug 1450793. To be clear, the crashes noted in this bug were affecting more than just Samsung devices - they are startup crashes when users updated to 59.0.2 (at least the 3 that I filed that are listed in Comment 1).
OK Bogdan, looks like you shouldn't work on this, and I'm taking Vlad off it too.
Flags: needinfo?(vlad.baicu)
Flags: needinfo?(sdaswani)
FYI Google wants STRs to but we don't have any so I'm not sure Vlad can respond satisfactorily. Marcia any advice on how we should proceed?
Flags: needinfo?(mozillamarcia.knous)
(In reply to :sdaswani from comment #47)
> FYI Google wants STRs to but we don't have any so I'm not sure Vlad can
> respond satisfactorily. Marcia any advice on how we should proceed?

We probably should close the issue out for now, since we don't have the information the Google Developers are seeking.
Flags: needinfo?(mozillamarcia.knous)
Vlad, please close the issue. Thanks Marcia!
Flags: needinfo?(vlad.baicu)
I don't think the issues on their tracker have a close option unless it requires a higher access in order to do that (e.g. someone from Google). I think we can either leave it as it is or mention that we currently don't have any clear STRs.
Flags: needinfo?(vlad.baicu)
(In reply to Vlad Baicu from comment #50)
> I don't think the issues on their tracker have a close option unless it
> requires a higher access in order to do that (e.g. someone from Google). I
> think we can either leave it as it is or mention that we currently don't
> have any clear STRs.

Sure mention we don’t have STRs and they can handle it.
With only a week to go before the Firefox 60 release, it's too late to fix this for 59.
Moving to p3 because no activity for at least 24 weeks.
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P1 → P3

I think this can be closed now, right?

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.