Closed Bug 1555734 Opened 8 months ago Closed 8 months ago

The browser crashes after installing the Lockwise add-on on Firefox 66

Categories

(Toolkit :: Telemetry, defect, P1)

66 Branch
x86_64
All
defect

Tracking

()

RESOLVED FIXED
mozilla69
Tracking Status
firefox-esr60 --- unaffected
firefox67 --- wontfix
firefox67.0.1 blocking wontfix
firefox68 --- fixed
firefox69 --- fixed

People

(Reporter: cosmin.muntean, Assigned: chutten)

References

()

Details

(Keywords: crash, Whiteboard: [fixed by bug 1555798])

Crash Data

Attachments

(1 file)

Attached image Lockwise on 66.gif

[Notes]:

  • If the Lockwise add-on was never installed on Firefox versions 66, the browser crashes after the first install attempt of the add-on. After the browser is restarted the Lockwise add-on works as expected and if it is removed and reinstalled the browser no longer crashes.
  • The issue is also logged in Lockwise Github repository. Please see the following issues:

https://github.com/mozilla-lockwise/lockwise-addon/issues/53
https://github.com/mozilla-lockwise/lockwise-addon/issues/290

[Affected Versions]:

  • Lockwise 2.2.2-alpha
  • Firefox 66.x

[Affected Platforms]

  • All Windows
  • All Mac
  • All Linux

[Perquisites]

  • Have the latest Firefox 66 version installed.
  • Have a new clean profile.

[Steps to reproduce]:

  1. Open the Firefox 66 version with the profile from prerequisites.
  2. Navigate to "https://lockwise.firefox.com/" website.
  3. Click the "Install for Firefox" button.
  4. Observe the behavrior.

[Expected results]:

  • A doorhnager which informs the users that the add-on is not compatible with Firefox 66 is displayed.
    OR
  • The add-on is successfully installed and the browser doesn't crash.

[Actual results]:

  • The add-on is installing then the browser crashes.

[Additional Notes]:

  • That stack shows it is crashing the browsing when trying to record a telemetry event.
  • The issue is not reproducible on Firefox 64, 63 or older versions.
  • The issue is not reproducible on Firefox 67, 68 and 69.
Summary: The browser crashes after the installing the Lockwise add-on on Firefox 66 → The browser crashes after installing the Lockwise add-on on Firefox 66
Crash Signature: https://crash-stats.mozilla.org/report/index/bda83196-6069-4471-ad78-59bae0190530 → `anonymous namespace'::ScalarBase::ScalarBase(const struct `anonymous namespace'::BaseScalarInfo& const)

@Georg, The Lockbox/Lockwise extension was causing the full Firefox browser to crash when recording a telemetry event. Can you look in to this?

Flags: needinfo?(gfritzsche)

I also ran into this bug when installing the extension on Nightly 69; sounds like I may be the only one who has seen it, though.

Any chance you can look at this today? I believe many folks in Europe are OOO

Flags: needinfo?(chutten)

That stack shows it is crashing the browsing when trying to record a telemetry event.

To re-iterate as I understand it: this bug is still in "fact finding mode" but initial suspicion is this is a telemetry-caused crash happening before the browser can check/confirm the extension can (not) be installed.

Next steps to help validate:

  • :jaws requested Georg help look at this
  • :_6a6a8 is also requesting help from :chutten due to EU holiday
Assignee: nobody → jhirsch

(georg's on PTO until Monday)

Works fine for me on Nightly 69.

On Lockwise install it records the dynamically-registered event lockboxv1#startup#webextension which is summarized to a Scalar via Event Summary

Which then crashes with an access violation while trying to instantiate the Scalar. Weird. I'm not sure how that could happen.

Flags: needinfo?(chutten)

Tania, here's another issue where we could use extra testing today (on trailhead ie 67.0.1)

Flags: needinfo?(tmaity)

Sounds like we have at least one report of this happening on 69.

Maybe this was a telemetry related issue in 66, fixed in 67. It seems worth specifically trying to reproduce in 67.0.1.

Considering the cost of doing another point release, a better workaround for Trailhead specifically might be to update the lockwise download page to not allow 66 and older versions to install the addon.

Krupa, could you please assign someone for this one?

Flags: needinfo?(tmaity) → needinfo?(kraj)

Considering the cost of doing another point release, a better workaround for Trailhead specifically might be to update the lockwise download page to not allow 66 and older versions to install the addon.

Agreed. Thanks, Jared.

:m_and_m has volunteered to add this check onto the website, tracking here:
https://github.com/mozilla-lockwise/mozilla-lockwise.github.io/issues/124

(In reply to Tania Maity (:tmaity) from comment #11)

Krupa, could you please assign someone for this one?

Peter has agreed to help out. We will check if it affects 67.0.5 and 69

Flags: needinfo?(kraj) → needinfo?(pdehaan)
Flags: needinfo?(gfritzsche)

Not sure where the best place for updates is between this and the GitHub issues is, but I've been posting a few crash logs and updates in https://github.com/mozilla-lockwise/lockwise-addon/issues/290#issuecomment-497433504

So far (on macOS):

  • 65.0.2 (no crash)
  • 66.0.1 (crash)
  • 66.0.4 (crash)
  • 66.0.5 (crash)
  • 67.0 (no crash)
  • 68.0a1 (old nightly, no crash)
  • 69.0a1 (latest nightly, no crash)
Flags: needinfo?(pdehaan)

(In reply to Marnie Pasciuto-Wood [:marnie] from comment #3)

I also ran into this bug when installing the extension on Nightly 69; sounds like I may be the only one who has seen it, though.

Marnie, are you able to get a crash stack from Nightly 69? You may see it if you visit about:crashes

Flags: needinfo?(mpasciutowood)

Crashed in 66.0.1 [1] w/ the following Browser Console log:

1559243676851 addons.xpi-utils WARN Add-on lockbox@mozilla.com is not compatible with application version.
1559243677039 addons.xpi-utils WARN Add-on lockbox@mozilla.com is not compatible with application version.

Do we have a list of older browsers you want me to try verifying on? Or should I just keep testing 66.0.2-66.0.3?


[1] https://ftp.mozilla.org/pub/firefox/releases/66.0.1/mac/en-US/

With help from :jaws, here's the link to my crash that happened as soon as I installed the extension. It was fine after a reboot.
https://crash-stats.mozilla.org/report/index/bp-e43b6513-3baa-48b7-87f2-94c0e0190528

Flags: needinfo?(mpasciutowood)

I just triggered the crash again.
STR:
Was in Nightly 69.0a1 (unsure of date)
Uninstalled existing Lockwise extension.
Updated Nightly to 69.0a1 (2019-05-30)
Went to lockwise.firefox.com
Clicked button on website to add extension.
Got through first "accept" dialog.
Clicked ok/accept/whatever on second dialog (I think it was the one about sending/sharing data)
Browser crashed.
Upon restart, extension is NOT installed.

https://crash-stats.mozilla.org/report/index/bp-2dd5995a-ae55-4056-8359-7f1070190530

We have a theory of the bug that this might be a regression from bug 1451813

Technical Details

In the bug's patch we store a reference to a BaseScalarInfo inside each KeyedScalar so that we can access things like its name and id. This works fine if the BaseScalarInfo doesn't move. For Scalars registered at compile-time this is no problem. But Lockwise uses "Dynamic Scalars" which are registered at runtime. This causes the nsTArray gDynamicScalarInfo to shuffle its storage which causes the ref that the dynamic_event_counts KeyedScalar was holding to become invalid.

The next time we try to use the dynamic_event_counts KeyedScalar as a part of Event Summary, things go wrong. We try to allocate a new Scalar, but since its ScalarInfo has been compromised, the scalar type is wrong. So we don't allocate anything and return nullptr. We dutifully store this nullptr in storage and retrieve it later when trying to accumulate to the scalar, which results in us trying to access a member at address 0x23.

In debug builds this fails at the step "the scalar type is wrong", still much later than the actual problem.

I am at a loss as to why this only happens in 66.0.5. It should happen in any build where there's dynamic scalars. The oddities I've noticed that might be build-specific are small things like the presence of the sandbox.no_job dynamic scalar. If it wasn't there, maybe the nsTArray wouldn't reallocate and the ref would remain uncompromised.

Next steps

I will be working on a patch to fix this code fault in KeyedScalar. I have filed bug 1555798 for that effort and will get it into 69 and ask for uplift to at least Beta 68.

It might be worth shipping a version of Lockwise without Telemetry data collections to builds containing bug 1451813 to ensure we don't run afoul of this again.

To my knowledge Firefox 67 could be affected by this, so this does matter for Trailhead. Why it doesn't crash is a mystery, but I'll take the good luck as it comes. I don't know which whiteboard/tracking flags we might want for this bug or bug 1555798 and leave that to those in the Cc to set.

I don't know of any examples of this crash occurring in 67 release, which is odd, if it's still a legit bug.

:pdehaan, would you mind trying to repro on 67, or maybe try to find a regression range in nightly?

Flags: needinfo?(pdehaan)

Unable to repro crash in 67.0 release build. Will try earlier 67 beta builds (which seem to range from 67.0b3 to 67.0b19)

Version Build ID Status
Firefox 67.0 20190516215225 no crash
Firefox 67.0b19 20190509185224 no crash
Firefox 67.0b16 20190502232159 no crash
Firefox 67.0b13 20190422163745 no crash
Firefox 67.0b9 20190408123043 no crash
Firefox 67.0b3 20190318154932 no crash
Flags: needinfo?(pdehaan)
Crash Signature: `anonymous namespace'::ScalarBase::ScalarBase(const struct `anonymous namespace'::BaseScalarInfo& const) → [@ `anonymous namespace'::ScalarBase::ScalarBase(const struct `anonymous namespace'::BaseScalarInfo& const)] [@ (anonymous namespace)::ScalarBase::ScalarBase((anonymous namespace)::BaseScalarInfo const&)] [@ (anonymous namespace)::internal_ScalarAlloca…
Keywords: crash

Today I did more investigation and I think I might be on to what it is causing this crash. Here is what I have found:

I have tested this issue on multiple Firefox versions: 63, 64, 65, 66, 67, 68, 69 to check if it is a Firefox regression.

  • I have managed to reproduce the issue only on Firefox 66.x release, but the interesting part is that it is not reproducible on the Nightly 66.x or Beta 66.x builds.

Considering this I have searched for differences between Firefox 66 release and Firefox 66 Nightly channel.

  • The main difference that I have observed is that there is a normandy rollout (pref-rollout-block-autoplay-release-66-1535667) which targets only Firefox 66 release. This rollout sends an event in “about:telemetry#events-tab” > dynamic (see this screenshot), where Lockwise also sends the events.

In order to check if the rollout helps cause the crash, I have forced the profile to block any normandy recipes.

  • After opening the profile I managed to install the Lockwsie add-on without any crashes.
  • It seems that on Firefox versions lower than 67, the Normandy rollouts are sending some events which may be in conflict with the Lockwise ones and causes the browser to crash.

In order to verify this issue on all Firefox versions, I have used a rollout recipe on the Normandy stage server that targets all the Firefox versions and channels from 61 to 69.

  • Using this I have managed to find out that only Firefox 65 and 66 are affected. Considering this I have used mozregression tool in order to find out what change introduced this issue and what change fixed the issue.
  • Here are the results:

When the bug was introduced:

  • Last good revision: 5dabc51178c6ad0a4a83ca9c22387c49d645553f
  • First bad revision: 91c9a4e6d7a859f4578111b3c3b63f6af49c8e7a
  • Pushlog: link.
  • From this pushlog looks like Bug 1498165 introduced the issue.

When the bug was fixed:

  • First good revision: 48d98388c29cfa5d0e4dd92acfb3327f160fc774
  • Last bad revision: a1dd47617412e8932b94fb3d57f370feeb4486ed
  • Pushlog: link.
  • From this pushlog looks like Bug 1443560 fixed the issue on Firefox 67.

However, probably the issue is also reproducible on Firefox 67 and above if another study or add-on is installed which can be in conflict with the telemetry pings sent by Lockwise add-on. Which might be the case that Marnie is encountering on her end.

Please let me know if you have any questions.

So, as long as we never push an experiment or system addon with this scalar issue to 66.0.5, no one should hit this crash?
In theory we could avoid doing that, but in practice we have a bunch of experiments that go out even beyond the life of a release cycle (we're still running experiments on 66, for example, during the 67 release, because there's still a substantial population of users there.) It would make it tricky to target 66.0 and 66.0.2 and beyond. How tricky is setting up those rules? and how confident are we that we can maintain that? If we aren't then I agree we should go with shipping Lockwise with telemetry disabled for 66.0.5.

Flags: needinfo?(jhirsch)
Flags: needinfo?(elancaster)

(In reply to Liz Henry (:lizzard) (use needinfo) from comment #25)

So, as long as we never push an experiment or system addon with this scalar issue to 66.0.5, no one should hit this crash?
In theory we could avoid doing that, but in practice we have a bunch of experiments that go out even beyond the life of a release cycle (we're still running experiments on 66, for example, during the 67 release, because there's still a substantial population of users there.) It would make it tricky to target 66.0 and 66.0.2 and beyond. How tricky is setting up those rules? and how confident are we that we can maintain that? If we aren't then I agree we should go with shipping Lockwise with telemetry disabled for 66.0.5.

For what it's worth, Lockwise officially supports Firefox versions 67 and above, and not 66 or earlier. This issue was discovered via Lockwise because of some bugs in our support range declarations while testing that.

Unfortunately, it looks like the monitor system addon also uses the API that can trigger the crash:

https://searchfox.org/mozilla-central/source/browser/extensions/fxmonitor/privileged/api.js#164

I'm not sure if anyone has observed these crashes while testing monitor, as the conditions to actually trigger a crash seem rare.

Maybe it's an acceptable risk to take, since monitor testing hasn't turned up tons of these crashes?

And, now that I think about it, disabling the use of this API in monitor (a system addon) would require a respin anyway...

:chutten, thoughts on the risk here? if fxmonitor hasn't triggered the crashes up till now, presumably it's fine?

Flags: needinfo?(jhirsch) → needinfo?(chutten)

Unfortunately I can't say for sure what the risk is here.

The specific sequence of steps to cause the issue appears to be an interleaving of recording to dynamic events and registering dynamic scalars. If you record a dynamic event, register enough dynamic scalars to cause nsTArray to realloc, then record another dynamic event, you'll trigger the fault and Firefox will crash.

In practice this seems to be a rather rare occurrence, but we can't rule it out for anything that uses either of those APIs so long as, in total across Firefox and its experiments and addons, both APIs are being used.

This means the risk is low, but unknown without being able to guarantee the rest of the state of Firefox.

Flags: needinfo?(chutten)

Here's another crash, again after uninstalling the Lockwise extension, updating Nightly, then reinstalling:
https://crash-stats.mozilla.org/report/index/bp-4bb32015-9dd0-4ee7-86d7-ef0200190531

Aha! It turns out :marnie's crash-prone nightly profile has the old pricewise addon installed, which dynamically registers quite a few events:

https://github.com/mozilla/price-tracker/blob/c04cb829d981b6fa49a6935bf5b81f0704d3ac6f/src/telemetry/events.js

Thus increasing the likelihood of one of them landing after a realloc, boosting the crash probability.

I've just updated Nightly to the 2019-06-01 release with the patch, and I've installed the updated Lockwise extension that removed telemetry. Nightly no longer crashes for me! Thanks everyone!

Flags: needinfo?(elancaster)
Priority: -- → P1

Reassigning, as this definitely seems to be chutten's bug ^_^

Assignee: jhirsch → chutten

Given :marnie's comment 32, I'd say this is done. I was leaving this one for the investigation and in case we needed another bug to hang Lockwise-side work onto. Looks like those cases are taken care of.

Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Whiteboard: [fixed by bug 1555798]
Target Milestone: --- → mozilla69
You need to log in before you can comment on or make changes to this bug.