Closed Bug 1651133 (spurious-about-restartrequired) Opened 4 years ago Closed 3 years ago

Verify that we only show about:restartrequired in the presence of a build ID mismatch

Categories

(Core :: IPC, task)

task

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox91 --- fixed

People

(Reporter: mossop, Assigned: gerard-majax)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(3 files)

We've been seeing the odd report where about:restartrequired is shown when it seems like there should not have been a background application update (bug 1640304 is an example).

Looking over the logic I am wondering if a build ID mismatch is the only case that might lead us to show this UI. The flow seems to be:

  1. MessageChannel in the parent process defaults the flag mBuildIDsConfirmedMatch to false.
  2. The content process early in startup compares its build ID to that passed from the parent (via a command line argument), if they match it sends a message up to the parent saying so, if not the content process quits.
  3. When the parent receives the message it sets mBuildIDsConfirmedMatch to true.

x. If the content process crashes and mBuildIDsConfirmedMatch is false then we show about:restartrequired.

I may be missing other details but based on that it seems like there is a possibility for the content process to crash before sending the message and then the parent assuming the crash was because of a build ID mismatch.

One way this can currently happen on Linux if is things go sufficiently wrong when initializing GTK, for example by failing to connect to the display server; I ran into this today, so I can confirm that it is possible to get about:restartrequired in a case that should have been considered a normal tab crash. Bug 1635451 will avoid that specific situation as a side-effect of not initializing GTK in most content processes, and that's platform-specific code so it doesn't apply to Mac/Windows/Android, but there are a few other things we do this early that could theoretically crash.

This is a little delicate, because if the build IDs don't match, the child process can't use normal IPC, and it's possible that we could someday have incompatibilities that result in a crash before the build ID check (e.g., serialized prefs). But maybe the parent process could re-read the build ID from the install directory (it seems to be in platform.ini) after observing the crash; if it was a build mismatch, it should see the new ID at that point.

See Also: → 1366808

I decided to throw some code together while I was thinking about this. Tested by modifying ContentChild to not send the BuildIDsMatch message and using kill -11; that causes about:restartrequired without this patch and a normal tab crash page with it.

This could use a little cleanup, and ideally a test case.

Assignee: nobody → lissyx+mozillians
Status: NEW → ASSIGNED

As much as I can tell, there's no testing at all on sending the oop-browser-buildid-mismatch event. There's testing coverage that when it is sent, we do display about:restartrequired though: https://searchfox.org/mozilla-central/rev/443f87caa5fadba920b0382e12874693d6c6133a/browser/base/content/test/tabcrashed/browser_shownRestartRequired.js#66-114

I'm still trying to look through ipc tests if we can find a way to cover that.

So, I have hacked something based on this patch and :jld's comment #2, and it somehow worked, however the current implementation is rather unstable / flakky and thus unreliable. It would still seem to indicate this fix is working as intended.

It seems, however, I got some success using application.ini but then failures. Changing its values does not seems to always reflect?

(In reply to Alexandre LISSY :gerard-majax from comment #5)

It seems, however, I got some success using application.ini but then failures. Changing its values does not seems to always reflect?

Seems like it was bad luck and the correct source of truth is https://searchfox.org/mozilla-central/source/__GENERATED__/toolkit/library/buildid.cpp#1

(In reply to Alexandre LISSY :gerard-majax from comment #6)

(In reply to Alexandre LISSY :gerard-majax from comment #5)

It seems, however, I got some success using application.ini but then failures. Changing its values does not seems to always reflect?

Seems like it was bad luck and the correct source of truth is https://searchfox.org/mozilla-central/source/__GENERATED__/toolkit/library/buildid.cpp#1

Which is actually showing how much focused I was and completely missed application.ini VS platform.ini

So, I have been able to build something on top of :jld's patch and now the last question, since it seems to be running quite stable locally is whether we can rely on platform.ini like that?

Flags: needinfo?(ksteuber)

I'm not sure that I'm the right person to answer your question. I don't really know anything in particular about platform.ini.

I tried taking a look at the test, and it looks like you are writing to platform.ini to try to force an about:restartrequired situation? I don't really know enough about how that page gets shown to know if that is a reasonable way to achieve this.

Flags: needinfo?(ksteuber)

(In reply to Kirk Steuber (he/him) [:bytesized] from comment #10)

I'm not sure that I'm the right person to answer your question. I don't really know anything in particular about platform.ini.

I tried taking a look at the test, and it looks like you are writing to platform.ini to try to force an about:restartrequired situation? I don't really know enough about how that page gets shown to know if that is a reasonable way to achieve this.

Thanks, I'll try to find someone who might know better the situation around platform.ini, if you have any name to suggest, I'm interested.

From your answer, it might be that I was not clear enough though: this bug is explicitely about leveraging platform.ini infos to be able to send a more accurate about:restartrequired, so we are really interested in ensuring that reading from the file is a good thing to do.

(In reply to Kirk Steuber (he/him) [:bytesized] from comment #10)

I'm not sure that I'm the right person to answer your question. I don't really know anything in particular about platform.ini.

I tried taking a look at the test, and it looks like you are writing to platform.ini to try to force an about:restartrequired situation? I don't really know enough about how that page gets shown to know if that is a reasonable way to achieve this.

Copying here the answer from :nalexander on matrix: platform.ini is reliable as long as we understand its limitations:

 so platform.ini tells you something about the state of the application on disk, but it doesn't necessarily tell you what the version of libxul actually is.

We might also want to use some other source of truth because they are trying to get rid of those ini files.

Also, I was unable to get the test to run correctly on Windows: GPU Process is hanging on some resources not free'd. As much as I could analyze, there is some CompositorThreadHolder reference being held.

From refcount logging, I could get some data that shows the imbalance:

50:   AddRef 1  CompositorThreadHolder::Start          ==> Release 5 CompositorThreadHolder::Release
60:   AddRef 2  VsyncBridgeParent::VsyncBridgeParent   ==> Release 9 PVsyncBridgeParent::OnChannelClose

91:   AddRef 3  CompositorManagerParent::Create        ==> Release 18 CompositorManagerParent::DeferredDestroy
121:  AddRef 4  ImageBridgeParent::ImageBridgeParent   ==> Release 16 ImageBridgeParent::DeferredDestroy
152:  AddRef 5  VRManagerParent::CreateForGPUProcess   
182:  AddRef 6  VRManagerParent::OnChannelConnected    ==> Release 17 VRManagerParent::ActorDealloc

212:  AddRef 6  CompositorManagerParent::Create        ==> Release 16 CompositorManagerParent::DeferredDestroy
242:  AddRef 7  ImageBridgeParent::ImageBridgeParent   ==> Release 15 ImageBridgeParent::DeferredDestroy
272:  AddRef 8  VideoBridgeParent::VideoBridgeParent   ==> Release 10 VideoBridgeParent::ActorDealloc
292:  AddRef 9  VRManagerParent::OnChannelConnected    ==> Release 17 VRManagerParent::ActorDealloc

307:  AddRef 10 CompositorManagerParent::Create        ==> Release 18 CompositorManagerParent::DeferredDestroy
337:  AddRef 11 ImageBridgeParent::ImageBridgeParent   ==> Release 17 ImageBridgeParent::DeferredDestroy
368:  AddRef 12 VRManagerParent::OnChannelConnected    ==> Release 22 VRManagerParent::ActorDealloc
383:  AddRef 13 VideoBridgeParent::VideoBridgeParent   ==> Release 5 VideoBridgeParent::ActorDealloc

414:  AddRef 14 CompositorManagerParent::Create        ==> Release 16 CompositorManagerParent::DeferredDestroy
444:  AddRef 15 ImageBridgeParent::ImageBridgeParent   ==> Release 15 ImageBridgeParent::DeferredDestroy
475:  AddRef 16 VRManagerParent::OnChannelConnected    ==> Release 21 VRManagerParent::ActorDealloc

490:  AddRef 17 CompositorManagerParent::Create        ==> Release 14 CompositorManagerParent::DeferredDestroy
520:  AddRef 18 ImageBridgeParent::ImageBridgeParent   ==> Release 13 ImageBridgeParent::DeferredDestroy
551:  AddRef 19 VRManagerParent::OnChannelConnected    ==> Release 20 VRManagerParent::ActorDealloc

616:  AddRef 17 CompositorManagerParent::Create        ==> Release 12 CompositorManagerParent::DeferredDestroy
646:  AddRef 18 ImageBridgeParent::ImageBridgeParent   ==> Release 11 ImageBridgeParent::DeferredDestroy

727:  AddRef 16 CompositorManagerParent::Create        ==> Release 7 CompositorManagerParent::DeferredDestroy
757:  AddRef 17 ImageBridgeParent::ImageBridgeParent   ==> Release 6 ImageBridgeParent::DeferredDestroy

788:  AddRef 18 CompositorManagerParent::Create
818:  AddRef 19 ImageBridgeParent::ImageBridgeParent   ==> Release 4 ImageBridgeParent::DeferredDestroy
849:  AddRef 20 VRManagerParent::OnChannelConnected    ==> Release 19 VRManagerParent::ActorDealloc

864:  AddRef 21 CompositorManagerParent::Create
894:  AddRef 22 ImageBridgeParent::ImageBridgeParent   ==> Release 3 ImageBridgeParent::DeferredDestroy
925:  AddRef 23 VRManagerParent::OnChannelConnected    ==> Release 8 VRManagerParent::ActorDealloc

1283: AddRef 6  CompositorThreadHolder::Shutdown       ==> Release 5 CompositorThreadHolder::Shutdown

1370: Release 2 RunnableFunction<CompositorThread.cpp:131:7>

And we end up with two non released CompositorThreadHolder references.

Looking at the problem from another angle, I could get RemoteDecoderManagerParent to prove it was registered two times more than it was deregistered.

Attachment #9222779 - Attachment description: WIP: Bug 1651133 - Double-check the build ID → Bug 1651133 - Double-check the build ID to avoid spurious about:restartrequired r?jld,glandium

So:

  • platform.ini is not to be used because update team is working to get rid of it
  • we can't leverage PR_LoadLibrary to load on-disk libxul and read its symbol: PR_LoadLibrary will see it is already loaded and do not dlopen()
  • there is code in the profiler / crashreporter to read symbols, but it is only meant for reading in memory, not on disk
  • mozglue/linker would require work to be used outside of mozglue, and to support macOS and Windows shared objects

I think we have exhausted the alternatives on the parent side.

We might be able to improve the situation but still have some (very rare?) edge cases if we try and make use if process exit code: we make sure we check parentBuildId in the child as early as possible (when we parse the command line maybe?) and we exit with a specific exit code:

So, I've got some better view on https://bugzilla.mozilla.org/show_bug.cgi?id=1712145 but we have nothing to distinguish "good" VS "bad" about:restartrequired ; do you think we should proceed with some extra telemetry info to gather data and see, actually, how much wrong about:restartrequired we really have VS legit ones?

And considering comment #14, such a temporary experiment would rather rely on platform.ini (I know), because fixing the way we get on-disk platform build ID is IMHO too much work for the moment, until we can justify there is actually a real problem with false positive buildid mismatches.

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jld)

From discussion earlier this morning on Matrix, it seems we can go forward to collect more data and see if a more complicated fix is really justified.

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jld)

Request for data collection review form

All questions are mandatory. You must receive review from a data steward peer on your responses to these questions before shipping new data collection.

  1. What questions will you answer with this data?

"Do we have actually false-positive of buildID mismatches in the wild"

  1. Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements? Some example responses:

about:restartrequired when there was no actual change to libxul or any update are getting in the way of users and might hide other issues

  1. What alternative methods did you consider to answer these questions? Why were they not sufficient?

Having a look at existing telemetry reveals something is happening but the current data available is not enough because we have no way yet to distinguish a genuine buildid mismatch from something else

  1. Can current instrumentation answer these questions?

Unfortunately, no.

  1. List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories found on the Mozilla wiki.

Note that the data steward reviewing your request will characterize your data collection based on the highest (and most sensitive) category.

Measurement Description: Frequency of false-positive buildID mismatches
Data Collection Category: Category 1
Tracking Bug #: Bug 1651133
  1. Please provide a link to the documentation for this data collection which describes the ultimate data set in a public, complete, and accurate way.

This collection is documented in its definitions files Histograms.json, Scalars.yaml, and/or Events.yaml and in the Probe Dictionary at https://probes.telemetry.mozilla.org.

  1. How long will this data be collected? Choose one of the following:

Collecting should only be needed for a few releases to get enough data to make a decision.

  1. What populations will you measure?
  • Which release channels?

Release at least, but we are interested in all

  • Which countries?

All

  • Which locales?

All

  • Any other filters? Please describe in detail below.

None

  1. If this data collection is default on, what is the opt-out mechanism for users?

Default opt-out mechanism for opt-out telemetry probes.

  1. Please provide a general description of how you will analyze this data.

This telemetry probe maintains the ability to detect the frequency of real buildID mismatches between parent and child processes versions versus false-positive that would arise because of a very early crash.

  1. Where do you intend to share the results of your analysis?

Internally, to know if the amount of work required to perform some changes can be justified by an actual problem users are facing.

  1. Is there a third-party tool (i.e. not Telemetry) that you are proposing to use for this data collection? If so:

No

Flags: needinfo?(chutten)

In future please attach Data Review Requests as files and use the data-review? flag since it fits Data Stewards' workflows better. I've gone ahead and done that here.

Flags: needinfo?(chutten)
Attachment #9226208 - Flags: data-review?(chutten)

Comment on attachment 9226208 [details]
data collection review request

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

No. This collection will expire in Firefox 95.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does the data collection use a third-party collection tool?

No.


Result: datareview+

Attachment #9226208 - Flags: data-review?(chutten) → data-review+
Pushed by alissy@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/dd7070937b88
Double-check the build ID to avoid spurious about:restartrequired r=jld
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch
Blocks: 1716774
Blocks: 1723098

So-far-so-good....
Have upgraded to v91.0 and now to v91.0.1 and have not as-yet been nagged with the 'updated in background, restart required' screen.

Well... the bug seems to be fixed and that message is now only being displayed when it actually SHOULD be.
It just popped-up a few min ago due the fact that I did indeed update firefox via slackpkg while it was running
therefore, the copy in memory was the previous version and the copy on HDD is the updated version.

Mon Aug 23 20:05:36 UTC 2021
xap/mozilla-firefox-91.0.1-x86_64-2.txz: Rebuilt.
Rebuilt with: --with-unsigned-addon-scopes=app,system --allow-addon-sideload

See Also: → 1730045
Regressions: 1732999
See Also: → 1754658
See Also: → 1653430
See Also: 1653430
Depends on: 1772219
Alias: spurious-about-restartrequired
Depends on: 1751861
See Also: → 1777404
See Also: → 1867289
Blocks: 1876590
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: