Verify that we only show about:restartrequired in the presence of a build ID mismatch
Categories
(Core :: IPC, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox91 | --- | fixed |
People
(Reporter: mossop, Assigned: gerard-majax)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
Attachments
(3 files)
3.00 KB,
text/plain
|
Details | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
2.94 KB,
text/plain
|
chutten
:
data-review+
|
Details |
We've been seeing the odd report where about:restartrequired is shown when it seems like there should not have been a background application update (bug 1640304 is an example).
Looking over the logic I am wondering if a build ID mismatch is the only case that might lead us to show this UI. The flow seems to be:
- MessageChannel in the parent process defaults the flag mBuildIDsConfirmedMatch to false.
- The content process early in startup compares its build ID to that passed from the parent (via a command line argument), if they match it sends a message up to the parent saying so, if not the content process quits.
- When the parent receives the message it sets mBuildIDsConfirmedMatch to true.
x. If the content process crashes and mBuildIDsConfirmedMatch is false then we show about:restartrequired.
I may be missing other details but based on that it seems like there is a possibility for the content process to crash before sending the message and then the parent assuming the crash was because of a build ID mismatch.
Comment 1•3 years ago
|
||
One way this can currently happen on Linux if is things go sufficiently wrong when initializing GTK, for example by failing to connect to the display server; I ran into this today, so I can confirm that it is possible to get about:restartrequired
in a case that should have been considered a normal tab crash. Bug 1635451 will avoid that specific situation as a side-effect of not initializing GTK in most content processes, and that's platform-specific code so it doesn't apply to Mac/Windows/Android, but there are a few other things we do this early that could theoretically crash.
This is a little delicate, because if the build IDs don't match, the child process can't use normal IPC, and it's possible that we could someday have incompatibilities that result in a crash before the build ID check (e.g., serialized prefs). But maybe the parent process could re-read the build ID from the install directory (it seems to be in platform.ini
) after observing the crash; if it was a build mismatch, it should see the new ID at that point.
Comment 2•3 years ago
|
||
I decided to throw some code together while I was thinking about this. Tested by modifying ContentChild
to not send the BuildIDsMatch
message and using kill -11
; that causes about:restartrequired
without this patch and a normal tab crash page with it.
This could use a little cleanup, and ideally a test case.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 3•3 years ago
|
||
As much as I can tell, there's no testing at all on sending the oop-browser-buildid-mismatch
event. There's testing coverage that when it is sent, we do display about:restartrequired
though: https://searchfox.org/mozilla-central/rev/443f87caa5fadba920b0382e12874693d6c6133a/browser/base/content/test/tabcrashed/browser_shownRestartRequired.js#66-114
I'm still trying to look through ipc tests if we can find a way to cover that.
Assignee | ||
Comment 4•3 years ago
|
||
So, I have hacked something based on this patch and :jld
's comment #2, and it somehow worked, however the current implementation is rather unstable / flakky and thus unreliable. It would still seem to indicate this fix is working as intended.
Assignee | ||
Comment 5•3 years ago
|
||
It seems, however, I got some success using application.ini
but then failures. Changing its values does not seems to always reflect?
Assignee | ||
Comment 6•3 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #5)
It seems, however, I got some success using
application.ini
but then failures. Changing its values does not seems to always reflect?
Seems like it was bad luck and the correct source of truth is https://searchfox.org/mozilla-central/source/__GENERATED__/toolkit/library/buildid.cpp#1
Assignee | ||
Comment 7•3 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #6)
(In reply to Alexandre LISSY :gerard-majax from comment #5)
It seems, however, I got some success using
application.ini
but then failures. Changing its values does not seems to always reflect?Seems like it was bad luck and the correct source of truth is https://searchfox.org/mozilla-central/source/__GENERATED__/toolkit/library/buildid.cpp#1
Which is actually showing how much focused I was and completely missed application.ini
VS platform.ini
Assignee | ||
Comment 8•3 years ago
|
||
Assignee | ||
Comment 9•3 years ago
|
||
So, I have been able to build something on top of :jld
's patch and now the last question, since it seems to be running quite stable locally is whether we can rely on platform.ini
like that?
Comment 10•3 years ago
|
||
I'm not sure that I'm the right person to answer your question. I don't really know anything in particular about platform.ini
.
I tried taking a look at the test, and it looks like you are writing to platform.ini
to try to force an about:restartrequired
situation? I don't really know enough about how that page gets shown to know if that is a reasonable way to achieve this.
Assignee | ||
Comment 11•3 years ago
|
||
(In reply to Kirk Steuber (he/him) [:bytesized] from comment #10)
I'm not sure that I'm the right person to answer your question. I don't really know anything in particular about
platform.ini
.I tried taking a look at the test, and it looks like you are writing to
platform.ini
to try to force anabout:restartrequired
situation? I don't really know enough about how that page gets shown to know if that is a reasonable way to achieve this.
Thanks, I'll try to find someone who might know better the situation around platform.ini
, if you have any name to suggest, I'm interested.
From your answer, it might be that I was not clear enough though: this bug is explicitely about leveraging platform.ini
infos to be able to send a more accurate about:restartrequired
, so we are really interested in ensuring that reading from the file is a good thing to do.
Assignee | ||
Comment 12•3 years ago
|
||
(In reply to Kirk Steuber (he/him) [:bytesized] from comment #10)
I'm not sure that I'm the right person to answer your question. I don't really know anything in particular about
platform.ini
.I tried taking a look at the test, and it looks like you are writing to
platform.ini
to try to force anabout:restartrequired
situation? I don't really know enough about how that page gets shown to know if that is a reasonable way to achieve this.
Copying here the answer from :nalexander
on matrix: platform.ini
is reliable as long as we understand its limitations:
so platform.ini tells you something about the state of the application on disk, but it doesn't necessarily tell you what the version of libxul actually is.
We might also want to use some other source of truth because they are trying to get rid of those ini
files.
Assignee | ||
Comment 13•3 years ago
|
||
Also, I was unable to get the test to run correctly on Windows: GPU Process is hanging on some resources not free'd. As much as I could analyze, there is some CompositorThreadHolder
reference being held.
From refcount logging, I could get some data that shows the imbalance:
50: AddRef 1 CompositorThreadHolder::Start ==> Release 5 CompositorThreadHolder::Release
60: AddRef 2 VsyncBridgeParent::VsyncBridgeParent ==> Release 9 PVsyncBridgeParent::OnChannelClose
91: AddRef 3 CompositorManagerParent::Create ==> Release 18 CompositorManagerParent::DeferredDestroy
121: AddRef 4 ImageBridgeParent::ImageBridgeParent ==> Release 16 ImageBridgeParent::DeferredDestroy
152: AddRef 5 VRManagerParent::CreateForGPUProcess
182: AddRef 6 VRManagerParent::OnChannelConnected ==> Release 17 VRManagerParent::ActorDealloc
212: AddRef 6 CompositorManagerParent::Create ==> Release 16 CompositorManagerParent::DeferredDestroy
242: AddRef 7 ImageBridgeParent::ImageBridgeParent ==> Release 15 ImageBridgeParent::DeferredDestroy
272: AddRef 8 VideoBridgeParent::VideoBridgeParent ==> Release 10 VideoBridgeParent::ActorDealloc
292: AddRef 9 VRManagerParent::OnChannelConnected ==> Release 17 VRManagerParent::ActorDealloc
307: AddRef 10 CompositorManagerParent::Create ==> Release 18 CompositorManagerParent::DeferredDestroy
337: AddRef 11 ImageBridgeParent::ImageBridgeParent ==> Release 17 ImageBridgeParent::DeferredDestroy
368: AddRef 12 VRManagerParent::OnChannelConnected ==> Release 22 VRManagerParent::ActorDealloc
383: AddRef 13 VideoBridgeParent::VideoBridgeParent ==> Release 5 VideoBridgeParent::ActorDealloc
414: AddRef 14 CompositorManagerParent::Create ==> Release 16 CompositorManagerParent::DeferredDestroy
444: AddRef 15 ImageBridgeParent::ImageBridgeParent ==> Release 15 ImageBridgeParent::DeferredDestroy
475: AddRef 16 VRManagerParent::OnChannelConnected ==> Release 21 VRManagerParent::ActorDealloc
490: AddRef 17 CompositorManagerParent::Create ==> Release 14 CompositorManagerParent::DeferredDestroy
520: AddRef 18 ImageBridgeParent::ImageBridgeParent ==> Release 13 ImageBridgeParent::DeferredDestroy
551: AddRef 19 VRManagerParent::OnChannelConnected ==> Release 20 VRManagerParent::ActorDealloc
616: AddRef 17 CompositorManagerParent::Create ==> Release 12 CompositorManagerParent::DeferredDestroy
646: AddRef 18 ImageBridgeParent::ImageBridgeParent ==> Release 11 ImageBridgeParent::DeferredDestroy
727: AddRef 16 CompositorManagerParent::Create ==> Release 7 CompositorManagerParent::DeferredDestroy
757: AddRef 17 ImageBridgeParent::ImageBridgeParent ==> Release 6 ImageBridgeParent::DeferredDestroy
788: AddRef 18 CompositorManagerParent::Create
818: AddRef 19 ImageBridgeParent::ImageBridgeParent ==> Release 4 ImageBridgeParent::DeferredDestroy
849: AddRef 20 VRManagerParent::OnChannelConnected ==> Release 19 VRManagerParent::ActorDealloc
864: AddRef 21 CompositorManagerParent::Create
894: AddRef 22 ImageBridgeParent::ImageBridgeParent ==> Release 3 ImageBridgeParent::DeferredDestroy
925: AddRef 23 VRManagerParent::OnChannelConnected ==> Release 8 VRManagerParent::ActorDealloc
1283: AddRef 6 CompositorThreadHolder::Shutdown ==> Release 5 CompositorThreadHolder::Shutdown
1370: Release 2 RunnableFunction<CompositorThread.cpp:131:7>
And we end up with two non released CompositorThreadHolder
references.
Looking at the problem from another angle, I could get RemoteDecoderManagerParent
to prove it was registered two times more than it was deregistered.
Updated•3 years ago
|
Assignee | ||
Comment 14•3 years ago
|
||
So:
platform.ini
is not to be used because update team is working to get rid of it- we can't leverage
PR_LoadLibrary
to load on-disklibxul
and read its symbol:PR_LoadLibrary
will see it is already loaded and do notdlopen()
- there is code in the profiler / crashreporter to read symbols, but it is only meant for reading in memory, not on disk
mozglue/linker
would require work to be used outside ofmozglue
, and to support macOS and Windows shared objects
I think we have exhausted the alternatives on the parent side.
We might be able to improve the situation but still have some (very rare?) edge cases if we try and make use if process exit code: we make sure we check parentBuildId
in the child as early as possible (when we parse the command line maybe?) and we exit with a specific exit code:
- we could still have some false negatives
- we currently cannot get the exit code because of https://bugzilla.mozilla.org/show_bug.cgi?id=1658072
Assignee | ||
Comment 15•3 years ago
|
||
So, I've got some better view on https://bugzilla.mozilla.org/show_bug.cgi?id=1712145 but we have nothing to distinguish "good" VS "bad" about:restartrequired
; do you think we should proceed with some extra telemetry info to gather data and see, actually, how much wrong about:restartrequired
we really have VS legit ones?
And considering comment #14, such a temporary experiment would rather rely on platform.ini
(I know), because fixing the way we get on-disk platform build ID is IMHO too much work for the moment, until we can justify there is actually a real problem with false positive buildid mismatches.
Assignee | ||
Comment 16•3 years ago
|
||
From discussion earlier this morning on Matrix, it seems we can go forward to collect more data and see if a more complicated fix is really justified.
Assignee | ||
Comment 17•3 years ago
•
|
||
Request for data collection review form
All questions are mandatory. You must receive review from a data steward peer on your responses to these questions before shipping new data collection.
- What questions will you answer with this data?
"Do we have actually false-positive of buildID mismatches in the wild"
- Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements? Some example responses:
about:restartrequired
when there was no actual change to libxul
or any update are getting in the way of users and might hide other issues
- What alternative methods did you consider to answer these questions? Why were they not sufficient?
Having a look at existing telemetry reveals something is happening but the current data available is not enough because we have no way yet to distinguish a genuine buildid mismatch from something else
- Can current instrumentation answer these questions?
Unfortunately, no.
- List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories found on the Mozilla wiki.
Note that the data steward reviewing your request will characterize your data collection based on the highest (and most sensitive) category.
Measurement Description: Frequency of false-positive buildID mismatches
Data Collection Category: Category 1
Tracking Bug #: Bug 1651133
- Please provide a link to the documentation for this data collection which describes the ultimate data set in a public, complete, and accurate way.
This collection is documented in its definitions files Histograms.json, Scalars.yaml, and/or Events.yaml and in the Probe Dictionary at https://probes.telemetry.mozilla.org.
- How long will this data be collected? Choose one of the following:
Collecting should only be needed for a few releases to get enough data to make a decision.
- What populations will you measure?
- Which release channels?
Release at least, but we are interested in all
- Which countries?
All
- Which locales?
All
- Any other filters? Please describe in detail below.
None
- If this data collection is default on, what is the opt-out mechanism for users?
Default opt-out mechanism for opt-out telemetry probes.
- Please provide a general description of how you will analyze this data.
This telemetry probe maintains the ability to detect the frequency of real buildID mismatches between parent and child processes versions versus false-positive that would arise because of a very early crash.
- Where do you intend to share the results of your analysis?
Internally, to know if the amount of work required to perform some changes can be justified by an actual problem users are facing.
- Is there a third-party tool (i.e. not Telemetry) that you are proposing to use for this data collection? If so:
No
Comment 18•3 years ago
|
||
In future please attach Data Review Requests as files and use the data-review?
flag since it fits Data Stewards' workflows better. I've gone ahead and done that here.
Comment 19•3 years ago
|
||
Comment on attachment 9226208 [details]
data collection review request
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes.
Is there a control mechanism that allows the user to turn the data collection on and off?
Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
No. This collection will expire in Firefox 95.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 1, Technical.
Is the data collection request for default-on or default-off?
Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
No.
Is the data collection covered by the existing Firefox privacy notice?
Yes.
Does the data collection use a third-party collection tool?
No.
Result: datareview+
Comment 20•3 years ago
|
||
Pushed by alissy@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/dd7070937b88 Double-check the build ID to avoid spurious about:restartrequired r=jld
Comment 21•3 years ago
|
||
bugherder |
Comment 23•3 years ago
|
||
So-far-so-good....
Have upgraded to v91.0 and now to v91.0.1 and have not as-yet been nagged with the 'updated in background, restart required' screen.
Comment 24•3 years ago
|
||
Well... the bug seems to be fixed and that message is now only being displayed when it actually SHOULD be.
It just popped-up a few min ago due the fact that I did indeed update firefox via slackpkg while it was running
therefore, the copy in memory was the previous version and the copy on HDD is the updated version.
Mon Aug 23 20:05:36 UTC 2021
xap/mozilla-firefox-91.0.1-x86_64-2.txz: Rebuilt.
Rebuilt with: --with-unsigned-addon-scopes=app,system --allow-addon-sideload
Assignee | ||
Updated•2 years ago
|
Description
•