[meta] Yellow Screen of Death
Categories
(Toolkit :: Startup and Profile System, defect)
Tracking
()
People
(Reporter: zbraniecki, Unassigned)
References
(Depends on 9 open bugs, Blocks 1 open bug)
Details
(Keywords: meta)
This is a tracking meta bug on issues related to YSOD and the effort to mitigate it.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Comment 1•4 years ago
|
||
I think Toolkit: Startup* is perhaps better component for this. XML parser itself should fail, if bogus data is given to it.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 2•4 years ago
|
||
In bug 1692231 the reporter provided us STR to get about:preferences, one of the main sources of YSODs to reproduce locally!
That STR is not likely to be what causes YSODs in the wild, as it requires rapid page reloads which cause the network to cancel request, but the (empty) data is still fed to the parser, for "last token" consumption, which is empty, and that causes YSOD.
In bug 1692231 comment 4 I attached a stack trace of that behavior.
What that STR provides us tho, is a new avenue to reason about the potential causes of YSODs in the wild, and therefore finally we may be able to estimate the full severity.
Preliminary analysis:
In that stack trace, we can see that nsInputStreamPump::OnInputStreamReady
fires nsInputStreamPump::OnStateStop
in result of STATE_STOP
being set. That results in nsParser::OnStopRequest
being called, which, in turn, calls nsParser::ResumeParse
and that calls nsExpatDriver::ConsumeToken
with the finalToken
which is empty which fires YSOD.
STATE_STOP driven hypothesis
A new hypothesis emerges which states that setting STATE_STOP
for any reason, will likely result in YSOD if timed correctly. It is also a plausible explanation on what may be happening to DTD-driven YSODs (cases where XHTML loads correctly, but associated DTD is empty, which results in entity reference missing and that trigger YSOD - those kind of YSODs are second most common after empty document).
Scope implications
If that was to be the case, we would have to evaluate all uses of nsInputStreamPump
as any resource loaded via it may be susceptible to a timing problem if STATE_STOP
were to be triggered during the load.
We know that this may cause YSOD in XHTML/XML. We know it may explain known DTD YSODs.
We should evaluate whether a timed STATE_STOP
may be causing zero-byte loads of JavaScript, Properties, Fluent and other resources required for the UI to function.
Hypothesis validation
Installing a probe that fires a telemetry event on every STATE_STOP
and enumerate which setter has been used would allow us to validate that event as a predictor for YSOD, and identify which setters are used in cases of YSOD.
I'm concerned that this would generate a lot of traffic tho, so I think we could start with a smaller probe. Since nightly is not fully overlapping with beta/release, but close enough for this purpose, I think the probe could be limited to nightly channel and only collect STATE_STOP events on the following three documents:
- browser.xhtml
- preferences.xhtml
- browser.dtd
Once we get an insight into which setters of that state are causing it, we could narrow down the probe to only those setters and open up to other documents to see what other documents may be affected.
Reporter | ||
Comment 3•4 years ago
|
||
I evaluated what Fluent does and it seems that synchronous Fluent does not use nsInputStreamPump
(it uses nsLocalFile::OpenNSPRFileDesc and URLPreloader
), while the asynchronous Fluent does use nsInputStreamPump
.
Fluent will not fail if fed zero-byte-length string, neither on parsing, nor on applying translations. I'll want to further explore this, but it seems to indicate that Fluent is not only not reporting on that issue, it actually is not causing catastrophic failure in case of such timing attack.
Of course that's of little consolation since if FTL file is fed empty, we will just display an UI without strings, but at least if FTL is the only thing that would be affected then Fluent itself will not make situation worse.
From the telemetry of DTD/XHTML we know that it's not the case that all resource loads fail in a single loading and that it's possible for XHTML to load properly, while DTD fails. That indicates that most likely the problem is accrued - the more resources are loaded to display the UI the higher the chance that one of them will fail.
If it's a Fluent one, then those strings will be missing (but other Fluent resources loaded for the same UI load may work!) and it will not break the UI.
Same with CSS I believe.
But if it is any of XHTML, DTD, Properties or or JavaScript, we can expect a fully broken UI since Properties error out when missing, JS breaks, and DTD/XHTML cause YSOD.
Now the question is whether Properties/JS use async I/O.
Reporter | ||
Comment 4•4 years ago
|
||
The good news is that the new probe landed in bug 1693146 returned results! That's awesome!
The bad news is that it returned 1.5 million events on nightly in one day. That's a lot.
That's a huge body of new data to analyze for potential problems, many of them may be unrelated to YSOD!
We can use the day - 2021-02-18 to analyze correlations and data on the broadest spectrum (the volume and some popular URLs may be worth investigating!), but for our use case I think we should narrow down the probe. I'll do it in bug 1693708.
Reporter | ||
Comment 5•4 years ago
|
||
I filed bug 1693711 to separately analyze all the data collected today.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 6•4 years ago
|
||
:nalexander was able to reproduce YSOD on Windows, Nightly, browser.xhtml with the UNDEFINED_ENTITY error type (indicating load error on DTD).
He said he received an update, and then started loading browser with YSOD.
The update may be a red herring because we know that YSODs happen consistently on release outside of the release cadence, but since the problem is likely multi-causal maybe Update mechanism can cause it?
Or maybe update mechanism is related to DTD errors, but not XHTML errors?
When you have a moment, can you document your experience? It's the first capture of an engineer experiencing it so it would be great to have it written down for us when we try to reason about where to put the next probe :)
Comment 7•4 years ago
|
||
When you have a moment, can you document your experience? It's the first capture of an engineer experiencing it so it would be great to have it written down for us when we try to reason about where to put the next probe :)
I am on my Windows 10 laptop. I have had an update pending for some time that has been asking for elevation. (In fact, since Monday, Feb 15th.) On Friday, Feb 19th at 12:16 pm, I elevated to allow it to complete. It did so, and I have the logs, and an update.status
with success. When Firefox started, I got the undefined entity in the screenshots.
I immediately contacted Zibi and started archiving things. I captured:
- GreD
- ProfD (in fact,
Profiles
in its entirety) - UpdRootD (in fact,
updates
in its entirety) - profiles.ini and installs.ini
I verified that this reproduced a number of times on Friday before leaving it until Monday. On Monday, I started isolating causes.
First, I verified that the complete MAR was in fact what was on disk. Using mar.exe
, unxz
, and shasum
, I conclude that the GreD looks as it should.
Second, I started playing with the profiles. Running firefox.exe --ProfileManager
, creating a new ysod
profile, and running that (either via Profile Manager or via --profile ...
) did not reproduce the issue. But running the default profile, either without arguments or with --profile ...
, continued to reproduce the YSOD.
Then I renamed the default profile and renamed the new, almost empty, ysod
profile to the default profile name. This continued to reproduce the YSOD!
I then deleted the default profile directory and made a new, empty directory with the default profile name. This let Firefox start... and sadly, appears to have interrupted my ability to reproduce. When I move either the ysod
or the default profile back into the default location, Firefox starts up. Further, unzipping my archived failing profile no longer fails.
This suggests that there's some property of the profile directory -- maybe ownership, permissions, or locks -- that is impacting the issue.
Comment 8•4 years ago
|
||
Further: I tried to reproduce the update to see if that was the issue. Specifically, I found the earlier build ID and fetched a complete MAR for that build ID. Then I unpacked it using mar -x ...
, then unxz
ed the individual files. I did not put the random extra files I had in the GreD back in place, since they do not seem relevant. The nature of the bug meant that I had an update directory and state very close to what was needed to apply the update again. I unpacked the bad updates directory and modified update.status
to be pending-service
instead of succeeded
. I unpacked the bad profile into place. I then ran Firefox and could not reproduce :(
I realized while typing that this procedure would not downgrade the version of the Mozilla Maintenance Service to whatever was in place before the original bad update. This may be relevant, because I don't think I should be prompted to elevate at all! So it seems that there is something funny with the MMS; I definitely see some "error 43" in the update logs, which is basically "service didn't work". I will try to investigate that a little further but it's very difficult to get much information out of that system.
Comment 9•4 years ago
•
|
||
I'm now back on my Windows 10 laptop, so I can include some additional details.
- Based on the
updates.xml
file, I determined that the build ID from before the update was20210211092822
. - Based on running with
--profile ...
and inspectingabout:support
, I determined that the build ID after the update was20210215162656
. - Based on timestamps and various log entries, I determined that the update was downloaded on Monday, Feb 15th, but not applied until Friday, Feb 19th.
- After the update, while the problem reproduced, I confirmed that the shasums in
c:/Program Files/Firefox Nightly
all agreed with the shasums of the complete MAR file. - The contents of
active-update.xml
after failure was something like:
Note the error code 43, which means something like the MMS failed.<?xml version="1.0"?> <updates xmlns="http://www.mozilla.org/2005/app-update"> <update xmlns="http://www.mozilla.org/2005/app-update" appVersion="87.0a1" buildID="20210215162656" channel="nightly" detailsURL="https://www.mozilla.org/en-CA/firefox/nightly/notes/" displayVersion="87.0a1" installDate="1613450417599" isCompleteUpdate="true" name="Nightly 87.0a1" previousAppVersion="87.0a1" promptWaitTime="43200" serviceURL="https://aus5.mozilla.org/update/6/Firefox/87.0a1/20210211092822/WINNT_x86_64-msvc-x64/en-CA/nightly/Windows_NT%2010.0.0.0.18363.1379%20(x64)/ISET:SSE4_2,MEM:16279/default/default/update.xml" type="minor" statusText="Install Pending" platformVersion="87.0a1"> <patch size="62052389" type="complete" URL="https://archive.mozilla.org/pub/firefox/nightly/2021/02/2021-02-15-16-26-56-mozilla-central-l10n/firefox-87.0a1.en-CA.win64.complete.mar" errorCode="43" selected="true" state="pending-service" hashFunction="sha512" hashValue="82e00e551a426408e6d521ddd5a5763645bf7c365f2a04ca1fe47b93c518ab860adb6453e036dfe6ca8339ff902443b2aa7076450b4c4dfb0736383ce93a9f27" bitsId="{7A8B59A4-47A4-4F4A-BA05-EE95350D7C2C}" bitsResult="0"/> </update> </updates>
Now, things that I hoped would help us reproduce.
- The contents of
c:/Program Files/Firefox Nightly
are zipped here. There's anupdated
directory which I can't explain yet -- perhaps the remnants of the failed staging that shows up in the logs? Perhaps I installed in this location at some point and it's just hanging around? - The contents of
c:/ProgramData/Mozilla/updates
are zipped here. Be aware that I have many, many versions of Firefox and that most of this is not relevant. The relevant path hash is6F193CCC56814779
. (Edit: just the relevant update directory is zipped here.) - The profile is sensitive so I've archived it but won't link to it publicly. In fact, I captured the contents of
c:/Users/nalexander/AppData/Roaming/Mozilla/Firefox/Profiles
, but I don't think the other profiles are relevant. - I did not capture
profiles.ini
andinstalls.ini
early enough. A slightly later capture, after monkeying with a newysod
profile and (I think) toggling the "start automatically" option in the--ProfileManager
, is zipped here.
Comment 10•4 years ago
|
||
:agashlin suggested I also capture the contents of c:/Program Files (x86)/Mozilla Maintenance Service
, so I've done that, well after the fact, and it's archived here.
Reporter | ||
Comment 11•4 years ago
|
||
For anyone interested, Bug 1695247 is a new meta bug counterpart to this meta that is focused on the network level analysis.
Reporter | ||
Comment 12•4 years ago
|
||
We now have a report of YSOD in Fenix - https://www.reddit.com/r/firefox/comments/mg2ydx/firefox_for_android_suddenly_stopped_working/
Reporter | ||
Comment 13•4 years ago
|
||
Fenix bug YSOD - bug 1668842
Comment 14•3 years ago
|
||
Is there anything we can ask the reporter to help diagnose this?
Comment 15•3 years ago
|
||
I just can confirm that the issue still exists.
Comment hidden (off-topic) |
Comment hidden (off-topic) |
Comment hidden (off-topic) |
Updated•2 years ago
|
Reporter | ||
Comment 19•2 years ago
|
||
Is there anything we can ask the reporter to help diagnose this?
Not that I know of. we need to diagnose it on our own. Mozilla employees have access to detailed telemetry that is observational and shows state of network and parser subsystems around the YSOD event plus the event itself.
Everything we've seen so far, at least until I stopped working on this, indicates subtle cache corruption in between subsystems, maybe in nsIJAR? I don't think any in-the-wild user affected by it can really provide us additional details based on release build.
My main value of collecting in-the-wild observations is to rule out potential narrowing assumptions such as that the YSOD happens only on one platform (it does not), that it happens only on Nightly (it does not) or that it affects only langpacks or repacks (it does not).
Comment 20•2 years ago
|
||
Updated•1 year ago
|
Description
•