Open Bug 1675823 Opened 4 years ago Updated 2 months ago

[meta] Yellow Screen of Death

Categories

(Toolkit :: Startup and Profile System, defect)

defect

Tracking

()

People

(Reporter: zbraniecki, Unassigned)

References

(Depends on 9 open bugs, Blocks 1 open bug)

Details

(Keywords: meta)

This is a tracking meta bug on issues related to YSOD and the effort to mitigate it.

Assignee: nobody → gandalf
Status: NEW → ASSIGNED
Depends on: 418305, 1667772

I think Toolkit: Startup* is perhaps better component for this. XML parser itself should fail, if bogus data is given to it.

Component: XML → Startup and Profile System
Product: Core → Toolkit
Depends on: 1690438
Depends on: 1690444
Depends on: 1690445
Depends on: 1692244
Depends on: 1692231

In bug 1692231 the reporter provided us STR to get about:preferences, one of the main sources of YSODs to reproduce locally!

That STR is not likely to be what causes YSODs in the wild, as it requires rapid page reloads which cause the network to cancel request, but the (empty) data is still fed to the parser, for "last token" consumption, which is empty, and that causes YSOD.

In bug 1692231 comment 4 I attached a stack trace of that behavior.

What that STR provides us tho, is a new avenue to reason about the potential causes of YSODs in the wild, and therefore finally we may be able to estimate the full severity.

Preliminary analysis:

In that stack trace, we can see that nsInputStreamPump::OnInputStreamReady fires nsInputStreamPump::OnStateStop in result of STATE_STOP being set. That results in nsParser::OnStopRequest being called, which, in turn, calls nsParser::ResumeParse and that calls nsExpatDriver::ConsumeToken with the finalToken which is empty which fires YSOD.

STATE_STOP driven hypothesis

A new hypothesis emerges which states that setting STATE_STOP for any reason, will likely result in YSOD if timed correctly. It is also a plausible explanation on what may be happening to DTD-driven YSODs (cases where XHTML loads correctly, but associated DTD is empty, which results in entity reference missing and that trigger YSOD - those kind of YSODs are second most common after empty document).

Scope implications

If that was to be the case, we would have to evaluate all uses of nsInputStreamPump as any resource loaded via it may be susceptible to a timing problem if STATE_STOP were to be triggered during the load.
We know that this may cause YSOD in XHTML/XML. We know it may explain known DTD YSODs.

We should evaluate whether a timed STATE_STOP may be causing zero-byte loads of JavaScript, Properties, Fluent and other resources required for the UI to function.

Hypothesis validation

Installing a probe that fires a telemetry event on every STATE_STOP and enumerate which setter has been used would allow us to validate that event as a predictor for YSOD, and identify which setters are used in cases of YSOD.

I'm concerned that this would generate a lot of traffic tho, so I think we could start with a smaller probe. Since nightly is not fully overlapping with beta/release, but close enough for this purpose, I think the probe could be limited to nightly channel and only collect STATE_STOP events on the following three documents:

  • browser.xhtml
  • preferences.xhtml
  • browser.dtd

Once we get an insight into which setters of that state are causing it, we could narrow down the probe to only those setters and open up to other documents to see what other documents may be affected.

I evaluated what Fluent does and it seems that synchronous Fluent does not use nsInputStreamPump (it uses nsLocalFile::OpenNSPRFileDesc and URLPreloader), while the asynchronous Fluent does use nsInputStreamPump.

Fluent will not fail if fed zero-byte-length string, neither on parsing, nor on applying translations. I'll want to further explore this, but it seems to indicate that Fluent is not only not reporting on that issue, it actually is not causing catastrophic failure in case of such timing attack.
Of course that's of little consolation since if FTL file is fed empty, we will just display an UI without strings, but at least if FTL is the only thing that would be affected then Fluent itself will not make situation worse.

From the telemetry of DTD/XHTML we know that it's not the case that all resource loads fail in a single loading and that it's possible for XHTML to load properly, while DTD fails. That indicates that most likely the problem is accrued - the more resources are loaded to display the UI the higher the chance that one of them will fail.
If it's a Fluent one, then those strings will be missing (but other Fluent resources loaded for the same UI load may work!) and it will not break the UI.
Same with CSS I believe.

But if it is any of XHTML, DTD, Properties or or JavaScript, we can expect a fully broken UI since Properties error out when missing, JS breaks, and DTD/XHTML cause YSOD.

Now the question is whether Properties/JS use async I/O.

Depends on: 1693146

The good news is that the new probe landed in bug 1693146 returned results! That's awesome!
The bad news is that it returned 1.5 million events on nightly in one day. That's a lot.

That's a huge body of new data to analyze for potential problems, many of them may be unrelated to YSOD!

We can use the day - 2021-02-18 to analyze correlations and data on the broadest spectrum (the volume and some popular URLs may be worth investigating!), but for our use case I think we should narrow down the probe. I'll do it in bug 1693708.

I filed bug 1693711 to separately analyze all the data collected today.

Depends on: 1693708
See Also: → 1693711

:nalexander was able to reproduce YSOD on Windows, Nightly, browser.xhtml with the UNDEFINED_ENTITY error type (indicating load error on DTD).

He said he received an update, and then started loading browser with YSOD.
The update may be a red herring because we know that YSODs happen consistently on release outside of the release cadence, but since the problem is likely multi-causal maybe Update mechanism can cause it?

Or maybe update mechanism is related to DTD errors, but not XHTML errors?

When you have a moment, can you document your experience? It's the first capture of an engineer experiencing it so it would be great to have it written down for us when we try to reason about where to put the next probe :)

Flags: needinfo?(nalexander)
Depends on: 1694067
Depends on: 1694175

When you have a moment, can you document your experience? It's the first capture of an engineer experiencing it so it would be great to have it written down for us when we try to reason about where to put the next probe :)

I am on my Windows 10 laptop. I have had an update pending for some time that has been asking for elevation. (In fact, since Monday, Feb 15th.) On Friday, Feb 19th at 12:16 pm, I elevated to allow it to complete. It did so, and I have the logs, and an update.status with success. When Firefox started, I got the undefined entity in the screenshots.

I immediately contacted Zibi and started archiving things. I captured:

  • GreD
  • ProfD (in fact, Profiles in its entirety)
  • UpdRootD (in fact, updates in its entirety)
  • profiles.ini and installs.ini

I verified that this reproduced a number of times on Friday before leaving it until Monday. On Monday, I started isolating causes.

First, I verified that the complete MAR was in fact what was on disk. Using mar.exe, unxz, and shasum, I conclude that the GreD looks as it should.

Second, I started playing with the profiles. Running firefox.exe --ProfileManager, creating a new ysod profile, and running that (either via Profile Manager or via --profile ...) did not reproduce the issue. But running the default profile, either without arguments or with --profile ..., continued to reproduce the YSOD.

Then I renamed the default profile and renamed the new, almost empty, ysod profile to the default profile name. This continued to reproduce the YSOD!

I then deleted the default profile directory and made a new, empty directory with the default profile name. This let Firefox start... and sadly, appears to have interrupted my ability to reproduce. When I move either the ysod or the default profile back into the default location, Firefox starts up. Further, unzipping my archived failing profile no longer fails.

This suggests that there's some property of the profile directory -- maybe ownership, permissions, or locks -- that is impacting the issue.

Flags: needinfo?(nalexander)
Depends on: 1694176

Further: I tried to reproduce the update to see if that was the issue. Specifically, I found the earlier build ID and fetched a complete MAR for that build ID. Then I unpacked it using mar -x ..., then unxzed the individual files. I did not put the random extra files I had in the GreD back in place, since they do not seem relevant. The nature of the bug meant that I had an update directory and state very close to what was needed to apply the update again. I unpacked the bad updates directory and modified update.status to be pending-service instead of succeeded. I unpacked the bad profile into place. I then ran Firefox and could not reproduce :(

I realized while typing that this procedure would not downgrade the version of the Mozilla Maintenance Service to whatever was in place before the original bad update. This may be relevant, because I don't think I should be prompted to elevate at all! So it seems that there is something funny with the MMS; I definitely see some "error 43" in the update logs, which is basically "service didn't work". I will try to investigate that a little further but it's very difficult to get much information out of that system.

I'm now back on my Windows 10 laptop, so I can include some additional details.

  • Based on the updates.xml file, I determined that the build ID from before the update was 20210211092822.
  • Based on running with --profile ... and inspecting about:support, I determined that the build ID after the update was 20210215162656.
  • Based on timestamps and various log entries, I determined that the update was downloaded on Monday, Feb 15th, but not applied until Friday, Feb 19th.
  • After the update, while the problem reproduced, I confirmed that the shasums in c:/Program Files/Firefox Nightly all agreed with the shasums of the complete MAR file.
  • The contents of active-update.xml after failure was something like:
    <?xml version="1.0"?>
    <updates xmlns="http://www.mozilla.org/2005/app-update">
      <update xmlns="http://www.mozilla.org/2005/app-update" appVersion="87.0a1" buildID="20210215162656" channel="nightly"   detailsURL="https://www.mozilla.org/en-CA/firefox/nightly/notes/" displayVersion="87.0a1" installDate="1613450417599" isCompleteUpdate="true" name="Nightly 87.0a1" previousAppVersion="87.0a1" promptWaitTime="43200" serviceURL="https://aus5.mozilla.org/update/6/Firefox/87.0a1/20210211092822/WINNT_x86_64-msvc-x64/en-CA/nightly/Windows_NT%2010.0.0.0.18363.1379%20(x64)/ISET:SSE4_2,MEM:16279/default/default/update.xml" type="minor" statusText="Install Pending" platformVersion="87.0a1">
        <patch size="62052389" type="complete" URL="https://archive.mozilla.org/pub/firefox/nightly/2021/02/2021-02-15-16-26-56-mozilla-central-l10n/firefox-87.0a1.en-CA.win64.complete.mar" errorCode="43" selected="true" state="pending-service" hashFunction="sha512" hashValue="82e00e551a426408e6d521ddd5a5763645bf7c365f2a04ca1fe47b93c518ab860adb6453e036dfe6ca8339ff902443b2aa7076450b4c4dfb0736383ce93a9f27" bitsId="{7A8B59A4-47A4-4F4A-BA05-EE95350D7C2C}" bitsResult="0"/>
      </update>
    </updates>
    
    Note the error code 43, which means something like the MMS failed.

Now, things that I hoped would help us reproduce.

  • The contents of c:/Program Files/Firefox Nightly are zipped here. There's an updated directory which I can't explain yet -- perhaps the remnants of the failed staging that shows up in the logs? Perhaps I installed in this location at some point and it's just hanging around?
  • The contents of c:/ProgramData/Mozilla/updates are zipped here. Be aware that I have many, many versions of Firefox and that most of this is not relevant. The relevant path hash is 6F193CCC56814779. (Edit: just the relevant update directory is zipped here.)
  • The profile is sensitive so I've archived it but won't link to it publicly. In fact, I captured the contents of c:/Users/nalexander/AppData/Roaming/Mozilla/Firefox/Profiles, but I don't think the other profiles are relevant.
  • I did not capture profiles.ini and installs.ini early enough. A slightly later capture, after monkeying with a new ysod profile and (I think) toggling the "start automatically" option in the --ProfileManager, is zipped here.

:agashlin suggested I also capture the contents of c:/Program Files (x86)/Mozilla Maintenance Service, so I've done that, well after the fact, and it's archived here.

Depends on: 1693955
Depends on: 1694263
See Also: → 1695247

For anyone interested, Bug 1695247 is a new meta bug counterpart to this meta that is focused on the network level analysis.

Depends on: 1695560

Fenix bug YSOD - bug 1668842

Depends on: 1702937
Blocks: 1708675
Depends on: 1703346

Is there anything we can ask the reporter to help diagnose this?

Flags: needinfo?(zbraniecki)

I just can confirm that the issue still exists.

Flags: needinfo?(dtownsend) → needinfo?(zibi)

Is there anything we can ask the reporter to help diagnose this?

Not that I know of. we need to diagnose it on our own. Mozilla employees have access to detailed telemetry that is observational and shows state of network and parser subsystems around the YSOD event plus the event itself.

Everything we've seen so far, at least until I stopped working on this, indicates subtle cache corruption in between subsystems, maybe in nsIJAR? I don't think any in-the-wild user affected by it can really provide us additional details based on release build.

My main value of collecting in-the-wild observations is to rule out potential narrowing assumptions such as that the YSOD happens only on one platform (it does not), that it happens only on Nightly (it does not) or that it affects only langpacks or repacks (it does not).

Flags: needinfo?(zibi)
See Also: → 1773504
Depends on: 1696551
Assignee: zibi → nobody
Status: ASSIGNED → NEW
Depends on: 1880317
Depends on: 1639821
See Also: 1639821
Depends on: 1920122
You need to log in before you can comment on or make changes to this bug.