Open Bug 1773504 Opened 2 years ago Updated 1 year ago

Proposal: Silently retry YSOD-related XML Parsing errors once before displaying an error.

Categories

(Core :: XML, enhancement)

enhancement

Tracking

()

People

(Reporter: mhoye, Unassigned)

References

Details

This is quite possibly a naive proposal, and I want to apologize ahead of time for something that sounds as why-don't-you-just-y as this if this comes across as so obvious as to be insulting.

I would like to propose that, rather than showing an XML parsing error to the user, ever, that we:

  • in that moment of failure, capture what we believe the contents of that broken file to be somewhere
  • on the assumption that this is a caching error somewhere, invalidate relevant caches, and
  • silently retry, without showing the user an error.

One of the things iOS does, that is somewhat demoralizing from an engineering-correctness perspective but fantastic from a user-friendliness perspective is that it handles crashes with "bloop, let's restart."

I'd like to propose that we take that approach here: when this happens nobody should ever see it and we should do whatever it takes to not show the user, we should retry the parsing, and if that fails, we should bloop-and-restart the tab.

See Also: → 1639821

Telemetry-wise, I think we should distinguish between "xml parsing errors" and "user-visible errors" to evaluate success here, with the goal being to drive user-visible errors to zero.

(In reply to Mike Hoye [:mhoye] from comment #0)

  • in that moment of failure, capture what we believe the contents of that broken file to be somewhere

Just as a side-note: we're already capturing the last line that we're parsing in telemetry for a YSOD: https://searchfox.org/mozilla-central/rev/552bfc6334b797d92fb2ce8e93a6ace5f47cd56d/parser/htmlparser/nsExpatDriver.cpp#1078. What have we done with that data, and was it insufficient for what we tried to do with it?

What have we done with that data, and was it insufficient for what we tried to do with it?

You'd have to look at telemetry charts that I built (but don't have access to), but my recollection is that:

  1. For YSOD in line 0 - the last line is empty
  2. For YSODs in line 2553 (or similar) - the last line is binary garbage

Overall, we didn't see a useful "cut off" that would indicate broken stream. We saw empty lines and complete gibrrish stream of bytes.

I think we should step back a bit and stop focusing so much on the user-visible part of this. I think there are a couple of scenarios that can lead us to the user-visible XML parse errors, and depending on the cause these can also manifest themselves in other problems that aren't so visible but lead to an equally broken browser.

Here are some of the causes I can think of:

  1. errors in the XML files themselves (either main file or imported DTDs)
  2. corrupted on-disk omni.ja file
  3. inconsistencies between startup cache and omni.ja (a lot of anecdotal evidence for this, but we're not sure how it can lead to XML parse errors?)
  4. bugs in the cache reading/writing code

First of all, we have to be careful that restarting doesn't just end up creating an infinite restart loop (in particular for 1 and 2). Blowing away the caches won't help then, but maybe we can make, or suggest to, the user to redownload/reinstall? There's no question that 1 is clearly something for which we can use the XML parse error to detect it. I think 3 and maybe 4 are the scenarios where blowing away caches might help.

But here's why I want us to step back a bit from the very visible XML parse errors: for any case apart from 1 the XML parse errors are just one of the possible symptoms. If we have corrupted scripts or CSS (or images?) we wouldn't get an XML parse error, but could still end up with potentially catastrophic brokenness. Can we find better ways to detect omni.ja corruption and/or startup cache inconsistencies that are not relying on XML parse errors? And can we take remedial action at that point (blow away cache, restart, …)? Or do we think that the only way to detect the problems is through actual failures in all the different parsers?

It'd certainly be great if we can avoid showing the brokenness to the user and try to recover. But if we think the cause is primarily case 3 then I would really like us to recover from it regardless of whether it leads to an XML parse error or to other brokenness that's not so directly visible.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #3)

Overall, we didn't see a useful "cut off" that would indicate broken stream. We saw empty lines and complete gibrrish stream of bytes.

Right, my question was mostly about what collecting the broken file would tell us that we don't already know from the last line.

Right, my question was mostly about what collecting the broken file would tell us that we don't already know from the last line.

The only thing that comes to my mind is learning if in the "last line is garbage" scenario is the whole file stream garbage, or is normal file send until a point where it turns into random read (UAF?).

But also, we don't know what we don't know

I'm reluctant to think that we should continue to let his happen to our users while we figure out a perfect solution, if an imperfect solution that can be deployed quickly solves most and maybe all of the user-visible manifestation of the problem.

To my knowledge we've never managed to capture an example of on-disk file corruption in-situ on machines where this has manifested itself, and that plus the fact that a restart always seems to fix it suggests whatever the multiplicity of causes here might be, the overwhelming majority of them don't live on disk.

(In reply to Mike Hoye [:mhoye] from comment #7)

the fact that a restart always seems to fix it suggests whatever the multiplicity of causes here might be, the overwhelming majority of them don't live on disk.

Maybe I misunderstood, but from reading the thread on Slack that doesn't seem to be true: "It seems pretty persistent - killing the app and restarting doesn't seem to have made a difference.", https://bugzilla.mozilla.org/show_bug.cgi?id=1668842#c9 ("Force closing app and clearing cache doesn't seem to resolve it"), https://github.com/mozilla-mobile/fenix/issues/14700#issuecomment-691206803 ("Disabled all three add-ons and restarted phone. Still happening").

The thread that brought this up describes the problem on Android, which is pretty novel. On desktop the restart resolves it.

It is entirely possible that what we're seeing here is "consumer hardware and XML parsing standards are fundamentally antithetical" and that extremely rare transient errors that would be survivable and likely unnoticeable in any other data stream cause XML parsers to do exactly what they're designed to do, exactly what the spec says they should do: fail immediately and catastrophically on any error. I'm proposing, effectively, that we play the odds here and throw those dice again, just once.

(In reply to Mike Hoye [:mhoye] from comment #9)

The thread that brought this up describes the problem on Android, which is pretty novel. On desktop the restart resolves it.

Not necessarily (bug 1763899 is about Microsoft Store).

(In reply to Mike Hoye [:mhoye] from comment #7)

To my knowledge we've never managed to capture an example of on-disk file corruption in-situ on machines where this has manifested itself, and that plus the fact that a restart always seems to fix it suggests whatever the multiplicity of causes here might be, the overwhelming majority of them don't live on disk.

We definitely do, at least for failures to load JSMs at startup. When those failures start, the same user repeatedly crashes with the same corrupt file contents until they reinstall. And there's no reason to think it's any less likely for that kind of corruption to break an XML file than a JSM.

It's entirely possible that there's also memory corruption going on, and that trying to read and parse the file again may fix it in some cases. I'm open to try it, even though I'm not very optimistic. But at the very least, we should add telemetry that tells us how often it actually succeeds.

See Also: → 1675823
You need to log in before you can comment on or make changes to this bug.