Capture last line in the YSOD telemetry
Categories
(Core :: XML, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox85 | --- | fixed |
People
(Reporter: zbraniecki, Assigned: zbraniecki)
References
(Blocks 1 open bug)
Details
Attachments
(2 files, 1 obsolete file)
47 bytes,
text/x-phabricator-request
|
Details | Review | |
2.55 KB,
text/plain
|
chutten
:
data-review+
|
Details |
In bug 1657242 and bug 1661646 we added basic YSOD telemetry from the XML Parser point of view.
This has lead to the zero-byte-length
hypothesis and in bug 1675205 we landed a new telemetry probe in the JARChannel to test it.
Unfortunately, the probe so far has not returned any results in Nightly, while we see a flow of ysod
events.
It may yet still turn out that we'll start seeing correlations, but I also want to look beyond that.
There are multiple reasons the jarchannel probe may not work:
- Maybe the issue is not in JARChannel but somewhere further down the logic between JARChannel and XML Parser
** Maybe the error is related to some cache that bypasses JARChannel - Maybe the string that the XML Parser receives is not actually empty but rather mangled/malformed. Maybe it's not decoded from zip?
To test that, I'm going to a new key to the ysod
probe to return the content of the last line of the document with error (mLastLine
).
This should not affect privacy as it is a Chrome document (so, part of UI) and no user data is encoded in the document at the time of parsing.
I'm also a bit suspicious on whether Telemetry will actually store the unicode properly, so I'm also going to capture the length of the last line.
This is imperfect since if the source is a malformed UTF, then likely mLastLine
will be constructed lossy here: https://searchfox.org/mozilla-central/source/parser/htmlparser/nsExpatDriver.cpp#1010
but let's first see what this turns back and then go deeper if needed.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
Assignee | ||
Comment 2•5 years ago
|
||
Comment 3•5 years ago
|
||
Comment on attachment 9188395 [details]
data-review-ysod3.txt
PRELIMINARY NOTES:
It would be helpful to somewhere define that YSOD means Yellow Screen of Death and means the error page users are confronted with if a XML file has a syntax error in it someplace.
There appears to be a copy-pasta error of description for last_line_len
. Please ensure it's correct in the documentation, I will proceed as though it read "The length (in characters) of the content of the last line that led to the error, if present."
On expiry (Q7) it is important to explain whether the collection will expire on its own, or whether it is permanent. In this case it is permanent, and requires an individual be listed as responsible for this probe over its lifetime. Please provide one. I can't give data-review+
without one.
The last line of XML... is there any way that user-crafted strings might end up in it? Is this last line guaranteed to only ever be a string we shipped? Or at least likely that it's just a bitflip or accidental mutation of a string that we ship? I ask because if a user might put data in there, one of them might put something in it that we do not care to store in our data platform.
Assignee | ||
Comment 4•5 years ago
|
||
Thank you! Updated the review request document based on the feedback.
The last line of XML... is there any way that user-crafted strings might end up in it?
No.
Is this last line guaranteed to only ever be a string we shipped?
Yes. It is a document XML (kind of like an outline of the UI) shipped as an XML file.
Or at least likely that it's just a bitflip or accidental mutation of a string that we ship? I ask because if a user might put data in there, one of them might put something in it that we do not care to store in our data platform.
User has no power to put data into an XML file to be parsed as UI. It is like asking if the user can modify omni.ja and place their data in browser.xhtml
- it is not something that we would ever support in any way.
Comment 5•5 years ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #4)
User has no power to put data into an XML file to be parsed as UI. It is like asking if the user can modify omni.ja and place their data in
browser.xhtml
- it is not something that we would ever support in any way.
This isn't about supporting something, I think it's about whether this can end up exfiltrating private data from a user's system by accident. I am concerned about that too (my r+ of the patch is purely technical), especially since this will end up being hit precisely in error conditions. For example, it's one thing to be accidentally reading the wrong file on a system and displaying it in the UI, but then sending it out over the network is something we should be worried about.
Assignee | ||
Comment 6•5 years ago
|
||
(In reply to Peter Van der Beken [:peterv] from comment #5)
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #4)
User has no power to put data into an XML file to be parsed as UI. It is like asking if the user can modify omni.ja and place their data in
browser.xhtml
- it is not something that we would ever support in any way.This isn't about supporting something, I think it's about whether this can end up exfiltrating private data from a user's system by accident. I am concerned about that too (my r+ of the patch is purely technical), especially since this will end up being hit precisely in error conditions. For example, it's one thing to be accidentally reading the wrong file on a system and displaying it in the UI, but then sending it out over the network is something we should be worried about.
That makes sense. Thank you for the explanation!
I don't know how to escape this trap - we're hunting down a wide spread, highly impactful, and insanely hard to reproduce problem. We have not been successful in finding STRs or even ability to locally reproduce it by any of our engineers over years, and yet it hits tens of thousands of sessions a day.
We're trying to use telemetry to find out what is happening and we know that parser errors at line 1, column 1 of documents like browser.xhtml
saying that there is no root (it also does it in documents as simple as https://searchfox.org/mozilla-central/source/toolkit/components/thumbnails/content/backgroundPageThumbs.xhtml ). We now know that JARChannel sends at least 1 byte everywhere since we also added telemetry for that.
So this is my last attempt to see what data is sent in such scenarios. Is it empty? Is it binary blob (maybe unencoded zip)? Is it trimmed? Is it one byte?
If it's empty, we know the problem is between JARChannel and parser. If it is not empty, we know how to adjust our JARChannel probe to move to catching it there.
If we cannot land this probe, then I have no further ideas on what to test and where. :(
Therefore, I would like to ask for your help in finding an acceptable way to land it.
Here are my thoughts:
- How do we manage the risk vs value?
The risk described by Peter is technically possible, but would take a lot of coincidence to actually lead to any data being leaked. It is no different from reading random bytes from memory and retrieving user password.
In such scenario, the data will land in our telemetry and we'll be able to notice that and remove - are we to trust ourselves?
If not, does it mean that we just cannot risk such an unlikely scenario to gain the value we're after?
- How is it different from other keys?
There are other probes that theoretically can do the same - in an error scenario any variable can read any region of memory and retrieve some user data. I would say it is very unlikely to be done at random, but I agree that it is not impossible.
How do we allow for other Events.yaml keys to be stored, which can also can in an error scenario read user's system.
- Can we limit the scope to make it okay?
The question will likely be answered within a day or two of collecting the data as we'll get answers to what gets fed into the parser and which next step should we take.
Having the limited probe to only nightly or couple days is very limiting, but it is better than nothing, and the alternative I'm facing is nothing.
Would that alleviate the worry to tilt the risk/value ratio in the eye of the decision maker?
- What can we do to make it okay?
If that is not okay, do you have any ideas on what else can we do in this scenario? I'm facing a scenario of having to come back with no avenue to explore if we cannot understand how correct XHTML omni.ja documents lead to XML parser errors.
I'd appreciate your guidance and ideas
Comment 7•5 years ago
|
||
Luckily we have a process for this : ) I'll fill out the data collection review response form marking it data-review-
pending review from Privacy. I'll needinfo Emily Litka to OK your mitigations to permit the collection. Then, assuming all is good, I mark it data-review+
and you're good to go.
Comment 8•5 years ago
|
||
Comment on attachment 9188442 [details]
data-review-ysod3.txt
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes.
Is there a control mechanism that allows the user to turn the data collection on and off?
Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
Yes, zbraniecki and vchin are responsible.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Potentially (though unlikely) Category 4, Highly Sensitive Data.
Is the data collection request for default-on or default-off?
Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
No.
Is the data collection covered by the existing Firefox privacy notice?
Requesting clarification from Privacy.
Does there need to be a check-in in the future to determine whether to renew the data?
Yes. zbraniecki is responsible for renewing or removing the collection as soon as we can.
Result: datareview- pending Privacy's ok.
Comment 9•5 years ago
|
||
Hi all, I'm sorry - I'm having a hard time following what's happening here. Can you give me the high-level explanation of (a) what the issue is you're trying to solve force; (b) what data you're collecting, from who, and when; and (c) what mitigations your putting place to address the risk? I'm also happy to jump on a quick call, if that's easier.
Assignee | ||
Comment 10•5 years ago
|
||
(In reply to Emily Litka from comment #9)
No worries! It is a bit quirky situation! :)
- There is a major issue that we struggle to reproduce in developer environment
- The issue is seen as potential major churn factor
- I'm tasked to use telemetry to try to find out what may causing it
- The telemetry so far is telling us that
a) Parser is rejecting UI documents because they don't contain XML
b) I/O layer is sending non-empty files
The remaining question is:
a) Is Parser receiving empty files?
b) Is Parser receiving non-empty files that are somehow broken?
This patch adds to our telemetry the last line that the Parser sees when it errors out.
It should answer my question and in theory it should not be able to contain any user data, because it basically should look like this:
a) Empty: ""
b) Non Empty XML: "<?xml version="1.0"?>" which is the first line from https://searchfox.org/mozilla-central/source/browser/base/content/webrtcLegacyIndicator.xhtml
c) Non Empty mangled string like "Y29tcHJlc3NkIGRmIHNkZiBzZmQgcw=="
None of those should contain any user data as we feed documents from omni.ja to the parser without any per-user modifications.
In all likely hood I'd say that (a) is most likely, (c) is second and (b) is last.
Peter pointed out that since we are dealing with an error scenario, in theory the problem may be that we somehow fetch the wrong data and send it to the parser and in theory that data may be user identifiable.
So, in theory there's another option:
d) Non-empty string from users password file or users prefs.js or something like that
I'm claiming that the likelyhood of that happening is insanely low, like as if you pointed a gun at a building in a distance and accidentally hit a button in the elevator in the building.
It's not impossible, and attackers use that method, but they spend a lot of effort to calibrate the direction of the shot to hit that elevator button.
We're not even suspecting that we'll be shooting at a building, but in theory there may be one, and in theory we may end up hitting the button.
So, what I'm trying to figure out is how to decide on what can I do to get this information to proceed with my exploration and finally address the problem, and whether because of that theoretical reason we will not be able to learn what the last line of the parser is.
Comment 11•5 years ago
|
||
Hi, I discussed with Zibi. I'm fine with the additional telemetry with the following mitigations: (1) it'll first ship to nightly users; (2) Zibi will review the telemetry that comes in to ensure that there's no PI in it; (3) if there is, we'll delete the PI.
Assuming the nightly test goes well, I'm fine with the additional telemetry in production release. Additionally, and I'll defer to @chutten on this, but I believe that if we're adding new telemetry in production, we also need to update the public-facing telemetry documentation? Please let me know!
Thanks,
Emily
Comment 12•5 years ago
|
||
Oh yes, the documentation will be updated. From the Data Review Request:
This collection is documented in its definitions file Events.yaml and in the Probe Dictionary at https://probes.telemetry.mozilla.org
This happens through the change to the definitions file automatically being picked up by the Probe Info Service which the Probe Dictionary queries.
Comment 13•5 years ago
|
||
Clearing emily's ni?
for her. No outstanding requests of her remain.
Updated•5 years ago
|
Comment 14•5 years ago
|
||
Comment 15•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Description
•