Closed Bug 1529016 Opened 5 years ago Closed 5 years ago

Consider understanding the unexpected files insides the origin directories

Categories

(Core :: Storage: Quota Manager, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla69
Tracking Status
firefox69 --- fixed

People

(Reporter: tt, Assigned: tt)

References

Details

Attachments

(2 files, 1 obsolete file)

We've landed some fixes based on the reports on Bugzilla. However, the number of having unexpected files on the origin directories goes up and down recently for no visible reasons [1].

We are considering the fix this issue a bit more aggressive rather than just waiting for files bugs from users. So, at least, inside origin directories, all the files are managed by the QuotaManager and it only allows file on the allow list. In this level, it shouldn't have a file with a sensitive filename (e.g. filename with origins or something). Unexpected files are expected to be created by OS or anti-virus application.

Therefore, maybe collected filename on this level (inside the origin directory) and only on the Nightly channel can work. If we can do that, that will help us fix this issue efficiently.

[1] https://sql.telemetry.mozilla.org/queries/61236/source#157855

Hey Chris,

We are considering getting the filenames of unexpected files in the origin directories (For instance, $profile/storage/default/https+++foo.com/UNEXPECTED_FILE).

In these directories, QuotaManager only allows or expect files like: clients' folder (e.g. idb/, cache/, ...), metadata files (e.g. .metadata, .metadata-v2), OS metadata (e.g. Desktop.ini, ...), and dot-files (e.g. .xxx). And, if there is any file aside from this allow-list, then it would just block the process of initialization. We haven't been bother with these for a long while.

Recently, [1] shows that we got more errors about unexpected files on origin directories. Since we haven't landed any relative code in to m-c, I believe these files are just some files haven't been included in our allow-list (instead of a regression).

Our current approach for dealing with this is a bit passive, we adjust the allow-list based on the reports on Bugzilla. I wonder if it's possible to take a little more aggressive approach: collecting these file name on Nighlty only for few versions.

So, do you have any suggestion about collecting these data (filename)? We lean to recokn that there shouldn't have any file with a sensitive filename in this level. I believe most of name we will get is OS-created files (e.g. thumb.db) which should be Category 1.

Furthermore, I don't think there would have sensitive filenames, but if you are worried about this, would you recommend collecting filename in the origin directories on Nightly only by Prio? Thanks!

[1] https://sql.telemetry.mozilla.org/queries/61236/source#157855

Flags: needinfo?(chutten)

Prio encoding at this stage is only capable of recording boolean information, and reporting aggregate counts of the boolean values from the population. Recording unknown strings isn't one of its capabilities at the present time, so I don't think it'd be the right tool for the job.

If we have a list of potential candidate os-created file names, let's write them down and count them. (Maybe as a categorical histogram). If we're still left with too many unknowns we can broach the question of whether the names of files in that directory can be reported, and how we'd report them.

Or, if we'd like to skip right to that, we can pose a question to Marshall about which Category of collection file names in that obscure technical folder would be.

Flags: needinfo?(chutten)

(In reply to Chris H-C :chutten from comment #2)
Thanks for the reply!

Prio encoding at this stage is only capable of recording boolean information, and reporting aggregate counts of the boolean values from the population. Recording unknown strings isn't one of its capabilities at the present time, so I don't think it'd be the right tool for the job.

I see, so the Prio is not an option for now.

If we have a list of potential candidate os-created file names, let's write them down and count them. (Maybe as a categorical histogram). If we're still left with too many unknowns we can broach the question of whether the names of files in that directory can be reported, and how we'd report them.

The thing is that we've put all reported or knowing os-created file in our allow-list and we still got errors for unexpected files on different directories.

So, I guess you are right. My question should be "whether the names of files in the origin directory can be reported?". Personally, I think we can, because:

  1. These directories are managed by QuotaManager, putting a self-defined file would just break the functionality of the QuotaManager. Users cannot get any benefit, so I don't think users intend to do that.
  2. If it's on the parents level of the origin directories, there are more chances to get the names of origin. However, in the origin directories level, we shouldn't get these data.

But, I have to also admit that there is a rather small chance to get a sensitive filename.

Or, if we'd like to skip right to that, we can pose a question to Marshall about which Category of collection file names in that obscure technical folder would be.

Could you point me out where should raise the discussion or just in this bug would be okay? Thanks!

For raising questions with Marshall a needinfo on a bug with a self-contained summary background+question comment usually does the trick.

I wonder, though, if maybe we should look into what we'd do with the data. If we received perfect data that half of the problematic files were path/to/filename.ext... what then? What actions would we take?

What if it were only 1/4 of the problematic files? 1% of them?

Will our actions be different if we can intuit from the filename what its function is? Or will the response always be "whitelist anything that comes up"?

( I ask because sometimes we end up learning that a no-knowledge solution would be roughly equivalent to a perfect-knowledge solution so we can skip the phase of data collection, collation, and analysis and jump straight to solving the problem )

(In reply to Chris H-C :chutten from comment #4)

For raising questions with Marshal a needinfo on a bug with a self-contained summary background+question comment usually does the trick.

Got it! Thanks!

I wonder, though, if maybe we should look into what we'd do with the data. If we received perfect data that half of the problematic files were path/to/filename.ext... what then? What actions would we take?

If the filename doesn't make any sense (e.g. abc.ext) and its counter is quite low, we would consider removing them while initialization.
If the filename is probably created by the OS, we would add them into our allow-list.
I think, if there is a filename and it's counted myriad times, we would consider ignoring these files since that probably means a lot of users are doing that and we should maybe just ignore that.

What if it were only 1/4 of the problematic files? 1% of them?

The number of these unexpected files with the same name might be a factor for us to make the decision. And, that would be useful for us to check if it's a regression or not.

Will our actions be different if we can intuit from the filename what its function is? Or will the response always be "whitelist anything that comes up"?

Yes, basically, our major problem now is that we are not sure if we can either just ignore, remove while finding them if we don't know their name.
If we take an aggressive approach (always remove the unexpected files):

  • we might drop our performance if these are os files.

If we just ignore all the unexpected files,

  • we might not be able to find the problem if these are created by us accidentally.
  • some users might complain about Firefox eats too much space while QuotaManager thinks it's okay (since we ignore these files).

( I ask because sometimes we end up learning that a no-knowledge solution would be roughly equivalent to a perfect-knowledge solution so we can skip the phase of data collection, collation, and analysis and jump straight to solving the problem )

It would be really helpful if we can collect data before making a decision even collecting them just for few releases. Also, we can apply the allow-list to different directories (e.g. repositories directories, this level contains files with origin names).

Great, thank you for so fully answering my questions. Let's see if I can craft a good Marshall comment :)

Marshall, we're looking for a decision on what Data Collection Category filenames within an internal, obscure Firefox folder would be. There should be no reason for anyone to put anything there on purpose unless by some automated means, but there's also nothing preventing someone from putting something meaningful there.

We need to know the names of these files and the frequencies they appear to learn if it would be better to ignore or to wipe out these files. If we can't figure that out, we'd have to treat them the same: either ignoring everything (in which case we might accidentally balloon our own storage without limit) or removing everything (in which case we might be fighting with an automated process about who can use the disk fastest).

If this ends up being Cat3 or above, can you think of a mitigation that would be acceptable to satisfy these requirements?

Flags: needinfo?(merwin)

Chris, I can't think of a mitigation here that would help. In similar situations in the past, we use a white list to only collect non-sensitive fields. But if I understand this bug correctly, if we knew this correct whitelist, we wouldn't need the collection in the first place.

This does though strike me as category 1, with some very small (or possibly non-existent) possibility that the data is category 3. In that case, it seems reasonable to me to collect the data and it is within policy because there is no practical risk. If our assumptions are wrong, we could then make a correction and stop the collection.

Tom, can you plan on revisiting this once you have the data just to confirm that it is category 1 as expected?

Flags: needinfo?(merwin) → needinfo?(shes050117)

(In reply to Merwin from comment #7)

Chris, I can't think of a mitigation here that would help. In similar situations in the past, we use a white list to only collect non-sensitive fields. But if I understand this bug correctly, if we knew this correct whitelist, we wouldn't need the collection in the first place.

This does though strike me as category 1, with some very small (or possibly non-existent) possibility that the data is category 3. In that case, it seems reasonable to me to collect the data and it is within policy because there is no practical risk. If our assumptions are wrong, we could then make a correction and stop the collection.

Tom, can you plan on revisiting this once you have the data just to confirm that it is category 1 as expected?

Thanks for the information! I'll definitely revisit this once I get the data.

My plan is only collect filename of unexpected files in the origin directories on Nightly Channel. Note that these filenames shouldn't contain information related to origin. Let's only collect that for 69 as our first baby step if it's possible. Then, I will discuss the result on our team meeting and make a decision if we want to keep collecting them in the future.

I'll work on a patch as soon as possible.

Flags: needinfo?(shes050117)

IIUC of the document, there seem to have two ways to collect unknown strings. One is Scalar (kind string), but a user might have multiple unexpected files on origin directories of their own profile. Another seems to be keyed Scalar (kind uint/boolean), but I'm not so sure if there will be more than 100 different filenames.

Chris, do you happen to know if there is another way to get unknown strings by Scalar/Histograms? Thanks in advance!

Flags: needinfo?(chutten)

Your understanding is correct: we don't make it easy to collect batches of unknown strings in Telemetry.

Keyed scalars (uint or bool) might be the best way to transport this information, as awkward as that may be.

Flags: needinfo?(chutten)

(In reply to Chris H-C :chutten from comment #10)

Your understanding is correct: we don't make it easy to collect batches of unknown strings in Telemetry.

Keyed scalars (uint or bool) might be the best way to transport this information, as awkward as that may be.

Thanks! I decided to try the keyed scalars (uint).

Attached file request-bug1529016.md
Attachment #9071809 - Flags: data-review?(chutten)
Comment on attachment 9071809 [details]
request-bug1529016.md

DATA COLLECTION REVIEW RESPONSE:

    Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is Telemetry so is documented in its definitions file [Scalars.yaml](https://hg.mozilla.org/mozilla-central/file/tip/toolkit/components/telemetry/Scalars.yaml) and the [Probe Dictionary](https://telemetry.mozilla.org/probe-dictionary/).

    Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

    If the request is for permanent data collection, is there someone who will monitor the data over time?

No. This collection will expire in Firefox 70.

    Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical. (Note prior discussion in e.g. comment#7)

    Is the data collection request for default-on or default-off?

Default on for pre-release channels only.

    Does the instrumentation include the addition of any new identifiers?

No.

    Is the data collection covered by the existing Firefox privacy notice?

Yes.

    Does there need to be a check-in in the future to determine whether to renew the data?

Yes. :tt is responsible for renewing or removing the collection before it expires in Firefox 70.

---
Result: datareview+
Attachment #9071809 - Flags: data-review?(chutten) → data-review+

I'm going to split the patch into two pieces. One is implementation and another one is the test because I somehow cannot get Services.telemetry.getSnapshotForKeyedScalars("main", false)["parent"]; on platforms of try server (https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=252322639&repo=try&lineNumber=5742). I can get that on my Mac, though.

I will land the implementation part today.

Attachment #9071628 - Attachment description: Bug 1529016 - Collect unexpected file's name in the origin directories; → Bug 1529016 - P1 - Collect unexpected file's name in the origin directories;

Depends on D34721

Pushed by shes050117@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/0ebbb35e701b
P1 - Collect unexpected file's name in the origin directories; r=asuth,chutten
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
Blocks: 1560109
Blocks: 1561611
Attachment #9072705 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: