Closed Bug 1422815 Opened 7 years ago Closed 5 years ago

Categories

(Core :: Storage: Quota Manager, defect, P2)

58 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox66 --- wontfix
firefox67 --- fixed
firefox68 --- fixed

People

(Reporter: tw0517tw, Assigned: janv)

References

Details

(Keywords: crash, regression, Whiteboard: DWS_NEXT)

Crash Data

Attachments

(1 file, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Build ID: 20171130160223

Steps to reproduce:

with Firefox version 58.0b8, go to https://bottender.js.org/docs/GettingStarted 


Actual results:

browser crashes and ask for sending crash report.
https://crash-stats.mozilla.com/report/index/cad08ee2-38a4-4a59-81df-f87140171202#tab-details


Expected results:

successfully load the website
http://bit.ly/2zMerOQ are the other crashes in 59 and 58, affecting Mac and Windows.
Status: UNCONFIRMED → NEW
Crash Signature: [@ mozilla::dom::cache::BodyMaybeUpdatePaddingSize]
Component: Untriaged → DOM
Ever confirmed: true
Keywords: crash, regression
OS: Unspecified → All
Product: Firefox → Core
Hardware: Unspecified → All
Tom may know what's going on here (looks to be related to padding size).
Flags: needinfo?(ttung)
Priority: -- → P2
I cannot reproduce the crash on my mac with Nightly 59.0a1, but it seems to be related to the quota object.

Skim the reports and I find out all the crashes happen because they failed to get the quota object[1]. It could imply the originInfo had either not been created yet or been deleted for some reasons.

I'll dig more this afternoon.

[1] https://hg.mozilla.org/releases/mozilla-beta/annotate/27dc691eb4a0/dom/cache/FileUtils.cpp#l300
Assignee: nobody → ttung
Status: NEW → ASSIGNED
Flags: needinfo?(ttung)
Another QuotaObject issue (both mQuotaObject and mJournalObject are nullptr) is Bug 1414737. It could be the simliar issue.
I still have no idea why we are missing quotaObject during executing cache-put-action. I suspect we fail to get the quotaObject because the originInfo is deleted due to eviction by QuotaManager. 

In theory, QuotaManager tries to evict the inactive origins when the increasing usage makes the overall usage greater than the overall limit [1]. Specifically, QuotaManager distinguishes whether the origin is active or not by checking if the origin is holding the directory lock[2]. I guess somehow users bump into this situation when either cache manager didn't hold a directory lock or there is something wrong while checking whether the origin is holding the directory lock.

[1] https://searchfox.org/mozilla-central/source/dom/quota/ActorsParent.cpp#3097-3113
[2] https://searchfox.org/mozilla-central/source/dom/quota/ActorsParent.cpp#3464-3489
 
I'll keep investigating in the direction of my guess. Maybe start with a test, which makes the eviction frequently happen to hit the assertion, and run it by RR chaos mode.

Until I find the root cause, I think we need a runtime check for the quotaObject to avoid from accessing the nullptr accidentally [3].
[3] https://searchfox.org/mozilla-central/source/dom/cache/FileUtils.cpp#307
Ben, this patch is preventing us from accessing quotaObject while failing to get it until I find the root cause. Could you help me to review? Thanks! 

Furthermore, I'm thinking to uplift this patch to beta. Does it sound reasonable?
Attachment #8934780 - Flags: review?(bkelly)
Hi tw0517tw,

It would be really nice if I could get more information from you. So that I can verify whether my guess at comment #5 is correct or not. Would you mind testing whether the Firefox is still crashing on that website after restarting it? Thanks!
Flags: needinfo?(tw0517tw)
Attachment #8934780 - Flags: review?(bkelly) → review+
(In reply to Tom Tung [:tt] from comment #6)
Note that this patch won't fix the issue, it just avoids the release users from accessing the nullptr since MOZ_DIAGNOSTIC_ASSERT is only valid on Nightly and Beta.
Still crashes after restarting Firefox (Dev Edition 58.0b9). Getting this crash report:
https://crash-stats.mozilla.com/report/index/6cea3765-1979-4f9c-8613-764310171206

But I found that the website can be loaded in Private Browsing mode.
Flags: needinfo?(tw0517tw)
(In reply to tw0517tw from comment #10)
Thanks for the reply!

> Still crashes after restarting Firefox (Dev Edition 58.0b9). Getting this
> crash report:
> https://crash-stats.mozilla.com/report/index/6cea3765-1979-4f9c-8613-
> 764310171206

Okay, then my guess might be wrong. If you restart the Firefox, it should create an originInfo for each origin during initializing origins in QuotaManager.

> But I found that the website can be loaded in Private Browsing mode.

It's probably because the service worker won't work in Private Browsing mode. That website utilizes service worker to store opaque response via cache API. Thus, if the service worker doesn't be launched, you won't bump into the same condition.
Getting a copy of the profile, or just its storage directories, might be the best way to investigate.  If you are willing to share them privately with us.
I tried to load that storage folder on Linux, it doesn't crash but always crash on my windows.
Thanks to tw0517tw who shared their crash profile with us privately. I get the reason why we bump into this situation. 

The group string get from QuotaManger is "js.org" but the group string from (dom) cache is "bottender.js.org". And, that why we fail to get the quotaObject because it cannot get the originInfo with the group's string is "bottender.js.org".
https://publicsuffix.org/list/effective_tld_names.dat
See js.org. It's actually in the public suffix list.
I guess we have mis-matched logic code. So we took js.org as eTLD+1.
(In reply to Tom Tung [:tt] from comment #14)
Note that the group's string got from QuotaManager is read from ".metadata/.metadata-v2" file while (DOM) cache gets the group's string from the principal.

Still not sure about why we write the different group's string into ".metadata/.metadata-v2".
(In reply to Shawn Huang [:shawnjohnjr] from comment #13)
> I tried to load that storage folder on Linux, it doesn't crash but always
> crash on my windows.
Correction: I can hit the same problem on Linux.
(In reply to Tom Tung [:tt] from comment #16)
> Still not sure about why we write the different group's string into
> ".metadata/.metadata-v2".
(In reply to Shawn Huang [:shawnjohnjr] from comment #15)
> https://publicsuffix.org/list/effective_tld_names.dat
> See js.org. It's actually in the public suffix list.
> I guess we have mis-matched logic code. So we took js.org as eTLD+1.

This might be able to explain that. Before we updated the list, we had used to treat "js.org" as a group (since "org" were eTLD) and store it into ".metadata/.metadata-v2" in the QuotaManager. After that, we started to treat "bottender.js.org" as a group (since updating the "js.org" to the list). 

Then, while Firefox is initializing origins, QuotaManager creates the originInfo/groupInfo/groupInfoPairs based on the record in the ".metadata/.metadata-v2". It makes the website fail to get the quota object from originInfo/groupInfo/groupInfoPairs.

This scenario could happen on manifold websites.
Note: js.org is added into Aug 17. 2016.
We discuss this at today's storage meeting with Jan.

I'll investigate more and consider the following options:
1. Recover group from origin -> but need to jump into the main thread
 - Maybe we could have a runtime recover mechanism while we are in the initializing step and we are in the main thread.
2. Only update the metadata file which their group has been changed.
 - Investigate what service we have for eTLD + 1 service
 - Check notification time via OS's functions.
3. Just delete all the metadata file -> but may lose the persisted state 
 - Add a new table for updating eTLD list in storage.sqlite
If I understood Tom correctly, this isn't a new regression since 58 but the defect has been in our code earlier, so I modified the firefox57 status to reflect the situation properly. Please feel free to correct me if I am wrong.
Update the commit message. 

I'm planning to land and uplift it in this week.
Attachment #8934780 - Attachment is obsolete: true
Attachment #8936443 - Flags: review+
Keywords: leave-open
Comment on attachment 8936443 [details] [diff] [review]
[Final] Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly

Land the fix for avoiding accessing nullptr first
Attachment #8936443 - Attachment description: Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly → [Final] Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly
Comment on attachment 8936443 [details] [diff] [review]
[Final] Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly

This patch is mainly to avoid the Release from accessing a null pointer. 

And, I can locally reproduce this in Release 57 (crash) by having the profile provided by the reporter.

Approval Request Comment
[Feature/Bug causing the regression]:
Well, I'm not sure which specific bug causes this. There are two reasons which make the browser crash.

First of all, we should be able to get a quotaObject after initializing the origin. However, in this case, we're not able to get it due to group's string mismatching.

The other is that Bug 1290481 add a MOZ_DIAGNOSTIC_ASSERT for expecting quotaObject exists. 

Thus, the assertion is hit and crash the browser. However, the assert won't exist on Release and it makes Release may be able to access a null pointer in this condiction. Therefore, this patch is trying to avoid touch null pointer on Release 58+. 

[User impact if declined]:

Crash while browsing a website which used indexedDB/Cache API before Firefox 57 and then visits it again after Firefox 57 while the website is storing opaque cache response.

[Is this code covered by automated tests?]:

No, but it just an early return.

[Has the fix been verified in Nightly?]:

It won't work on Nightly and Beta since it'll hit the assert before the if-statement. But, it should work on the Release since the if-statement prevent us from accessing the null pointer. 

[Needs manual test from QE? If yes, steps to reproduce]:

No.
 
[List of other uplifts needed for the feature/fix]:

Only this one for now.

[Is the change risky?]:

I don't think so.

[Why is the change risky/not risky?]:

It just an early return. I verified it (not crash again) on my Mac by hiding the assert and applying the patch.

[String changes made/needed]:

No.
Attachment #8936443 - Flags: approval-mozilla-beta?
I put the note I got after discussing with Jan, Andrew, and Shawn below. Please correct me if I put something wrong.

I might need to:
- Have a minor upgrade on QuotaManager, to add a column of "hash for etld.dat" on storage.sqllite and maybe update the group string for the first time.

- Each time when the QM is initialing if hash for etld.dat changes, check whether QM needs to update the group string in the metadata file.

Others:
- Have a test to test downgrade.
- Need to worry about failure during the upgrade (e.g. have .tmp file then finalize the file).

Something might be useful:
- https://searchfox.org/mozilla-central/source/netwerk/dns/prepare_tlds.py
Comment on attachment 8936443 [details] [diff] [review]
[Final] Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly

Avoid a crash. Beta58+.
Attachment #8936443 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
The issue in this bug is the group string from eTLD service is updated, but the one in the directory file isn't. 
We should have a mechanism in QuotaManager to update the directory for the group. However, the problem is that it might take tons of time to initialize and check whether the directory for the group is up-to-date. Also, the current optimization for eTLD service is lived in the main thread. I guess we should consider if we try to get this service from other thread (e.g. IO thread) is appropriate or not. Besides, we need a minor upgrade in QuotaManager to update and check all the directory for the group is okay or not. I planned to utilize this chance to search the whole directories and remove the bad chrome origins which are created after QM's schema 2_1. They shouldn't break the initialization for QM, but we should remove them in some place in the future.
Assignee: ttung → nobody
Status: ASSIGNED → NEW
take a look in a few weeks
Tom is back, I think he should take this bug again.

Basically, comment 26 is still valid, but I think we could also try to get rid of .metadata files first and store metadata info in storage.sqlite. Maintaining own file is quite expensive and we will need to store and update more and more metadata in future I guess.
Andrew, what do you think ?
Flags: needinfo?(shes050117)
Flags: needinfo?(jvarga)
Flags: needinfo?(bugmail)
Assignee: nobody → shes050117
Flags: needinfo?(shes050117)
Agreed.  I think we want to get to a future where on startup we check for an indication that QM had a clean shutdown.  If QM had a clean shutdown, all of the info we need can come from storage.sqlite without doing an initialization sweep of PROFILE/storage.  If there's indication of a dirty shutdown, we would then do the sweep.

And a good first step, as you propose, would be to move the contents of the metadata files into storage.sqlite.  We could maintain the existing initialization sweep, but be rid of the .metadata files.

I might propose we might want to add a telemetry histogram probe that tracks the duration of the initialization sweep and land that prior to this bug.  That way we can see the speed-up from the .metadata file removals.  We'd also have it include whether initialization succeeded as a boolean.
Flags: needinfo?(bugmail)
Depends on: 1481716
(In reply to Andrew Sutherland [:asuth] from comment #33)
> Agreed.  I think we want to get to a future where on startup we check for an
> indication that QM had a clean shutdown.  If QM had a clean shutdown, all of
> the info we need can come from storage.sqlite without doing an
> initialization sweep of PROFILE/storage.  If there's indication of a dirty
> shutdown, we would then do the sweep.

I'm not sure if I understand that correctly. What if files are accidentally modified (maybe be added or deleted) between a clean shutdown and the next open. For instance, an anti-virus software deletes a file under PROFILE/storage while Firefox isn't opening, though It won't happen frequently. Although it's a clean shutdown, there are a few corner cases that make files be changed. If we want to do that, we may need to consider how to make sure the record in the storage.sqlite is always correct. 

> And a good first step, as you propose, would be to move the contents of the
> metadata files into storage.sqlite.  We could maintain the existing
> initialization sweep, but be rid of the .metadata files.
> 
> I might propose we might want to add a telemetry histogram probe that tracks
> the duration of the initialization sweep and land that prior to this bug. 
> That way we can see the speed-up from the .metadata file removals.  We'd
> also have it include whether initialization succeeded as a boolean.

That sounds like a good idea to me, so I file the bug 1481716.
(In reply to Tom Tung [:tt] from comment #34)
> (In reply to Andrew Sutherland [:asuth] from comment #33)
> > Agreed.  I think we want to get to a future where on startup we check for an
> > indication that QM had a clean shutdown.  If QM had a clean shutdown, all of
> > the info we need can come from storage.sqlite without doing an
> > initialization sweep of PROFILE/storage.  If there's indication of a dirty
> > shutdown, we would then do the sweep.
> 
> I'm not sure if I understand that correctly. What if files are accidentally
> modified (maybe be added or deleted) between a clean shutdown and the next
> open. For instance, an anti-virus software deletes a file under
> PROFILE/storage while Firefox isn't opening, though It won't happen
> frequently. Although it's a clean shutdown, there are a few corner cases
> that make files be changed. If we want to do that, we may need to consider
> how to make sure the record in the storage.sqlite is always correct. 

That's why we haven't done this yet, it's not trivial task, all the corner cases
have to be covered or mitigated at least.
There will be more and more origin directories, especially once we land new
LocalStorage implementation. So the performance optimization (which avoids
profile/storage sweep when possible) is inevitable.
(In reply to Tom Tung [:tt] from comment #34)
> I'm not sure if I understand that correctly. What if files are accidentally
> modified (maybe be added or deleted) between a clean shutdown and the next
> open. For instance, an anti-virus software deletes a file under
> PROFILE/storage while Firefox isn't opening, though It won't happen
> frequently. Although it's a clean shutdown, there are a few corner cases
> that make files be changed. If we want to do that, we may need to consider
> how to make sure the record in the storage.sqlite is always correct. 

I think it's important to keep in mind that the main goal of Quota Manager is to make sure an origin doesn't use more space than we allow it to.  If an anti-virus software quarantines or outright deletes a file under PROFILE/storage, QM over-estimating the quota in use by the origin isn't that big a problem as long as we have some mechanism for correcting our numbers, such as re-running the tally when the origin is opened.  And/or doing an entire counting sweep that doesn't block QM startup as part of a maintenance process like how we try and vacuum IDB databases monthly.

If something is adding files under PROFILE/storage, then I'd expect them to either be:
- file-manager artifacts like ".desktop" that we've been seeing, that are user intent and shouldn't be tracked for QM purposes
- The anti-virus putting a file it stole back.  If it puts it back before we notice, it's not a problem.  If it puts it back after, well, I would suggest that if we implement what I propose, we'd probably actually want to spend a release or two where we use the database and still do an initialization counting at startup to make sure they agree and report back telemetry that indicates if in fact files are disappearing or re-appearing a lot, etc.
- Anything else I think amounts to malware trying to hide files in our profile.  I'm not sure there's anything we can reasonably do about this other than report to the user that their computer seems haunted.

One thing that may also be worth looking into is whether we're working with file-systems that can efficiently tell us if a directory sub-tree has changed.  For example, on linux, btrfs may be able to efficiently let us know whether PROFILE/STORAGE has changed at all.  On something like Windows/NTFS, however, I think we might be limited to checking directory timestamps to know if their immediate contents have changed (but not the contents of their sub-directories).

All that said, I think these are our 2 big goals from a performance perspective:
1) For a given origin, especially a ServiceWorker, we want to reduce the latency of initial DOM Cache opening and IDB openings.  This argues against doing *blocking* origin counting when we first go to display a site for the origin.  I think however we could get away with counting the origin lazily after we've already opened it.
2) We want browser startup's time to visible about:newtab (or restored site) being as short as possible.  This argues against the initialization counting sweep that we currently do.
Thank Jan and Andrew for explaining!

(In reply to Jan Varga [:janv] from comment #35)
> That's why we haven't done this yet, it's not trivial task, all the corner
> cases
> have to be covered or mitigated at least.
> There will be more and more origin directories, especially once we land new
> LocalStorage implementation. So the performance optimization (which avoids
> profile/storage sweep when possible) is inevitable.

I see!

(In reply to Andrew Sutherland [:asuth] from comment #36)
> I think it's important to keep in mind that the main goal of Quota Manager
> is to make sure an origin doesn't use more space than we allow it to.  If an

It's a rather good point! I'll keep that in mind

> anti-virus software quarantines or outright deletes a file under
> PROFILE/storage, QM over-estimating the quota in use by the origin isn't
> that big a problem as long as we have some mechanism for correcting our
> numbers, such as re-running the tally when the origin is opened.  And/or
> doing an entire counting sweep that doesn't block QM startup as part of a
> maintenance process like how we try and vacuum IDB databases monthly.
> 
> If something is adding files under PROFILE/storage, then I'd expect them to
> either be:
> - file-manager artifacts like ".desktop" that we've been seeing, that are
> user intent and shouldn't be tracked for QM purposes
> - The anti-virus putting a file it stole back.  If it puts it back before we
> notice, it's not a problem.  If it puts it back after, well, I would suggest
> that if we implement what I propose, we'd probably actually want to spend a
> release or two where we use the database and still do an initialization
> counting at startup to make sure they agree and report back telemetry that
> indicates if in fact files are disappearing or re-appearing a lot, etc.
> - Anything else I think amounts to malware trying to hide files in our
> profile.  I'm not sure there's anything we can reasonably do about this
> other than report to the user that their computer seems haunted.
> 
> One thing that may also be worth looking into is whether we're working with
> file-systems that can efficiently tell us if a directory sub-tree has
> changed.  For example, on linux, btrfs may be able to efficiently let us
> know whether PROFILE/STORAGE has changed at all.  On something like
> Windows/NTFS, however, I think we might be limited to checking directory
> timestamps to know if their immediate contents have changed (but not the
> contents of their sub-directories).
> 
> All that said, I think these are our 2 big goals from a performance
> perspective:
> 1) For a given origin, especially a ServiceWorker, we want to reduce the
> latency of initial DOM Cache opening and IDB openings.  This argues against
> doing *blocking* origin counting when we first go to display a site for the
> origin.  I think however we could get away with counting the origin lazily
> after we've already opened it.
> 2) We want browser startup's time to visible about:newtab (or restored site)
> being as short as possible.  This argues against the initialization counting
> sweep that we currently do.
Assignee: shes050117 → nobody
Whiteboard: DWS_NEXT

The leave-open keyword is there and there is no activity for 6 months.
:hsinyi, maybe it's time to close this bug?

Flags: needinfo?(htsai)

This issue is still on QuotaManager. We just have other issues with higher priority to fix. I put the leave-open keyword because I only land a quick-fix rather than an actual fix.

I guess it's better to keep it open? Or, I can open another bug to track this issue if it's more appropriate.

Thanks Tom for the comment. I'm moving to QuotaManager component.

Component: DOM → DOM: Quota Manager
Flags: needinfo?(htsai)
Crash Signature: [@ mozilla::dom::cache::BodyMaybeUpdatePaddingSize] → [@ mozilla::dom::cache::BodyMaybeUpdatePaddingSize] [@ mozilla::dom::quota::QuotaObject::LockedMaybeUpdateSize]
Depends on: 1535995
Crash Signature: [@ mozilla::dom::cache::BodyMaybeUpdatePaddingSize] [@ mozilla::dom::quota::QuotaObject::LockedMaybeUpdateSize] → [@ mozilla::dom::cache::BodyMaybeUpdatePaddingSize]

Fixed by bug 1535995.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Assignee: nobody → shes050117
Assignee: shes050117 → jvarga
Target Milestone: --- → mozilla68
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: