Bugzilla

Shawn Huang [:shawnjohnjr][:shuang] (as a happy gecko contributor)

Comment 3

•

7 years ago

I cannot reproduce the crash on my mac with Nightly 59.0a1, but it seems to be related to the quota object.

Skim the reports and I find out all the crashes happen because they failed to get the quota object[1]. It could imply the originInfo had either not been created yet or been deleted for some reasons.

I'll dig more this afternoon.

[1] https://hg.mozilla.org/releases/mozilla-beta/annotate/27dc691eb4a0/dom/cache/FileUtils.cpp#l300

Assignee: nobody → ttung

Status: NEW → ASSIGNED

Flags: needinfo?(ttung)

Comment 4

•

7 years ago

Another QuotaObject issue (both mQuotaObject and mJournalObject are nullptr) is Bug 1414737. It could be the simliar issue.

Comment 5

•

7 years ago

I still have no idea why we are missing quotaObject during executing cache-put-action. I suspect we fail to get the quotaObject because the originInfo is deleted due to eviction by QuotaManager. 

In theory, QuotaManager tries to evict the inactive origins when the increasing usage makes the overall usage greater than the overall limit [1]. Specifically, QuotaManager distinguishes whether the origin is active or not by checking if the origin is holding the directory lock[2]. I guess somehow users bump into this situation when either cache manager didn't hold a directory lock or there is something wrong while checking whether the origin is holding the directory lock.

[1] https://searchfox.org/mozilla-central/source/dom/quota/ActorsParent.cpp#3097-3113
[2] https://searchfox.org/mozilla-central/source/dom/quota/ActorsParent.cpp#3464-3489
 
I'll keep investigating in the direction of my guess. Maybe start with a test, which makes the eviction frequently happen to hit the assertion, and run it by RR chaos mode.

Until I find the root cause, I think we need a runtime check for the quotaObject to avoid from accessing the nullptr accidentally [3].
[3] https://searchfox.org/mozilla-central/source/dom/cache/FileUtils.cpp#307

Comment 6

•

7 years ago

Attached patch Fix to avoid from using quotaObject while we fail to get it. (obsolete) — Details — Splinter Review

Ben, this patch is preventing us from accessing quotaObject while failing to get it until I find the root cause. Could you help me to review? Thanks! 

Furthermore, I'm thinking to uplift this patch to beta. Does it sound reasonable?

Attachment #8934780 - Flags: review?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Comment 7

•

7 years ago

Hi tw0517tw,

It would be really nice if I could get more information from you. So that I can verify whether my guess at comment #5 is correct or not. Would you mind testing whether the Firefox is still crashing on that website after restarting it? Thanks!

Flags: needinfo?(tw0517tw)

Updated

•

7 years ago

Attachment #8934780 - Flags: review?(bkelly) → review+

Comment 8

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=65f79d13d95c4e901447b0b2fec7ad4868dd4d11

Comment 9

•

7 years ago

(In reply to Tom Tung [:tt] from comment #6)
Note that this patch won't fix the issue, it just avoids the release users from accessing the nullptr since MOZ_DIAGNOSTIC_ASSERT is only valid on Nightly and Beta.

tw0517tw

Reporter

Comment 10

•

7 years ago

Still crashes after restarting Firefox (Dev Edition 58.0b9). Getting this crash report:
https://crash-stats.mozilla.com/report/index/6cea3765-1979-4f9c-8613-764310171206

But I found that the website can be loaded in Private Browsing mode.

Flags: needinfo?(tw0517tw)

Ben Kelly [:bkelly, not reviewing]

Comment 11

•

7 years ago

(In reply to tw0517tw from comment #10)
Thanks for the reply!

> Still crashes after restarting Firefox (Dev Edition 58.0b9). Getting this
> crash report:
> https://crash-stats.mozilla.com/report/index/6cea3765-1979-4f9c-8613-
> 764310171206

Okay, then my guess might be wrong. If you restart the Firefox, it should create an originInfo for each origin during initializing origins in QuotaManager.

> But I found that the website can be loaded in Private Browsing mode.

It's probably because the service worker won't work in Private Browsing mode. That website utilizes service worker to store opaque response via cache API. Thus, if the service worker doesn't be launched, you won't bump into the same condition.

Comment 12

•

7 years ago

Getting a copy of the profile, or just its storage directories, might be the best way to investigate.  If you are willing to share them privately with us.

Shawn Huang [:shawnjohnjr][:shuang] (as a happy gecko contributor)

Comment 13

•

7 years ago

I tried to load that storage folder on Linux, it doesn't crash but always crash on my windows.

Shawn Huang [:shawnjohnjr][:shuang] (as a happy gecko contributor)

Comment 14

•

7 years ago

Thanks to tw0517tw who shared their crash profile with us privately. I get the reason why we bump into this situation. 

The group string get from QuotaManger is "js.org" but the group string from (dom) cache is "bottender.js.org". And, that why we fail to get the quotaObject because it cannot get the originInfo with the group's string is "bottender.js.org".

Comment 15

•

7 years ago

https://publicsuffix.org/list/effective_tld_names.dat
See js.org. It's actually in the public suffix list.
I guess we have mis-matched logic code. So we took js.org as eTLD+1.

Shawn Huang [:shawnjohnjr][:shuang] (as a happy gecko contributor)

Comment 16

•

7 years ago

(In reply to Tom Tung [:tt] from comment #14)
Note that the group's string got from QuotaManager is read from ".metadata/.metadata-v2" file while (DOM) cache gets the group's string from the principal.

Still not sure about why we write the different group's string into ".metadata/.metadata-v2".

Comment 17

•

7 years ago

(In reply to Shawn Huang [:shawnjohnjr] from comment #13)
> I tried to load that storage folder on Linux, it doesn't crash but always
> crash on my windows.
Correction: I can hit the same problem on Linux.

Shawn Huang [:shawnjohnjr][:shuang] (as a happy gecko contributor)

Comment 18

•

7 years ago

(In reply to Tom Tung [:tt] from comment #16)
> Still not sure about why we write the different group's string into
> ".metadata/.metadata-v2".
(In reply to Shawn Huang [:shawnjohnjr] from comment #15)
> https://publicsuffix.org/list/effective_tld_names.dat
> See js.org. It's actually in the public suffix list.
> I guess we have mis-matched logic code. So we took js.org as eTLD+1.

This might be able to explain that. Before we updated the list, we had used to treat "js.org" as a group (since "org" were eTLD) and store it into ".metadata/.metadata-v2" in the QuotaManager. After that, we started to treat "bottender.js.org" as a group (since updating the "js.org" to the list). 

Then, while Firefox is initializing origins, QuotaManager creates the originInfo/groupInfo/groupInfoPairs based on the record in the ".metadata/.metadata-v2". It makes the website fail to get the quota object from originInfo/groupInfo/groupInfoPairs.

This scenario could happen on manifold websites.

Comment 19

•

7 years ago

Note: js.org is added into Aug 17. 2016.

Hsin-Yi Tsai (she/her) [:hsinyi]

Comment 20

•

7 years ago

We discuss this at today's storage meeting with Jan.

I'll investigate more and consider the following options:
1. Recover group from origin -> but need to jump into the main thread
 - Maybe we could have a runtime recover mechanism while we are in the initializing step and we are in the main thread.
2. Only update the metadata file which their group has been changed.
 - Investigate what service we have for eTLD + 1 service
 - Check notification time via OS's functions.
3. Just delete all the metadata file -> but may lose the persisted state 
 - Add a new table for updating eTLD list in storage.sqlite

Comment 21

•

7 years ago

If I understood Tom correctly, this isn't a new regression since 58 but the defect has been in our code earlier, so I modified the firefox57 status to reflect the situation properly. Please feel free to correct me if I am wrong.

status-firefox57: --- → wontfix

Comment 22

•

7 years ago

Attached patch [Final] Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly — Details — Splinter Review

Update the commit message. 

I'm planning to land and uplift it in this week.

Attachment #8934780 - Attachment is obsolete: true

Attachment #8936443 - Flags: review+

Comment 23

•

6 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/dfa71cd9294d9d27b8a416e6fa7aef222e3b73d4
Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly

Updated

•

6 years ago

Keywords: leave-open

Comment 24

•

6 years ago

Comment on attachment 8936443 [details] [diff] [review]
[Final] Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly

Land the fix for avoiding accessing nullptr first

Attachment #8936443 - Attachment description: Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly → [Final] Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly

Comment 25

•

6 years ago

Comment on attachment 8936443 [details] [diff] [review]
[Final] Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly

This patch is mainly to avoid the Release from accessing a null pointer. 

And, I can locally reproduce this in Release 57 (crash) by having the profile provided by the reporter.

Approval Request Comment
[Feature/Bug causing the regression]:
Well, I'm not sure which specific bug causes this. There are two reasons which make the browser crash.

First of all, we should be able to get a quotaObject after initializing the origin. However, in this case, we're not able to get it due to group's string mismatching.

The other is that Bug 1290481 add a MOZ_DIAGNOSTIC_ASSERT for expecting quotaObject exists. 

Thus, the assertion is hit and crash the browser. However, the assert won't exist on Release and it makes Release may be able to access a null pointer in this condiction. Therefore, this patch is trying to avoid touch null pointer on Release 58+. 

[User impact if declined]:

Crash while browsing a website which used indexedDB/Cache API before Firefox 57 and then visits it again after Firefox 57 while the website is storing opaque cache response.

[Is this code covered by automated tests?]:

No, but it just an early return.

[Has the fix been verified in Nightly?]:

It won't work on Nightly and Beta since it'll hit the assert before the if-statement. But, it should work on the Release since the if-statement prevent us from accessing the null pointer. 

[Needs manual test from QE? If yes, steps to reproduce]:

No.
 
[List of other uplifts needed for the feature/fix]:

Only this one for now.

[Is the change risky?]:

I don't think so.

[Why is the change risky/not risky?]:

It just an early return. I verified it (not crash again) on my Mac by hiding the assert and applying the patch.

[String changes made/needed]:

No.

Attachment #8936443 - Flags: approval-mozilla-beta?

Bogdan Tara[:bogdan_tara | bogdant]

Comment 26

•

6 years ago

I put the note I got after discussing with Jan, Andrew, and Shawn below. Please correct me if I put something wrong.

I might need to:
- Have a minor upgrade on QuotaManager, to add a column of "hash for etld.dat" on storage.sqllite and maybe update the group string for the first time.

- Each time when the QM is initialing if hash for etld.dat changes, check whether QM needs to update the group string in the metadata file.

Others:
- Have a test to test downgrade.
- Need to worry about failure during the upgrade (e.g. have .tmp file then finalize the file).

Something might be useful:
- https://searchfox.org/mozilla-central/source/netwerk/dns/prepare_tlds.py

Comment 27

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/dfa71cd9294d

Gerry Chang [:gchang]

Comment 28

•

6 years ago

Comment on attachment 8936443 [details] [diff] [review]
[Final] Bug 1422815 - P1: Fix for failing to get quotaObject unexpectedly. r=bkelly

Avoid a crash. Beta58+.

Attachment #8936443 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Cosmin Sabou [:CosminS]

Comment 29

•

6 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/3d30fcbdd407

status-firefox58: affected → fixed

Comment 30

•

6 years ago

The issue in this bug is the group string from eTLD service is updated, but the one in the directory file isn't. 
We should have a mechanism in QuotaManager to update the directory for the group. However, the problem is that it might take tons of time to initialize and check whether the directory for the group is up-to-date. Also, the current optimization for eTLD service is lived in the main thread. I guess we should consider if we try to get this service from other thread (e.g. IO thread) is appropriate or not. Besides, we need a minor upgrade in QuotaManager to update and check all the directory for the group is okay or not. I planned to utilize this chance to search the whole directories and remove the bad chrome origins which are created after QM's schema 2_1. They shouldn't break the initialization for QM, but we should remove them in some place in the future.

Assignee: ttung → nobody

Status: ASSIGNED → NEW

Updated

•

6 years ago

status-firefox58: fixed → wontfix

status-firefox59: affected → wontfix

status-firefox60: --- → affected

status-firefox61: --- → ?

Marion Daly [:mdaly]

Comment 31

•

6 years ago

take a look in a few weeks

status-firefox58: wontfix → fixed

status-firefox59: wontfix → affected

status-firefox60: affected → ---

status-firefox61: ? → ---

Flags: needinfo?(jvarga)

Updated

•

6 years ago

status-firefox58: fixed → wontfix

status-firefox59: affected → wontfix

status-firefox60: --- → fix-optional

status-firefox61: --- → affected

status-firefox-esr52: --- → unaffected

Updated

•

6 years ago

status-firefox60: fix-optional → wontfix

status-firefox61: affected → wontfix

status-firefox62: --- → fix-optional

status-firefox-esr60: --- → wontfix

Assignee

Comment 32

•

6 years ago

Tom is back, I think he should take this bug again.

Basically, comment 26 is still valid, but I think we could also try to get rid of .metadata files first and store metadata info in storage.sqlite. Maintaining own file is quite expensive and we will need to store and update more and more metadata in future I guess.
Andrew, what do you think ?

Flags: needinfo?(shes050117)

Flags: needinfo?(jvarga)

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Updated

•

6 years ago

Assignee: nobody → shes050117

Flags: needinfo?(shes050117)

Comment 33

•

6 years ago

Agreed.  I think we want to get to a future where on startup we check for an indication that QM had a clean shutdown.  If QM had a clean shutdown, all of the info we need can come from storage.sqlite without doing an initialization sweep of PROFILE/storage.  If there's indication of a dirty shutdown, we would then do the sweep.

And a good first step, as you propose, would be to move the contents of the metadata files into storage.sqlite.  We could maintain the existing initialization sweep, but be rid of the .metadata files.

I might propose we might want to add a telemetry histogram probe that tracks the duration of the initialization sweep and land that prior to this bug.  That way we can see the speed-up from the .metadata file removals.  We'd also have it include whether initialization succeeded as a boolean.

Flags: needinfo?(bugmail)

Updated

•

6 years ago

Depends on: 1481716

Comment 34

•

6 years ago

(In reply to Andrew Sutherland [:asuth] from comment #33)
> Agreed.  I think we want to get to a future where on startup we check for an
> indication that QM had a clean shutdown.  If QM had a clean shutdown, all of
> the info we need can come from storage.sqlite without doing an
> initialization sweep of PROFILE/storage.  If there's indication of a dirty
> shutdown, we would then do the sweep.

I'm not sure if I understand that correctly. What if files are accidentally modified (maybe be added or deleted) between a clean shutdown and the next open. For instance, an anti-virus software deletes a file under PROFILE/storage while Firefox isn't opening, though It won't happen frequently. Although it's a clean shutdown, there are a few corner cases that make files be changed. If we want to do that, we may need to consider how to make sure the record in the storage.sqlite is always correct. 

> And a good first step, as you propose, would be to move the contents of the
> metadata files into storage.sqlite.  We could maintain the existing
> initialization sweep, but be rid of the .metadata files.
> 
> I might propose we might want to add a telemetry histogram probe that tracks
> the duration of the initialization sweep and land that prior to this bug. 
> That way we can see the speed-up from the .metadata file removals.  We'd
> also have it include whether initialization succeeded as a boolean.

That sounds like a good idea to me, so I file the bug 1481716.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 35

•

6 years ago

(In reply to Tom Tung [:tt] from comment #34)
> (In reply to Andrew Sutherland [:asuth] from comment #33)
> > Agreed.  I think we want to get to a future where on startup we check for an
> > indication that QM had a clean shutdown.  If QM had a clean shutdown, all of
> > the info we need can come from storage.sqlite without doing an
> > initialization sweep of PROFILE/storage.  If there's indication of a dirty
> > shutdown, we would then do the sweep.
> 
> I'm not sure if I understand that correctly. What if files are accidentally
> modified (maybe be added or deleted) between a clean shutdown and the next
> open. For instance, an anti-virus software deletes a file under
> PROFILE/storage while Firefox isn't opening, though It won't happen
> frequently. Although it's a clean shutdown, there are a few corner cases
> that make files be changed. If we want to do that, we may need to consider
> how to make sure the record in the storage.sqlite is always correct. 

That's why we haven't done this yet, it's not trivial task, all the corner cases
have to be covered or mitigated at least.
There will be more and more origin directories, especially once we land new
LocalStorage implementation. So the performance optimization (which avoids
profile/storage sweep when possible) is inevitable.

Comment 36

•

6 years ago

(In reply to Tom Tung [:tt] from comment #34)
> I'm not sure if I understand that correctly. What if files are accidentally
> modified (maybe be added or deleted) between a clean shutdown and the next
> open. For instance, an anti-virus software deletes a file under
> PROFILE/storage while Firefox isn't opening, though It won't happen
> frequently. Although it's a clean shutdown, there are a few corner cases
> that make files be changed. If we want to do that, we may need to consider
> how to make sure the record in the storage.sqlite is always correct. 

I think it's important to keep in mind that the main goal of Quota Manager is to make sure an origin doesn't use more space than we allow it to.  If an anti-virus software quarantines or outright deletes a file under PROFILE/storage, QM over-estimating the quota in use by the origin isn't that big a problem as long as we have some mechanism for correcting our numbers, such as re-running the tally when the origin is opened.  And/or doing an entire counting sweep that doesn't block QM startup as part of a maintenance process like how we try and vacuum IDB databases monthly.

If something is adding files under PROFILE/storage, then I'd expect them to either be:
- file-manager artifacts like ".desktop" that we've been seeing, that are user intent and shouldn't be tracked for QM purposes
- The anti-virus putting a file it stole back.  If it puts it back before we notice, it's not a problem.  If it puts it back after, well, I would suggest that if we implement what I propose, we'd probably actually want to spend a release or two where we use the database and still do an initialization counting at startup to make sure they agree and report back telemetry that indicates if in fact files are disappearing or re-appearing a lot, etc.
- Anything else I think amounts to malware trying to hide files in our profile.  I'm not sure there's anything we can reasonably do about this other than report to the user that their computer seems haunted.

One thing that may also be worth looking into is whether we're working with file-systems that can efficiently tell us if a directory sub-tree has changed.  For example, on linux, btrfs may be able to efficiently let us know whether PROFILE/STORAGE has changed at all.  On something like Windows/NTFS, however, I think we might be limited to checking directory timestamps to know if their immediate contents have changed (but not the contents of their sub-directories).

All that said, I think these are our 2 big goals from a performance perspective:
1) For a given origin, especially a ServiceWorker, we want to reduce the latency of initial DOM Cache opening and IDB openings.  This argues against doing *blocking* origin counting when we first go to display a site for the origin.  I think however we could get away with counting the origin lazily after we've already opened it.
2) We want browser startup's time to visible about:newtab (or restored site) being as short as possible.  This argues against the initialization counting sweep that we currently do.

BugBot [:suhaib / :marco/ :calixte]

Comment 37

•

6 years ago

Thank Jan and Andrew for explaining!

(In reply to Jan Varga [:janv] from comment #35)
> That's why we haven't done this yet, it's not trivial task, all the corner
> cases
> have to be covered or mitigated at least.
> There will be more and more origin directories, especially once we land new
> LocalStorage implementation. So the performance optimization (which avoids
> profile/storage sweep when possible) is inevitable.

I see!

(In reply to Andrew Sutherland [:asuth] from comment #36)
> I think it's important to keep in mind that the main goal of Quota Manager
> is to make sure an origin doesn't use more space than we allow it to.  If an

It's a rather good point! I'll keep that in mind

> anti-virus software quarantines or outright deletes a file under
> PROFILE/storage, QM over-estimating the quota in use by the origin isn't
> that big a problem as long as we have some mechanism for correcting our
> numbers, such as re-running the tally when the origin is opened.  And/or
> doing an entire counting sweep that doesn't block QM startup as part of a
> maintenance process like how we try and vacuum IDB databases monthly.
> 
> If something is adding files under PROFILE/storage, then I'd expect them to
> either be:
> - file-manager artifacts like ".desktop" that we've been seeing, that are
> user intent and shouldn't be tracked for QM purposes
> - The anti-virus putting a file it stole back.  If it puts it back before we
> notice, it's not a problem.  If it puts it back after, well, I would suggest
> that if we implement what I propose, we'd probably actually want to spend a
> release or two where we use the database and still do an initialization
> counting at startup to make sure they agree and report back telemetry that
> indicates if in fact files are disappearing or re-appearing a lot, etc.
> - Anything else I think amounts to malware trying to hide files in our
> profile.  I'm not sure there's anything we can reasonably do about this
> other than report to the user that their computer seems haunted.
> 
> One thing that may also be worth looking into is whether we're working with
> file-systems that can efficiently tell us if a directory sub-tree has
> changed.  For example, on linux, btrfs may be able to efficiently let us
> know whether PROFILE/STORAGE has changed at all.  On something like
> Windows/NTFS, however, I think we might be limited to checking directory
> timestamps to know if their immediate contents have changed (but not the
> contents of their sub-directories).
> 
> All that said, I think these are our 2 big goals from a performance
> perspective:
> 1) For a given origin, especially a ServiceWorker, we want to reduce the
> latency of initial DOM Cache opening and IDB openings.  This argues against
> doing *blocking* origin counting when we first go to display a site for the
> origin.  I think however we could get away with counting the origin lazily
> after we've already opened it.
> 2) We want browser startup's time to visible about:newtab (or restored site)
> being as short as possible.  This argues against the initialization counting
> sweep that we currently do.

Marion Daly [:mdaly]

Updated

•

6 years ago

Assignee: shes050117 → nobody

Whiteboard: DWS_NEXT

Comment 38

•

5 years ago

The leave-open keyword is there and there is no activity for 6 months.
:hsinyi, maybe it's time to close this bug?

Flags: needinfo?(htsai)

Hsin-Yi Tsai (she/her) [:hsinyi]

Comment 39

•

5 years ago

This issue is still on QuotaManager. We just have other issues with higher priority to fix. I put the leave-open keyword because I only land a quick-fix rather than an actual fix.

I guess it's better to keep it open? Or, I can open another bug to track this issue if it's more appropriate.

Comment 40

•

5 years ago

Thanks Tom for the comment. I'm moving to QuotaManager component.

Component: DOM → DOM: Quota Manager

Flags: needinfo?(htsai)

Assignee

Updated

•

5 years ago

Crash Signature: [@ mozilla::dom::cache::BodyMaybeUpdatePaddingSize] → [@ mozilla::dom::cache::BodyMaybeUpdatePaddingSize] [@ mozilla::dom::quota::QuotaObject::LockedMaybeUpdateSize]

Assignee

Updated

•

5 years ago

Depends on: 1535995

Assignee

Updated

•

5 years ago

Crash Signature: [@ mozilla::dom::cache::BodyMaybeUpdatePaddingSize] [@ mozilla::dom::quota::QuotaObject::LockedMaybeUpdateSize] → [@ mozilla::dom::cache::BodyMaybeUpdatePaddingSize]

BugBot [:suhaib / :marco/ :calixte]

Assignee

Comment 42

•

5 years ago

Fixed by bug 1535995.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Updated

•

5 years ago

Assignee: nobody → shes050117

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Keywords: leave-open