Closed Bug 1402581 Opened 2 years ago Closed 2 years ago

Crash in mozilla::dom::cache::LockedUpdateDirectoryPaddingFile

Categories

(Core :: DOM: Core & HTML, defect, P1, critical)

57 Branch
x86
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- unaffected
firefox56 --- unaffected
firefox57 --- fixed
firefox58 --- fixed

People

(Reporter: philipp, Assigned: tt)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(5 files, 6 obsolete files)

8.75 KB, patch
Details | Diff | Splinter Review
2.36 KB, patch
Details | Diff | Splinter Review
3.55 KB, patch
Details | Diff | Splinter Review
8.74 KB, patch
Details | Diff | Splinter Review
2.68 KB, patch
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is 
report bp-ea566dd0-40ab-4378-98ff-216300170922.
=============================================================
Crashing Thread (65), Name: DOMCacheThread
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::dom::cache::LockedUpdateDirectoryPaddingFile(nsIFile*, mozIStorageConnection*, __int64, __int64, bool) 	dom/cache/FileUtils.cpp:813
1 	xul.dll 	`anonymous namespace'::CacheQuotaClient::MaybeUpdatePaddingFileInternal<<lambda_05e898411bffc42f0b8fcdeec8794db5> > 	dom/cache/QuotaClient.cpp:326
2 	xul.dll 	mozilla::dom::cache::MaybeUpdatePaddingFile<<lambda_9016d51e96a641430ae2586b84e342ac> >(nsIFile*, mozIStorageConnection*, __int64, __int64, <lambda_9016d51e96a641430ae2586b84e342ac>) 	dom/cache/QuotaClient.cpp:466
3 	xul.dll 	mozilla::dom::cache::`anonymous namespace'::SetupAction::RunSyncWithDBOnTarget 	dom/cache/Manager.cpp:112
4 	xul.dll 	mozilla::dom::cache::SyncDBAction::RunWithDBOnTarget(mozilla::dom::cache::Action::Resolver*, mozilla::dom::cache::QuotaInfo const&, nsIFile*, mozIStorageConnection*) 	dom/cache/DBAction.cpp:166
5 	xul.dll 	mozilla::dom::cache::DBAction::RunOnTarget(mozilla::dom::cache::Action::Resolver*, mozilla::dom::cache::QuotaInfo const&, mozilla::dom::cache::Action::Data*) 	dom/cache/DBAction.cpp:121
6 	xul.dll 	mozilla::dom::cache::Context::QuotaInitRunnable::Run() 	dom/cache/Context.cpp:459
7 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1039
8 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:521
9 	xul.dll 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:368
10 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:319
11 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:299
12 	xul.dll 	nsThread::ThreadFunc(void*) 	xpcom/threads/nsThread.cpp:427
13 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397
14 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:95
15 	ucrtbase.dll 	thread_start<unsigned int (__stdcall*)(void*)> 	
16 	kernel32.dll 	BaseThreadInitThunk 	
17 	mozglue.dll 	patched_BaseThreadInitThunk 	mozglue/build/WindowsDllBlocklist.cpp:824
18 	ntdll.dll 	__RtlUserThreadStart 	
19 	ntdll.dll 	_RtlUserThreadStart

these crashes started with a low frequency in 57.0a1, so far from 32bit builds on windows. they come with  	MOZ_RELEASE_ASSERT(currentPaddingSize >= aDecreaseSize) that got added in bug 1290481.
I guess we may need to check the temporaryPaddingFileExist again after acquiring the mutex lock. So does the other place after acquire the mutex lock.

This failure might happen in the following sequence:
A thread checks the temp file and the file is not existed. Then, wait for the lock.
B thread acquires the lock and executes its operation but fails. It leave the temp file.
A thread acquires the lock and doesn't realize the previous operation failed because it think the temp file doesn't exist. Finally, hits the assertion.
Hi Shawn,

This crash happens due to [1], and it only happens on Windows platform so far. There are two possible issues in this bug:
1. Forgetting to re-check the value after acquiring the lock which I put it in comment 1.

2. We somehow update the wrong number to padding file, so that it makes a number minus the other number bigger than itself. If the reason is that, we might need to revisit the whole logic for updating padding size [2] to make sure this won't happen again.

I suspect the reason is 1, since it only happens on Windows (which may fail to delete the temporary file). But, I'm not pretty sure about that. Could you help me to investigate this when you have time? Thanks!

[1] https://hg.mozilla.org/releases/mozilla-beta/annotate/3354b84cc524/dom/cache/FileUtils.cpp#l813
[2] http://searchfox.org/mozilla-central/source/dom/cache/QuotaClient.cpp#450
Flags: needinfo?(shuang)
Patch if the guess in comment 1 is right. There's only one place that forget to recheck the value after acquire the lock, so correct it!
Attachment #8911489 - Flags: feedback?(shuang)
 Hi Shawn, as Tom us off this week, please take it and drive it to completion. Thank you.
Assignee: nobody → shuang
Priority: -- → P1
(In reply to Hsin-Yi Tsai [:hsinyi] from comment #4)
>  Hi Shawn, as Tom us off this week, please take it and drive it to
> completion. Thank you.

I'm working on it now.
Maybe it's okay to use this as an example to reproduce this bug.
https://bug1397128.bmoattachments.org/attachment.cgi?id=8908571

But change it to opaque type. Make the test case consist of "Caches.delete", "Cache.delete", "Cache.put/Cache.add" multiple times.
It sounds like we at least need some kind of runtime check to avoid having this overflow:

https://searchfox.org/mozilla-central/source/dom/cache/FileUtils.cpp#814

We probably can't control if some other application on the machine modifies the padding file.  We need to treat it as external data that might be inconsistent.
Steal this from Shawn with his permission :p
Assignee: shuang → ttung
Yeah, I actually tried to reproduce this bug to figure out the root cause but no luck last week. It's better to know what really creates the problem.
(In reply to Ben Kelly [:bkelly] from comment #7)
> It sounds like we at least need some kind of runtime check to avoid having
> this overflow:
> 
> https://searchfox.org/mozilla-central/source/dom/cache/FileUtils.cpp#814
> 
> We probably can't control if some other application on the machine modifies
> the padding file.  We need to treat it as external data that might be
> inconsistent.

You are right. I'll try to find out the root cause and change this assertion to a runtime if-statement.
I guess I find out the root cause. 

I modified the code to make delete/write padding file fail in the certain probability (0.1), and then I hit the assertion (MOZ_ASSERT) for ensuring the decreasing usage should less than the current usage in QuotaManger [0]. Although it's not the same assertion that on crash report, I think the reason is the same, and it should hit the assertion on the crash report in the next cache action.

I tried to delete the deletedPaddingSize when completing [1, 2, 3]. However, I didn't notice that if the failure happens, I shouldn't decrease the usage because the usage won't be updated to cache.sqlite.

[0] http://searchfox.org/mozilla-central/source/dom/quota/ActorsParent.cpp#5889
[1] http://searchfox.org/mozilla-central/source/dom/cache/Manager.cpp#500-502
[2] http://searchfox.org/mozilla-central/source/dom/cache/Manager.cpp#872-874
[3] http://searchfox.org/mozilla-central/source/dom/cache/Manager.cpp#1112-1114
Hi Ben,

I'm not sure whether I understand the code correctly. Could you help me to check the things below? Thanks!

After digging a bit more, I found no matter the action is resolver with NS_OK or not we will run into CompleteOnInitiatingThread() in STATE_COMPLETEING [1]. It seems that we remove the body file while failing. 

If IIUC, we shouldn't call NoteOrphanedBodyIdList() to remove the bodies when the result is not NS_OK. So that, we won't delete the body file while its header and information still be kept in cache.sqlite. So does the others [3, 4].

[1] http://searchfox.org/mozilla-central/source/dom/cache/Context.cpp#633-657
[2] http://searchfox.org/mozilla-central/source/dom/cache/Manager.cpp#498
[3] http://searchfox.org/mozilla-central/source/dom/cache/Manager.cpp#870
[4] http://searchfox.org/mozilla-central/source/dom/cache/Manager.cpp#1110
Flags: needinfo?(bkelly)
I think you are correct.  In those cases we roll back the transaction on error, so we shouldn't delete the body files.  Good catch!
Flags: needinfo?(bkelly)
Attachment #8911489 - Attachment is obsolete: true
Hi Ben,

This is the patch for not deleting padding size and body file when failure happens. Could you help me to review it? Thanks!
Attachment #8914685 - Flags: review?(bkelly)
This is the patch for making the padding mechanism much error-tolerant (avoid IO failures in a row).
Attachment #8914687 - Flags: review?(bkelly)
Note to myself: 
- Remeber to address comment #7 (add runtime check) after verifying P1 & P2 fixing this issue (Maybe one week later after landing them if they get r+ ?).
- Uplift the fixes to 57-beta
Attachment #8914685 - Flags: review?(bkelly) → review+
Comment on attachment 8914687 [details] [diff] [review]
Bug 1402581 - P2: Ensure the temporary file exists when failure happens. r?bkelly

Review of attachment 8914687 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/cache/QuotaClient.cpp
@@ +328,5 @@
>          LockedUpdateDirectoryPaddingFile(aBaseDir, aConn, aIncreaseSize,
>                                           aDecreaseSize,
>                                           temporaryPaddingFileExist);
>        if (NS_WARN_IF(NS_FAILED(rv))) {
> +        // Not delete the temporary padding file here to force the next action

nit: "Don't delete"

@@ +335,5 @@
>        }
>  
>        rv = aCommitHook();
>        if (NS_WARN_IF(NS_FAILED(rv))) {
> +        // Not delete the temporary padding file here to force the next action

nit: "Don't delete"

@@ +350,5 @@
> +
> +        // Ensure that we are able to force file to be restored.
> +        MOZ_DIAGNOSTIC_ASSERT(
> +          mozilla::dom::cache::
> +          DirectoryPaddingFileExists(aBaseDir, DirPaddingFile::TMP_FILE));

I think I've come to the conclusion we can't assert things like file existence.  We simply cannot control if another application or the OS comes along and changes the disk.  For example, if the disk is full, can we necessarily create the TMP file?

So, maybe make this just a MOZ_ASSERT() so we get the checking in debug builds.  It looks like there is nothing else we can do to recover in this case, but at least we probably shouldn't crash if the something unexpected happens with the disk.
Attachment #8914687 - Flags: review?(bkelly) → review+
(In reply to Ben Kelly [:bkelly] from comment #19)
Thanks for the review! I'll address the comment!

> @@ +350,5 @@
> > +
> > +        // Ensure that we are able to force file to be restored.
> > +        MOZ_DIAGNOSTIC_ASSERT(
> > +          mozilla::dom::cache::
> > +          DirectoryPaddingFileExists(aBaseDir, DirPaddingFile::TMP_FILE));
> 
> I think I've come to the conclusion we can't assert things like file
> existence.  We simply cannot control if another application or the OS comes
> along and changes the disk.  For example, if the disk is full, can we
> necessarily create the TMP file?
> 
> So, maybe make this just a MOZ_ASSERT() so we get the checking in debug
> builds.  It looks like there is nothing else we can do to recover in this
> case, but at least we probably shouldn't crash if the something unexpected
> happens with the disk.

You are right. 

I was trying to handle the failure cases of |LockedDirectoryPaddingFinalizeWrite()|. In that function, it renames the padding file from ".padding-tmp" to ".padding". So, if it fails, the temporary padding file should exist. So that, the padding size will be recalculated when the next action is coming, or QuotaManager is getting origin usage.

I will move this assertion to MOZ_ASSERT() and write a patch to handle if the temporary file doesn't exist.
Pushed by ttung@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/af0092aac7a1
P1: Don't delete the body file and decrease the padding size when the transaction fails. r=bkelly
https://hg.mozilla.org/integration/mozilla-inbound/rev/d95beed1ed18
P2: Ensure the temporary file exists when failure happens. r=bkelly
Mark as "leave-open" for checking whether these patches do fix the crash. If the crash won't happen after landing them, I will write a patch to change assertions to runtime check per comment #7. After that, uplift them to 57-beta.
Status: NEW → ASSIGNED
Keywords: leave-open
Oops, I forgot to update the patch to Bugzilla. Sorry for this...
Attachment #8914685 - Attachment is obsolete: true
Attachment #8915438 - Flags: review+
Attachment #8915438 - Attachment description: Bug 1402581 - P1: Don't delete the body file and decrease the padding size when the transaction fails. r=bkelly → [Final] Bug 1402581 - P1: Don't delete the body file and decrease the padding size when the transaction fails. r=bkelly
Attachment #8915440 - Attachment description: Bug 1402581 - P2: Ensure the temporary file exists when failure happens. r=bkelly → [Final] Bug 1402581 - P2: Ensure the temporary file exists when failure happens. r=bkelly
Can we close this now?
Flags: needinfo?(ttung)
(In reply to Jim Mathies [:jimm] from comment #26)
> Can we close this now?

We still need to change some of the diagnosis assertions to runtime checks. Besides, I just find out current patches can only prevent padding size from being incorrect but not fixing the incorrect padding size to be revised. Fortunately, runtime checks can also fix this issue.
Flags: needinfo?(ttung)
(In reply to Tom Tung [:tt] from comment #27)
We also need to uplift these patches to beta after that.
Hi Ben,

This patch mainly does the things as follows:
1. Reduce reading padding file one more time if calling LockedDirectoryPaddingRestore() when failing to get the padding size from the file.
2. Allow failure happens on restoring padding file when getting usage (Note: Still ensure leaving temporary padding file so that the next cache action will update file again).
3. Make sure WipePaddingFileInternal() finish almost everything before deleting the temporary padding file

Would you mind helping me to review this patch? Thanks!
Attachment #8915885 - Flags: review?(bkelly)
This patch is mainly to make the diagnostic assertions for padding size should not overflow/underflow in LockedUpdateDirectoryPaddingFile() be runtime checks. It also makes sure deleting the padding file while failure happens, so that the padding size will be recalculated in the next cache action.

I not so sure whether should I make it crash if the user runs into this situation. (force QM to recalculate the usage/padding size in the padding file) I put the MOZ_ASSERT(false) for now.

Note: Unfortunately, there is no easy way to update(increase) the padding size to QM [1], so the idea here is that only update the padding file. Therefore, the size will be updated to QM when a user restarts Firefox and initialize QM next time.

The alternative solution is to implement an API in QM to update the usage under an origin, but it's a bit complex while increasing usage because we need to consider what should we do if increasing usage hits the quota limits (group/global).
Attachment #8915891 - Flags: review?(bkelly)
After P1 P2 P3 P4, if there is an error related to padding in each cache action, it should either leave a temporary padding file or delete the padding file. So that, we'll know if we do need to recalculate the padding size in the next action.
Comment on attachment 8915885 [details] [diff] [review]
Bug 1402581 - P3: Allow failures happen when restoring or wiping padding file. r?bkelly

Review of attachment 8915885 [details] [diff] [review]:
-----------------------------------------------------------------

This is a bit of rs+, to be honest.  All of this looks ok by itself, but I've lost the bigger picture a bit.  I trust that you know how this fits in with the rest.

::: dom/cache/QuotaClient.cpp
@@ +85,5 @@
> +                OpenDBConnection(quotaInfo, aDir, getter_AddRefs(conn));
> +  if (rv == NS_ERROR_FILE_NOT_FOUND ||
> +      rv == NS_ERROR_FILE_TARGET_DOES_NOT_EXIST) {
> +    // If the db doesn't exsit, the padding size should be zero.
> +    *aPaddingSizeOut = 0;

nit: I think its slightly safer to set out params to zero at the beginning of the method.  That way we know its a known value when we return an error, etc.
Attachment #8915885 - Flags: review?(bkelly) → review+
Attachment #8915891 - Flags: review?(bkelly) → review+
(In reply to Ben Kelly [:bkelly] from comment #33)
Thanks for the review!

> Comment on attachment 8915885 [details] [diff] [review]
> Bug 1402581 - P3: Allow failures happen when restoring or wiping padding
> file. r?bkelly
> 
> Review of attachment 8915885 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is a bit of rs+, to be honest.  All of this looks ok by itself, but
> I've lost the bigger picture a bit.  I trust that you know how this fits in
> with the rest.

Thanks! I'll try to explain the overall picture in bug 1290481.

> ::: dom/cache/QuotaClient.cpp
> @@ +85,5 @@
> > +                OpenDBConnection(quotaInfo, aDir, getter_AddRefs(conn));
> > +  if (rv == NS_ERROR_FILE_NOT_FOUND ||
> > +      rv == NS_ERROR_FILE_TARGET_DOES_NOT_EXIST) {
> > +    // If the db doesn't exsit, the padding size should be zero.
> > +    *aPaddingSizeOut = 0;
> 
> nit: I think its slightly safer to set out params to zero at the beginning
> of the method.  That way we know its a known value when we return an error,
> etc.

Done
Attachment #8916471 - Attachment is obsolete: true
Pushed by ttung@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/dfc03e908c89
P3: Allow failures happen when restoring or wiping padding file. r=bkelly
https://hg.mozilla.org/integration/mozilla-inbound/rev/daf2afb7a40b
P4: Make the assertions for increaing and decreaing padding size be runtime checks. r=bkelly
(In reply to Pulsebot from comment #38)
I'll send an uplift request to 57-beta if this issue won't happen after applying these four patches.
(In reply to Tom Tung [:tt] from comment #39)
Note that these patches should fix the crash or at least the crash signature should change on non-debug build since I change the assertion to runtime check.
The new crashes happen in beta-57 frequently, so I decide to uplift these four patches to beta right now.
Comment on attachment 8915438 [details] [diff] [review]
[Final] Bug 1402581 - P1: Don't delete the body file and decrease the padding size when the transaction fails. r=bkelly

Approval Request Comment
[Feature/Bug causing the regression]:
Bug 1290481

[User impact if declined]:
This crash happens because the tracked padding size/usage is wrong and not able to be fixed to correct value. If the user's Firefox somehow jumps into this condition, they will not be able to use the storage types (idb/(dom) cache [service worker]/asmjs cache) under QuotaManager until they restart their firefox. Moreover, the showing usage will be wrong on Preference (Site Data).

[Is this code covered by automated tests?]:
No, it's not easy to have a test since this crash probably happens when the Firefox fails to write/delete a file. I wrote a patch(attachment 8914689 [details] [diff] [review]) to force the write/delete fail, and verify the crash was solved locally.

[Has the fix been verified in Nightly?]:
Yes, though I only landed the part of patches two days ago. 
Note that these patches should at least fix *this crash* because I changed the assertion to runtime check and tried to correct the padding size when running into the condition.

[Needs manual test from QE? If yes, steps to reproduce]:
No
 
[List of other uplifts needed for the feature/fix]:
These four patches

[Is the change risky?]:
Low

[Why is the change risky/not risky?]:
Because these patches mainly changes the code for failure handling (the order for removing files, ignore the error in some conditions, try to fix the wrong usage when failure happens, ... etc) and two assertions to runtime checks

[String changes made/needed]:
No
Attachment #8915438 - Flags: approval-mozilla-beta?
Comment on attachment 8915440 [details] [diff] [review]
[Final] Bug 1402581 - P2: Ensure the temporary file exists when failure happens. r=bkelly

Same as above
Attachment #8915440 - Flags: approval-mozilla-beta?
Comment on attachment 8916472 [details] [diff] [review]
[Final] Bug 1402581 - P3: Allow failures happen when restoring or wiping padding file. r=bkelly

Same as above
Attachment #8916472 - Flags: approval-mozilla-beta?
Comment on attachment 8916473 [details] [diff] [review]
[Final] Bug 1402581 - P4: Make the assertions for increaing and decreaing padding size be runtime checks. r=bkelly

Same as above
Attachment #8916473 - Flags: approval-mozilla-beta?
Hi Andrew, Ben, the size of this change is worrisome. I looked at the crash signature and it seems to be a low volume crash to me. Is this truly a must fix for 57? Need a second opinion. Thanks!
Flags: needinfo?(overholt)
Flags: needinfo?(bkelly)
(In reply to Ritu Kothari (:ritu) from comment #47)
> Hi Andrew, Ben, the size of this change is worrisome. I looked at the crash
> signature and it seems to be a low volume crash to me. Is this truly a must
> fix for 57? Need a second opinion. Thanks!

Since we still have a few weeks for bake time on beta I think it should be uplifted.  Yes, its low frequency now, but I think the frequency could increase a lot on release when we get more users with lower end hardware.

Also, this changes some behavior from assertions (that won't trigger in release) to runtime checks.  Its possible if we go to release without this fix their profile will become corrupted for that site.  In this case the given site may not work right for them until they wipe storage, create a new profile, etc.

Also, the risk from this patch is localized to Cache API.  This isn't a core component used throughout the browser.
Flags: needinfo?(bkelly)
Also, keep in mind the crash volume excludes all of beta population since we don't execute diagnostic assertions there.
Flags: needinfo?(overholt)
Comment on attachment 8915438 [details] [diff] [review]
[Final] Bug 1402581 - P1: Don't delete the body file and decrease the padding size when the transaction fails. r=bkelly

Thanks Ben. I'll take it if you insist this is safe and the rewards outweigh the risk. Beta57+
Attachment #8915438 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8915440 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8916472 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8916473 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.