Closed Bug 1333273 Opened 7 years ago Closed 7 years ago

Intermittent dom/indexedDB/test/test_complex_keyPaths.html | Got success, but did not expect it!

Categories

(Core :: Storage: IndexedDB, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- unaffected
firefox53 --- unaffected
firefox54 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: bevis)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(1 file)

I'm trying to narrow a regression range for this; in doing so, I'm seeing bug 1336848 crop up in many of the same runs that fail here.
See Also: → 1336848
From the logs it seems very likely that the call to QuotaManagerService.clearStoragesForPrincipal is failing to cleanup the database created by the worker.
Flags: needinfo?(jvarga)
See Also: → 1302290
thanks for looking into this :asuth and :janv!  As a note, we have many indexedDB tests which are failing in the same way, cumulative they build a great spot on the top intermittent list :)  When this is fixed, I will verify all is well with the others
:asuth, can you ping :jvarga offline or find someone else to work on this?  This is failing at a 20%+ rate, I am really close to disabling this test for windows*, but would prefer to see this fixed.
Flags: needinfo?(bugmail)
Jan told me he's going to take a look today.
Flags: needinfo?(bugmail)
Update my finding so far for both bug 1336848 and this bug because the root cause looks similar:
1. the origin data was cleared incompletely after test_blob_worker_xhr_post.html was done which was run in prior to test_complex_keyPaths.html.
2. From the test log in [1] from treeherder result in [2], the .metadata and .metadata-v2 are removed when clearAllDatabase() but the deleting operation was stop when removing the directory of 3881517805%l2mFtthe.sttsso%p2_F.files.
3. The suspicious point is that 
  - Files and directories under 3881517805%l2mFtthe.sttsso%p2_F.files were removed successfully.
  - However, the directory of 3881517805%l2mFtthe.sttsso%p2_F.files was not able to be removed even though the files and sub-directories are removed successfully. (The error code 145 from GetLastError() in windows means ERROR_DIR_NOT_EMPTY according to the MSDN.)
4. Unlike other test cases, in test_complex_keyPaths.html, the same database name is used to test in both worker & window, so when the clearAllDatabases() was done after testing in worker, the database was not removed as expected because, internally, the operation of clearAllDatabases() was stopped at GetDirectoryMetadata2WithRestore() while running OriginClearOp::DeleteFiles().[3]  

Hi Jan,

Did you meet this problem before?
I cannot understand, in quota manager, why we got ERROR_DIR_NOT_EMPTY from removing a directory when its files and sub-directories were removed successfully.

In addition, I still have no idea why this only happens under e10s. :|

Do you are any advice of how to investigate these suspicious points in advance?

Thanks!

[1] https://archive.mozilla.org/pub/firefox/try-builds/btseng@mozilla.com-4dd4ff718fc9599db553dbfe6975668e19890ad2/try-win32/try_win7_vm_test-mochitest-e10s-2-bm140-tests1-windows-build2195.txt.gz
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=4dd4ff718fc9599db553dbfe6975668e19890ad2&group_state=expanded&selectedJob=79594278
[3] http://searchfox.org/mozilla-central/rev/b1044cf7c2000c3e75e8181e893236a940c8b6d2/dom/quota/ActorsParent.cpp#6471
Are you sure, the .files directory is empty ?, I see this in the log:
01:30:59     INFO - BVS, nsLocalFile::Remove, removing file: c:\users\cltbld\appdata\local\temp\tmpkcxqeg.mozrunner\storage\default\http+++mochi.test+8888\idb\3881517805%l2mFtthe.sttsso%p2_F.files\1
01:30:59     INFO - BVS, nsLocalFile::Remove, failed to remove file: c:\users\cltbld\appdata\local\temp\tmpkcxqeg.mozrunner\storage\default\http+++mochi.test+8888\idb\3881517805%l2mFtthe.sttsso%p2_F.files\1, rv: 2152857621

So the file named "1" is probably still open, files like this represent stored DOM blobs/files, Mutable files, WASM modules or big structured clones.

Currently, there's no code that would prevent OriginClearOp from starting the delete when stored files are still open :(
There's a hack in OriginClearOp::DeleteFiles, I mean the for loop with PR_Sleep inside of it. The comment says it's windows specific. I think the problem is not directly related to Windows. It's probably not working on windows because Windows doesn't allow to delete files that are still open, unlike other systems.

I tend to think that the primary problem here is that 3881517805%l2mFtthe.sttsso%p2_F.files\1 is still being read by something.

The best solution would be to abort any operations on stored files and wait for correct closing of these files, before we start deleting origins. However, that's quite complex task.
For now, we can maybe identify problematic places and abort operations on stored files manually or find some other workaround.

Anyway, thanks for providing this log, it helped a lot.
Flags: needinfo?(jvarga)
thanks for the updates on this!  :janv, will you be coming up with a fix for this?  With 150+ failures in the last week and still going strong, this is showing up in try pushes and confusing many people.  I would like to see this fixed soon or temporarily disabled if we don't have time to fix this.
Flags: needinfo?(jvarga)
(In reply to Jan Varga [:janv] from comment #25)
> 01:30:59     INFO - BVS, nsLocalFile::Remove, failed to remove file:
> c:\users\cltbld\appdata\local\temp\tmpkcxqeg.
> mozrunner\storage\default\http+++mochi.test+8888\idb\3881517805%l2mFtthe.
> sttsso%p2_F.files\1, rv: 2152857621

Expanding on Jan's response, what I see in the logs is that:
- The "1" file is successfully marked for deletion, but not actually deleted because there's still an open reference to the file.  (Probably the XHR post in test_blob_worker_xhr_post.html is still holding the nsFileInputStream open.)  Per the DeleteFile docs (https://msdn.microsoft.com/en-us/library/windows/desktop/aa363915(v=vs.85).aspx), "The DeleteFile function marks a file for deletion on close. Therefore, the file deletion does not occur until the last handle to the file is closed."
- Removing the directory fails because that file has not yet actually been deleted.
- When the removal attempt occurs again in the excerpted log entry I quote above, we get NS_ERROR_FILE_ACCESS_DENIED (rv=2152857621) is because the previous call to DeleteFile succeeded.  The DeleteFiles docs tell us "Subsequent calls to CreateFile to open the file fail with ERROR_ACCESS_DENIED."  I think it makes sense that the repeated calls to DeleteFile would fail in this fashion based on windows' model of access rights; presumably deletion removes all access rights.


== FILE_SHARE_DELETE notes

In looking into this, I also noticed that our Blob streams are opened for reading safely with FILE_SHARE_DELETE, but Quota streams (both reading and writing, are not).  Files opened without FILE_SHARE_DELETE will similarly return an ERROR_ACCESS_DENIED when DeleteFile is invoked.  This is notable in this case since that's what IDB's FileHelper::CreateFileFromStream uses (http://searchfox.org/mozilla-central/source/dom/indexedDB/ActorsParent.cpp#29435).

FileBlobImpl uses the following behavior flags and so Blob reads should not result in NS_ERROR_FILE_ACCESS_DENIED:
const uint32_t sFileStreamFlags =
  nsIFileInputStream::CLOSE_ON_EOF |
  nsIFileInputStream::REOPEN_ON_REWIND |
  nsIFileInputStream::DEFER_OPEN |
  nsIFileInputStream::SHARE_DELETE;

Quota's FileStreams specify a default behavior flag set of 0, which is what IDB's CreateFileFromStream uses, see http://searchfox.org/mozilla-central/source/dom/quota/FileStreams.h#87.  Perhaps the quota manager stream behavior flags (input and output) should be set to default to include nsIFileInputStream::SHARE_DELETE?  (Note that although nsIFileOutputStream does not expose such a constant, nsFileStreamBase only deals in nsIFileInputStream constants anyways and will honor it.)


== FileInfo refcounting.

(In reply to Jan Varga [:janv] from comment #25)
> The best solution would be to abort any operations on stored files and wait
> for correct closing of these files, before we start deleting origins.
> However, that's quite complex task.

Yeah, having the input streams (which can be serialized) hold FileInfo references is not a small undertaking.  Note that things are slightly worse right now when BlobImplStoredFile blobs are sliced.  They use FileBlobImpl's implementation which returns a new FileBlobImpl which keeps the nsIFile reference but obviously no longer has a FileInfo reference.  While test_blob_worker_xhr_post.html does perform blob slicing and that's what it xhr.send()s, it's not clear that this has a meaningful effect on the test.
(In reply to Jan Varga [:janv] from comment #25)
> There's a hack in OriginClearOp::DeleteFiles, I mean the for loop with
> PR_Sleep inside of it. The comment says it's windows specific. I think the
> problem is not directly related to Windows. It's probably not working on
> windows because Windows doesn't allow to delete files that are still open,
> unlike other systems.

SQLite explicitly retries most of its file operations because of contention with anti-virus software.  I would expect this is a lot of the decision, but as you say, windows' deletion semantics and our inability to reliably close all the handles first is also a reasonable reason to do sleep+retry.  See SQLite's os_win.c (https://www.sqlite.org/src/artifact/2a6c73eef01c51a0) for more details, particularly winDelete and SQLITE_WIN32_IOERR_RETRY and the code/comments around them.
(In reply to Andrew Sutherland [:asuth] from comment #28)
> (In reply to Jan Varga [:janv] from comment #25)
> > There's a hack in OriginClearOp::DeleteFiles, I mean the for loop with
> > PR_Sleep inside of it. The comment says it's windows specific. I think the
> > problem is not directly related to Windows. It's probably not working on
> > windows because Windows doesn't allow to delete files that are still open,
> > unlike other systems.
The PR_Sleep looks useless in windows because the nsresult for nsLocalFile::Rmove(/*aRecursive*/ true) of a directory is not used as a return value in nsLocalFile::Remove() in nsLocalFileWin.cpp.
http://searchfox.org/mozilla-central/rev/b1044cf7c2000c3e75e8181e893236a940c8b6d2/xpcom/io/nsLocalFileWin.cpp#2437

The workaround I have for now are:
1. To return correct nsresult from nsLocalFileWin.cpp
2. To rename the test db name in test_complex_keyPaths.html with the same convention to
   const name = this.window ? window.location.pathname : "test_complex_keyPaths";
I know this is stupid, but I'd like to give it a try to see if the orange factor can be reduced.
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #30)
> (In reply to Andrew Sutherland [:asuth] from comment #28)
> > (In reply to Jan Varga [:janv] from comment #25)
> > > There's a hack in OriginClearOp::DeleteFiles, I mean the for loop with
> > > PR_Sleep inside of it. The comment says it's windows specific. I think the
> > > problem is not directly related to Windows. It's probably not working on
> > > windows because Windows doesn't allow to delete files that are still open,
> > > unlike other systems.
> The PR_Sleep looks useless in windows because the nsresult for
> nsLocalFile::Rmove(/*aRecursive*/ true) of a directory is not used as a
> return value in nsLocalFile::Remove() in nsLocalFileWin.cpp.
> http://searchfox.org/mozilla-central/rev/
> b1044cf7c2000c3e75e8181e893236a940c8b6d2/xpcom/io/nsLocalFileWin.cpp#2437
> 
> The workaround I have for now are:
> 1. To return correct nsresult from nsLocalFileWin.cpp
> 2. To rename the test db name in test_complex_keyPaths.html with the same
> convention to
>    const name = this.window ? window.location.pathname :
> "test_complex_keyPaths";
> I know this is stupid, but I'd like to give it a try to see if the orange
> factor can be reduced.

Unforunately, returning nsresult from nsLocalFileWin or adding SHARE_DELETE in open input stream doesn't help on this.
Besides, renaming db name in test_complex_keyPaths.html is not enough. We have to rename all the db name "Splendid Test" used in worker.
Instead of doing this, I think we should disable test_blob_worker*.html first in Win 7 VM instead.
Disable test_blob_worker_{crash, xhr_post, xhr_post_multifile}.html in Win7(e10s) to prevent similar intermittent failures in bug 1336848 and this bug.

Wait for treeherder result complete:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1ad97ac4971af51207320a64c22af7362e3b21e7
Comment on attachment 8840794 [details] [diff] [review]
Disable test_blob_worker_{crash, xhr_post, xhr_post_multifile}.html in Win7(e10s).

Review of attachment 8840794 [details] [diff] [review]:
-----------------------------------------------------------------

treeherder looks fine.

Hi Jan,

I've filtered out these 3 test cases that cause the symptom mentioned in comment 24.
In addition, bug 1342415 has been filed to track this problem.

This patch is to disable these test cases in win 7 build(e10s) to remove the orange factor from treeherder first.

May I have your review for this change?

Thanks!
Attachment #8840794 - Flags: review?(jvarga)
BTW, orange factors were gone according to the treeherder result in comment 32.
Comment on attachment 8840794 [details] [diff] [review]
Disable test_blob_worker_{crash, xhr_post, xhr_post_multifile}.html in Win7(e10s).

Review of attachment 8840794 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8840794 - Flags: review?(jvarga) → review+
Pushed by btseng@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/973ab774ff18
Disable test_blob_worker_{crash, xhr_post, xhr_post_multifile}.html in Win7(e10s). r=janv
Whiteboard: [stockwell disabled]
https://hg.mozilla.org/mozilla-central/rev/973ab774ff18
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Assignee: nobody → btseng
Depends on: 1342415
Flags: needinfo?(jvarga)
See Also: → 1342415
You need to log in before you can comment on or make changes to this bug.