Closed Bug 799532 Opened 7 years ago Closed 7 years ago

Try to find the source of IOError when removing profile directory

Categories

(Testing :: XPCShell Harness, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla19

People

(Reporter: gps, Assigned: gps)

References

Details

(Whiteboard: [sheriff-want])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #793855 +++

I'm getting sick of these IOError's and want to try to find the problem. This patch dumps an |ls -a| like output for the profile directory if we get an IOError when deleting it. It will probably need some refining once we uncover what's actually happening, but it's a start.

Yes, I known the redundant lookups to getpwuid and friends are wasteful. This isn't critical path code: I'm not worried about perf.

Sample output (with mach prepending times):

14.13 TEST-INFO | (xpcshell/head.js) | No (+ 0) checks actually run
14.13 <<<<<<<
14.13 >>>>>>>
14.13 Traceback (most recent call last):
14.13   File "/Users/gps/src/mozilla-central-hg/testing/xpcshell/runxpcshelltests.py", line 896, in runTests
14.14     self.removeDir(self.profileDir)
14.14   File "/Users/gps/src/mozilla-central-hg/testing/xpcshell/runxpcshelltests.py", line 339, in removeDir
14.14     shutil.rmtree(dirname)
14.14   File "/usr/local/Cellar/python/2.7.3/lib/python2.7/shutil.py", line 245, in rmtree
14.14     rmtree(fullname, ignore_errors, onerror)
14.14   File "/usr/local/Cellar/python/2.7.3/lib/python2.7/shutil.py", line 250, in rmtree
14.14     onerror(os.remove, fullname, sys.exc_info())
14.14   File "/usr/local/Cellar/python/2.7.3/lib/python2.7/shutil.py", line 248, in rmtree
14.14     os.remove(fullname)
14.14 OSError: [Errno 13] Permission denied: '/var/folders/sz/9xx9hkm53_q6sgt3k50nwffc0000gn/T/tmpBwWvtS/startupCache/startupCache.8.little'
14.14 <<<<<<<
14.14 >>>>>>>
14.14 Files in profile directory:
14.14 <<<<<<<
14.14 40700 gps staff /var/folders/sz/9xx9hkm53_q6sgt3k50nwffc0000gn/T/tmpBwWvtS/
14.14 40755 root wheel /var/folders/sz/9xx9hkm53_q6sgt3k50nwffc0000gn/T/tmpBwWvtS/startupCache/
14.14 100644 gps staff /var/folders/sz/9xx9hkm53_q6sgt3k50nwffc0000gn/T/tmpBwWvtS/startupCache/startupCache.8.little
Attachment #669577 - Flags: review?(ted.mielczarek)
No longer blocks: 778688
Finally got a Try build to repro!

From https://tbpl.mozilla.org/php/getParsedLog.php?id=16184085&tree=Try&full=1:

Traceback (most recent call last):
  File "xpcshell/runxpcshelltests.py", line 895, in runTests
    self.removeDir(self.profileDir)
  File "xpcshell/runxpcshelltests.py", line 339, in removeDir
    shutil.rmtree(dirname)
  File "/home/cltbld/lib/python2.5/shutil.py", line 175, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/home/cltbld/lib/python2.5/shutil.py", line 167, in rmtree
    onerror(os.listdir, path, sys.exc_info())
  File "/home/cltbld/lib/python2.5/shutil.py", line 165, in rmtree
    names = os.listdir(path)
OSError: [Errno 13] Permission denied: '/tmp/tmpktTCyH/Cache.Trash949609424'
<<<<<<<
Files in profile directory:
40700 cltbld cltbld /tmp/tmpktTCyH/
100664 cltbld cltbld /tmp/tmpktTCyH/runxpcshelltests_leaks.log
100600 cltbld cltbld /tmp/tmpktTCyH/secmod.db
100600 cltbld cltbld /tmp/tmpktTCyH/key3.db
100600 cltbld cltbld /tmp/tmpktTCyH/cert8.db
100600 cltbld cltbld /tmp/tmpktTCyH/_CACHE_CLEAN_
OS Error while performing os.walk!
Traceback (most recent call last):
  File "/home/cltbld/lib/python2.5/os.py", line 286, in walk
    names = listdir(top)
OSError: [Errno 13] Permission denied: '/tmp/tmpktTCyH/Cache.Trash949609424'

40775 cltbld cltbld /tmp/tmpktTCyH/Cache/
100600 cltbld cltbld /tmp/tmpktTCyH/Cache/_CACHE_001_
100600 cltbld cltbld /tmp/tmpktTCyH/Cache/_CACHE_MAP_
100600 cltbld cltbld /tmp/tmpktTCyH/Cache/_CACHE_002_
100600 cltbld cltbld /tmp/tmpktTCyH/Cache/_CACHE_003_
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/D/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/3/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/1/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/7/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/0/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/5/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/B/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/2/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/A/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/6/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/E/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/C/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/F/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/4/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/9/
40700 cltbld cltbld /tmp/tmpktTCyH/Cache/8/
40755 cltbld cltbld /tmp/tmpktTCyH/webapps/
100644 cltbld cltbld /tmp/tmpktTCyH/webapps/webapps.json

So, what is "Cache.Trash949609424"?
Perhaps some Necko folk can explain where "Cache.Trash949609424" comes from.
Comment on attachment 669577 [details] [diff] [review]
Dump directory listing on profile removal failure, v1

Review of attachment 669577 [details] [diff] [review]:
-----------------------------------------------------------------

r=me since this is just diagnostic code and we'll remove it eventually.
Attachment #669577 - Flags: review?(ted.mielczarek) → review+
Digging through source code, it appears the Necko cache has a "trash compartment." It appears that old cache directories get renamed to Cache.Trash<random> and are then deleted on a background thread after some delay. Makes sense.

I suspect that permissions are getting "corrupted" as part of the move. From nsDeleteDir.cpp:

    // Important: must rename directory w/o changing parent directory: else on
    // NTFS we'll wait (with cache lock) while nsIFile's ACL reset walks file
    // tree: was hanging GUI for *minutes* on large cache dirs.
    rv = dir->MoveToNative(nullptr, leaf);

dir is just an nsIFile. I suspect we are running into some kind of race condition where the OS or filesystem is attempting to "reparent" directory entries in the directory as part of the move and we try to delete the directory before this has finished. But, that's just speculation.
https://hg.mozilla.org/mozilla-central/rev/9fcf07b50922
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
> I suspect we are running into some kind of race condition where the OS or
> filesystem is attempting to "reparent" directory entries in the directory as
> part of the move and we try to delete the directory before this has finished

I'm not much of an expert on NTFS, but AFAIK that shouldn't be happening:  when nsDeleteDir renames the file we pass null for the parent directory, which winds up resolving to "skipNtfsAclReset" here: 

  http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsLocalFileWin.cpp#1862

But perhaps there's some race condition even without doing the ACL resets.  The diagnostic output will hopefully give us a clue.
So, what's the easiest way for me to enable diagnostic output on try builds?
Blocks: 809071
Blocks: 823996
You need to log in before you can comment on or make changes to this bug.