Closed Bug 752243 Opened 12 years ago Closed 8 years ago

Intermittent test_service_wipeServer.js test_service_createAccount.js test_service_changePassword.js test_aitc_server.js test_aitc_client.js test_hashcompleter.js | Failed to clean up the test profile directory: [Errno 13] Permission denied: '/tmp/.../..'

Categories

(Core :: General, defect, P3)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox17 --- wontfix
firefox18 --- disabled
firefox19 --- disabled
firefox20 --- disabled

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [test which aborts the suite][test disabled on OS X and Linux][leave open])

Not sure what the problem is, but

https://tbpl.mozilla.org/php/getParsedLog.php?id=11481405&tree=Mozilla-Inbound
Rev4 MacOSX Snow Leopard 10.6 mozilla-inbound debug test xpcshell on 2012-05-04 15:17:10 PDT for push b3238e70fcb2

OSError: [Errno 13] Permission denied: '/var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/tmpdNbNkE/Cache/0'

and

https://tbpl.mozilla.org/php/getParsedLog.php?id=11501111&tree=Mozilla-Inbound
Rev4 MacOSX Snow Leopard 10.6 mozilla-inbound debug test xpcshell on 2012-05-05 09:57:21 PDT for push 7a7c6d1e67e0

OSError: [Errno 13] Permission denied: '/var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/tmpqyjFkK/Cache/6'

when nobody else is having trouble running that test (both happened after services/sync/tests/unit/test_service_wipeServer.js finished), and when that slave also sometimes runs xpcshell successfully, seems odd.
Hi ho, hi ho, it's off to Sync we go...

https://tbpl.mozilla.org/php/getParsedLog.php?id=11574898&tree=Mozilla-Inbound
Rev4 MacOSX Snow Leopard 10.6 mozilla-inbound debug test xpcshell on 2012-05-08 10:32:14 PDT for push 24cf58c37749
slave: talos-r4-snow-055

TEST-PASS | /Users/cltbld/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_createAccount.js | test passed (time: 1105.603ms)
Traceback (most recent call last):
  File "xpcshell/runxpcshelltests.py", line 923, in <module>
    main()
  File "xpcshell/runxpcshelltests.py", line 919, in main
    if not xpcsh.runTests(args[0], testdirs=args[1:], **options.__dict__):
  File "xpcshell/runxpcshelltests.py", line 806, in runTests
    self.removeDir(self.profileDir)
  File "xpcshell/runxpcshelltests.py", line 339, in removeDir
    shutil.rmtree(dirname)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 216, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 216, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 208, in rmtree
    onerror(os.listdir, path, sys.exc_info())
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 206, in rmtree
    names = os.listdir(path)
OSError: [Errno 13] Permission denied: '/var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/tmpIoLqNx/Cache/0'

https://tbpl.mozilla.org/php/getParsedLog.php?id=11576004&tree=Mozilla-Inbound
Rev4 MacOSX Snow Leopard 10.6 mozilla-inbound debug test xpcshell on 2012-05-08 11:11:34 PDT for push 24cf58c37749
slave: talos-r4-snow-011

TEST-PASS | /Users/cltbld/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | test passed (time: 1241.343ms)
Traceback (most recent call last):
  File "xpcshell/runxpcshelltests.py", line 923, in <module>
    main()
  File "xpcshell/runxpcshelltests.py", line 919, in main
    if not xpcsh.runTests(args[0], testdirs=args[1:], **options.__dict__):
  File "xpcshell/runxpcshelltests.py", line 806, in runTests
    self.removeDir(self.profileDir)
  File "xpcshell/runxpcshelltests.py", line 339, in removeDir
    shutil.rmtree(dirname)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 216, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 216, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 208, in rmtree
    onerror(os.listdir, path, sys.exc_info())
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 206, in rmtree
    names = os.listdir(path)
OSError: [Errno 13] Permission denied: '/var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/tmp20NVwz/Cache/0'
Alias: talos-r4-snow-084
Component: Release Engineering: Machine Management → Firefox Sync: Backend
Product: mozilla.org → Mozilla Services
QA Contact: armenzg → sync-backend
Summary: talos-r4-snow-084 problem tracking → Sync tests (including test_service_wipeServer.js and test_service_createAccount.js) intermittently creating unremovable files
Version: other → unspecified
Blocks: 438871
Whiteboard: [orange]
https://tbpl.mozilla.org/php/getParsedLog.php?id=11571472&tree=Mozilla-Inbound
Rev4 MacOSX Snow Leopard 10.6 mozilla-inbound debug test xpcshell on 2012-05-08 08:05:39 PDT for push 81eed0782699
slave: talos-r4-snow-058

TEST-PASS | /Users/cltbld/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_createAccount.js | test passed (time: 1090.495ms)
Traceback (most recent call last):
  File "xpcshell/runxpcshelltests.py", line 923, in <module>
    main()
  File "xpcshell/runxpcshelltests.py", line 919, in main
    if not xpcsh.runTests(args[0], testdirs=args[1:], **options.__dict__):
  File "xpcshell/runxpcshelltests.py", line 806, in runTests
    self.removeDir(self.profileDir)
  File "xpcshell/runxpcshelltests.py", line 339, in removeDir
    shutil.rmtree(dirname)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 216, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 208, in rmtree
    onerror(os.listdir, path, sys.exc_info())
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 206, in rmtree
    names = os.listdir(path)
OSError: [Errno 13] Permission denied: '/var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/tmpw9CAhB/Cache'
Most interesting. First time I've heard of this issue. It looks like we're getting the error on an opendir() or readdir()! However, I have no clue where that "Cache" directory comes from. What part of the profile supplies a "Cache" directory (complete with uppercase 'C')? And, why would the permissions lock out the xpcshell test runner? Perhaps the directory doesn't have the execute bit set?

I'd like to say Sync just happened to be caught in the middle of another regression...
This appears to be the necko cache. I was confused because this directory didn't exist in my personal profiles. By default on OS X, the necko cache goes to ~/Library/Caches/Firefox/Profiles/XXXX/Cache. When you run with -profile, the Cache directory gets created in a unified directory tree.

So, the question becomes, "why do we get 'Permission denied' when the test runner attempts to delete the necko cache directory?"
Hmm, I don't recall any changes to the cache that ought to change directory permissions.  Maybe Michal/Nick know something.
https://tbpl.mozilla.org/php/getParsedLog.php?id=11861498&tree=Mozilla-Inbound
Summary: Sync tests (including test_service_wipeServer.js and test_service_createAccount.js) intermittently creating unremovable files → Sync tests (including test_service_wipeServer.js, test_service_createAccount.js, test_service_changePassword.js) intermittently creating unremovable files
Per bug 755949, Sync is disabling the Necko cache, so the cache shouldn't come into play.

To make matters more interesting, the deletion of the profile directory occurs *after* the xpcshell process has exited. So, there shouldn't be a problem with deleting an in-use file/directory.

It is worth mentioning that as soon as Sync makes an HTTP request using Necko, the Cache directory is created in the profile. I've executed these tests locally and verified that the cache is empty before the profile directory is removed (this is expected).

Also, every failure is on OS X 10.6. I wonder if some infrastructure change triggered this. I also wonder if any other component using Neck (like extensions) is seeing failures like this.

*scratches head*
For Necko people: should the Cache directory be created if Sync is properly disabling the cache? Or, is it like an automatic thing that is initialized when Necko is first used?
RelEng: what is /var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/ and why does every failure have this as the TMPDIR? I see from other logs that this directory is not always the TMPDIR.

I hypothesize that if /var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/ is used, the test will fail.
Look! 2 more logs with TMPDIR=/var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/.

I'm sending this back to RelEng so they can investigate this directory.
Component: Firefox Sync: Backend → Release Engineering: Machine Management
Product: Mozilla Services → mozilla.org
QA Contact: sync-backend → armenzg
Version: unspecified → other
(In reply to Gregory Szorc [:gps] from comment #19)
> Look! 2 more logs with
> TMPDIR=/var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/.
> 
> I'm sending this back to RelEng so they can investigate this directory.

what test host was this running on?
talos-r4-snow-042
talos-r4-snow-016
talos-r4-snow-025
talos-r4-snow-035
talos-r4-snow-054
talos-r4-snow-058
talos-r4-snow-040
talos-r4-snow-058
talos-r4-snow-027
talos-r4-snow-020
talos-r4-snow-013
talos-r4-snow-011
talos-r4-snow-055
talos-r4-snow-084
talos-r4-snow-084
k, I took a look at one of them:

talos-r4-snow-013:HsDn6a9SG8idoIya6p9mtE+++TI cltbld$ cd /var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/
talos-r4-snow-013:-Tmp- cltbld$ ls -la
total 0
drwx------   4 cltbld  staff   136 May 18 19:59 .
drwxr-xr-x   5 cltbld  staff   170 May 18 19:50 ..
-rw-------   1 cltbld  staff     0 May 18 19:58 tmpCyF1dnpidlog
drwxrwxr-x  52 cltbld  staff  1768 May 18 20:00 tmpSVM6OG
talos-r4-snow-013:-Tmp- cltbld$ cd tmpSVM6OG/
talos-r4-snow-013:tmpSVM6OG cltbld$ ls -la
total 37040
drwxrwxr-x  52 cltbld  staff      1768 May 18 20:00 .
drwx------   4 cltbld  staff       136 May 18 19:59 ..
-rw-rw-r--   1 cltbld  staff         0 May 18 19:58 .parentlock
drwxrwxr-x  22 cltbld  staff       748 May 18 19:59 Cache
drwx------   4 cltbld  staff       136 May 18 19:59 OfflineCache
drwx------   2 cltbld  staff        68 May 18 19:58 bookmarkbackups
-rw-------   1 cltbld  staff     65536 May 18 19:59 cert8.db
drwxrwxr-x   2 cltbld  staff        68 May 18 19:58 chrome
-rw-r--r--   1 cltbld  staff     98304 May 18 20:00 chromeappsstore.sqlite
-rw-------   1 cltbld  staff       283 May 18 19:58 compatibility.ini
-rw-r--r--   1 cltbld  staff    229376 May 18 20:00 content-prefs.sqlite
-rw-r--r--   1 cltbld  staff    524288 May 18 19:59 cookies.sqlite
-rw-r--r--   1 cltbld  staff     32768 May 18 19:59 cookies.sqlite-shm
-rw-r--r--   1 cltbld  staff    590288 May 18 19:59 cookies.sqlite-wal
-rw-r--r--   1 cltbld  staff     65536 May 18 20:00 downloads.sqlite
drwxrwxr-x   6 cltbld  staff       204 May 18 19:58 extensions
-rw-r--r--   1 cltbld  staff       486 May 18 19:58 extensions.ini
-rw-r--r--   1 cltbld  staff    425984 May 18 19:58 extensions.sqlite
-rw-r--r--   1 cltbld  staff    262720 May 18 19:58 extensions.sqlite-journal
-rw-r--r--   1 cltbld  staff    196608 May 18 19:59 formhistory.sqlite
-rw-------   1 cltbld  staff     16384 May 18 19:58 key3.db
-rw-rw-r--   1 cltbld  staff       502 May 18 19:58 lightweighttheme-footer
-rw-rw-r--   1 cltbld  staff       501 May 18 19:58 lightweighttheme-header
-rw-rw-r--   1 cltbld  staff      6533 May 18 20:00 localstore.rdf
-rw-rw-r--   1 cltbld  staff      4050 May 18 20:00 mimeTypes.rdf
drwx------   2 cltbld  staff        68 May 18 19:58 minidumps
drwx------   2 cltbld  staff        68 May 18 19:59 mozilla-media-cache
-rw-r--r--   1 cltbld  staff      8192 May 18 19:59 permissions.sqlite
-rw-r--r--   1 cltbld  staff  10485760 May 18 20:00 places.sqlite
-rw-r--r--   1 cltbld  staff     32768 May 18 19:58 places.sqlite-shm
-rw-r--r--   1 cltbld  staff     98408 May 18 20:00 places.sqlite-wal
-rw-------   1 cltbld  staff       682 May 18 19:59 pluginreg.dat
drwxrwxr-x   3 cltbld  staff       102 May 18 19:51 plugins
-rw-------   1 cltbld  staff     25747 May 18 20:00 prefs.js
-rw-r--r--   1 cltbld  staff       188 May 18 19:59 search-metadata.json
-rw-r--r--   1 cltbld  staff     12051 May 18 19:59 search.json
drwxr-xr-x   2 cltbld  staff        68 May 18 19:59 searchplugins
-rw-------   1 cltbld  staff     16384 May 18 19:58 secmod.db
-rw-rw-r--   1 cltbld  staff        11 May 18 19:58 server_alive.txt
-rw-------   1 cltbld  staff      7846 May 18 20:00 sessionstore.js
-rw-r--r--   1 cltbld  staff    327680 May 18 19:59 signons.sqlite
-rw-rw-r--   1 cltbld  staff       635 May 18 19:58 ssltunnel.cfg
drwxrwxr-x   2 cltbld  staff        68 May 18 19:58 startupCache
-rw-rw-r--   1 cltbld  staff      1281 May 18 19:58 testConfig.js
-rw-rw-r--   1 cltbld  staff        89 May 18 19:58 tests.manifest
-rw-r--r--   1 cltbld  staff        32 May 18 19:58 urlclassifier.pset
-rw-r--r--   1 cltbld  staff   5242880 May 18 19:58 urlclassifier3.sqlite
-rw-rw-r--   1 cltbld  staff     22214 May 18 19:58 user.js
-rw-rw-r--   1 cltbld  staff       253 May 18 19:58 userChrome.css
drwxr-xr-x   5 cltbld  staff       170 May 18 19:59 weave
drwxr-xr-x   2 cltbld  staff        68 May 18 19:58 webapps
-rw-r--r--   1 cltbld  staff     98304 May 18 19:58 webappsstore.sqlite
Is this temporary directory on local disk or on some NAS?

Why do different machines have the same TMPDIR? Shouldn't the temporary directory be random across different machines? Shouldn't the temporary directory be chosen randomly at job execution time?
No longer blocks: 438871
It is located on disk. 

[23:00]  <philor> ah, random at GUI login, to prevent attacks from prior knowledge of where something will create a temp file

that's all I know - the other questions need to be answered by the devs or ateam folk who are infinitely more familiar with this code than I am.
And that "random at GUI login" was from a not particularly reputable looking forum post. Not surprisingly, being Apple, they didn't exactly document "we're using random tmp locations, but we're taking the random locations from this set of 30 possibilities and only changing which one is used under these three circumstances." There are lots of possibilities for why it's the same one, good and bad: releng has a reference image from which all the slaves are imaged, so it could be just fine that they all get one with that name, or it could be that something in the imaging or something in the way they are rebooted and logged in doesn't rotate tmp locations like 10.6 expected it to happen.

But as long as there aren't files in there owned by another user, and the owner has write permission, it still smells more like file-in-use than bad infra.
What really boggles me is that the stack trace implies this is on an opendir() or readdir(), not an unlink() or rmdir():

/shutil.py", line 206, in rmtree
    names = os.listdir(path)
OSError: [Errno 13] Permission denied: '/var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/tmpIoLqNx/Cache/0'

We should be able to do those operations, even if a file is in use. Of course, this all happens *after* the xpcshell process has exited, so there should be no lingering handles (assuming there were no forks, which I'm pretty sure we don't do). That's why I suspect there is some weird underlying filesystem issue, perhaps a hardware failure manifesting as a permissions error.

*Keeps scratching head*
https://tbpl.mozilla.org/php/getParsedLog.php?id=13552953&tree=Mozilla-Inbound
Whiteboard: [orange] → [orange][test which aborts the suite]
https://tbpl.mozilla.org/php/getParsedLog.php?id=14658250&tree=Birch
Summary: Sync tests (including test_service_wipeServer.js, test_service_createAccount.js, test_service_changePassword.js) intermittently creating unremovable files → Sync tests (including test_service_wipeServer.js, test_service_createAccount.js, test_service_changePassword.js, test_aitc_server.js) intermittently creating unremovable files
https://tbpl.mozilla.org/php/getParsedLog.php?id=14776610&tree=Mozilla-Inbound
Summary: Sync tests (including test_service_wipeServer.js, test_service_createAccount.js, test_service_changePassword.js, test_aitc_server.js) intermittently creating unremovable files → Sync tests (including test_service_wipeServer.js, test_service_createAccount.js, test_service_changePassword.js, test_aitc_server.js, test_aitc_client.js) intermittently creating unremovable files
https://tbpl.mozilla.org/php/getParsedLog.php?id=14992596&tree=Mozilla-Inbound
Summary: Sync tests (including test_service_wipeServer.js, test_service_createAccount.js, test_service_changePassword.js, test_aitc_server.js, test_aitc_client.js) intermittently creating unremovable files → Sync tests (including test_service_wipeServer.js, test_service_createAccount.js, test_service_changePassword.js, test_aitc_server.js, test_aitc_client.js, test_hashcompleter.js) intermittently creating unremovable files
https://tbpl.mozilla.org/php/getParsedLog.php?id=15025325&tree=Mozilla-Inbound
Summary: Sync tests (including test_service_wipeServer.js, test_service_createAccount.js, test_service_changePassword.js, test_aitc_server.js, test_aitc_client.js, test_hashcompleter.js) intermittently creating unremovable files → Sync tests (including test_service_wipeServer.js, test_service_createAccount.js, test_service_changePassword.js, test_aitc_server.js, test_aitc_client.js) and others (including test_hashcompleter.js) intermittently creating unremovable files
Not a sync test, but still 10.6...

https://tbpl.mozilla.org/php/getParsedLog.php?id=15304044&tree=Mozilla-Inbound

Rev4 MacOSX Snow Leopard 10.6 mozilla-inbound debug test xpcshell
talos-r4-snow-021

TEST-PASS | /Users/cltbld/talos-slave/test/build/xpcshell/tests/toolkit/components/url-classifier/tests/unit/test_hashcompleter.js | test passed (time: 530.273ms)
Traceback (most recent call last):
  File "xpcshell/runxpcshelltests.py", line 992, in <module>
    main()
  File "xpcshell/runxpcshelltests.py", line 988, in main
    if not xpcsh.runTests(args[0], testdirs=args[1:], **options.__dict__):
  File "xpcshell/runxpcshelltests.py", line 866, in runTests
    self.removeDir(self.profileDir)
  File "xpcshell/runxpcshelltests.py", line 325, in removeDir
    shutil.rmtree(dirname)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 216, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 216, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 208, in rmtree
    onerror(os.listdir, path, sys.exc_info())
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/shutil.py", line 206, in rmtree
    names = os.listdir(path)
OSError: [Errno 13] Permission denied: '/var/folders/Hs/HsDn6a9SG8idoIya6p9mtE+++TI/-Tmp-/tmp1uOCqW/Cache/0'
program finished with exit code 1
Depends on: 793855
https://tbpl.mozilla.org/php/getParsedLog.php?id=15688834&tree=Try

It's actually been Mac and Linux for quite a while now.
OS: Mac OS X → All
Hardware: x86_64 → All
Please may someone take a look at this?

Since it's a "have to manually star" failure, it's even more frustrating for sheriffs than it's OrangeFactor score would suggest (and even then, it's a top 5 orangefactor failure for the last few days).

Thank you :-)
I am relatively confident the underlying issue is not in Sync's code. The problem occurs *after* the xpcshell process launched by the Python test runner has exited. So, unless the xpcshell process forked a grandchild that was detached from all file descriptors and lives longer than xpcshell itself, this is a problem with the Python xpcshell test runner or the machines running tests.
Also, since the failure frequency has increased recently, what other changes could they be correlated to? One theory is one of RelEng's system images might be buggy. Have new test systems been "flashed" recently to cause an increase in the frequency of these failures? Have there been any xpcshell or xpcshell test runner changes recently?
https://tbpl.mozilla.org/php/getParsedLog.php?id=15782549&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=15781548&tree=Mozilla-Inbound

https://tbpl.mozilla.org/php/getParsedLog.php?id=15786669&tree=Mozilla-Inbound
TEST-INFO | /home/cltbld/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | running test ...
Traceback (most recent call last):
  File "xpcshell/runxpcshelltests.py", line 992, in <module>
    main()
  File "xpcshell/runxpcshelltests.py", line 988, in main
    if not xpcsh.runTests(args[0], testdirs=args[1:], **options.__dict__):
  File "xpcshell/runxpcshelltests.py", line 866, in runTests
    self.removeDir(self.profileDir)
  File "xpcshell/runxpcshelltests.py", line 325, in removeDir
    shutil.rmtree(dirname)
  File "/home/cltbld/lib/python2.5/shutil.py", line 175, in rmtree
TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | test passed (time: 1528.884ms)
    rmtree(fullname, ignore_errors, onerror)
  File "/home/cltbld/lib/python2.5/shutil.py", line 167, in rmtree
    onerror(os.listdir, path, sys.exc_info())
  File "/home/cltbld/lib/python2.5/shutil.py", line 165, in rmtree
    names = os.listdir(path)
OSError: [Errno 13] Permission denied: '/tmp/tmpU3c2Fw/Cache'

https://tbpl.mozilla.org/php/getParsedLog.php?id=15787118&tree=Mozilla-Inbound
My patch in bug 793855, gave the following (hopefully helpful) output:
https://tbpl.mozilla.org/php/getParsedLog.php?id=15894268&full=1&branch=try#error0

{
TEST-PASS | /builds/slave/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | [test_wipeServer_list_success : 62] false == false
wipeServer() will happily ignore the non-existent collection and use the timestamp of the last DELETE that was successful.
Sync.Resource	DEBUG	mesg: DELETE success 200 http://localhost:8080/1.1/johndoe/storage/steam
Sync.Resource	DEBUG	DELETE success 200 http://localhost:8080/1.1/johndoe/storage/steam
Sync.Resource	DEBUG	mesg: DELETE success 200 http://localhost:8080/1.1/johndoe/storage/diesel
Sync.Resource	DEBUG	DELETE success 200 http://localhost:8080/1.1/johndoe/storage/diesel
Processing request
Sync.Resource	DEBUG	mesg: DELETE fail 404 http://localhost:8080/1.1/johndoe/storage/petrol
Sync.Resource	DEBUG	DELETE fail 404 http://localhost:8080/1.1/johndoe/storage/petrol

TEST-PASS | /builds/slave/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | [test_wipeServer_list_success : 66] 1349611692.35 == 1349611692.35
wipeServer stopped deleting after encountering an error with the 'petrol' collection, thus only 'steam' has been deleted.

...
...

TEST-INFO | (xpcshell/head.js) | test 2 pending
TEST-INFO | /builds/slave/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | Starting test_wipeServer_list_503
Service.wipeServer() deletes collections given as argument.
Sync.Service	DEBUG	Caching URLs under storage user base: http://localhost:8080/1.1/null/
Sync.Identity	INFO	Username changed. Removing stored credentials.
Sync.Identity	INFO	Basic password has no value. Removing.
Sync.Identity	INFO	Sync Key has no value. Deleting.
Sync.Identity	INFO	Basic password has no value. Removing.
Sync.Identity	INFO	Sync Key being updated.
Sync.SyncKeyBundle	INFO	SyncKeyBundle being created.
Sync.Identity	INFO	Basic password being updated.
Sync.Identity	INFO	Sync Key being updated.
Sync.SyncKeyBundle	INFO	SyncKeyBundle being created.
Sync.Identity	WARN	Provided key is not a passphrase, cannot derive Sync Key Bundle. JS Stack trace: generateFromKey@keys.js:202 < SyncKeyBundle@keys.js:184 < @identity.js:285 < @identity.js:256 < SyncTestingInfrastructure@head_helpers.js:243 < test_wipeServer_list_503@test_service_wipeServer.js:92 < _run_next_test@head.js:888 < .run@head.js:418
Sync.Service	DEBUG	Caching URLs under storage user base: http://localhost:8080/1.1/johndoe/
Confirm initial environment.

TEST-PASS | /builds/slave/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | [test_wipeServer_list_503 : 95] false == false

TEST-PASS | /builds/slave/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | [test_wipeServer_list_503 : 96] false == false
wipeServer() will happily ignore the non-existent collection, delete the 'steam' collection and abort after an receiving an error on the 'petrol' collection.
Sync.Resource	DEBUG	mesg: DELETE fail 404 http://localhost:8080/1.1/johndoe/storage/non-existent
Sync.Resource	DEBUG	DELETE fail 404 http://localhost:8080/1.1/johndoe/storage/non-existent
Sync.Resource	DEBUG	mesg: DELETE success 200 http://localhost:8080/1.1/johndoe/storage/steam
Sync.Resource	DEBUG	DELETE success 200 http://localhost:8080/1.1/johndoe/storage/steam
Processing request
Sync.Resource	DEBUG	mesg: DELETE fail 503 http://localhost:8080/1.1/johndoe/storage/petrol
Sync.Resource	DEBUG	DELETE fail 503 http://localhost:8080/1.1/johndoe/storage/petrol
Sync.Service	DEBUG	Aborting wipeServer. Server responded with 503 response for http://localhost:8080/1.1/johndoe/storage/petrol
wipeServer() threw this exception: null

TEST-PASS | /builds/slave/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | [test_wipeServer_list_503 : 107] 503 == 503
wipeServer stopped deleting after encountering an error with the 'petrol' collection, thus only 'steam' has been deleted.

...
...

TEST-INFO | (xpcshell/head.js) | test 2 pending
TEST-INFO | /builds/slave/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | Starting test_wipeServer_all_success
Service.wipeServer() deletes all the things.
Sync.Service	DEBUG	Caching URLs under storage user base: http://localhost:8080/1.1/null/
Sync.Identity	INFO	Username changed. Removing stored credentials.
Sync.Identity	INFO	Basic password has no value. Removing.
Sync.Identity	INFO	Sync Key has no value. Deleting.
Sync.Identity	INFO	Basic password has no value. Removing.
Sync.Identity	INFO	Sync Key being updated.
Sync.SyncKeyBundle	INFO	SyncKeyBundle being created.
Try deletion.
Sync.Identity	INFO	Basic password being updated.
Sync.Identity	INFO	Sync Key being updated.
Sync.SyncKeyBundle	INFO	SyncKeyBundle being created.
Sync.Identity	WARN	Provided key is not a passphrase, cannot derive Sync Key Bundle. JS Stack trace: generateFromKey@keys.js:202 < SyncKeyBundle@keys.js:184 < @identity.js:285 < @identity.js:256 < SyncTestingInfrastructure@head_helpers.js:243 < test_wipeServer_all_success@test_service_wipeServer.js:140 < _run_next_test@head.js:888 < .run@head.js:418
Sync.Service	DEBUG	Caching URLs under storage user base: http://localhost:8080/1.1/johndoe/

...
...

TEST-INFO | (xpcshell/head.js) | test 2 pending
TEST-INFO | /builds/slave/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | Starting test_wipeServer_all_404
Service.wipeServer() accepts a 404.
Sync.Service	DEBUG	Caching URLs under storage user base: http://localhost:8080/1.1/null/
Sync.Identity	INFO	Username changed. Removing stored credentials.
Sync.Identity	INFO	Basic password has no value. Removing.
Sync.Identity	INFO	Sync Key has no value. Deleting.
Sync.Identity	INFO	Basic password has no value. Removing.
Sync.Identity	INFO	Sync Key being updated.
Sync.SyncKeyBundle	INFO	SyncKeyBundle being created.
Try deletion.
Sync.Identity	INFO	Basic password being updated.
Sync.Identity	INFO	Sync Key being updated.
Sync.SyncKeyBundle	INFO	SyncKeyBundle being created.
Sync.Identity	WARN	Provided key is not a passphrase, cannot derive Sync Key Bundle. JS Stack trace: generateFromKey@keys.js:202 < SyncKeyBundle@keys.js:184 < @identity.js:285 < @identity.js:256 < SyncTestingInfrastructure@head_helpers.js:243 < test_wipeServer_all_404@test_service_wipeServer.js:172 < _run_next_test@head.js:888 < .run@head.js:418
Sync.Service	DEBUG	Caching URLs under storage user base: http://localhost:8080/1.1/johndoe/

TEST-PASS | /builds/slave/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | [storageHandler : 158] DELETE == DELETE

TEST-PASS | /builds/slave/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_wipeServer.js | [storageHandler : 159] true == true
Sync.Resource	DEBUG	mesg: DELETE fail 404 http://localhost:8080/1.1/johndoe/storage
Sync.Resource	DEBUG	DELETE fail 404 http://localhost:8080/1.1/johndoe/storage
}
Bah, I misread the debug output; I don't think that helps :-/
https://tbpl.mozilla.org/php/getParsedLog.php?id=15898864&full=1&branch=try#error0

{
WARNING: Could not open cache map file: file ../../../netwerk/cache/nsDiskCacheMap.cpp, line 58
###
### mDiskDevice->Init() failed (0x8052000b)
###    - disabling disk cache for this session.
###

...
...

WARNING: nsExceptionService ignoring thread destruction after shutdown: file ../../../xpcom/base/nsExceptionService.cpp, line 166
WARNING: OOPDeinit() without successful OOPInit(): file ../../../toolkit/crashreporter/nsExceptionHandler.cpp, line 2252
}
Summary: Sync tests (including test_service_wipeServer.js, test_service_createAccount.js, test_service_changePassword.js, test_aitc_server.js, test_aitc_client.js) and others (including test_hashcompleter.js) intermittently creating unremovable files → Intermittent test_service_wipeServer.js test_service_createAccount.js test_service_changePassword.js test_aitc_server.js test_aitc_client.js test_hashcompleter.js | Failed to clean up the test profile directory: <type 'exceptions.OSError'>
Depends on: 799532
inbound now contains code that will dump a lot more context when these errors occur. Hopefully that will tell us something useful that we can use to track down these annoying errors.
Huzzah! The new code in inbound identifier the culprit:

OS Error while performing os.walk!
Traceback (most recent call last):
  File "/home/cltbld/lib/python2.5/os.py", line 286, in walk
    names = listdir(top)
OSError: [Errno 13] Permission denied: '/tmp/tmppBpVJ9/Cache.Trash374715952'

Cache.Trash is some Necko foo. This is a core platform bug or possibly a RelEng bug for weird system state.
Depends on: 808545
Depends on: 809071
Summary: Intermittent test_service_wipeServer.js test_service_createAccount.js test_service_changePassword.js test_aitc_server.js test_aitc_client.js test_hashcompleter.js | Failed to clean up the test profile directory: <type 'exceptions.OSError'> → Intermittent test_service_wipeServer.js test_service_createAccount.js test_service_changePassword.js test_aitc_server.js test_aitc_client.js test_hashcompleter.js | Failed to clean up the test profile directory: [Errno 13] Permission denied: '/tmp/.../..'
Whiteboard: [orange][test which aborts the suite] → [test which aborts the suite]
moving to general to get triaged appropriately. See comment #432 for analysis from :gps
Component: Release Engineering: Machine Management → Release Engineering
QA Contact: armenzg
According to Hal, this is in the wrong component, which is why it's been ignored. Hal, could you move it to the right place and ensure it gets triaged? Thanks!
Assignee: nobody → hwine
Status: NEW → ASSIGNED
(In reply to Richard Newman [:rnewman] from comment #1017)
> According to Hal, this is in the wrong component, which is why it's been
> ignored. Hal, could you move it to the right place and ensure it gets
> triaged? Thanks!

Already done in comment #1009 - assigning to me removes it from triage, setting back to where it belongs.
Assignee: hwine → nobody
gps, any news on this?

It's now consistently in the top 3 oranges seen on our infrastructure - but I'd rather not have to start disabling tests and lose coverage if possible...
Flags: needinfo?(gps)
It's in releng triage. Absolutely nothing to do with Sync, we think. With any luck they'll figure it out.
Status: ASSIGNED → NEW
Flags: needinfo?(gps)
I'm confounded by this issue.

In bug 793855 we improved the error message. In bug 799532 we dumped filesystem permissions on directory deletion failure to aid in tracking down what's happening. It didn't reveal anything useful. In bug 809071 we backed that out and added a sleep before trying again (theorizing it might be a race condition at the OS/filesystem level). That didn't help.

I want to say we are triggering a race condition in Necko or nsIFile where we change permissions before a move and exit the process before permissions are reset. Perhaps the next step should be adjusting the directory's permissions before we attempt cleanup?
I didn't see any mention of anything that even vaguely relates in any way to releng in that. Is this a Testing::XPCShell Harness bug, or a Core::Networking: File bug?
(In reply to Phil Ringnalda (:philor) from comment #1052)
> I didn't see any mention of anything that even vaguely relates in any way to
> releng in that. Is this a Testing::XPCShell Harness bug, or a
> Core::Networking: File bug?

Comment 432: "This is a core platform bug or possibly a RelEng bug for weird system state.". I'm happy to let someone who can look at our builders figure out the root cause and throw it at the right component.
Any MoCo employee can get access to a test slave, it only requires filing a releng bug. Having a releng employee learn enough about xpcshell and netwerk to make even a 51:49 guess about which of the two to throw it in would require hiring at least one new employee, but probably more like eight, since the first seven would get snapped up by P1 jobs which are going undone because they are so shorthanded.
Component: Release Engineering → General
Product: mozilla.org → Core
Version: other → Trunk
mozfile.rmtree might work better than shutil.rmtree.

https://tbpl.mozilla.org/?tree=Try&rev=ed7ebe8049ec

Includes uplift of recent patch to mozfile. Presumably mozbase update will drop in m-c soon.
mozfile/mozbase isn't available to the xpcshell runner. Of course it isn't!

*sigh*
This is consistently one of our top 2-3 intermittent failures.

I appreciate gps' work so far to try and diagnose, but it's getting to the point where I'm just soon going to have to disable these tests on OS X and Linux, unless we make some progress here or someone has a better idea (eg make failing cleanup non-fatal).
(In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #1138)
> This is consistently one of our top 2-3 intermittent failures.
> 
> I appreciate gps' work so far to try and diagnose, but it's getting to the
> point where I'm just soon going to have to disable these tests on OS X and
> Linux, unless we make some progress here or someone has a better idea (eg
> make failing cleanup non-fatal).

1100 failures is just too many. Disabled on OS X and linux:
https://hg.mozilla.org/integration/mozilla-inbound/rev/93beff102db2
Whiteboard: [test which aborts the suite] → [test which aborts the suite][test disabled on OS X and Linux][leave open]
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.