Closed Bug 905350 Opened 12 years ago Closed 12 years ago

The basedir for r4-lion slaves is set incorrectly, causing OS X 10.7 talos jobs to fail

Categories

(Testing :: Talos, defect)

x86_64
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: dustin)

Details

Attachments

(1 file)

Some change within the last day has caused OSX 10.7 talos suites to fail with "talosError: 'initialization timed out'" errors a la bug 739089 50+% of the time across all trees. Until the cause of this problem is investigated and fixed, those suites have been hidden across all sheriff-managed trees. Bug 739089 comment 586 indicates this may be a permissions-related issue with the recently reimaged lion slaves. (In reply to Chris Cooper [:coop] from comment #586) > (In reply to TinderboxPushlog Robot from comment #569) > > RyanVM > > https://tbpl.mozilla.org/php/getParsedLog.php?id=26545286&tree=Mozilla-Aurora > > Rev4 MacOSX Lion 10.7 mozilla-aurora talos svgr on 2013-08-14 09:38:00 > > revision: 530cefd1f09e > > slave: talos-r4-lion-071 > > Targeted talos-r4-lion-071 slave as a repeat offender. > > Is this line of output pertinent? > > 09:40:29 INFO - 2013-08-14 09:40:29.439 firefox[1195:2607] Persistent > UI failed to open file > file://localhost/Users/cltbld/Library/Saved%20Application%20State/org. > mozilla.aurora.savedState/window_1.data: No such file or directory (2) > > Checking on that particular slave, I see the following: > > [cltbld@talos-r4-lion-071.build.scl1.mozilla.com Saved Application State]$ > ls -la /Users/cltbld/Library/Saved\ Application\ State/ > total 0 > dr-x------ 2 cltbld staff 68 13 Aug 14:56 . > drwx------@ 14 cltbld staff 476 13 Aug 18:57 .. > > Are overly-restrictive permissions hindering us here?
Lion talos has been converted to PuppetAgain during that time range - bug 891880. Where can we find additional logging - what process is ffsetup executing? where does its output go? how long is this timeout?
Here's the complete list of slaves that have hit this issue and their corresponding counts, as culled from bug 739089: talos-r4-lion-046 1 talos-r4-lion-061 9 talos-r4-lion-062 16 talos-r4-lion-064 7 talos-r4-lion-065 7 talos-r4-lion-066 12 talos-r4-lion-067 8 talos-r4-lion-068 8 talos-r4-lion-069 3 talos-r4-lion-070 10 talos-r4-lion-071 13 talos-r4-lion-072 6 talos-r4-lion-073 4 talos-r4-lion-074 8 talos-r4-lion-075 7 talos-r4-lion-076 12 talos-r4-lion-077 7 talos-r4-lion-078 9 talos-r4-lion-079 11 talos-r4-lion-080 12 talos-r4-lion-082 11 talos-r4-lion-084 7 talos-r4-lion-085 3 talos-r4-lion-086 2 talos-r4-lion-087 10 talos-r4-lion-088 5 talos-r4-lion-089 9 talos-r4-lion-090 6 Interesting that no slaves below 046 have hit this.
Dustin, the process that it is executing at that time is the browser under test. The timeout that happens means that firefox did not exist the way Talos thought it should within the time that Talos gives to firefox to exit. That code is here: http://mxr.mozilla.org/build/source/talos/talos/ffsetup.py#293 In the logs [1], you can also see a Debug message stating that there was an unknown error during cleanup which does not throw a python traceback. That error is coming from here: http://mxr.mozilla.org/build/source/talos/talos/ttest.py#230 This error ^ and the other issue in the log where Firefox claims it cannot open something in the profile[2] points me toward thinking this might be permissions related on the machine. Can we check to see if permissions have changed on either the areas where Talos runs or in the areas where the profiles are being created? [1]: https://tbpl.mozilla.org/php/getParsedLog.php?id=26596801&tree=Mozilla-Inbound#error2 [2]: 10:08:57 INFO - 2013-08-15 10:08:57.887 firefox[1220:2607] Persistent UI failed to open file file://localhost/Users/cltbld/Library/Saved%20Application%20State/org.mozilla.nightly.savedState/window_1.data: No such file or directory (2)
(In reply to Clint Talbert ( :ctalbert ) from comment #3) > Dustin, the process that it is executing at that time is the browser under > test. The timeout that happens means that firefox did not exist Oops, exit, not exist. Sorry.
from the logs we extract/install the .dmg file to: /Users/cltbld/talos-slave/test/build/application/FirefoxAurora.app/Contents/MacOS/firefox and we reference it from talos at: /Users/cltbld/talos-slave/test/build/application/FirefoxAurora.app/Contents/MacOS/firefox it all looks good for that data point. Is this only on aurora where we see this problem, or across central based trees?
All trees, even b2g18*.
Joel, that doesn't make a lot of sense - Buildbot is running under /builds/slave/talos-slave. [root@talos-r4-lion-050.build.scl1.mozilla.com ~]# find /Users/cltbld/talos-slave find: /Users/cltbld/talos-slave: No such file or directory and [root@talos-r4-lion-050.build.scl1.mozilla.com ~]# find /builds/slave/talos-slave \! -user cltbld [root@talos-r4-lion-050.build.scl1.mozilla.com ~]# find /builds/slave/talos-slave \! -group staff (which is to say, both return no results) However, [root@talos-r4-lion-050.build.scl1.mozilla.com ~]# find /Users/cltbld \! -user cltbld -ls 4237020 8 -rw------- 1 root staff 151 Aug 15 19:08 /Users/cltbld/Library/Preferences/.GlobalPreferences.plist 336019 0 -rwxr-xr-x 1 root staff 0 Aug 12 10:35 /Users/cltbld/Library/Preferences/.GlobalPreferences.plist.lockfile 336034 0 -rwxr-xr-x 1 root staff 0 Aug 12 10:35 /Users/cltbld/Library/Preferences/ByHost/com.apple.screensaver.2A98DCB3-15F4-5A78-8B42-DD181218705C.plist.lockfile 336028 8 -rw------- 1 root staff 61 Aug 12 10:35 /Users/cltbld/Library/Preferences/com.apple.LaunchServices.plist 336027 0 -rwxr-xr-x 1 root staff 0 Aug 12 10:35 /Users/cltbld/Library/Preferences/com.apple.LaunchServices.plist.lockfile The last two of those are managed by puppet and shouldn't affect Firefox. Might the first be an issue? It does seem like it would be useful for http://mxr.mozilla.org/build/source/talos/talos/ttest.py#230 to log the exception!
well, I read it in the log file: https://tbpl.mozilla.org/php/getParsedLog.php?id=26596801&full=1&branch=mozilla-inbound if the logs are lying, we have other problems :)
cc-ing armenzg because he recently deployed a talos change.
That actually explains a lot! Coop, it looks like the basedir for those slaves wasn't changed in slavealloc. MariaDB [buildslaves]> select name, basedir from slaves where name like '%r4-lion%'; +-------------------+----------------------------+ | name | basedir | +-------------------+----------------------------+ | talos-r4-lion-001 | /builds/slave/talos-slave/ | | talos-r4-lion-002 | /builds/slave/talos-slave/ | | talos-r4-lion-003 | /builds/slave/talos-slave/ | | talos-r4-lion-004 | /builds/slave/talos-slave/ | | talos-r4-lion-005 | /builds/slave/talos-slave/ | | talos-r4-lion-006 | /builds/slave/talos-slave/ | | talos-r4-lion-007 | /builds/slave/talos-slave/ | | talos-r4-lion-008 | /builds/slave/talos-slave/ | | talos-r4-lion-009 | /builds/slave/talos-slave/ | | talos-r4-lion-010 | /builds/slave/talos-slave/ | | talos-r4-lion-011 | /builds/slave/talos-slave/ | | talos-r4-lion-012 | /builds/slave/talos-slave/ | | talos-r4-lion-013 | /builds/slave/talos-slave/ | | talos-r4-lion-014 | /builds/slave/talos-slave/ | | talos-r4-lion-015 | /builds/slave/talos-slave/ | | talos-r4-lion-016 | /builds/slave/talos-slave/ | | talos-r4-lion-017 | /builds/slave/talos-slave/ | | talos-r4-lion-018 | /builds/slave/talos-slave/ | | talos-r4-lion-019 | /builds/slave/talos-slave/ | | talos-r4-lion-020 | /builds/slave/talos-slave/ | | talos-r4-lion-021 | /builds/slave/talos-slave/ | | talos-r4-lion-022 | /builds/slave/talos-slave/ | | talos-r4-lion-023 | /builds/slave/talos-slave/ | | talos-r4-lion-024 | /builds/slave/talos-slave/ | | talos-r4-lion-025 | /builds/slave/talos-slave/ | | talos-r4-lion-026 | /builds/slave/talos-slave/ | | talos-r4-lion-027 | /builds/slave/talos-slave/ | | talos-r4-lion-028 | /builds/slave/talos-slave/ | | talos-r4-lion-029 | /builds/slave/talos-slave/ | | talos-r4-lion-030 | /builds/slave/talos-slave/ | | talos-r4-lion-031 | /builds/slave/talos-slave/ | | talos-r4-lion-032 | /builds/slave/talos-slave/ | | talos-r4-lion-033 | /builds/slave/talos-slave/ | | talos-r4-lion-034 | /builds/slave/talos-slave/ | | talos-r4-lion-035 | /builds/slave/talos-slave/ | | talos-r4-lion-036 | /builds/slave/talos-slave/ | | talos-r4-lion-037 | /builds/slave/talos-slave/ | | talos-r4-lion-038 | /builds/slave/talos-slave/ | | talos-r4-lion-039 | /builds/slave/talos-slave/ | | talos-r4-lion-040 | /builds/slave/talos-slave/ | | talos-r4-lion-041 | /builds/slave/talos-slave/ | | talos-r4-lion-042 | /builds/slave/talos-slave/ | | talos-r4-lion-043 | /builds/slave/talos-slave/ | | talos-r4-lion-044 | /builds/slave/talos-slave/ | | talos-r4-lion-045 | /builds/slave/talos-slave/ | | talos-r4-lion-046 | /builds/slave/talos-slave/ | | talos-r4-lion-047 | /builds/slave/talos-slave/ | | talos-r4-lion-048 | /builds/slave/talos-slave/ | | talos-r4-lion-049 | /builds/slave/talos-slave/ | | talos-r4-lion-050 | /builds/slave/talos-slave/ | | talos-r4-lion-051 | /builds/slave/talos-slave/ | | talos-r4-lion-052 | /builds/slave/talos-slave/ | | talos-r4-lion-053 | /builds/slave/talos-slave/ | | talos-r4-lion-054 | /builds/slave/talos-slave/ | | talos-r4-lion-055 | /builds/slave/talos-slave/ | | talos-r4-lion-056 | /builds/slave/talos-slave/ | | talos-r4-lion-057 | /builds/slave/talos-slave/ | | talos-r4-lion-058 | /builds/slave/talos-slave/ | | talos-r4-lion-059 | /builds/slave/talos-slave/ | | talos-r4-lion-060 | /builds/slave/talos-slave/ | | talos-r4-lion-061 | /Users/cltbld/talos-slave/ | | talos-r4-lion-062 | /Users/cltbld/talos-slave/ | | talos-r4-lion-063 | /Users/cltbld/talos-slave/ | | talos-r4-lion-064 | /Users/cltbld/talos-slave/ | | talos-r4-lion-065 | /Users/cltbld/talos-slave/ | | talos-r4-lion-066 | /Users/cltbld/talos-slave/ | | talos-r4-lion-067 | /Users/cltbld/talos-slave/ | | talos-r4-lion-068 | /Users/cltbld/talos-slave/ | | talos-r4-lion-069 | /Users/cltbld/talos-slave/ | | talos-r4-lion-070 | /Users/cltbld/talos-slave/ | | talos-r4-lion-071 | /Users/cltbld/talos-slave/ | | talos-r4-lion-072 | /Users/cltbld/talos-slave/ | | talos-r4-lion-073 | /Users/cltbld/talos-slave/ | | talos-r4-lion-074 | /Users/cltbld/talos-slave/ | | talos-r4-lion-075 | /Users/cltbld/talos-slave/ | | talos-r4-lion-076 | /Users/cltbld/talos-slave/ | | talos-r4-lion-077 | /Users/cltbld/talos-slave/ | | talos-r4-lion-078 | /Users/cltbld/talos-slave/ | | talos-r4-lion-079 | /Users/cltbld/talos-slave/ | | talos-r4-lion-080 | /Users/cltbld/talos-slave/ | | talos-r4-lion-081 | /Users/cltbld/talos-slave/ | | talos-r4-lion-082 | /Users/cltbld/talos-slave/ | | talos-r4-lion-084 | /Users/cltbld/talos-slave/ | | talos-r4-lion-085 | /Users/cltbld/talos-slave/ | | talos-r4-lion-086 | /Users/cltbld/talos-slave/ | | talos-r4-lion-087 | /Users/cltbld/talos-slave/ | | talos-r4-lion-088 | /Users/cltbld/talos-slave/ | | talos-r4-lion-089 | /Users/cltbld/talos-slave/ | | talos-r4-lion-090 | /Users/cltbld/talos-slave/ | +-------------------+----------------------------+ 89 rows in set (0.01 sec) At your say-so, I can change those.
Rows matched: 89 Changed: 29 Warnings: 0 So this should start working now?
dustin, I agree with your theory. I'm confident that we should see no more problems with any new jobs. On another note, could we please track puppet changes on https://wiki.mozilla.org/ReleaseEngineering:Maintenance? (In reply to Chris Cooper [:coop] from comment #9) > cc-ing armenzg because he recently deployed a talos change. I only landed the change for switching to talos-bundles around 2013-08-14 07:40 PT. I don't think it is related from the comments even though the timing matches.
The basedir was wrong in slavealloc for the affected slaves. This caused apache to look in the wrong spot for the talos files. Dustin fixed the entries in the db. After the current set of in-flight tests complete (and fail), the next round of talos tests should complete normally (perhaps even successfully) on lion.
Great - thank you everyone! :-)
I've unhidden everything but svgr, which continues to be near perma-fail. That was a known issue prior to this one, however. IIRC, the upcoming svgx replacement suite is expected to fix this.
rafx was unhidden as part of comment 16 - Joel, is it still supposed to be hidden, since it's a WIP?
Flags: needinfo?(jmaher)
Assignee: nobody → dustin
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Summary: Un-hide the OSX 10.7 talos suites when their failure rate isn't unacceptably high → The basedir for r4-lion slaves is set incorrectly, caused OS X 10.7 talos jobs to fail
Summary: The basedir for r4-lion slaves is set incorrectly, caused OS X 10.7 talos jobs to fail → The basedir for r4-lion slaves is set incorrectly, causing OS X 10.7 talos jobs to fail
yes, lets hide rafx, that is a temporary staging suite which should have some changes today or tomorrow.
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #18) > yes, lets hide rafx, that is a temporary staging suite which should have > some changes today or tomorrow. Cool, done. Thank you :-)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: