The basedir for r4-lion slaves is set incorrectly, causing OS X 10.7 talos jobs to fail

RESOLVED FIXED

Status

Testing
Talos
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: RyanVM, Assigned: dustin)

Tracking

Trunk
x86_64
Mac OS X
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
Some change within the last day has caused OSX 10.7 talos suites to fail with "talosError: 'initialization timed out'" errors a la bug 739089 50+% of the time across all trees. Until the cause of this problem is investigated and fixed, those suites have been hidden across all sheriff-managed trees.

Bug 739089 comment 586 indicates this may be a permissions-related issue with the recently reimaged lion slaves.
(In reply to Chris Cooper [:coop] from comment #586)
> (In reply to TinderboxPushlog Robot from comment #569)
> > RyanVM
> > https://tbpl.mozilla.org/php/getParsedLog.php?id=26545286&tree=Mozilla-Aurora
> > Rev4 MacOSX Lion 10.7 mozilla-aurora talos svgr on 2013-08-14 09:38:00
> > revision: 530cefd1f09e
> > slave: talos-r4-lion-071
> 
> Targeted talos-r4-lion-071 slave as a repeat offender.
> 
> Is this line of output pertinent?
> 
> 09:40:29     INFO -  2013-08-14 09:40:29.439 firefox[1195:2607] Persistent
> UI failed to open file
> file://localhost/Users/cltbld/Library/Saved%20Application%20State/org.
> mozilla.aurora.savedState/window_1.data: No such file or directory (2)
> 
> Checking on that particular slave, I see the following:
> 
> [cltbld@talos-r4-lion-071.build.scl1.mozilla.com Saved Application State]$
> ls -la /Users/cltbld/Library/Saved\ Application\ State/
> total 0
> dr-x------   2 cltbld  staff   68 13 Aug 14:56 .
> drwx------@ 14 cltbld  staff  476 13 Aug 18:57 ..
> 
> Are overly-restrictive permissions hindering us here?
Lion talos has been converted to PuppetAgain during that time range - bug 891880.

Where can we find additional logging - what process is ffsetup executing? where does its output go? how long is this timeout?

Comment 2

5 years ago
Here's the complete list of slaves that have hit this issue and their corresponding counts, as culled from bug 739089:

talos-r4-lion-046 1
talos-r4-lion-061 9
talos-r4-lion-062 16
talos-r4-lion-064 7
talos-r4-lion-065 7
talos-r4-lion-066 12
talos-r4-lion-067 8
talos-r4-lion-068 8
talos-r4-lion-069 3
talos-r4-lion-070 10
talos-r4-lion-071 13
talos-r4-lion-072 6
talos-r4-lion-073 4
talos-r4-lion-074 8
talos-r4-lion-075 7
talos-r4-lion-076 12
talos-r4-lion-077 7
talos-r4-lion-078 9
talos-r4-lion-079 11
talos-r4-lion-080 12
talos-r4-lion-082 11
talos-r4-lion-084 7
talos-r4-lion-085 3
talos-r4-lion-086 2
talos-r4-lion-087 10
talos-r4-lion-088 5
talos-r4-lion-089 9
talos-r4-lion-090 6

Interesting that no slaves below 046 have hit this.

Comment 3

5 years ago
Dustin, the process that it is executing at that time is the browser under test. The timeout that happens means that firefox did not exist the way Talos thought it should within the time that Talos gives to firefox to exit. That code is here:
http://mxr.mozilla.org/build/source/talos/talos/ffsetup.py#293

In the logs [1], you can also see a Debug message stating that there was an unknown error during cleanup which does not throw a python traceback. That error is coming from here:
http://mxr.mozilla.org/build/source/talos/talos/ttest.py#230

This error ^ and the other issue in the log where Firefox claims it cannot open something in the profile[2] points me toward thinking this might be permissions related on the machine.  Can we check to see if permissions have changed on either the areas where Talos runs or in the areas where the profiles are being created?

[1]: https://tbpl.mozilla.org/php/getParsedLog.php?id=26596801&tree=Mozilla-Inbound#error2
[2]: 10:08:57     INFO -  2013-08-15 10:08:57.887 firefox[1220:2607] Persistent UI failed to open file file://localhost/Users/cltbld/Library/Saved%20Application%20State/org.mozilla.nightly.savedState/window_1.data: No such file or directory (2)

Comment 4

5 years ago
(In reply to Clint Talbert ( :ctalbert ) from comment #3)
> Dustin, the process that it is executing at that time is the browser under
> test. The timeout that happens means that firefox did not exist 
Oops, exit, not exist. Sorry.
from the logs we extract/install the .dmg file to:
/Users/cltbld/talos-slave/test/build/application/FirefoxAurora.app/Contents/MacOS/firefox

and we reference it from talos at:
/Users/cltbld/talos-slave/test/build/application/FirefoxAurora.app/Contents/MacOS/firefox

it all looks good for that data point.



Is this only on aurora where we see this problem, or across central based trees?
(Reporter)

Comment 6

5 years ago
All trees, even b2g18*.
Joel, that doesn't make a lot of sense - Buildbot is running under /builds/slave/talos-slave.

[root@talos-r4-lion-050.build.scl1.mozilla.com ~]# find /Users/cltbld/talos-slave
find: /Users/cltbld/talos-slave: No such file or directory

and

[root@talos-r4-lion-050.build.scl1.mozilla.com ~]# find /builds/slave/talos-slave \! -user cltbld
[root@talos-r4-lion-050.build.scl1.mozilla.com ~]# find /builds/slave/talos-slave \! -group staff

(which is to say, both return no results)

However,

[root@talos-r4-lion-050.build.scl1.mozilla.com ~]# find /Users/cltbld \! -user cltbld -ls
4237020        8 -rw-------    1 root             staff                 151 Aug 15 19:08 /Users/cltbld/Library/Preferences/.GlobalPreferences.plist
336019        0 -rwxr-xr-x    1 root             staff                   0 Aug 12 10:35 /Users/cltbld/Library/Preferences/.GlobalPreferences.plist.lockfile
336034        0 -rwxr-xr-x    1 root             staff                   0 Aug 12 10:35 /Users/cltbld/Library/Preferences/ByHost/com.apple.screensaver.2A98DCB3-15F4-5A78-8B42-DD181218705C.plist.lockfile
336028        8 -rw-------    1 root             staff                  61 Aug 12 10:35 /Users/cltbld/Library/Preferences/com.apple.LaunchServices.plist
336027        0 -rwxr-xr-x    1 root             staff                   0 Aug 12 10:35 /Users/cltbld/Library/Preferences/com.apple.LaunchServices.plist.lockfile

The last two of those are managed by puppet and shouldn't affect Firefox.  Might the first be an issue?

It does seem like it would be useful for http://mxr.mozilla.org/build/source/talos/talos/ttest.py#230 to log the exception!
well, I read it in the log file:
https://tbpl.mozilla.org/php/getParsedLog.php?id=26596801&full=1&branch=mozilla-inbound

if the logs are lying, we have other problems :)

Comment 9

5 years ago
cc-ing armenzg because he recently deployed a talos change.
Created attachment 791287 [details]
Screenshot of lion machine attempting to run talos
That actually explains a lot!  Coop, it looks like the basedir for those slaves wasn't changed in slavealloc.

MariaDB [buildslaves]> select name, basedir from slaves where name like '%r4-lion%';
+-------------------+----------------------------+
| name              | basedir                    |
+-------------------+----------------------------+
| talos-r4-lion-001 | /builds/slave/talos-slave/ |
| talos-r4-lion-002 | /builds/slave/talos-slave/ |
| talos-r4-lion-003 | /builds/slave/talos-slave/ |
| talos-r4-lion-004 | /builds/slave/talos-slave/ |
| talos-r4-lion-005 | /builds/slave/talos-slave/ |
| talos-r4-lion-006 | /builds/slave/talos-slave/ |
| talos-r4-lion-007 | /builds/slave/talos-slave/ |
| talos-r4-lion-008 | /builds/slave/talos-slave/ |
| talos-r4-lion-009 | /builds/slave/talos-slave/ |
| talos-r4-lion-010 | /builds/slave/talos-slave/ |
| talos-r4-lion-011 | /builds/slave/talos-slave/ |
| talos-r4-lion-012 | /builds/slave/talos-slave/ |
| talos-r4-lion-013 | /builds/slave/talos-slave/ |
| talos-r4-lion-014 | /builds/slave/talos-slave/ |
| talos-r4-lion-015 | /builds/slave/talos-slave/ |
| talos-r4-lion-016 | /builds/slave/talos-slave/ |
| talos-r4-lion-017 | /builds/slave/talos-slave/ |
| talos-r4-lion-018 | /builds/slave/talos-slave/ |
| talos-r4-lion-019 | /builds/slave/talos-slave/ |
| talos-r4-lion-020 | /builds/slave/talos-slave/ |
| talos-r4-lion-021 | /builds/slave/talos-slave/ |
| talos-r4-lion-022 | /builds/slave/talos-slave/ |
| talos-r4-lion-023 | /builds/slave/talos-slave/ |
| talos-r4-lion-024 | /builds/slave/talos-slave/ |
| talos-r4-lion-025 | /builds/slave/talos-slave/ |
| talos-r4-lion-026 | /builds/slave/talos-slave/ |
| talos-r4-lion-027 | /builds/slave/talos-slave/ |
| talos-r4-lion-028 | /builds/slave/talos-slave/ |
| talos-r4-lion-029 | /builds/slave/talos-slave/ |
| talos-r4-lion-030 | /builds/slave/talos-slave/ |
| talos-r4-lion-031 | /builds/slave/talos-slave/ |
| talos-r4-lion-032 | /builds/slave/talos-slave/ |
| talos-r4-lion-033 | /builds/slave/talos-slave/ |
| talos-r4-lion-034 | /builds/slave/talos-slave/ |
| talos-r4-lion-035 | /builds/slave/talos-slave/ |
| talos-r4-lion-036 | /builds/slave/talos-slave/ |
| talos-r4-lion-037 | /builds/slave/talos-slave/ |
| talos-r4-lion-038 | /builds/slave/talos-slave/ |
| talos-r4-lion-039 | /builds/slave/talos-slave/ |
| talos-r4-lion-040 | /builds/slave/talos-slave/ |
| talos-r4-lion-041 | /builds/slave/talos-slave/ |
| talos-r4-lion-042 | /builds/slave/talos-slave/ |
| talos-r4-lion-043 | /builds/slave/talos-slave/ |
| talos-r4-lion-044 | /builds/slave/talos-slave/ |
| talos-r4-lion-045 | /builds/slave/talos-slave/ |
| talos-r4-lion-046 | /builds/slave/talos-slave/ |
| talos-r4-lion-047 | /builds/slave/talos-slave/ |
| talos-r4-lion-048 | /builds/slave/talos-slave/ |
| talos-r4-lion-049 | /builds/slave/talos-slave/ |
| talos-r4-lion-050 | /builds/slave/talos-slave/ |
| talos-r4-lion-051 | /builds/slave/talos-slave/ |
| talos-r4-lion-052 | /builds/slave/talos-slave/ |
| talos-r4-lion-053 | /builds/slave/talos-slave/ |
| talos-r4-lion-054 | /builds/slave/talos-slave/ |
| talos-r4-lion-055 | /builds/slave/talos-slave/ |
| talos-r4-lion-056 | /builds/slave/talos-slave/ |
| talos-r4-lion-057 | /builds/slave/talos-slave/ |
| talos-r4-lion-058 | /builds/slave/talos-slave/ |
| talos-r4-lion-059 | /builds/slave/talos-slave/ |
| talos-r4-lion-060 | /builds/slave/talos-slave/ |
| talos-r4-lion-061 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-062 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-063 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-064 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-065 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-066 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-067 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-068 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-069 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-070 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-071 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-072 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-073 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-074 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-075 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-076 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-077 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-078 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-079 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-080 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-081 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-082 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-084 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-085 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-086 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-087 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-088 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-089 | /Users/cltbld/talos-slave/ |
| talos-r4-lion-090 | /Users/cltbld/talos-slave/ |
+-------------------+----------------------------+
89 rows in set (0.01 sec)

At your say-so, I can change those.
Rows matched: 89  Changed: 29  Warnings: 0

So this should start working now?

Comment 13

5 years ago
dustin, I agree with your theory. I'm confident that we should see no more problems with any new jobs.

On another note, could we please track puppet changes on https://wiki.mozilla.org/ReleaseEngineering:Maintenance?

(In reply to Chris Cooper [:coop] from comment #9)
> cc-ing armenzg because he recently deployed a talos change.

I only landed the change for switching to talos-bundles around 2013-08-14 07:40 PT. I don't think it is related from the comments even though the timing matches.
The basedir was wrong in slavealloc for the affected slaves. This caused apache to look in the wrong spot for the talos files.

Dustin fixed the entries in the db. After the current set of in-flight tests complete (and fail), the next round of talos tests should complete normally (perhaps even successfully) on lion.
Great - thank you everyone! :-)
(Reporter)

Comment 16

5 years ago
I've unhidden everything but svgr, which continues to be near perma-fail. That was a known issue prior to this one, however. IIRC, the upcoming svgx replacement suite is expected to fix this.
rafx was unhidden as part of comment 16 - Joel, is it still supposed to be hidden, since it's a WIP?
Flags: needinfo?(jmaher)

Updated

5 years ago
Assignee: nobody → dustin
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Summary: Un-hide the OSX 10.7 talos suites when their failure rate isn't unacceptably high → The basedir for r4-lion slaves is set incorrectly, caused OS X 10.7 talos jobs to fail

Updated

5 years ago
Summary: The basedir for r4-lion slaves is set incorrectly, caused OS X 10.7 talos jobs to fail → The basedir for r4-lion slaves is set incorrectly, causing OS X 10.7 talos jobs to fail
yes, lets hide rafx, that is a temporary staging suite which should have some changes today or tomorrow.
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #18)
> yes, lets hide rafx, that is a temporary staging suite which should have
> some changes today or tomorrow.

Cool, done. Thank you :-)
You need to log in before you can comment on or make changes to this bug.