Closed Bug 899784 Opened 12 years ago Closed 12 years ago

Rev4 machines have Puppet disabled which can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

https://tbpl.mozilla.org/php/getParsedLog.php?id=25925156&tree=Mozilla-Inbound#error0 ./configs/talos/linux_config.py: "title": os.uname()[1].lower().split('.')[0], ./configs/talos/mac_config.py: "title": os.uname()[1].lower().split('.')[0], 12:55:08 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found 12:55:08 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:55:13 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found 12:55:13 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:55:23 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found 12:55:23 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:55:43 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found 12:55:43 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:56:23 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found 12:56:23 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:57:43 CRITICAL - FAIL: Graph server unreachable (5 attempts) 12:57:43 CRITICAL - RETURN:No machine_name called 'client-builders-mac-mini-10' can be found 12:57:43 CRITICAL - RETURN: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:57:43 ERROR - Traceback (most recent call last): 12:57:43 CRITICAL - talos.utils.talosError: 'Graph server unreachable (5 attempts)\nsend failed, graph server says:\nNo machine_name called \'client-builders-mac-mini-10\' can be found\n File "/var/www/html/graphs/server/pyfomatic/collect.py", line 271, in handleRequest\n metadata = MetaDataFromTalos(databaseCursor, databaseModule, inputStream)\n File "/var/www/html/graphs/server/pyfomatic/collect.py", line 63, in __init__\n self.doDatabaseThings(databaseCursor)\n File "/var/www/html/graphs/server/pyfomatic/collect.py", line 92, in doDatabaseThings\n raise DatabaseException("No machine_name called \'%s\' can be found" % self.machine_name)\n\n' 12:57:43 ERROR - Return code: 1
Blocks: 713055
No longer depends on: 713055
Summary: Some machines can loose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10 → Some machines can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10
We're switching to the new Puppet infra soon (<1week). If we have problematic slaves we should disable them until we sync up with the new puppet infra. Callek, what is the bug for the new Puppet infra? Slaves with the issue: slave: talos-r4-snow-029 slave: talos-r4-lion-067 slave: talos-r4-snow-053
Flags: needinfo?(bugspam.Callek)
Summary: Some machines can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10 → Rev4 machines have Puppet disabled which can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10
Flags: needinfo?(bugspam.Callek) → needinfo?(coop)
Even with puppet attached, we weren't immune to this, but we would error out *in puppet* before taking jobs. I would shy away from disabling these slaves. Wait times on these platforms are already terrible. Running the steps in the remote_scutil_cmds.bash, either via the script or by hand, will resurrect a machine in this state: https://hg.mozilla.org/build/braindump/file/8ccc8daef11b/mac-related/remote_scutil_cmds.bash
Flags: needinfo?(coop)
Product: mozilla.org → Release Engineering
RyanVM says he hasn't seen this bug in ages.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.