Closed
Bug 899784
Opened 12 years ago
Closed 12 years ago
Rev4 machines have Puppet disabled which can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Unassigned)
References
Details
https://tbpl.mozilla.org/php/getParsedLog.php?id=25925156&tree=Mozilla-Inbound#error0
./configs/talos/linux_config.py: "title": os.uname()[1].lower().split('.')[0],
./configs/talos/mac_config.py: "title": os.uname()[1].lower().split('.')[0],
12:55:08 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found
12:55:08 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:55:13 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found
12:55:13 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:55:23 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found
12:55:23 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:55:43 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found
12:55:43 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:56:23 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found
12:56:23 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:57:43 CRITICAL - FAIL: Graph server unreachable (5 attempts)
12:57:43 CRITICAL - RETURN:No machine_name called 'client-builders-mac-mini-10' can be found
12:57:43 CRITICAL - RETURN: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:57:43 ERROR - Traceback (most recent call last):
12:57:43 CRITICAL - talos.utils.talosError: 'Graph server unreachable (5 attempts)\nsend failed, graph server says:\nNo machine_name called \'client-builders-mac-mini-10\' can be found\n File "/var/www/html/graphs/server/pyfomatic/collect.py", line 271, in handleRequest\n metadata = MetaDataFromTalos(databaseCursor, databaseModule, inputStream)\n File "/var/www/html/graphs/server/pyfomatic/collect.py", line 63, in __init__\n self.doDatabaseThings(databaseCursor)\n File "/var/www/html/graphs/server/pyfomatic/collect.py", line 92, in doDatabaseThings\n raise DatabaseException("No machine_name called \'%s\' can be found" % self.machine_name)\n\n'
12:57:43 ERROR - Return code: 1
Updated•12 years ago
|
Blocks: t-snow-r4-0051
| Reporter | ||
Updated•12 years ago
|
Summary: Some machines can loose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10 → Some machines can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10
Comment 1•12 years ago
|
||
Comment 2•12 years ago
|
||
| Reporter | ||
Comment 3•12 years ago
|
||
We're switching to the new Puppet infra soon (<1week).
If we have problematic slaves we should disable them until we sync up with the new puppet infra.
Callek, what is the bug for the new Puppet infra?
Slaves with the issue:
slave: talos-r4-snow-029
slave: talos-r4-lion-067
slave: talos-r4-snow-053
Flags: needinfo?(bugspam.Callek)
Summary: Some machines can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10 → Rev4 machines have Puppet disabled which can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10
| Reporter | ||
Updated•12 years ago
|
Flags: needinfo?(bugspam.Callek) → needinfo?(coop)
Comment 4•12 years ago
|
||
Even with puppet attached, we weren't immune to this, but we would error out *in puppet* before taking jobs.
I would shy away from disabling these slaves. Wait times on these platforms are already terrible.
Running the steps in the remote_scutil_cmds.bash, either via the script or by hand, will resurrect a machine in this state:
https://hg.mozilla.org/build/braindump/file/8ccc8daef11b/mac-related/remote_scutil_cmds.bash
Flags: needinfo?(coop)
Comment 5•12 years ago
|
||
Comment 6•12 years ago
|
||
Comment 7•12 years ago
|
||
Comment 8•12 years ago
|
||
| Assignee | ||
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Comment 9•12 years ago
|
||
Comment 10•12 years ago
|
||
Comment 11•12 years ago
|
||
Comment 12•12 years ago
|
||
RyanVM says he hasn't seen this bug in ages.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
| Assignee | ||
Updated•8 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•