Closed Bug 488288 Opened 16 years ago Closed 16 years ago

Some talos boxes no longer reporting

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: johnath, Unassigned)

Details

Looking at the perf dashboard, I see some machines have disappeared on 191 and trunk. On 191: XP - qm-pxp-talos02 (e.g. http://graphs.mozilla.org/#show=2419214,2420860,2417627) Linux - qm-plinux-talos02 and maybe -talos01? (e.g. http://graphs.mozilla.org/#show=2421549,2424804,2418074 ) On central: Vista - qm-pvista-trunk03 (e.g. http://graphs.mozilla.org/#show=787159,787149,787160 ) Linux - qm-plinux-trunk02 (e.g. http://graphs.mozilla.org/#show=395143,395148,395172 ) Copying joduinn as releng contact this week, and catlee because he was mentioning earlier that he had kicked one of these boxes. As I understand it, these boxes are watched by Nagios but it will not complain if the box is responding, even if buildbot isn't running or it isn't picking up talos jobs. Is that fixable? The dashboard does make their disappearance more visible since you lose one of the three-coloured lines, but obviously it would be better if it set off alarms whenever some machine wandered off.
If this query can be run on the graph server semi regularly, we can catch these boxes: SELECT machines.name as machine_name FROM machines WHERE is_active = 1 AND EXISTS (SELECT * FROM test_runs WHERE test_runs.machine_id = machines.id ) AND NOT EXISTS (SELECT * FROM test_runs WHERE test_runs.machine_id = machines.id AND test_runs.date_run > %(cutoff)s ) The test_run table needs to have indexes on the machine_id and date_run columns to make this run efficiently. 'cutoff' is a timestamp that refers to how far back we look for data. The first EXISTS clause could be omitted if the is_active data is up-to-date for all the hosts. In my copy of the database there are many hosts in the machines table with is_active=1, but they have never reported any results.
I should also mention that I kicked qm-pxp-talos02 this morning...which doesn't seem to have helped. I'll reboot it now.
In terms of monitoring for missing machines, that should be shifted to bug 476966. This bug should follow the missing boxes in question till they are back online.
(In reply to comment #4) > Looking at the perf dashboard, I see some machines have disappeared on 191 and > trunk. > > On 191: > > XP - qm-pxp-talos02 (e.g. > http://graphs.mozilla.org/#show=2419214,2420860,2417627) Now reporting on graphs.m.o. > Linux - qm-plinux-talos02 and maybe -talos01? (e.g. > http://graphs.mozilla.org/#show=2421549,2424804,2418074 ) Both now reporting on graphs.m.o. > On central: > Vista - qm-pvista-trunk03 (e.g. > http://graphs.mozilla.org/#show=787159,787149,787160 ) Now reporting on graphs.m.o. > Linux - qm-plinux-trunk02 (e.g. > http://graphs.mozilla.org/#show=395143,395148,395172 ) Now reporting on graphs.m.o. All done, so closing. > Copying joduinn as releng contact this week, and catlee because he was > mentioning earlier that he had kicked one of these boxes. > > As I understand it, these boxes are watched by Nagios but it will not complain > if the box is responding, even if buildbot isn't running or it isn't picking up > talos jobs. Is that fixable? The dashboard does make their disappearance more > visible since you lose one of the three-coloured lines, but obviously it would > be better if it set off alarms whenever some machine wandered off.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Component: Release Engineering: Talos → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.