Closed
Bug 488288
Opened 16 years ago
Closed 16 years ago
Some talos boxes no longer reporting
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: johnath, Unassigned)
Details
Looking at the perf dashboard, I see some machines have disappeared on 191 and trunk.
On 191:
XP - qm-pxp-talos02 (e.g. http://graphs.mozilla.org/#show=2419214,2420860,2417627)
Linux - qm-plinux-talos02 and maybe -talos01? (e.g. http://graphs.mozilla.org/#show=2421549,2424804,2418074 )
On central:
Vista - qm-pvista-trunk03 (e.g. http://graphs.mozilla.org/#show=787159,787149,787160 )
Linux - qm-plinux-trunk02 (e.g. http://graphs.mozilla.org/#show=395143,395148,395172 )
Copying joduinn as releng contact this week, and catlee because he was mentioning earlier that he had kicked one of these boxes.
As I understand it, these boxes are watched by Nagios but it will not complain if the box is responding, even if buildbot isn't running or it isn't picking up talos jobs. Is that fixable? The dashboard does make their disappearance more visible since you lose one of the three-coloured lines, but obviously it would be better if it set off alarms whenever some machine wandered off.
Comment 1•16 years ago
|
||
If this query can be run on the graph server semi regularly, we can catch these boxes:
SELECT
machines.name as machine_name
FROM
machines
WHERE
is_active = 1 AND
EXISTS
(SELECT * FROM
test_runs
WHERE
test_runs.machine_id = machines.id
) AND
NOT EXISTS
(SELECT * FROM
test_runs
WHERE
test_runs.machine_id = machines.id AND
test_runs.date_run > %(cutoff)s
)
The test_run table needs to have indexes on the machine_id and date_run columns to make this run efficiently. 'cutoff' is a timestamp that refers to how far back we look for data.
The first EXISTS clause could be omitted if the is_active data is up-to-date for all the hosts. In my copy of the database there are many hosts in the machines table with is_active=1, but they have never reported any results.
Comment 2•16 years ago
|
||
I should also mention that I kicked qm-pxp-talos02 this morning...which doesn't seem to have helped. I'll reboot it now.
Comment 3•16 years ago
|
||
In terms of monitoring for missing machines, that should be shifted to bug 476966. This bug should follow the missing boxes in question till they are back online.
Comment 4•16 years ago
|
||
(In reply to comment #4)
> Looking at the perf dashboard, I see some machines have disappeared on 191 and
> trunk.
>
> On 191:
>
> XP - qm-pxp-talos02 (e.g.
> http://graphs.mozilla.org/#show=2419214,2420860,2417627)
Now reporting on graphs.m.o.
> Linux - qm-plinux-talos02 and maybe -talos01? (e.g.
> http://graphs.mozilla.org/#show=2421549,2424804,2418074 )
Both now reporting on graphs.m.o.
> On central:
> Vista - qm-pvista-trunk03 (e.g.
> http://graphs.mozilla.org/#show=787159,787149,787160 )
Now reporting on graphs.m.o.
> Linux - qm-plinux-trunk02 (e.g.
> http://graphs.mozilla.org/#show=395143,395148,395172 )
Now reporting on graphs.m.o.
All done, so closing.
> Copying joduinn as releng contact this week, and catlee because he was
> mentioning earlier that he had kicked one of these boxes.
>
> As I understand it, these boxes are watched by Nagios but it will not complain
> if the box is responding, even if buildbot isn't running or it isn't picking up
> talos jobs. Is that fixable? The dashboard does make their disappearance more
> visible since you lose one of the three-coloured lines, but obviously it would
> be better if it set off alarms whenever some machine wandered off.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
| Assignee | ||
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•