Support deploying the `monitoring-agent` Ansible role on hgssh
Categories
(Developer Services :: Mercurial: hg.mozilla.org, task, P1)
Tracking
(Not tracked)
People
(Reporter: sheehan, Assigned: sheehan)
References
Details
Attachments
(3 files)
In bug 1525954 I added support for monitoring the hgweb hosts behind hg.mo with the InfluxDB stack. We created a monitoring-agent
role which installs Telegraf and some associated plugins. These plugins are currently assumed to be running on a web head (for example, the Apache plugin, vcsreplicator-consumer monitoring, etc) and are therefore incompatible with hgssh.
Let's refactor our Ansible config to suppport monitoring hgssh hosts as well. We can add support for monitoring things such as the aggregator daemon lag, Try heads, etc. This will also be useful as we start adding more web heads for CI-private mirrors, which will increase the load on the hgssh servers. Monitoring that load and ensuring it doesn't regress performance of things like Try pushes will be critical.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
This way we can deploy our Telegraf config to hgssh,
and monitor host-specific things (such as aggregator
lag on hgssh, consumer lag on hgweb, etc).
Assignee | ||
Comment 2•5 years ago
|
||
This commit adds a --telegraf
flag to the aggregator lag check
script running on hgssh. When the script is run with this flag,
the output will be formatted as JSON, consumable by Telegraf's
exec
plugin. With this flag added, we will be able to send
aggregator lag data to InfluxDB for monitoring/alerting purposes.
Assignee | ||
Comment 3•5 years ago
|
||
This commit adds a hgssh
specific set of configuration options
to the Telegraf configuration file. At the moment we track the
lag of the "aggregator" replication daemons, and the presence
of processes for those daemons.
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/fae277edab13
ansible/monitoring-agent: move hgweb
specific components behind a variable r=smacleod
https://hg.mozilla.org/hgcustom/version-control-tools/rev/7853f082ab09
vcsreplicator: add --telegraf
flag to aggregator lag check r=smacleod
https://hg.mozilla.org/hgcustom/version-control-tools/rev/8644eae3410c
ansible/monitoring-agent: add hgssh
specific plugins to Telegraf config r=smacleod
Pushed by cosheehan@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/a9512159265c terraform: allow MDC1 hosts to send traffic to InfluxDB test instance
Description
•