Closed Bug 1545434 Opened 5 years ago Closed 5 years ago

Support deploying the `monitoring-agent` Ansible role on hgssh

Categories

(Developer Services :: Mercurial: hg.mozilla.org, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sheehan, Assigned: sheehan)

References

Details

Attachments

(3 files)

In bug 1525954 I added support for monitoring the hgweb hosts behind hg.mo with the InfluxDB stack. We created a monitoring-agent role which installs Telegraf and some associated plugins. These plugins are currently assumed to be running on a web head (for example, the Apache plugin, vcsreplicator-consumer monitoring, etc) and are therefore incompatible with hgssh.

Let's refactor our Ansible config to suppport monitoring hgssh hosts as well. We can add support for monitoring things such as the aggregator daemon lag, Try heads, etc. This will also be useful as we start adding more web heads for CI-private mirrors, which will increase the load on the hgssh servers. Monitoring that load and ensuring it doesn't regress performance of things like Try pushes will be critical.

Type: defect → task
Priority: -- → P1

This way we can deploy our Telegraf config to hgssh,
and monitor host-specific things (such as aggregator
lag on hgssh, consumer lag on hgweb, etc).

This commit adds a --telegraf flag to the aggregator lag check
script running on hgssh. When the script is run with this flag,
the output will be formatted as JSON, consumable by Telegraf's
exec plugin. With this flag added, we will be able to send
aggregator lag data to InfluxDB for monitoring/alerting purposes.

This commit adds a hgssh specific set of configuration options
to the Telegraf configuration file. At the moment we track the
lag of the "aggregator" replication daemons, and the presence
of processes for those daemons.

Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/fae277edab13
ansible/monitoring-agent: move hgweb specific components behind a variable r=smacleod
https://hg.mozilla.org/hgcustom/version-control-tools/rev/7853f082ab09
vcsreplicator: add --telegraf flag to aggregator lag check r=smacleod
https://hg.mozilla.org/hgcustom/version-control-tools/rev/8644eae3410c
ansible/monitoring-agent: add hgssh specific plugins to Telegraf config r=smacleod

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/a9512159265c
terraform: allow MDC1 hosts to send traffic to InfluxDB test instance
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: