1059325 - [Meta] Improve monitoring/alert coverage of treeherder (eg New Relic, Nagios, CloudWatch)

Reporter

Description

•

11 years ago

This bug is for the long term fix for this problem, since even if this particular instance clears up, getting this much behind is a tree-closing event & so something that blocks switching to treeherder.

Ed Morley [:emorley]

Reporter

Comment 1

•

11 years ago

To clarify: looking at say mozilla-inbound on treeherder right now, I see jobs still marked as running as far back as https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=2ccb65865db7 - even though they appear completed on TBPL & buildapi says they finished ~10 hours ago.

Mauro Doglio [:mdoglio]

Comment 2

•

11 years ago

This backlog in the data ingestion is due to a problem we had on production since the last code push yesterday. I saw an error in the routine that processes the incoming jobs but it was not present on dev/stage. Everything got fixed pushing the chief red button without any code change. One way to mitigate this kind of problems would be to have a staging environment that matches the production architecture in my opinion. Also, we could write a command to re-process those jobs that resulted in a failure during the ingestion in the last week or so.

Ed Morley [:emorley]

Reporter

Updated

•

11 years ago

No longer blocks: treeherder-sheriff-transition

Ed Morley [:emorley]

Reporter

Updated

•

11 years ago

Blocks: treeherder-dev-transition

Priority: P1 → P2

Summary: Data ingestion for multiple repos is many hours behind → Set up alerts for when data ingestion is failing and/or backlogged

Ed Morley [:emorley]

Reporter

Updated

•

11 years ago

Priority: P2 → P3

Ed Morley [:emorley]

Reporter

Comment 3

•

11 years ago

Now that we have access to newrelic, this isn't a regression over TBPL (we're already better off than having to read the import logs on tbpl-dev/cache/...).

Blocks: 1076750, 1074927
No longer blocks: treeherder-dev-transition

Ed Morley [:emorley]

Reporter

Comment 4

•

11 years ago

Guess we need to decide whether newrelic alerts are sufficient, or if we should get coverage via nagios too?

Ed Morley [:emorley]

Reporter

Updated

•

11 years ago

Blocks: 1075799

Mauro Doglio [:mdoglio]

Comment 5

•

11 years ago

We should instrument the rabbitmq instance on the admin node to report to new relic. In that way we would be able to see how many tasks we have in each queue, etc. Maybe :fubar can help us?

Flags: needinfo?(klibby)

Ed Morley [:emorley]

Reporter

Comment 6

•

11 years ago

Ah great idea :-)

Kendall Libby [:fubar] (he/him)

Comment 7

•

11 years ago

As it happens, we are/were missing monitoring pieces for treeherder in nagios. :-( I've added standard http/s checks for zeus and the webheads. Currently looking at what nagios can do for rabbitmq, etc. newrelic is already set up on the admin node

Flags: needinfo?(klibby)

Ed Morley [:emorley]

Reporter

Comment 8

•

11 years ago

Making this bug more generic, since it sounds like we need to increase coverage for things other than data ingestion too. (In reply to Mauro Doglio [:mdoglio] from comment #5) > We should instrument the rabbitmq instance on the admin node to report to > new relic. > In that way we would be able to see how many tasks we have in each queue, > etc. Using something like this? http://newrelic.com/plugins/pivotal/95

Priority: P3 → P2

Summary: Set up alerts for when data ingestion is failing and/or backlogged → Improve Nagios & New Relic coverage of treeherder

Ed Morley [:emorley]

Reporter

Updated

•

11 years ago

Blocks: 1072681

Ed Morley [:emorley]

Reporter

Updated

•

11 years ago

Depends on: 1076737

Ed Morley [:emorley]

Reporter

Updated

•

11 years ago

Blocks: 1080757
No longer blocks: 1072681

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

No longer blocks: 1080757

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Component: Treeherder → Treeherder: Infrastructure

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1093757

Keywords: meta

Summary: Improve Nagios & New Relic coverage of treeherder → [Meta] Improve Nagios & New Relic coverage of treeherder

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1124278

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1125395

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1125569

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1127774

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1131130

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1131171

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1131240

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1131244

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1131247

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1131394

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Priority: P2 → P3

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1141036

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1141993

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1076886

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1165229

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1191080

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1200379

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1201086

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1223450

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1223496

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Depends on: 1225504

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1276249

Mauro Doglio [:mdoglio]

Updated

•

9 years ago

QA Contact: laura

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1281850

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1284289

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1287950

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Summary: [Meta] Improve Nagios & New Relic coverage of treeherder → [Meta] Improve monitoring/alert coverage of treeherder (eg New Relic, Nagios, CloudWatch)

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1306597

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1176412

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1201063

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1307465

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1308549

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1336276

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1340132, 1340123

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1340203

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1340216

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1346204

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1354484

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1357538

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1371264

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1373245

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Assignee: nobody → emorley

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1387475

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1387487

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1387543

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1387556

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1387642

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1393194

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1397727

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1413891

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Assignee: emorley → nobody

QA Contact: laura

Ed Morley [:emorley]

Reporter

Comment 9

•

7 years ago

I've just updated the Heroku metrics alerts (see https://devcenter.heroku.com/articles/metrics#threshold-alerting) to go to treeherder-internal@ rather than only me.

Ed Morley [:emorley]

Reporter

Updated

•

7 years ago

Depends on: 1439368

Ed Morley [:emorley]

Reporter

Updated

•

7 years ago

Depends on: 1463709

Ed Morley [:emorley]

Reporter

Updated

•

7 years ago

Depends on: 1483301

Ed Morley [:emorley]

Reporter

Updated

•

7 years ago

Depends on: 1503576

Ed Morley [:emorley]

Reporter

Updated

•

7 years ago

No longer depends on: 1463709

Ed Morley [:emorley]

Reporter

Updated

•

6 years ago

Depends on: 1513506

Karl Thiessen [:kthiessen, he/him]

Updated

•

6 years ago

Type: defect → task

Armen [:armenzg]

Comment 10

•

6 years ago

The two bugs are within the components' queue and it is not massive.
If we ever plan to tackle them they're filed. No need for a meta bug for only two bugs.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → INVALID