Closed Bug 1277436 Opened 8 years ago Closed 8 years ago

Trees closed, finished jobs not showing up on treeherder

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: camd)

References

Details

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=30e0d233fb29b0ff0d56d9d1704b5b136f16941e is about 12 hours worth of pushes, and is currently showing every single push as still being in progress, with hundreds of jobs that have been "running" for 500+ minutes down at the bottom, despite buildapi knowing that they are all done. fx-team looks the same, other trees don't have any pushes from today to show either way, but oddly, try doesn't seem to be affected, showing pushes that should be done as done.

All trees other than try are closed.
No idea what this could possibly mean, but central/beta/release/esr45 actually did have pushes during the day today, and they are all being properly shown as finished, so it's only mozilla-inbound and fx-team which are showing hundreds of jobs that are hundreds of minutes overdue.
asked #moc to page camd/emorley for this tree closure
cam responded to the ping and fixed something (i guess he can provide more details what happened when he is awake again :)

Trees Reopn
Flags: needinfo?(cdawson)
Depends on: 1277345
(In reply to Carsten Book [:Tomcat] from comment #2)
> asked #moc to page camd/emorley for this tree closure

Ah that would explain the missed call I saw this morning (must have slept through it). For me it was 0600 so no chance of being awake (my days are slightly shifted later to help with US meetings etc). If something occurs this time of day again it may be best to start by calling the west coast people first, since for them it's 2200 (so presumably still awake) and not my sleepy-time :-)

(In reply to Carsten Book [:Tomcat] from comment #3)
> caused by https://bugzilla.mozilla.org/show_bug.cgi?id=1266229 ?

No, that change (a) is a no-op, (b) hasn't been deployed yet.

The last prod deploy (prior to this incident) was on 26th May:
https://rpm.newrelic.com/accounts/677903/applications/4180461/deployments

New Relic error page:
https://rpm.newrelic.com/accounts/677903/applications/4180461/filterable_errors?tw%5Bend%5D=1464857418&tw%5Bstart%5D=1464814218#/heatmap?top_facet=transactionUiName&barchart=barchart&_k=vfl027

As Cameron found last night, this was due to IntegrityError exceptions during job ingestion, for example:
https://rpm.newrelic.com/accounts/677903/applications/4180461/filterable_errors?tw%5Bend%5D=1464857418&tw%5Bstart%5D=1464814218#/show/3fc9ee-831c92f5-2870-11e6-b947-b82a72d22a14/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=181g3u

This was noticed on stage yesterday, and fixed by Will in bug 1277345, however was not yet deployed. The bug that caused the regression had been deployed 7 days prior - presumably we'd just been lucky until now that we'd not hit this edge case.

Last night Cameron cherrypicked the bug 1277345 fix and landed it on prod.

There were no New Relic alerts for this, since the overall exception per transaction rate was still no higher than 0.8%.

Things that would have helped:
1) Fixing some of the background noise exceptions, so we can lower the overall thresholds for New Relic alerts
2) Setting a lower threshold for specific jobs (in this case builds-4hr ingestion) using New Relic's "key transaction" feature. However, we already have a key transaction for builds-4hr, it just didn't alert for some reason, even though that transactions error rates were over the threshold?! See:
https://rpm.newrelic.com/accounts/677903/key_transactions/14761?tw%5Bend%5D=1464857705&tw%5Bstart%5D=1464814505

I'll message New Relic support to see why #2 didn't work.
Assignee: nobody → cdawson
Status: NEW → RESOLVED
Closed: 8 years ago
Component: Treeherder → Treeherder: Infrastructure
Priority: -- → P1
QA Contact: laura
Resolution: --- → FIXED
Ed-- Thanks for summarizing all that!  I don't have anything more to add.  :)
Flags: needinfo?(cdawson)
Depends on: 1281850
You need to log in before you can comment on or make changes to this bug.