Closed Bug 1384934 Opened 7 years ago Closed 7 years ago

Recent jobs not appearing or updating in Treeherder

Categories

(Tree Management :: Treeherder, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Assigned: camd)

References

Details

Attachments

(2 files)

On multiple repos, such as autoland and inbound, jobs are either not showing up or are not updating to the correct status.

Not sure where the problem currently is, but filing here and can move once investigation moves along.

For some jobs that I checked, I see that we published a pulse message for them but treeherder is not showing the jobs as updated.  Other apps listening for those same messages the status is updated correctly.

On a side note, I received an email about a different issue from heroku for mozilla-taskcluster (which would cause decision tasks to not fire).  While looking into it, I was also getting 503 errors from heroku fro some of our postgres dbs.

I'm not sure if there is a larger issue going on here.
m-c/autoland/inbound closed
Severity: normal → blocker
Received a message from heroku about one of the redis DBs for a different app coming back online (total unreachable time was 16 minutes).  Hopefully other things fall into place soon.
Looks like some job result data is starting to filter in, but the missing pushes are still MIA.
Heroku says that the incident is over now and all systems are normal. Still missing pushes on our end, though.
I believe the pushes are now showing on all the repos.  I restarted all the Dynos on Heroku and that seems to have un-stuck whatever was stuck.  I still don't see a smoking gun, but will continue to look.
Cam restarted a few things and everything looks caught up now. Trees reopened!
Assignee: nobody → cdawson
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Catching up on my email after the fact, I saw a message from Heroku at 6:52 PT:

Our monitoring app has been unable to reach your database <name redacted> (REDIS on mozilla-taskcluster) since 2017-07-27 13:51:08 UTC. This is likely due to an underlying hardware or network failure and not something caused by your application.

We're attempting to bring it back online automatically. If we can't, we'll page an engineer to help.

And then a follow-up email at 7:08 PT saying things were back up.  I think what happened was that we lost connection to the DB, and even after the DB came back up, we couldn't reconnect till the dynos were restarted.
Happened to see this since the alerts came through on the Gmail important filter. Thank you for sorting Cameron!

Looks like AWS had a brief outage that caused heroku issues across several of their services.

What would be good to answer before the 3 day papertrail archive period kicks in, is:
* Was the manual Dyno restart required, or were things catching up anyway?
* If it was required, what got stuck and could be fixed to make this self-recover in the future?

For answering that, I'd look at:
* The heroku "metrics" tab (shows when dyno restarts occur)
* New relic (both the APM transactions section and the plugins page for CloudAMQP queues)
* Papertrail logs
* Whether pulse.mozilla.org was affected at all (since it uses CloudAMQP too, so on AWS). Ie the connection might have been broken that side rather than the treeherder side (or both!)

Typical first day off this happens haha.
It appears that there was a short back-log on Pulse.  But not for long.  It quickly caught up.  The large backlog was in our ``store_pulse_jobs``

We got one of these tracebacks for each of the ``store_pulse_jobs`` worker dynos:

Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.3: [2017-07-27 14:04:20,492: ERROR/MainProcess] consumer: Cannot connect to amqp://galwgrks:**@beige-hedgehog.rmq.cloudamqp.com:5671/galwgrks: timed out. 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: [2017-07-27 14:04:20,528: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection... 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: Traceback (most recent call last): 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/consumer.py", line 280, in start 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     blueprint.start(self) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     step.start(parent) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/consumer.py", line 884, in start 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     c.loop(*c.loop_args()) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/loops.py", line 73, in asynloop 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     update_qos() 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/kombu/common.py", line 407, in update 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     return self.set(self.value) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/kombu/common.py", line 400, in set 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     self.callback(prefetch_count=new_value) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/consumer.py", line 661, in set_prefetch_count 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     apply_global=qos_global, 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/kombu/messaging.py", line 533, in qos 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     apply_global) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/amqp/channel.py", line 2205, in basic_qos 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     (60, 11),  # Channel.basic_qos_ok 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     self.channel_id, allowed_methods, timeout) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 241, in _wait_method 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     channel, method_sig, args, content = read_timeout(timeout) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 330, in read_timeout 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     return self.method_reader.read_method() 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.3: Trying again in 4.00 seconds... 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     raise m 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: error: [Errno 110] Connection timed out 

Jul 27 07:04:50 treeherder-prod app/worker_store_pulse_jobs.1: [2017-07-27 14:04:50,624: ERROR/MainProcess] consumer: Cannot connect to amqp://galwgrks:**@beige-hedgehog.rmq.cloudamqp.com:5671/galwgrks: timed out. 
Jul 27 07:04:50 treeherder-prod app/worker_store_pulse_jobs.1: Trying again in 2.00 seconds...
To clarify on Pulse:  There was a message that the queue for pulse jobs had grown to 4000 at 6:53 PT.  And it was down to zero again by 7:08 PT.  So there was some kind of hiccup, but quickly resolved.
On Heroku in Metrics, here are the events:

7-710: Warning "Networking issues: Investigating" 
Those continue till...
8:20-30: The last of those warnings at the same time as the dyno restart
9:00-9:10: "Networking issues: Resolved"

So the restart at least perfectly coincides with the final warning and then recovery.  Isn't necessarily causation, but seems likely.  In the screenshot attached, the blue is the restart and orange are the warnings about network issues.
Sorry for all the bug-spam.  So, looking at New Relic 3 hours ending 7/27 8:50 PDT.
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?type=other&show_browser=false

It appears store_pulse_jobs and log-parser flat-lined around 6:47 PDT.  But then started to recover and process again at around 7:38 PDT.  That's well before I restarted the dynos at 8:20.  Though processing picked up a lot after the restart.  So it's entirely possible this would have self-recovered without the restart, though a bit more slowly.

I suppose I could have been a bit less trigger-happy with the restart.  I'll take a closer look before doing that in the future if we hit this again.  It would have been nice to know if things would have self-healed.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: