Closed
Bug 1384934
Opened 7 years ago
Closed 7 years ago
Recent jobs not appearing or updating in Treeherder
Categories
(Tree Management :: Treeherder, defect)
Tree Management
Treeherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: garndt, Assigned: camd)
References
Details
Attachments
(2 files)
On multiple repos, such as autoland and inbound, jobs are either not showing up or are not updating to the correct status. Not sure where the problem currently is, but filing here and can move once investigation moves along. For some jobs that I checked, I see that we published a pulse message for them but treeherder is not showing the jobs as updated. Other apps listening for those same messages the status is updated correctly. On a side note, I received an email about a different issue from heroku for mozilla-taskcluster (which would cause decision tasks to not fire). While looking into it, I was also getting 503 errors from heroku fro some of our postgres dbs. I'm not sure if there is a larger issue going on here.
Reporter | ||
Comment 2•7 years ago
|
||
Received a message from heroku about one of the redis DBs for a different app coming back online (total unreachable time was 16 minutes). Hopefully other things fall into place soon.
Reporter | ||
Comment 3•7 years ago
|
||
https://status.heroku.com/incidents/1230
Comment 4•7 years ago
|
||
Looks like some job result data is starting to filter in, but the missing pushes are still MIA.
Comment 5•7 years ago
|
||
Heroku says that the incident is over now and all systems are normal. Still missing pushes on our end, though.
Assignee | ||
Comment 7•7 years ago
|
||
I believe the pushes are now showing on all the repos. I restarted all the Dynos on Heroku and that seems to have un-stuck whatever was stuck. I still don't see a smoking gun, but will continue to look.
Comment 8•7 years ago
|
||
Cam restarted a few things and everything looks caught up now. Trees reopened!
Assignee: nobody → cdawson
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 9•7 years ago
|
||
Catching up on my email after the fact, I saw a message from Heroku at 6:52 PT: Our monitoring app has been unable to reach your database <name redacted> (REDIS on mozilla-taskcluster) since 2017-07-27 13:51:08 UTC. This is likely due to an underlying hardware or network failure and not something caused by your application. We're attempting to bring it back online automatically. If we can't, we'll page an engineer to help. And then a follow-up email at 7:08 PT saying things were back up. I think what happened was that we lost connection to the DB, and even after the DB came back up, we couldn't reconnect till the dynos were restarted.
Comment 10•7 years ago
|
||
Happened to see this since the alerts came through on the Gmail important filter. Thank you for sorting Cameron! Looks like AWS had a brief outage that caused heroku issues across several of their services. What would be good to answer before the 3 day papertrail archive period kicks in, is: * Was the manual Dyno restart required, or were things catching up anyway? * If it was required, what got stuck and could be fixed to make this self-recover in the future? For answering that, I'd look at: * The heroku "metrics" tab (shows when dyno restarts occur) * New relic (both the APM transactions section and the plugins page for CloudAMQP queues) * Papertrail logs * Whether pulse.mozilla.org was affected at all (since it uses CloudAMQP too, so on AWS). Ie the connection might have been broken that side rather than the treeherder side (or both!) Typical first day off this happens haha.
Assignee | ||
Comment 11•7 years ago
|
||
It appears that there was a short back-log on Pulse. But not for long. It quickly caught up. The large backlog was in our ``store_pulse_jobs`` We got one of these tracebacks for each of the ``store_pulse_jobs`` worker dynos: Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.3: [2017-07-27 14:04:20,492: ERROR/MainProcess] consumer: Cannot connect to amqp://galwgrks:**@beige-hedgehog.rmq.cloudamqp.com:5671/galwgrks: timed out. Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: [2017-07-27 14:04:20,528: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection... Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: Traceback (most recent call last): Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/consumer.py", line 280, in start Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: blueprint.start(self) Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: step.start(parent) Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/consumer.py", line 884, in start Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: c.loop(*c.loop_args()) Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/loops.py", line 73, in asynloop Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: update_qos() Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/kombu/common.py", line 407, in update Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: return self.set(self.value) Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/kombu/common.py", line 400, in set Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: self.callback(prefetch_count=new_value) Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/consumer.py", line 661, in set_prefetch_count Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: apply_global=qos_global, Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/kombu/messaging.py", line 533, in qos Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: apply_global) Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/amqp/channel.py", line 2205, in basic_qos Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: (60, 11), # Channel.basic_qos_ok Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: self.channel_id, allowed_methods, timeout) Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 241, in _wait_method Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: channel, method_sig, args, content = read_timeout(timeout) Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 330, in read_timeout Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: return self.method_reader.read_method() Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.3: Trying again in 4.00 seconds... Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: File "/app/.heroku/python/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: raise m Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: error: [Errno 110] Connection timed out Jul 27 07:04:50 treeherder-prod app/worker_store_pulse_jobs.1: [2017-07-27 14:04:50,624: ERROR/MainProcess] consumer: Cannot connect to amqp://galwgrks:**@beige-hedgehog.rmq.cloudamqp.com:5671/galwgrks: timed out. Jul 27 07:04:50 treeherder-prod app/worker_store_pulse_jobs.1: Trying again in 2.00 seconds...
Assignee | ||
Comment 12•7 years ago
|
||
To clarify on Pulse: There was a message that the queue for pulse jobs had grown to 4000 at 6:53 PT. And it was down to zero again by 7:08 PT. So there was some kind of hiccup, but quickly resolved.
Assignee | ||
Comment 13•7 years ago
|
||
On Heroku in Metrics, here are the events: 7-710: Warning "Networking issues: Investigating" Those continue till... 8:20-30: The last of those warnings at the same time as the dyno restart 9:00-9:10: "Networking issues: Resolved" So the restart at least perfectly coincides with the final warning and then recovery. Isn't necessarily causation, but seems likely. In the screenshot attached, the blue is the restart and orange are the warnings about network issues.
Assignee | ||
Comment 14•7 years ago
|
||
Assignee | ||
Comment 15•7 years ago
|
||
Sorry for all the bug-spam. So, looking at New Relic 3 hours ending 7/27 8:50 PDT. https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?type=other&show_browser=false It appears store_pulse_jobs and log-parser flat-lined around 6:47 PDT. But then started to recover and process again at around 7:38 PDT. That's well before I restarted the dynos at 8:20. Though processing picked up a lot after the restart. So it's entirely possible this would have self-recovered without the restart, though a bit more slowly. I suppose I could have been a bit less trigger-happy with the restart. I'll take a closer look before doing that in the future if we hit this again. It would have been nice to know if things would have self-healed.
Assignee | ||
Comment 16•7 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•