1384934 - Recent jobs not appearing or updating in Treeherder

Reporter

Description

•

7 years ago

On multiple repos, such as autoland and inbound, jobs are either not showing up or are not updating to the correct status.

Not sure where the problem currently is, but filing here and can move once investigation moves along.

For some jobs that I checked, I see that we published a pulse message for them but treeherder is not showing the jobs as updated.  Other apps listening for those same messages the status is updated correctly.

On a side note, I received an email about a different issue from heroku for mozilla-taskcluster (which would cause decision tasks to not fire).  While looking into it, I was also getting 503 errors from heroku fro some of our postgres dbs.

I'm not sure if there is a larger issue going on here.

Ryan VanderMeulen [:RyanVM]

Comment 1

•

7 years ago

m-c/autoland/inbound closed

Severity: normal → blocker

Greg Arndt [:garndt]

Reporter

Comment 2

•

7 years ago

Received a message from heroku about one of the redis DBs for a different app coming back online (total unreachable time was 16 minutes).  Hopefully other things fall into place soon.

Greg Arndt [:garndt]

Reporter

Comment 3

•

7 years ago

https://status.heroku.com/incidents/1230

Ryan VanderMeulen [:RyanVM]

Comment 4

•

7 years ago

Looks like some job result data is starting to filter in, but the missing pushes are still MIA.

Ryan VanderMeulen [:RyanVM]

Comment 5

•

7 years ago

Heroku says that the incident is over now and all systems are normal. Still missing pushes on our end, though.

Cameron Dawson [:camd]

Assignee

Comment 7

•

7 years ago

I believe the pushes are now showing on all the repos.  I restarted all the Dynos on Heroku and that seems to have un-stuck whatever was stuck.  I still don't see a smoking gun, but will continue to look.

Ryan VanderMeulen [:RyanVM]

Comment 8

•

7 years ago

Cam restarted a few things and everything looks caught up now. Trees reopened!

Assignee: nobody → cdawson

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Cameron Dawson [:camd]

Assignee

Comment 9

•

7 years ago

Catching up on my email after the fact, I saw a message from Heroku at 6:52 PT:

Our monitoring app has been unable to reach your database <name redacted> (REDIS on mozilla-taskcluster) since 2017-07-27 13:51:08 UTC. This is likely due to an underlying hardware or network failure and not something caused by your application.

We're attempting to bring it back online automatically. If we can't, we'll page an engineer to help.

And then a follow-up email at 7:08 PT saying things were back up.  I think what happened was that we lost connection to the DB, and even after the DB came back up, we couldn't reconnect till the dynos were restarted.

Ed Morley [:emorley]

Comment 10

•

7 years ago

Happened to see this since the alerts came through on the Gmail important filter. Thank you for sorting Cameron!

Looks like AWS had a brief outage that caused heroku issues across several of their services.

What would be good to answer before the 3 day papertrail archive period kicks in, is:
* Was the manual Dyno restart required, or were things catching up anyway?
* If it was required, what got stuck and could be fixed to make this self-recover in the future?

For answering that, I'd look at:
* The heroku "metrics" tab (shows when dyno restarts occur)
* New relic (both the APM transactions section and the plugins page for CloudAMQP queues)
* Papertrail logs
* Whether pulse.mozilla.org was affected at all (since it uses CloudAMQP too, so on AWS). Ie the connection might have been broken that side rather than the treeherder side (or both!)

Typical first day off this happens haha.

Cameron Dawson [:camd]

Assignee

Comment 11

•

7 years ago

It appears that there was a short back-log on Pulse.  But not for long.  It quickly caught up.  The large backlog was in our ``store_pulse_jobs``

We got one of these tracebacks for each of the ``store_pulse_jobs`` worker dynos:

Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.3: [2017-07-27 14:04:20,492: ERROR/MainProcess] consumer: Cannot connect to amqp://galwgrks:**@beige-hedgehog.rmq.cloudamqp.com:5671/galwgrks: timed out. 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: [2017-07-27 14:04:20,528: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection... 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: Traceback (most recent call last): 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/consumer.py", line 280, in start 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     blueprint.start(self) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     step.start(parent) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/consumer.py", line 884, in start 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     c.loop(*c.loop_args()) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/loops.py", line 73, in asynloop 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     update_qos() 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/kombu/common.py", line 407, in update 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     return self.set(self.value) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/kombu/common.py", line 400, in set 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     self.callback(prefetch_count=new_value) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/celery/worker/consumer.py", line 661, in set_prefetch_count 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     apply_global=qos_global, 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/kombu/messaging.py", line 533, in qos 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     apply_global) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/amqp/channel.py", line 2205, in basic_qos 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     (60, 11),  # Channel.basic_qos_ok 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     self.channel_id, allowed_methods, timeout) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 241, in _wait_method 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     channel, method_sig, args, content = read_timeout(timeout) 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 330, in read_timeout 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     return self.method_reader.read_method() 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.3: Trying again in 4.00 seconds... 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:   File "/app/.heroku/python/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1:     raise m 
Jul 27 07:04:20 treeherder-prod app/worker_store_pulse_jobs.1: error: [Errno 110] Connection timed out 

Jul 27 07:04:50 treeherder-prod app/worker_store_pulse_jobs.1: [2017-07-27 14:04:50,624: ERROR/MainProcess] consumer: Cannot connect to amqp://galwgrks:**@beige-hedgehog.rmq.cloudamqp.com:5671/galwgrks: timed out. 
Jul 27 07:04:50 treeherder-prod app/worker_store_pulse_jobs.1: Trying again in 2.00 seconds...

Cameron Dawson [:camd]

Assignee

Comment 12

•

7 years ago

To clarify on Pulse:  There was a message that the queue for pulse jobs had grown to 4000 at 6:53 PT.  And it was down to zero again by 7:08 PT.  So there was some kind of hiccup, but quickly resolved.

Cameron Dawson [:camd]

Assignee

Comment 13

•

7 years ago

On Heroku in Metrics, here are the events:

7-710: Warning "Networking issues: Investigating" 
Those continue till...
8:20-30: The last of those warnings at the same time as the dyno restart
9:00-9:10: "Networking issues: Resolved"

So the restart at least perfectly coincides with the final warning and then recovery.  Isn't necessarily causation, but seems likely.  In the screenshot attached, the blue is the restart and orange are the warnings about network issues.

Cameron Dawson [:camd]

Assignee

Comment 14

•

7 years ago

Attached image Screenshot 2017-07-28 14.26.39.png — Details

Cameron Dawson [:camd]

Assignee

Comment 15

•

7 years ago

Sorry for all the bug-spam.  So, looking at New Relic 3 hours ending 7/27 8:50 PDT.
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?type=other&show_browser=false

It appears store_pulse_jobs and log-parser flat-lined around 6:47 PDT.  But then started to recover and process again at around 7:38 PDT.  That's well before I restarted the dynos at 8:20.  Though processing picked up a lot after the restart.  So it's entirely possible this would have self-recovered without the restart, though a bit more slowly.

I suppose I could have been a bit less trigger-happy with the restart.  I'll take a closer look before doing that in the future if we hit this again.  It would have been nice to know if things would have self-healed.

Cameron Dawson [:camd]

Assignee

Comment 16

•

7 years ago

Attached image New Relic Non-web transactions — Details

Bugzilla

Quick Search

Recent jobs not appearing or updating in Treeherder

Categories

(Tree Management :: Treeherder, defect)

Tracking

(Not tracked)

People

(Reporter: garndt, Assigned: camd)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Attachment

General

Description

File Name

Content Type

Screenshot 2017-07-28 14.26.39.png 7 years ago Cameron Dawson [:camd] 5.43 KB, image/png		Details
New Relic Non-web transactions 7 years ago Cameron Dawson [:camd] 56.78 KB, image/png		Details