Closed Bug 1279556 Opened 8 years ago Closed 8 years ago

Wait times emails stopped on May 10

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1293659

People

(Reporter: coop, Unassigned)

Details

The last wait times email I received came on May 10. Something is wrong here.
Looking in papertrail, buildapi is returning 500:

Jun 10 06:06:04 relengwebadm.private.scl3.mozilla.com buildapi_waittimes: Error: fetching wait times from location http://buildapi.pvt.build.mozilla.org/buildapi/reports/waittimes/buildpool?maxb=480&endtime=1465542000&mpb=15&format=json : HTTP Error 500: Internal Server Error

No code has changed in buildapi since March 11. I tried re-deploying the existing buildapi code, and saw the following errors:

[2016-06-10 09:36:43] Running push_www
[2016-06-10 09:36:43] [web1.releng.webapp.scl3.mozilla.com] running: /data/bin/update-www.sh buildapi
[2016-06-10 09:36:43] [web2.releng.webapp.scl3.mozilla.com] running: /data/bin/update-www.sh buildapi
[2016-06-10 09:36:43] [celery1.srv.releng.scl3.mozilla.com] running: /data/bin/update-www.sh buildapi
[2016-06-10 09:36:43] [celery1.srv.releng.scl3.mozilla.com] failed: /data/bin/update-www.sh buildapi (0.086s)
[celery1.srv.releng.scl3.mozilla.com] err: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
[2016-06-10 09:36:43] [web2.releng.webapp.scl3.mozilla.com] failed: /data/bin/update-www.sh buildapi (0.091s)
[web2.releng.webapp.scl3.mozilla.com] err: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
[2016-06-10 09:36:43] [web1.releng.webapp.scl3.mozilla.com] failed: /data/bin/update-www.sh buildapi (0.103s)
[web1.releng.webapp.scl3.mozilla.com] err: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
[2016-06-10 09:36:43] Finished push_www (0.104s)
[2016-06-10 09:36:43] Starting new HTTPS connection (1): changelog.allizom.org

So we can't talk to celery from the webheads. This is likely related to the recent key changes, so NI-ing Callek for potential insight.
Flags: needinfo?(bugspam.Callek)
Priority: -- → P3
(In reply to Chris Cooper [:coop] from comment #1)
> Looking in papertrail, buildapi is returning 500:
> 
> Jun 10 06:06:04 relengwebadm.private.scl3.mozilla.com buildapi_waittimes:
> Error: fetching wait times from location
> http://buildapi.pvt.build.mozilla.org/buildapi/reports/waittimes/
> buildpool?maxb=480&endtime=1465542000&mpb=15&format=json : HTTP Error 500:
> Internal Server Error

I'd be interested in what the buildapi error was/is.... 

> 
> No code has changed in buildapi since March 11. I tried re-deploying the
> existing buildapi code, and saw the following errors:
> 
> [celery1.srv.releng.scl3.mozilla.com] err: Permission denied
> (publickey,gssapi-keyex,gssapi-with-mic).
> [web2.releng.webapp.scl3.mozilla.com] err: Permission denied
> (publickey,gssapi-keyex,gssapi-with-mic).
> [web1.releng.webapp.scl3.mozilla.com] err: Permission denied
> (publickey,gssapi-keyex,gssapi-with-mic).
> 
> So we can't talk to celery from the webheads. This is likely related to the
> recent key changes, so NI-ing Callek for potential insight.

These wouldn't be from my key changes, this would be from webops changes with the rebuilds recently. They are also the issues that require us to file bugs lately about other relengweb pushes (like trychooser)
Flags: needinfo?(bugspam.Callek)
[root@web1.releng.webapp.scl3 ~]# cat /var/log/httpd/buildapi.pvt.build.mozilla.org/error_log
[Fri Jun 10 10:05:01 2016] [error] [client 10.22.81.211] Timeout when reading response headers from daemon process 'buildapi': /data/www/buildapi/buildapi.wsgi



/var/log/buildapi/buildapi.log itself looks fine.

I wonder if this is some sort of fallout from the py2.7 upgrade that broke relengweb during the rebuild.
Flags: needinfo?(hwine)
I'm not aware of the py27 issues mentioned in comment 3

The trychooser push issue should be fixed as of this morning -- I can push, but I have extra rights atm.
Flags: needinfo?(hwine)
fwiw the entire relengweb cluster went down ~may 10'th during the rebuilds and a mod_wsgi/python update.

http://logs.glob.uno/?c=mozilla%23releng&s=10+May+2016&e=10+May+2016

See ~15:38 and on.
Okay, after some back and forth on #irc for me to remember the past -- a little more detail.

Some of our releng webcluster apps used to rely on data flows authorized with ssh keys. Unfortunately, some of those keys were not documented and/or were "loose" (not in puppet or ldap). Recent cleanups happened in 2 steps, the first was around May 10 iirc. The second happened later in May.

As of this morning, the keys have been recovered and properly inserted into puppet for trychooser. (See bug 1278585 comment 5 and on.)

If the wait times report used the same flows, then it should start working. It appears not to as
  http://buildapi.pvt.build.mozilla.org/buildapi/reports/waittimes
still returns a 500

Next steps would be to identify what credentials are being used in this flow, and working with webops to get that flow re-enabled and/or switch to the keys that do work.

ni: :coop to report back next week on whether the emails are still a no-show. And whether this is worth pursuing.
Flags: needinfo?(coop)
(In reply to Hal Wine [:hwine] (use NI) from comment #6)
> ni: :coop to report back next week on whether the emails are still a
> no-show. And whether this is worth pursuing.

Emails are still broken.

I think this is still worth pursuing, because even if these particular reports aren't useful, we can't use the data source to build an alternative until this is fixed.

The long term plan should probably involve moving to heroku.
Flags: needinfo?(coop)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.