978311 - Staging is throwing a 500 / Internal Server Error (Apache restart needed?)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Description

•

12 years ago

https://crash-stats.allizom.org/ down/throwing a 500, apparently at the Apache/webhead level; no clue why (does Apache need a restart? Config issue? Since we're not in New Relic, and Sentry/Errormill doesn't expose this info, it's impossible for Web QA to tell). We saw this (through automation), at least at 1:26pm PST, though it might have happened earlier. http://qa-selenium.mv.mozilla.com:8080/view/Socorro/job/socorro.stage.saucelabs/887/

Lonnen :lonnen

Comment 1

•

12 years ago

I see errors from the last push. I think the culprit is here. [socorroadm.stage.private.phx1.mozilla.com] err: mv: cannot stat `socorro-new.tar.gz': No such file or directory I tried to steamroll over it with another release but it caused the same problem. http://socorroadm.private.phx1.mozilla.com/chief/socorro.stage/logs/do.it..1393629927

Robert Helmer [:rhelmer]

Comment 2

•

12 years ago

The problem is the Apache config, it's missing this: WSGIPythonHome /data/socorro/webapp-django/virtualenv/ That's not in the Socorro repo, it's in a totally separate SVN puppet repo.

Lonnen :lonnen

Comment 3

•

12 years ago

I've peered into stage and prod apache configs and it doesn't look like we have WSGIPythonHome in either. I don't think that one line missing is the cause of our current stage woes, since production is working just fine and it looks like neither file has ever had it (at least for django). a possible lead -- Chief appears to have stopped updating. I've tried to kick off three manual builds and encountered errors: [Sun Mar 02 00:00:14 2014] [error] [client 10.22.248.54] (70007)The timeout specified has expired: proxy: error reading response

Robert Helmer [:rhelmer]

Comment 4

•

12 years ago

(In reply to Chris Lonnen :lonnen from comment #3) > I've peered into stage and prod apache configs and it doesn't look like we > have WSGIPythonHome in either. I don't think that one line missing is the > cause of our current stage woes, since production is working just fine and > it looks like neither file has ever had it (at least for django). Yes it looks like you're right... we recommend this in our example config: https://github.com/mozilla/socorro/blob/master/config/apache.conf-dist Also I can confirm that it seems to fix the problem, but I agree that it looks like it was working before without this setting (I made this change on one of the stage boxes and confirmed it worked, and let puppet clobber the change.) I would like to know exactly how it was working before and why it's not now!

Brandon Burton [:solarce]

Comment 5

•

12 years ago

1. Chief breaking was unrelated and was caused by me doing what I thought was some safe technical debt cleanup. w-e-l-p.com * I'll be updating mana with more details on how Chief currently works (it's not pretty) and hand the baton to phrawzty on how to move forward 2. Stage breaks after build 932, 932 has been pushed and is working: https://crash-stats.allizom.org/status/ Currently Chief pushes latest but you can push older builds via the CLI (details will be in mana shortly)

Robert Helmer [:rhelmer]

Comment 6

•

12 years ago

(In reply to Brandon Burton [:solarce] from comment #5) > 2. Stage breaks after build 932, 932 has been pushed and is working: > https://crash-stats.allizom.org/status/ Note that the next two builds failed for unrelated reasons... I know that we tried wiping the workspace, I am starting to suspect that this was broken earlier but the problem didn't manifest until the workspace was cleared :/ Is there any way to see a log of when the workspace was cleared? I could probably repro this locally in vagrant and bisect.

Robert Helmer [:rhelmer]

Comment 7

•

12 years ago

(In reply to Robert Helmer [:rhelmer] from comment #6) > (In reply to Brandon Burton [:solarce] from comment #5) > > 2. Stage breaks after build 932, 932 has been pushed and is working: > > https://crash-stats.allizom.org/status/ > > Note that the next two builds failed for unrelated reasons... I know that we > tried wiping the workspace, I am starting to suspect that this was broken > earlier but the problem didn't manifest until the workspace was cleared :/ > > Is there any way to see a log of when the workspace was cleared? I could > probably repro this locally in vagrant and bisect. Oh! Or I could just diff the builds.. that's a lot easier, will start w/ that.

Brandon Burton [:solarce]

Comment 8

•

12 years ago

I've updated mana with background on the current state of Chief and how to manually push an older build to staging. https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=5734601#crash-stats.mozilla.com%28Socorro%29-StageUpdateDetails I've also disabled the stage auto-push cron job until the path forward for Django and the Apache configuration is determined.

Robert Helmer [:rhelmer]

Comment 9

•

12 years ago

Attached file diff between socorro-release builds 932 ("works") and 935 ("busted") — Details

Lonnen :lonnen

Comment 10

•

12 years ago

I temporarily applied the WSGI directive to use the virtual environment on stage (r83555) and pushed the latest working CI (943). It was a failure. After the fact I noticed I put it in the wrong part of the file. It may yet work, if someone applies it correctly. I'm out of time today. Trying to restore to a known working state.

Lonnen :lonnen

Comment 11

•

12 years ago

Command line pushing the old 932 build isn't working. At the tail end of the script I get: Starting memcached: chown: cannot access `/var/run/memcached': No such file or directory The stage web heads have no useful info for me, in the apache logs or otherwise. I need to step away now but I can come back to it later tonight.

Robert Helmer [:rhelmer]

Comment 12

•

12 years ago

(In reply to Chris Lonnen :lonnen from comment #10) > I temporarily applied the WSGI directive to use the virtual environment on > stage (r83555) and pushed the latest working CI (943). It was a failure. > > After the fact I noticed I put it in the wrong part of the file. It may yet > work, if someone applies it correctly. I'm out of time today. Trying to > restore to a known working state. I put it at the top of crash-stats.allizom.org.conf when testing it before - I believe that it needs to be outside of the Virtualenv block.

Robert Helmer [:rhelmer]

Comment 13

•

12 years ago

(In reply to Robert Helmer [:rhelmer] from comment #12) > (In reply to Chris Lonnen :lonnen from comment #10) > > I temporarily applied the WSGI directive to use the virtual environment on > > stage (r83555) and pushed the latest working CI (943). It was a failure. > > > > After the fact I noticed I put it in the wrong part of the file. It may yet > > work, if someone applies it correctly. I'm out of time today. Trying to > > restore to a known working state. > > I put it at the top of crash-stats.allizom.org.conf when testing it before - > I believe that it needs to be outside of the Virtualenv block. And by "Virtualenv block", I meant "VirtualHost block" :)

Robert Helmer [:rhelmer]

Comment 14

•

12 years ago

(In reply to Robert Helmer [:rhelmer] from comment #13) > (In reply to Robert Helmer [:rhelmer] from comment #12) > > (In reply to Chris Lonnen :lonnen from comment #10) > > > I temporarily applied the WSGI directive to use the virtual environment on > > > stage (r83555) and pushed the latest working CI (943). It was a failure. > > > > > > After the fact I noticed I put it in the wrong part of the file. It may yet > > > work, if someone applies it correctly. I'm out of time today. Trying to > > > restore to a known working state. > > > > I put it at the top of crash-stats.allizom.org.conf when testing it before - > > I believe that it needs to be outside of the Virtualenv block. > > And by "Virtualenv block", I meant "VirtualHost block" :) Just confirmed this, Apache refuses to start with: WSGIPythonHome cannot occur within <VirtualHost> section I've just committed a fix to SVN: r83561

Robert Helmer [:rhelmer]

Comment 15

•

12 years ago

http://crash-stats.allizom.org/ appears to be back up now

Lonnen :lonnen

Comment 16

•

12 years ago

marking this so we remember when the prod push comes around

Whiteboard: [fromAutomation] → [fromAutomation][config change]

Target Milestone: --- → 77

Lonnen :lonnen

Comment 17

•

12 years ago

pushed to prod in 77

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 18

•

12 years ago

Thx; staging was fixed, and the push to prod was fine too, so verified.

Status: RESOLVED → VERIFIED