Closed Bug 978311 Opened 10 years ago Closed 10 years ago

Staging is throwing a 500 / Internal Server Error (Apache restart needed?)

Categories

(Socorro :: General, task)

task
Not set
major

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: stephend, Unassigned)

References

()

Details

(Whiteboard: [fromAutomation][config change])

Attachments

(1 file)

https://crash-stats.allizom.org/ down/throwing a 500, apparently at the Apache/webhead level; no clue why (does Apache need a restart?  Config issue?  Since we're not in New Relic, and Sentry/Errormill doesn't expose this info, it's impossible for Web QA to tell).

We saw this (through automation), at least at 1:26pm PST, though it might have happened earlier.

http://qa-selenium.mv.mozilla.com:8080/view/Socorro/job/socorro.stage.saucelabs/887/
I see errors from the last push. I think the culprit is here.

[socorroadm.stage.private.phx1.mozilla.com] err: mv: cannot stat `socorro-new.tar.gz': No such file or directory

I tried to steamroll over it with another release but it caused the same problem. http://socorroadm.private.phx1.mozilla.com/chief/socorro.stage/logs/do.it..1393629927
The problem is the Apache config, it's missing this:

WSGIPythonHome /data/socorro/webapp-django/virtualenv/

That's not in the Socorro repo, it's in a totally separate SVN puppet repo.
I've peered into stage and prod apache configs and it doesn't look like we have WSGIPythonHome in either. I don't think that one line missing is the cause of our current stage woes, since production is working just fine and it looks like neither file has ever had it (at least for django).

a possible lead -- Chief appears to have stopped updating. I've tried to kick off three manual builds and encountered errors:

[Sun Mar 02 00:00:14 2014] [error] [client 10.22.248.54] (70007)The timeout specified has expired: proxy: error reading response
(In reply to Chris Lonnen :lonnen from comment #3)
> I've peered into stage and prod apache configs and it doesn't look like we
> have WSGIPythonHome in either. I don't think that one line missing is the
> cause of our current stage woes, since production is working just fine and
> it looks like neither file has ever had it (at least for django).

Yes it looks like you're right... we recommend this in our example config:
https://github.com/mozilla/socorro/blob/master/config/apache.conf-dist

Also I can confirm that it seems to fix the problem, but I agree that it looks like it was working before without this setting (I made this change on one of the stage boxes and confirmed it worked, and let puppet clobber the change.)

I would like to know exactly how it was working before and why it's not now!
1. Chief breaking was unrelated and was caused by me doing what I thought was some safe technical debt cleanup. w-e-l-p.com
 * I'll be updating mana with more details on how Chief currently works (it's not pretty) and hand the baton to phrawzty on how to move forward

2. Stage breaks after build 932, 932 has been pushed and is working: https://crash-stats.allizom.org/status/

Currently Chief pushes latest but you can push older builds via the CLI (details will be in mana shortly)
(In reply to Brandon Burton [:solarce] from comment #5)
> 2. Stage breaks after build 932, 932 has been pushed and is working:
> https://crash-stats.allizom.org/status/

Note that the next two builds failed for unrelated reasons... I know that we tried wiping the workspace, I am starting to suspect that this was broken earlier but the problem didn't manifest until the workspace was cleared :/

Is there any way to see a log of when the workspace was cleared? I could probably repro this locally in vagrant and bisect.
(In reply to Robert Helmer [:rhelmer] from comment #6)
> (In reply to Brandon Burton [:solarce] from comment #5)
> > 2. Stage breaks after build 932, 932 has been pushed and is working:
> > https://crash-stats.allizom.org/status/
> 
> Note that the next two builds failed for unrelated reasons... I know that we
> tried wiping the workspace, I am starting to suspect that this was broken
> earlier but the problem didn't manifest until the workspace was cleared :/
> 
> Is there any way to see a log of when the workspace was cleared? I could
> probably repro this locally in vagrant and bisect.

Oh! Or I could just diff the builds.. that's a lot easier, will start w/ that.
I've updated mana with background on the current state of Chief and how to manually push an older build to staging.

https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=5734601#crash-stats.mozilla.com%28Socorro%29-StageUpdateDetails

I've also disabled the stage auto-push cron job until the path forward for Django and the Apache configuration is determined.
I temporarily applied the WSGI directive to use the virtual environment on stage (r83555) and pushed the latest working CI (943). It was a failure.

After the fact I noticed I put it in the wrong part of the file. It may yet work, if someone applies it correctly. I'm out of time today. Trying to restore to a known working state.
Command line pushing the old 932 build isn't working. At the tail end of the script I get:

    Starting memcached: chown: cannot access `/var/run/memcached': No such file or directory

The stage web heads have no useful info for me, in the apache logs or otherwise. I need to step away now but I can come back to it later tonight.
(In reply to Chris Lonnen :lonnen from comment #10)
> I temporarily applied the WSGI directive to use the virtual environment on
> stage (r83555) and pushed the latest working CI (943). It was a failure.
> 
> After the fact I noticed I put it in the wrong part of the file. It may yet
> work, if someone applies it correctly. I'm out of time today. Trying to
> restore to a known working state.

I put it at the top of crash-stats.allizom.org.conf when testing it before - I believe that it needs to be outside of the Virtualenv block.
(In reply to Robert Helmer [:rhelmer] from comment #12)
> (In reply to Chris Lonnen :lonnen from comment #10)
> > I temporarily applied the WSGI directive to use the virtual environment on
> > stage (r83555) and pushed the latest working CI (943). It was a failure.
> > 
> > After the fact I noticed I put it in the wrong part of the file. It may yet
> > work, if someone applies it correctly. I'm out of time today. Trying to
> > restore to a known working state.
> 
> I put it at the top of crash-stats.allizom.org.conf when testing it before -
> I believe that it needs to be outside of the Virtualenv block.

And by "Virtualenv block", I meant "VirtualHost block" :)
(In reply to Robert Helmer [:rhelmer] from comment #13)
> (In reply to Robert Helmer [:rhelmer] from comment #12)
> > (In reply to Chris Lonnen :lonnen from comment #10)
> > > I temporarily applied the WSGI directive to use the virtual environment on
> > > stage (r83555) and pushed the latest working CI (943). It was a failure.
> > > 
> > > After the fact I noticed I put it in the wrong part of the file. It may yet
> > > work, if someone applies it correctly. I'm out of time today. Trying to
> > > restore to a known working state.
> > 
> > I put it at the top of crash-stats.allizom.org.conf when testing it before -
> > I believe that it needs to be outside of the Virtualenv block.
> 
> And by "Virtualenv block", I meant "VirtualHost block" :)

Just confirmed this, Apache refuses to start with:

WSGIPythonHome cannot occur within <VirtualHost> section

I've just committed a fix to SVN: r83561
http://crash-stats.allizom.org/ appears to be back up now
marking this so we remember when the prod push comes around
Whiteboard: [fromAutomation] → [fromAutomation][config change]
Target Milestone: --- → 77
pushed to prod in 77
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Thx; staging was fixed, and the push to prod was fine too, so verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: