Closed
Bug 993254
Opened 11 years ago
Closed 11 years ago
dxr.m.o is ISE
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ashish, Assigned: fubar)
Details
What the summary says
20:46:22 < nagios-phx1> | Mon 20:46:22 PDT [1227] dxr.mozilla.org:HTTP is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 784 bytes in 0.007 second response time (http://m.mozilla.org/HTTP)
Reporter | ||
Comment 1•11 years ago
|
||
Error log has tons of tracebacks for config.py:
[Mon Apr 07 21:10:37 2014] [error] [client 10.8.81.215] IOError: [Errno 2] Unable to load configuration file (No such file or directory): '/data/dxr-prod/target/config.py'
Initial errors in logs are from 20:42 onwards. Reaching out devs to verify if there was a push. Documentation mentions pushes are all automated via cron.
Reporter | ||
Comment 2•11 years ago
|
||
Can confirm that /data/dxr-prod/target/config.py is not present on the admin server and that there quite likely was a deployment at 20:40. I can find the file on dxr-staging though.
:erikrose and :fubar have been paged for assistance
Reporter | ||
Comment 3•11 years ago
|
||
I've *temporarily* fixed this by symlinking
`/data/instances/9/target/config.py' -> `/data/instances/8/target/config.py'
directly on the webheads because I do not know where the code lives. From what I gather it is not a local change on dxradm. Lowering severity for now. But someone will have to put up the right fix for this in their morning.
Severity: critical → major
Assignee | ||
Comment 4•11 years ago
|
||
So, the webapp is semi-disconnected, but dependent, on the builds produced by dxr-processor1. Somehow, the build got pushed out but incompletely - config.py and the jinja_dxr_cache dir were missing. The build log in dxr-processor1 doesn't show any errors for the instance deploy:
(finished building 'comm-central' in 1:34:36.522703)
+ find target.new -type d -exec chmod o+rx '{}' ';'
+ find target.new -type f -exec chmod o+r '{}' ';'
+ '[' -d /data/www/instances/9 ']'
+ mkdir -p /data/www/instances/9.new/www.a
+ cd target.new
+ tar cpf - .
+ cd /data/www/instances/9.new/www.a
+ tar xpf -
+ pushd /data/www/instances/9.new
/data/www/instances/9.new /data/builds/mock_mozilla/prod-6hour/targetdata/dxr-build-env/target.new
+ ln -s www.a target
+ popd
/data/builds/mock_mozilla/prod-6hour/targetdata/dxr-build-env/target.new
+ mv /data/www/instances/9.new /data/www/instances/9
+ echo -e '\nFinished.'
The data in /data/builds/mock_mozilla/prod-6hour/targetdata/dxr-build-env/target.new *IS* complete, however. Manually copying it out now to see if that'll fix things.
Assignee: server-ops-webops → klibby
Assignee | ||
Comment 5•11 years ago
|
||
dxr.m.o is now correctly displaying the latest build.
Comment 6•11 years ago
|
||
Around 1300 PDT, I had to manually roll back a premature deployment of rev 977b8f0d564c3e22e5ac4cd808067265dfb50d56 of the web app. (We had forgot to update the `format` file, and pieces of the UI broke as a result.) I rolled it back to 63baae14e50b3b3272fc3499a7083e35de383015 by manually remaking the dxr-prod symlink.
I don't see how that could have caused the problem here, but I mention it because of the temporal coincidence. Otherwise, I have no ideas.
Assignee | ||
Comment 7•11 years ago
|
||
yeah, that shouldn't have had any impact. it looks like the tar pipeline died part way through the copy, but that should have aborted the scripts (set -eE), which didn't happen. I'm stumped.
Assignee | ||
Comment 8•11 years ago
|
||
I dislike not having a root cause, but the logs are long since rotated out, it hasn't broken again, and I'm swamped with vcs break/fix atm. Can re-open if it reoccurs.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 9•11 years ago
|
||
I expect we'll have to redo a lot of the deployment scripts anyway, once we go beyond another couple of indexed trees. And once we get Elasticsearch into place, there won't be an FS-move operation to bring the indexes into place anymore; all the data will flow over a socket. So never mind the old problems; I expect we'll have fresh, new ones before long. :-)
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•