Closed Bug 993254 Opened 11 years ago Closed 11 years ago

dxr.m.o is ISE

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ashish, Assigned: fubar)

Details

What the summary says 20:46:22 < nagios-phx1> | Mon 20:46:22 PDT [1227] dxr.mozilla.org:HTTP is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 784 bytes in 0.007 second response time (http://m.mozilla.org/HTTP)
Error log has tons of tracebacks for config.py: [Mon Apr 07 21:10:37 2014] [error] [client 10.8.81.215] IOError: [Errno 2] Unable to load configuration file (No such file or directory): '/data/dxr-prod/target/config.py' Initial errors in logs are from 20:42 onwards. Reaching out devs to verify if there was a push. Documentation mentions pushes are all automated via cron.
Can confirm that /data/dxr-prod/target/config.py is not present on the admin server and that there quite likely was a deployment at 20:40. I can find the file on dxr-staging though. :erikrose and :fubar have been paged for assistance
I've *temporarily* fixed this by symlinking `/data/instances/9/target/config.py' -> `/data/instances/8/target/config.py' directly on the webheads because I do not know where the code lives. From what I gather it is not a local change on dxradm. Lowering severity for now. But someone will have to put up the right fix for this in their morning.
Severity: critical → major
So, the webapp is semi-disconnected, but dependent, on the builds produced by dxr-processor1. Somehow, the build got pushed out but incompletely - config.py and the jinja_dxr_cache dir were missing. The build log in dxr-processor1 doesn't show any errors for the instance deploy: (finished building 'comm-central' in 1:34:36.522703) + find target.new -type d -exec chmod o+rx '{}' ';' + find target.new -type f -exec chmod o+r '{}' ';' + '[' -d /data/www/instances/9 ']' + mkdir -p /data/www/instances/9.new/www.a + cd target.new + tar cpf - . + cd /data/www/instances/9.new/www.a + tar xpf - + pushd /data/www/instances/9.new /data/www/instances/9.new /data/builds/mock_mozilla/prod-6hour/targetdata/dxr-build-env/target.new + ln -s www.a target + popd /data/builds/mock_mozilla/prod-6hour/targetdata/dxr-build-env/target.new + mv /data/www/instances/9.new /data/www/instances/9 + echo -e '\nFinished.' The data in /data/builds/mock_mozilla/prod-6hour/targetdata/dxr-build-env/target.new *IS* complete, however. Manually copying it out now to see if that'll fix things.
Assignee: server-ops-webops → klibby
dxr.m.o is now correctly displaying the latest build.
Around 1300 PDT, I had to manually roll back a premature deployment of rev 977b8f0d564c3e22e5ac4cd808067265dfb50d56 of the web app. (We had forgot to update the `format` file, and pieces of the UI broke as a result.) I rolled it back to 63baae14e50b3b3272fc3499a7083e35de383015 by manually remaking the dxr-prod symlink. I don't see how that could have caused the problem here, but I mention it because of the temporal coincidence. Otherwise, I have no ideas.
yeah, that shouldn't have had any impact. it looks like the tar pipeline died part way through the copy, but that should have aborted the scripts (set -eE), which didn't happen. I'm stumped.
I dislike not having a root cause, but the logs are long since rotated out, it hasn't broken again, and I'm swamped with vcs break/fix atm. Can re-open if it reoccurs.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
I expect we'll have to redo a lot of the deployment scripts anyway, once we go beyond another couple of indexed trees. And once we get Elasticsearch into place, there won't be an FS-move operation to bring the indexes into place anymore; all the data will flow over a socket. So never mind the old problems; I expect we'll have fresh, new ones before long. :-)
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.