Closed Bug 754067 Opened 13 years ago Closed 12 years ago

automatically rotate aus4-dev/aus4-admin-dev application logs

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: nmaul)

References

Details

(Whiteboard: [balrog][triaged 20120831][service interrupt])

I've been getting nothing but ISE 500 errors from aus4-admin-dev this week. The http logs showed that admin.wsgi was returning prematurely, but the application log was absolutely empty. I ended up running through the admin.wsgi code by hand in the python interpreter, and this happened: [bhearsum@node267.seamicro.phx1 aus4-admin]$ python Python 2.6.6 (r266:84292, Sep 12 2011, 14:03:02) [GCC 4.4.5 20110214 (Red Hat 4.4.5-6)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from auslib.web.base import db, app as application Traceback (most recent call last): File "<stdin>", line 1, in <module> File "auslib/web/base.py", line 1, in <module> from flask import Flask, request File "/usr/lib/python2.6/site-packages/flask/__init__.py", line 17, in <module> from werkzeug import abort, redirect ImportError: No module named werkzeug >>> >>> import site >>> site.addsitedir('vendor/lib/python') >>> from auslib.web.base import db, app as application Bus error (core dumped) Interestingly, I can cat the file fine: [bhearsum@node267.seamicro.phx1 aus4-admin]$ cat auslib/web/base.py from flask import Flask, request from auslib import version from auslib.db import AUSDatabase app = Flask(__name__) db = AUSDatabase() @app.errorhandler(500) def isa(error): log.error("Caught ISE 500 error.") log.debug("Balrog version is: %s", version) log.debug("Request path is: %s", request.path) log.debug("Request environment is: %s", request.environ) log.debug("Request headers are: %s", request.headers) return error # All of our View modules contain routing information that needs to be imported # to be active. from auslib.web.views import * This seems like it might be an infra issue, but I want to look into a little bit more before throwing it over the fence.
I dug into this some more and found that the root module that's causing the bus error is ctypes. It doesn't look like that module was updated anytime recently, so I'm not sure what to do from here. IT, any ideas? You can reproduce with the following in the python shell: 'import ctypes' If you add print statements to /usr/lib/python2.6/ctypes/__init__.py you can bisect a bit, but it's the final call to CFUNCTYPE (the last line of the file) that's causing the error. This issue is completely breaking aus4-dev currently, because no new data can be submitted.
Assignee: bhearsum → server-ops
Severity: normal → critical
Component: Release Engineering: Automation (General) → Server Operations: Web Operations
QA Contact: catlee → cshields
Assignee: server-ops → nmaul
[root@node267.seamicro.phx1 ~]# python Python 2.6.6 (r266:84292, Sep 12 2011, 14:03:02) [GCC 4.4.5 20110214 (Red Hat 4.4.5-6)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import ctypes >>> quit() The problem may be that these systems are out of disk space. There is giant log file at /var/httpd/log/aus.log, on both systems. This isn't a standard Apache log file... I suspect maybe it's something the app itself creates. It looks like debugging info... starts with some tracebacks, then a lot of JSON data. Most lines start with 'DEBUG:auslib.web.views.releases:' Any idea where this comes from, and what we can do with it?
Yeah, that's the application log. We can definitely shrink it some, but is there any way we can set-up rotation of it through syslogd or something? How is application log rotation/aging normally done?
I just deleted aus.log, because we don't care too much about the history at this point, to get things unwedged.
Severity: critical → major
According to Jake on IRC, we should be able to use logrotate to rotate/age these logs. We'll need to do this for both apps: one logs to /var/log/aus.log, the other to /var/log/httpd/aus4.mozilla.org/aus.log. I filed bug 754383 on greatly reducing the amount of JSON blob we print to these logs, which should help avoid this situation in the future.
Summary: aus4-admin-dev dumping its core when trying to run admin.wsgi → automatically rotate aus4-dev/aus4-admin-dev application logs
Also, had to create the log files by hand, and chown them to apache to give the app write permission again. 14:34 < solarce> bhearsum|buildduty: jakem we should probably use puppet to make that file and set permissions, but i'll defer to jakem
Severity: major → normal
Assignee: nmaul → bburton
Whiteboard: [balrog] → [balrog] [review 7/19]
Got caught up in some other things, so I fixed the issue that was reported in bug 774659 with [logging] ;Where to put the application log. No rotation is done on this file. -logfile=/var/log/httpd/aus4.mozilla.org/aus.log +logfile=/var/log/httpd/ausadmin.log But I still need to finish the log rotate and will look at this first thing tomorrow
Whiteboard: [balrog] [review 7/19] → [balrog] [finish 7/18]
Any update here? The admin application is still returning ISE 500 errors.
Severity: normal → major
Assignee: bburton → jcrowe
Assignee: jcrowe → bburton
Assignee: bburton → server-ops-webops
Severity: major → normal
Priority: -- → P4
Whiteboard: [balrog] [finish 7/18] → [balrog][triaged 20120831][waiting][webops]
Priority: P4 → P3
Whiteboard: [balrog][triaged 20120831][waiting][webops] → [balrog][triaged 20120831][service interrupt]
This bug has caused the aus4-dev environment to be broken for almost two months now, is there an ETA on having it fixed?
I think see the problem here... it's not the same one as this bug was initially opened for. As a security measure against log file injection, the /var/log/httpd directory is no longer traversable by system accounts (like "nobody" or "apache"). This doesn't bother Apache, which opens its logfiles *before* dropping root privileges. It does affect this app though, because it can only write with the permissions of the workers running it... which can't reach /var/log/httpd. I'm committing a change to the config to move this log file to /var/log/ausadmin.log. Once that goes out I think this should start to work again.
I believe this is resolved now. Please let us know if you're still getting errors, and how we can see replicate them.
Assignee: server-ops-webops → nmaul
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Yup, we're past that error now. The admin application appears to be working 100% fine. The read-only application is hitting a different error now, which I filed in bug 791663. Thanks for getting this resolved!
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.