Closed Bug 1192021 Opened 9 years ago Closed 9 years ago

Clean up logging for relengapi

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: Callek)

Details

Attachments

(2 files, 1 obsolete file)

* Make sure nothing's being missed
  * All Python logging from webheads and celery (bug 1185135)
  * Apache access logs
* Structured logging actually transmitted to MozDef
  * http://mozdef.readthedocs.org/en/latest/usage.html#sending-logs-to-mozdef
  * need to figure out how this is currently routed (if at all)
I'm seeing a bunch of logging go to /var/lib/httpd/error_log -- traceback for a 500 ISE, and a log of "generating signed S3 GET URL for .." from archiver.
Ah, mod_wsgi adds its own handler to the root of the logging structure (output from logging_tree):

 <--"" 
    Level NOTSET so inherits level NOTSET 
    Handler Stream <mod_wsgi.Log object at 0x7f2616b91e70> 
      Level INFO 
      Formatter fmt='%(asctime)s [%(name)s] %(message)s' datefmt=None 
    Handler <cloghandler.ConcurrentRotatingFileHandler object at 0x7f25f4867150> 
      Level INFO 
      Formatter fmt='%(asctime)s [%(name)s] %(message)s' datefmt=None 
    Handler SysLog ('localhost', 514) facility=1 
      Level INFO 
      Formatter fmt='%(asctime)s web1.stage.releng.webapp.scl3.mozilla.com relengapi: %(message)s' datefmt='%b %d %H:%M:%S' 

which is, I assume, what's putting those error messages in /var/log/httpd/error.log.  They should *also* be in /var/log/relengapi/relengapi.log.  Indeed, I can confirm that tracebacks for 500 errors and "generating signed S3 GET URL" are present there.  However, the tracebacks themselves are missing in papertrail; perhaps because they're huge and on a single line.
Part of that was because bug 1185135 was only applied for celery, not WSGI.  But even with that, the tracebacks in papertrail look like

 Sep 02 08:35:16 web1.stage.releng.webapp.scl3.mozilla.com relengapi: Exception on /tokenauth/tokens [POST]#012Traceback (most recent call last):#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request#012    rv = self.dispatch_request()#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request#012    return self.view_functions[rule.endpoint](**req.view_args)#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/newrelic-2.46.0.37/newrelic/hooks/framework_flask.py", line 30, in _nr_wrapper_handler_#012    return wrapped(*args, **kwargs)#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/relengapi/lib/api.py", line 107, in replacement#012    result = wrapped(*args, **kwargs)#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/relengapi/blueprints/tokenauth/__init__.py", line 282, in issue_token#012    token = token_issuers[typ](body, requested_permissions)#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/relengapi/blueprints/tokenauth/__init__.py", line 235, in issue_usr#012    session.commit()#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 150, in do#012    return getattr(self.registry(), name)(*args, **kwargs)#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 790, in commit#012    self.transaction.commit()#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 392, in commit#012    self._prepare_impl()#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 372, in _prepare_impl#012    self.session.flush()#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2004, in flush#012    self._flush(objects)#012  File "/data/www/relengapi/virtualenv/lib/python2.7/site- 

ooglay!
OK, straightening out those exceptions doesn't seem easy.
  http://stackoverflow.com/questions/32359144/log-each-line-of-an-exception-separately

I added the following to relengapi.wsgi and celery:

# monkey-patch Logger._log to log each exception line separately, so that they appear nicely in
# the syslog (rather than concatenated in a single message)
old_log = logging.Logger._log
def new_log(self, level, msg, args, exc_info=None, extra=None):
    old_log(self, level, msg, args, exc_info=None, extra=extra)
    if exc_info:
        if not isinstance(exc_info, tuple):
            exc_info = sys.exc_info()
        exc_text = logging._defaultFormatter.formatException(exc_info)
        for line in exc_text.split("\n"):
            old_log(self, level, line, ())
logging.Logger._log = new_log

and that seems to have worked.
OK, I *think* everything's going successfully to papertrail now!
Attachment #8656219 - Flags: review?(bugspam.Callek) → review+
This is all committed, but not landed yet.  That will happen along with the treestatus landing.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
EXCEPT this is causing celery to not start in staging ...

Traceback (most recent call last):
  File "/data/www/relengapi/celery", line 14, in <module>
    old_log = logging.Logger._log
NameError: name 'logging' is not defined


And if I add import logging above the code block...

Traceback (most recent call last):
  File "/data/www/relengapi/celery", line 83, in <module>
    load_entry_point('celery', 'console_scripts', 'celery')()
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/__main__.py", line 30, in main
    main()
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/celery.py", line 81, in main
    cmd.execute_from_commandline(argv)
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/celery.py", line 769, in execute_from_commandline
    super(CeleryCommand, self).execute_from_commandline(argv)))
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/base.py", line 307, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/celery.py", line 761, in handle_argv
    return self.execute(command, argv)
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/celery.py", line 693, in execute
    ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/worker.py", line 179, in run_from_argv
    return self(*args, **options)
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/base.py", line 270, in __call__
    ret = self.run(*args, **kwargs)
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/worker.py", line 212, in run
    state_db=self.node_format(state_db, hostname), **kwargs
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/worker/__init__.py", line 100, in __init__
    self.setup_instance(**self.prepare_args(**kwargs))
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/worker/__init__.py", line 141, in setup_instance
    self.blueprint.apply(self, **kwargs)
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bootsteps.py", line 209, in apply
    self._debug('Preparing bootsteps.')
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bootsteps.py", line 272, in _debug
    return debug(_pre(self, msg), *args)
  File "/usr/lib/python2.7/logging/__init__.py", line 1128, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/data/www/relengapi/celery", line 64, in new_log
    old_log(self, level, msg, args, exc_info=None, extra=extra)
  File "/data/www/relengapi/celery", line 18, in new_log
    old_log(self, level, msg, args, exc_info=None, extra=extra)
TypeError: 'NoneType' object is not callable


I'm not going to block relengapi's deploy on this, since its a deployment bug not a repo-bug (but I think we have other issues to address)...
Status: RESOLVED → REOPENED
Flags: needinfo?(dustin)
Resolution: FIXED → ---
Apparantly this does block prod since prod's and stagings celery script is identical... and simply commenting out that code doesn't work.
And... if I comment out the code block:

Traceback (most recent call last):
  File "/data/www/relengapi/celery", line 81, in <module>
    load_entry_point('celery', 'console_scripts', 'celery')()
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/__main__.py", line 30, in main
    main()
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/celery.py", line 81, in main
    cmd.execute_from_commandline(argv)
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/celery.py", line 769, in execute_from_commandline
    super(CeleryCommand, self).execute_from_commandline(argv)))
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/base.py", line 307, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/celery.py", line 761, in handle_argv
    return self.execute(command, argv)
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/celery.py", line 693, in execute
    ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/worker.py", line 179, in run_from_argv
    return self(*args, **options)
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/base.py", line 270, in __call__
    ret = self.run(*args, **kwargs)
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bin/worker.py", line 212, in run
    state_db=self.node_format(state_db, hostname), **kwargs
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/worker/__init__.py", line 100, in __init__
    self.setup_instance(**self.prepare_args(**kwargs))
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/worker/__init__.py", line 141, in setup_instance
    self.blueprint.apply(self, **kwargs)
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bootsteps.py", line 209, in apply
    self._debug('Preparing bootsteps.')
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/celery/bootsteps.py", line 272, in _debug
    return debug(_pre(self, msg), *args)
  File "/usr/lib/python2.7/logging/__init__.py", line 1128, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/data/www/relengapi/celery", line 62, in new_log
    old_log(self, level, msg, args, exc_info=None, extra=extra)
  File "/usr/lib/python2.7/logging/__init__.py", line 1249, in _log
    fn, lno, func = self.findCaller()
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/structlog/stdlib.py", line 32, in findCaller
    f, name = _find_first_app_frame_and_name(['logging'])
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/structlog/_frames.py", line 41, in _find_first_app_frame_and_name
    while any(name.startswith(i) for i in ignores):
  File "/data/www/relengapi/virtualenv/lib/python2.7/site-packages/structlog/_frames.py", line 41, in <genexpr>
    while any(name.startswith(i) for i in ignores):
AttributeError: 'NoneType' object has no attribute 'startswith'
... which looks like https://github.com/hynek/structlog/commit/655551f0da3dac53849684d893af3aa71f7b308b which is "interesting" since pip list says its structlog 15.2.0
Looking into c#11

On closer look (with debugging) it looks like frame.__name__ exists, but is explicitly set to None.

So in this case, we don't hit the fallback of "?", and is a structlog issue.

I created https://github.com/hynek/structlog/pull/58 for that.
The underlying problem here is that Celery helpfully deletes all symbols from __main__ when it starts up. It's like celery *tries* to be difficult to use :(

That's why logging wasn't defined, and why old_log becomes None.  And that's why I have all of those imports in the `setup` function -- that provides a function-level scope which Celery can't mess with, no matter how hard it tries.  I must have botched a merge or something and ended up monkey-patching Logger.log twice -- once at the module level and once in `setup`.  Just deleting the module-level version (as you've done by commenting it out) seems to help.

I think that frame issue is also related to Celery trying to "clean house".  But it may also be due to the monkey-patching (findCaller is always going to find `new_log` rather than its caller)
Flags: needinfo?(dustin)
Attached patch Monkeypatch Structlog (obsolete) — Splinter Review
Since structlog won't be released for at least a week, here is a monkeypatch.
Attachment #8661229 - Flags: review?(dustin)
Comment on attachment 8661229 [details] [diff] [review]
Monkeypatch Structlog

Review of attachment 8661229 [details] [diff] [review]:
-----------------------------------------------------------------

Err this is still failing, even after trying to monkeypatch
Attachment #8661229 - Flags: review?(dustin) → review-
This was tested on staging, and works.
Attachment #8661229 - Attachment is obsolete: true
Attachment #8661243 - Flags: review?(dustin)
Attachment #8661243 - Flags: review?(dustin) → review+
Assignee: dustin → bugspam.Callek
(In reply to Justin Wood (:Callek) from comment #17)
> Created attachment 8661243 [details] [diff] [review]
> Monkeypatch Structlog take 2
> 
> This was tested on staging, and works.

and this was subsequently deployed to prod. We're good now.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: