Closed Bug 698345 Opened 13 years ago Closed 12 years ago

notify bmo devs of application errors via arecibo

Categories

(bugzilla.mozilla.org :: Extensions, defect)

Production
x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: glob, Assigned: glob)

References

Details

Attachments

(1 file, 5 obsolete files)

arecibo provides a mechanism for monitoring application errors; let's hook bmo up to it.
arecibo docs:
  http://www.areciboapp.com/docs/client/post.html
  http://www.areciboapp.com/docs/client/variables.html

i have a working prototype extension which i'll attach.
Attached file wip (obsolete) —
Comment on attachment 570624 [details]
wip

Some small review comments.

># XXX move this to lib/Data.pm

Nit: lib/Constants.pm

>use constant ARECIBO_SERVER => 'http://amckay-arecibo.khan.mozilla.org/v/1/';
>use constant ARECIBO_TIMEOUT => 10; # seconds

>our $in_modperl;
>our $apache_server;

our $_in_mod_perl;
our $_apache_server;

>our $agent = LWP::UserAgent->new(agent => 'bugzilla.mozilla.org');
>our $hostname = hostname();

our $_agent = LWP::UserAgent->new(agent => 'bugzilla.mozilla.org');
our $_hostname = hostname();

>sub _post_error {
>    my $class = shift;
>    my @message = @_;
>
>    # TODO - set priority, type based on message content
>    my $priority = $class eq 'ERROR' ? 5 : 10;
>    my $type = '';

This may be difficult since the message for fatal errors can be pretty random,
such as database connectivity issues, strict mode errors, taint issues, etc.

>    my $data = [
>        status => '500',
>        priority => $priority,
>        user_agent => $ENV{HTTP_USER_AGENT},
>        url => Bugzilla->cgi->self_url,
>        type => $type,
>        server => $hostname,
>        msg => join("\n", @message),
>        traceback => longmess(),
>        timestamp => '', # TODO
>        username => Bugzilla->user->login,
>    ];

Nit: Line up =>'s

>    # fork then post
>    $SIG{CHLD} = 'IGNORE';
>    my $pid = fork();
>    if ($pid == 0) {
>        my $response = $agent->post(ARECIBO_SERVER, $data);
>        # TODO log errors
>        exit;

Could we just log to apache log in this case? or maybe data/errorlog. I don't see this filling
up the log too much unless common pages are erroring for everyone. Otherwise this will be mostly
for edge cases if we test well enough any changes we make.
(In reply to David Lawrence [:dkl] from comment #3)
> Comment on attachment 570624 [details]
> wip
> 
> Some small review comments.

sorry, i should have said "don't review" :)
this code was mostly to show how easy the interaction with arecibo is.

> This may be difficult since the message for fatal errors can be pretty
> random, such as database connectivity issues, strict mode errors, taint issues, 
> etc.

the ones i care about the most are the database issues, and they are easy to catch.  we can just add more regexs as we need them.

> Could we just log to apache log in this case?

yes, that's the plan.  that TODO is only to cover what happens when there's a failure to post to arecibo.


i've explored this a lot today, and it looks like we'll have to do this as a patch to Bugzilla::Error rather than as an extension.  catching 'die' is ok, but it doesn't get us coverage of template and code errors which are caught and displayed by bugzilla itself.  a classic example of a thrown template error which we *need* to catch is when posting to the jobqueue table fails.

also, perl's error management is a nightmare, and hurts my head.
Attached patch patch v2 (obsolete) — Splinter Review
in a nutshell this patch changes how errors are displayed:

all seriously fatal errors (db down, etc), all template errors, an some code errors are passed to arecibo for logging and notification.  all warnings are also reported to arecibo (with a not-as-important priority).

any error and warning which is passed to arecibo will also be logged to apache's error_log.

the stacktrace has been removed from the error page, and instead a more verbose trace is logged on arecibo.

when an error is logged, it is assigned an id which is displayed to the user.  this id is logged in apache's error_log and is visible via arecibo.


implementation notes:

i've kept the config inside the arecibo module to make it self-contained.  we can tweak the message string --> type matching as time goes on.

the logging to arecibo is done in a forked and detached process.  failure to post to arecibo isn't logged (the error will be logged in apache's log).
Attachment #570624 - Attachment is obsolete: true
Attachment #571896 - Flags: review?(dkl)
Comment on attachment 571896 [details] [diff] [review]
patch v2

found a problem; Apache2::Log is logging to the main error_log, not to the one configured against the virtual host.  cancelling review.
Attachment #571896 - Flags: review?(dkl)
Attached patch patch v3 (obsolete) — Splinter Review
as well as fixing the apache log issue, this also removes stack traces from user-visible DBI errors.
Attachment #571896 - Attachment is obsolete: true
Attachment #571908 - Flags: review?(dkl)
Comment on attachment 571908 [details] [diff] [review]
patch v3

Review of attachment 571908 [details] [diff] [review]:
-----------------------------------------------------------------

not ok 74 - (en/default) template/en/default/global/code-error.html.tmpl has unfiltered directives:
#   518: error_message 
# --ERROR
#   Failed test '(en/default) template/en/default/global/code-error.html.tmpl has unfiltered directives:
#   518: error_message 
# --ERROR'
#   at t/008filter.t line 130.

not ok 416 - Dependency on Bugzilla::Error from Bugzilla::Arecibo causes loop. --ERROR
#   Failed test 'Dependency on Bugzilla::Error from Bugzilla::Arecibo causes loop. --ERROR'
#   at t/010dependencies.t line 100.

After correcting the above, testing mostly worked even though I can't get to the arecibo server myself yet til
I figure out my VPN issues. Will be able to do more picking when I can get that figured out.

Don't really like the pink background color (is it pink for you? is for me). Can we change to a red box or a different
shade of red for the background?

dkl

::: Bugzilla/Arecibo.pm
@@ +236,5 @@
> +    require CGI::Carp;
> +    CGI::Carp::set_die_handler(sub {
> +        return if _in_eval();
> +        my $message = shift;
> +        eval { ThrowTemplateError($message) };

This line always throws an error for me when I add a random die() somewhere. It can't find ThrowTemplateError(). Works when
I change to:

eval { Bugzilla::Error::ThrowTemplateError($message) };
Attachment #571908 - Flags: review?(dkl) → review-
(In reply to David Lawrence [:dkl] from comment #8)
> not ok 74 - (en/default) template/en/default/global/code-error.html.tmpl has
> [...]

sorry dkl, i've changes my process to ensure this won't happen again :(

> After correcting the above, testing mostly worked even though I can't get to
> the arecibo server myself yet til
> I figure out my VPN issues. Will be able to do more picking when I can get
> that figured out.

i use a tunnel via people along with hosts file shenanigans.

> Don't really like the pink background color (is it pink for you? is for me).
> Can we change to a red box or a different shade of red for the background?

will do.
Attached patch patch v4 (obsolete) — Splinter Review
Attachment #571908 - Attachment is obsolete: true
Attachment #572385 - Flags: review?(dkl)
Comment on attachment 572385 [details] [diff] [review]
patch v4

Looks good. I was finally able to get access to the internal testing Arecibo server and tried server different error scenarios. The information was relayed to the server and displayed properly. 

My only concern is if we have some bad code causing the logs to fill with warnings, will this cause an issue with two many children being spawned off and
possibly not being cleaned up properly? I don't want the IT admins to have to deal with too many zombies. May not be an issue as the main reason for a failure would be the arecibo server bein inaccessible but we have a 10 sec timeout for that. Anything else you can think of that may cause issues similar to this?

Looks good. Color change discussed earlier can be committed on checkin. r=dkl
Attachment #572385 - Flags: review?(dkl) → review+
(In reply to David Lawrence [:dkl] from comment #11)
> My only concern is if we have some bad code causing the logs to fill with
> warnings, will this cause an issue with two many children being spawned off
> and possibly not being cleaned up properly?

the code doesn't generate zombies, so this shouldn't be a problem.
Depends on: 700683
Attached patch patch v5 (obsolete) — Splinter Review
while playing around i found a few gaps in the implementation in v4, so here's v5.

changes:
  - the arecibo server url is now a param
    - should have been a param from v1, sorry about that :)
  - added the ability to ignore errors
    - "Software caused connection abort" isn't an error worth reporting to arecibo
  - added traceback to apache error log when arecibo is disabled
Attachment #572385 - Attachment is obsolete: true
Attachment #573144 - Flags: review?(dkl)
Attached patch patch v6Splinter Review
while working on another bug i exposed a bug with earlier patches which caused it to go into an infinite error handling loop when a compilation failure is encountered.

this version ensures compilation failures don't use the template based error display.
Attachment #573144 - Attachment is obsolete: true
Attachment #573144 - Flags: review?(dkl)
Attachment #573195 - Flags: review?(dkl)
Depends on: 701290
Comment on attachment 573195 [details] [diff] [review]
patch v6

Review of attachment 573195 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. r+
Attachment #573195 - Flags: review?(dkl) → review+
due to a lack of action on bug 700683, i had a chat with mgoodwin about how we can get some sort of arecibo reporting in place.

i proposed removing all argument data from the stacktrace until bug 700683 is resolved, and he said this would be acceptable.  i'll commit a version which does that.
(In reply to Byron Jones ‹:glob› from comment #17)
> i proposed removing all argument data from the stacktrace until bug 700683
> is resolved, and he said this would be acceptable.  i'll commit a version
> which does that.

Let me know when this is committed; I'll take another quick look.

Thanks
example of the change:

full argument data: http://amckay-arecibo.khan.mozilla.org/view/1326232/
zero argument data: http://amckay-arecibo.khan.mozilla.org/view/1326231/
That's great; I'm happy with this going in without argument data.
thanks mark!

Committing to: bzr+ssh://bjones%40mozilla.com@bzr.mozilla.org/bmo/4.0/
added Bugzilla/Arecibo.pm
modified Bugzilla/DB.pm
modified Bugzilla/Error.pm
modified Bugzilla/Config/Advanced.pm
modified skins/standard/global.css
modified template/en/default/admin/params/advanced.html.tmpl
modified template/en/default/global/code-error.html.tmpl
Committed revision 8077.

Committing to: bzr+ssh://bjones%40mozilla.com@bzr.mozilla.org/bmo/4.2/
added Bugzilla/Arecibo.pm
modified Bugzilla/DB.pm
modified Bugzilla/Error.pm
modified Bugzilla/Config/Advanced.pm
modified skins/standard/global.css
modified template/en/default/admin/params/advanced.html.tmpl
modified template/en/default/global/code-error.html.tmpl
Committed revision 8067.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Extensions: Other → Extensions
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: