Closed Bug 1152742 Opened 9 years ago Closed 8 years ago

Determine why json log parsing is now taking 80x longer than before & re-enable it

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: emorley, Unassigned)

References

Details

In bug 1152681, the trees ended up being closed since log parsing was falling behind. This was found to be due to the parse-json-log taking 80x longer than it had previously, so it was disabled temporarily so the trees could reopen.

This bug is for:
* Figuring out why the parse-json-log task is taking so much longer as of ~5.5 hours ago (the task duration jumped at the same time on both stage and prod). See bug 1152681 for the initial investigation findings.
* Fixing the cause, through likely one or more of:
 - stopping taskcluster spamming us with malformed/or as many logs
 - increasing the number of log processor nodes
 - tweaking the workers/queues so we actually prioritise the log_parser_fail and log_parser_hp queues (currently we do a round robin between all queues, since that's celery's default behaviour)
 - profiling the json log parsing to see if we can improve performance - we could even do this on new relic: https://docs.newrelic.com/docs/apm/applications-menu/events/thread-profiler-dashboard#starting
 - fixing bug 1080760 so that taskcluster et al do the work themselves, and instead only submit the end result to us.
* Re-enabling json log parsing and removing the xfail annotation from the test. ie reverting these two commits:
https://github.com/mozilla/treeherder-service/commit/0ce875128189e783f892bc63bcf73fdb56557f09
https://github.com/mozilla/treeherder-service/commit/94a081e480f8f45c1434dda0c44fe074938c5ed8

A trace of one of the 170s json log parsing tasks:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw[end]=1428570447&tw[start]=1428566847#id=5b224f746865725472616e73616374696f6e2f43656c6572792f70617273652d6a736f6e2d6c6f67222c22225d&sort_by=throughput&app_trace_id=6387634562&tab-detail_6387634562=trace_details
The most relevant findings from bug 1152681:

* The parse-json-log task duration spike occurred at both the same time on stage and prod, even though we hadn't deployed something which to me indicates a change in what's being submitted to us (or else a puppet system update or something, but I'm sure that's less likely).
* The number of HTTP 404s for log parsing went down at around the time the parse-json-log load went up - perhaps taskcluster or another submitter fixed the log URLs being given to us?
* We started seeing MemoryError() exceptions during the json parse task for the first time:
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3528329220/similar_errors?original_error_id=3528329220&tw[end]=1428580182&tw[start]=1428493782
* 90+% of the parse-json-log traces are inside the task, not waiting for AWS or the DB etc.
Blocks: 1074927
Depends on: 1080760, 1079286, 1076761
Depends on: 1152769
We should also set a lower timeout (well, lower than the default celery value) for these tasks, so no one task can runaway and monopolise the resources. (Though we'll need to make sure the task doesn't get retried 10 times after timing out -> does celery even have an overall timeout for a task?)
The log in question from the comment 0 170s trace:
http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/bb932f6f882913a7b21bb566c84ffe0bb2b9f55b90e702d1db5e4f480126355c9d0379d1b7cb576776b27cf7d1ed252e67bfb1496870c9ce5b6103d0aaddef4d

(To get log URLs, click trace details, then the lens next to the "requests[mozilla-releng-blobs.s3.amazonaws.com]" row)
For the log in comment 3:

[~/src]$ ll -h browser-chrome-chunked_raw.log
-rwxrwx---+ 1 Ed None 1.4G Apr  9 19:17 browser-chrome-chunked_raw.log*
[~/src]$ wc -l browser-chrome-chunked_raw.log
7784944 browser-chrome-chunked_raw.log

...well I guess that answers that one!

We should set a max filesize, over which we don't attempt to download/parse.
Can we tell if this log came from a try run?  If not, we should investigate, since we shouldn't even be producing logs of that size.
(In reply to Jonathan Griffin (:jgriffin) from comment #5)
> Can we tell if this log came from a try run?  If not, we should investigate,
> since we shouldn't even be producing logs of that size.

The URL contains "mozilla-inbound".

I think we should:
1) Set a max response size (a la http://stackoverflow.com/questions/23514256/http-request-with-timeout-maximum-size-and-connection-pooling)
2) Find out what is creating these too large logs and stop them
3) Set a shorter timeout on the log parsing jobs in general (with the caveat of comment 2)
(In reply to Ed Morley [:emorley] (formerly :edmorley) from comment #6)
> (In reply to Jonathan Griffin (:jgriffin) from comment #5)

> 2) Find out what is creating these too large logs and stop them

I'll take care of this, and file a bug.
(In reply to Ed Morley [:emorley] (formerly :edmorley) from comment #6)
> I think we should:
4) Make the harness not allow logs of that size, no matter how much the tests output

(In reply to Jonathan Griffin (:jgriffin) from comment #7)
> I'll take care of this, and file a bug.

Execute:
> SELECT job_id FROM mozilla_inbound_jobs_1.job_log_url WHERE url = "http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/bb932f6f882913a7b21bb566c84ffe0bb2b9f55b90e702d1db5e4f480126355c9d0379d1b7cb576776b27cf7d1ed252e67bfb1496870c9ce5b6103d0aaddef4d" ORDER BY id desc LIMIT 1

+ ----------- +
| job_id      |
+ ----------- +
| 8643929     |
+ ----------- +
1 rows


https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=8643929
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a3f6306dd05e
The "normal" log [1] for that job exceeded the 50MB buildbot cap, and the whole push was pretty broken. Seems like we just need to set a cap for the logs uploaded to S3 too.

[1] http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64/1428554431/mozilla-inbound_snowleopard_test-mochitest-browser-chrome-3-bm108-tests1-macosx-build710.txt.gz
(In reply to Jonathan Griffin (:jgriffin) from comment #7)
> (In reply to Ed Morley [:emorley] (formerly :edmorley) from comment #6)
> > (In reply to Jonathan Griffin (:jgriffin) from comment #5)
> 
> > 2) Find out what is creating these too large logs and stop them
> 
> I'll take care of this, and file a bug.

That came from some debug statements that were accidentally landed, but since backed out.  See https://hg.mozilla.org/integration/mozilla-inbound/rev/d2a1fc813b91

It would be tempting to try to address this in mozharness, and cap the size of the files, but there are legitimate reasons to use debug prints on try.  We could, potentially, cap the size of files on non-try branches, and turn jobs red when their logs exceed a certain size.
See Also: → 1152934
(In reply to Jonathan Griffin (:jgriffin) from comment #10)
> It would be tempting to try to address this in mozharness, and cap the size
> of the files, but there are legitimate reasons to use debug prints on try. 
> We could, potentially, cap the size of files on non-try branches, and turn
> jobs red when their logs exceed a certain size.

The problem is that unless the person has only requests a handful of jobs, it's potentially 400 jobs x 1+ GB uploaded to S3, which seems expensive from an S3 point of view, let alone anything else.

I think at the very least, it should require an in-gecko-repo change to allow uploading files that large (ie someone who knows what they are doing can remove the cap, and request a very narrow trychooser range).

Even then, 1.1GB (or larger, presuming we have no upper limit at present?) seems excessive - the dev should manipulate the harness or test manifests to only run the code they wish to see the debug output for. Or they should debug locally, or else request a machine loan :-)
Depends on: 1152934
See Also: 1152934
(In reply to Ed Morley [:emorley] (formerly :edmorley) from comment #11)
> The problem is that unless the person has only requests a handful of jobs,
> it's potentially 400 jobs x 1+ GB uploaded to S3, which seems expensive from
> an S3 point of view, let alone anything else.

Although given the logs are compressed (and the repetitive native of log spam), it's presumably a logarithmic increase in size on S3.
Blocks: 1160117
So, I'm rather against setting a maximum size of the uncompressed log because the fact that mozharness imposes one on the uncompressed unstructured logs has caused me no end of problems for, as far as I can tell, very little gain. Limiting the compressed size makes a lot more sense because obviously having hundreds of 1GB gz files with accidental memory dumps in them, or whatever, isn't going to help anyone.

Ideally treeherder could fix the maximum time it would spend processing a log (or, if that's too affected by machine load, or whatever, the maximum number of lines/bytes it's prepared to process) and simply stop parsing at that point with some UI to indicate that's what happened.
I think Treeherder should set a max size (or time or whatever) - and that size shouldn't be too high. People can still download the compressed file from S3 using the links associated with a job (note this bug isn't bug 1152934). I think the "occasional developer having no choice but to produce noisy logs" case is much rarer than "something broke on *every* job for a half dozen pushes in a row, and we now have 2000+ logs that are massive and choking the log parser" case.

That max limit need not be as low as buildbot's 50MB limit of course, which is a bit too low.
Reading your comment back it seems in fact that we were agreeing anyway; ignore :-)
Maybe we should also investigate https://github.com/esnme/ultrajson over simplejson as we move towards using more JSON style logs?
We no longer ingest the full structured log, only the error summary - the code for this was removed in:
https://github.com/mozilla/treeherder/commit/7e4084d24fcf6d543fde9d9a2515be1e4072ee70
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.