1152742 - Determine why json log parsing is now taking 80x longer than before & re-enable it

Reporter

Description

•

9 years ago

In bug 1152681, the trees ended up being closed since log parsing was falling behind. This was found to be due to the parse-json-log taking 80x longer than it had previously, so it was disabled temporarily so the trees could reopen.

This bug is for:
* Figuring out why the parse-json-log task is taking so much longer as of ~5.5 hours ago (the task duration jumped at the same time on both stage and prod). See bug 1152681 for the initial investigation findings.
* Fixing the cause, through likely one or more of:
 - stopping taskcluster spamming us with malformed/or as many logs
 - increasing the number of log processor nodes
 - tweaking the workers/queues so we actually prioritise the log_parser_fail and log_parser_hp queues (currently we do a round robin between all queues, since that's celery's default behaviour)
 - profiling the json log parsing to see if we can improve performance - we could even do this on new relic: https://docs.newrelic.com/docs/apm/applications-menu/events/thread-profiler-dashboard#starting
 - fixing bug 1080760 so that taskcluster et al do the work themselves, and instead only submit the end result to us.
* Re-enabling json log parsing and removing the xfail annotation from the test. ie reverting these two commits:
https://github.com/mozilla/treeherder-service/commit/0ce875128189e783f892bc63bcf73fdb56557f09
https://github.com/mozilla/treeherder-service/commit/94a081e480f8f45c1434dda0c44fe074938c5ed8

A trace of one of the 170s json log parsing tasks:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw[end]=1428570447&tw[start]=1428566847#id=5b224f746865725472616e73616374696f6e2f43656c6572792f70617273652d6a736f6e2d6c6f67222c22225d&sort_by=throughput&app_trace_id=6387634562&tab-detail_6387634562=trace_details

Ed Morley [:emorley]

Reporter

Comment 1

•

9 years ago

The most relevant findings from bug 1152681:

* The parse-json-log task duration spike occurred at both the same time on stage and prod, even though we hadn't deployed something which to me indicates a change in what's being submitted to us (or else a puppet system update or something, but I'm sure that's less likely).
* The number of HTTP 404s for log parsing went down at around the time the parse-json-log load went up - perhaps taskcluster or another submitter fixed the log URLs being given to us?
* We started seeing MemoryError() exceptions during the json parse task for the first time:
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3528329220/similar_errors?original_error_id=3528329220&tw[end]=1428580182&tw[start]=1428493782
* 90+% of the parse-json-log traces are inside the task, not waiting for AWS or the DB etc.

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Blocks: 1074927

Depends on: 1080760, 1079286, 1076761

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1152769

Ed Morley [:emorley]

Reporter

Comment 2

•

9 years ago

We should also set a lower timeout (well, lower than the default celery value) for these tasks, so no one task can runaway and monopolise the resources. (Though we'll need to make sure the task doesn't get retried 10 times after timing out -> does celery even have an overall timeout for a task?)

Ed Morley [:emorley]

Reporter

Comment 3

•

9 years ago

The log in question from the comment 0 170s trace:
http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/bb932f6f882913a7b21bb566c84ffe0bb2b9f55b90e702d1db5e4f480126355c9d0379d1b7cb576776b27cf7d1ed252e67bfb1496870c9ce5b6103d0aaddef4d

(To get log URLs, click trace details, then the lens next to the "requests[mozilla-releng-blobs.s3.amazonaws.com]" row)

Ed Morley [:emorley]

Reporter

Comment 4

•

9 years ago

For the log in comment 3:

[~/src]$ ll -h browser-chrome-chunked_raw.log
-rwxrwx---+ 1 Ed None 1.4G Apr  9 19:17 browser-chrome-chunked_raw.log*
[~/src]$ wc -l browser-chrome-chunked_raw.log
7784944 browser-chrome-chunked_raw.log

...well I guess that answers that one!

We should set a max filesize, over which we don't attempt to download/parse.

Jonathan Griffin (:jgriffin)

Comment 5

•

9 years ago

Can we tell if this log came from a try run?  If not, we should investigate, since we shouldn't even be producing logs of that size.

Ed Morley [:emorley]

Reporter

Comment 6

•

9 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #5)
> Can we tell if this log came from a try run?  If not, we should investigate,
> since we shouldn't even be producing logs of that size.

The URL contains "mozilla-inbound".

I think we should:
1) Set a max response size (a la http://stackoverflow.com/questions/23514256/http-request-with-timeout-maximum-size-and-connection-pooling)
2) Find out what is creating these too large logs and stop them
3) Set a shorter timeout on the log parsing jobs in general (with the caveat of comment 2)

Jonathan Griffin (:jgriffin)

Comment 7

•

9 years ago

(In reply to Ed Morley [:emorley] (formerly :edmorley) from comment #6)
> (In reply to Jonathan Griffin (:jgriffin) from comment #5)

> 2) Find out what is creating these too large logs and stop them

I'll take care of this, and file a bug.

Ed Morley [:emorley]

Reporter

Comment 8

•

9 years ago

(In reply to Ed Morley [:emorley] (formerly :edmorley) from comment #6)
> I think we should:
4) Make the harness not allow logs of that size, no matter how much the tests output

(In reply to Jonathan Griffin (:jgriffin) from comment #7)
> I'll take care of this, and file a bug.

Execute:
> SELECT job_id FROM mozilla_inbound_jobs_1.job_log_url WHERE url = "http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/bb932f6f882913a7b21bb566c84ffe0bb2b9f55b90e702d1db5e4f480126355c9d0379d1b7cb576776b27cf7d1ed252e67bfb1496870c9ce5b6103d0aaddef4d" ORDER BY id desc LIMIT 1

+ ----------- +
| job_id      |
+ ----------- +
| 8643929     |
+ ----------- +
1 rows


https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=8643929
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a3f6306dd05e

Ed Morley [:emorley]

Reporter

Comment 9

•

9 years ago

The "normal" log [1] for that job exceeded the 50MB buildbot cap, and the whole push was pretty broken. Seems like we just need to set a cap for the logs uploaded to S3 too.

[1] http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64/1428554431/mozilla-inbound_snowleopard_test-mochitest-browser-chrome-3-bm108-tests1-macosx-build710.txt.gz

Jonathan Griffin (:jgriffin)

Comment 10

•

9 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #7)
> (In reply to Ed Morley [:emorley] (formerly :edmorley) from comment #6)
> > (In reply to Jonathan Griffin (:jgriffin) from comment #5)
> 
> > 2) Find out what is creating these too large logs and stop them
> 
> I'll take care of this, and file a bug.

That came from some debug statements that were accidentally landed, but since backed out.  See https://hg.mozilla.org/integration/mozilla-inbound/rev/d2a1fc813b91

It would be tempting to try to address this in mozharness, and cap the size of the files, but there are legitimate reasons to use debug prints on try.  We could, potentially, cap the size of files on non-try branches, and turn jobs red when their logs exceed a certain size.

Jonathan Griffin (:jgriffin)

Updated

•

9 years ago

Comment 11

•

9 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #10)
> It would be tempting to try to address this in mozharness, and cap the size
> of the files, but there are legitimate reasons to use debug prints on try. 
> We could, potentially, cap the size of files on non-try branches, and turn
> jobs red when their logs exceed a certain size.

The problem is that unless the person has only requests a handful of jobs, it's potentially 400 jobs x 1+ GB uploaded to S3, which seems expensive from an S3 point of view, let alone anything else.

I think at the very least, it should require an in-gecko-repo change to allow uploading files that large (ie someone who knows what they are doing can remove the cap, and request a very narrow trychooser range).

Even then, 1.1GB (or larger, presuming we have no upper limit at present?) seems excessive - the dev should manipulate the harness or test manifests to only run the code they wish to see the debug output for. Or they should debug locally, or else request a machine loan :-)

Depends on: 1152934

Comment 12

•

9 years ago

(In reply to Ed Morley [:emorley] (formerly :edmorley) from comment #11)
> The problem is that unless the person has only requests a handful of jobs,
> it's potentially 400 jobs x 1+ GB uploaded to S3, which seems expensive from
> an S3 point of view, let alone anything else.

Although given the logs are compressed (and the repetitive native of log spam), it's presumably a logarithmic increase in size on S3.

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Blocks: 1160117

James Graham [:jgraham]

Comment 13

•

9 years ago

So, I'm rather against setting a maximum size of the uncompressed log because the fact that mozharness imposes one on the uncompressed unstructured logs has caused me no end of problems for, as far as I can tell, very little gain. Limiting the compressed size makes a lot more sense because obviously having hundreds of 1GB gz files with accidental memory dumps in them, or whatever, isn't going to help anyone.

Ideally treeherder could fix the maximum time it would spend processing a log (or, if that's too affected by machine load, or whatever, the maximum number of lines/bytes it's prepared to process) and simply stop parsing at that point with some UI to indicate that's what happened.

Ed Morley [:emorley]

Reporter

Comment 14

•

9 years ago

I think Treeherder should set a max size (or time or whatever) - and that size shouldn't be too high. People can still download the compressed file from S3 using the links associated with a job (note this bug isn't bug 1152934). I think the "occasional developer having no choice but to produce noisy logs" case is much rarer than "something broke on *every* job for a half dozen pushes in a row, and we now have 2000+ logs that are massive and choking the log parser" case.

That max limit need not be as low as buildbot's 50MB limit of course, which is a bit too low.

Ed Morley [:emorley]

Reporter

Comment 15

•

9 years ago

Reading your comment back it seems in fact that we were agreeing anyway; ignore :-)

Ed Morley [:emorley]

Reporter

Comment 16

•

9 years ago

Maybe we should also investigate https://github.com/esnme/ultrajson over simplejson as we move towards using more JSON style logs?

Ed Morley [:emorley]

Reporter

Comment 17

•

8 years ago

We no longer ingest the full structured log, only the error summary - the code for this was removed in:
https://github.com/mozilla/treeherder/commit/7e4084d24fcf6d543fde9d9a2515be1e4072ee70

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

Bugzilla

Quick Search

Determine why json log parsing is now taking 80x longer than before & re-enable it

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P3)

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Updated

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17