Closed Bug 1152681 Opened 9 years ago Closed 9 years ago

Log parsing falling behind on prod (and stage) due to json log parsing taking 80x longer than normal

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(1 file)

Starting ~2.5 hrs ago, seemingly simultaneously on prod and stage, the "messages available" count started climbing:
https://rpm.newrelic.com/accounts/677903/dashboard/6293241?tw[end]=1428569263&tw[start]=1428547663
https://rpm.newrelic.com/accounts/677903/dashboard/6293252?tw[end]=1428569223&tw[start]=1428547623

I've:
* restarting the log workers on stage and prod
* restarted _all_ processes on prod
* checked that the prod processor nodes are actually doing work (ie cpu isn't flat because a worker has disconnected), and they are seemingly fine:
https://rpm.newrelic.com/accounts/677903/servers/4303171
https://rpm.newrelic.com/accounts/677903/servers/4303155
https://rpm.newrelic.com/accounts/677903/servers/5313432
(if anything, they've been doing more work since 7am today, implying an influx of tasks)
* purged the log_parser and log_parser_json queues, since they are less urgent, to give the others a chance to catch up

Things that seem strange:
* The memory usage on the processor nodes went from being flat to being noisy:
https://rpm.newrelic.com/accounts/677903/servers/4303171?tw[end]=1428569644&tw[start]=1428483244

Though this correlates with the nodes doing more work... in fact the amount of work has jumped quite a bit seemingly (at least the cpu usage on the nodes)...

(Still looking)
closed all trees since this is a huge backlog currently on all trees with parsing
The same time yesterday:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw[end]=1428484020&tw[start]=1428480420#id=5b224f746865725472616e73616374696f6e2f43656c6572792f70617273652d6a736f6e2d6c6f67222c22225d&sort_by=throughput

...the parse-json-log tasks took between 1-4s.

For the parse-json-log component of the trace, the average time jumped from 496ms to 37,000ms.

This seems to correlate with the HTTP 404 errors declining here:
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors?utf8=%E2%9C%93&search[name]=urllib2%3AHTTPError&search[sort_by]=count&search[type]=all

ie: I believe the log URLs being provided by either taskcluster or buildbot were previously 404ing, and now they are not, so the parse-json-log task went from being a near no-op much of the time, to having to do a lot of work perhaps?
Short term I can just disable the json log parsing added by bug 1113873 a few weeks ago. It's not yet clear whether the json parsing needs optimisation, or if TC is just spamming treeherder in a way it shouldn't - but either way the parsed json logs are not being used for anything at the moment, and they are causing normal log parsing to get behind.
Summary: Log parsing behind on prod (and stage) → Investigate log parsing falling behind on prod (and stage)
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Commit pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/0ce875128189e783f892bc63bcf73fdb56557f09
Bug 1152681 - Temporarily disable the parse-json-log task

Since it is taking 80x longer than it did previously and is causing
backlogs in the standard log parsing tasks too.
Deployed & json queue + non-failure queue purged to expedite the catch up - queue size is falling & should be good to reopen now/soon :-)
(In reply to Ed Morley [:edmorley] from comment #7)
> Deployed & json queue + non-failure queue purged to expedite the catch up -
> queue size is falling & should be good to reopen now/soon :-)

reopned :)
Commit pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/94a081e480f8f45c1434dda0c44fe074938c5ed8
Bug 1152681 - Mark test_parse_mozlog_log as xfail to make Travis green

We've disabled json log parsing, so we need to temporarily mark the
corresponding test as expected fail.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Summary: Investigate log parsing falling behind on prod (and stage) → Log parsing falling behind on prod (and stage) due to json log parsing taking 80x longer than normal
Blocks: 1152742
Depends on: 1141993
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: