Closed
Bug 1152681
Opened 9 years ago
Closed 9 years ago
Log parsing falling behind on prod (and stage) due to json log parsing taking 80x longer than normal
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
Details
Attachments
(1 file)
Starting ~2.5 hrs ago, seemingly simultaneously on prod and stage, the "messages available" count started climbing: https://rpm.newrelic.com/accounts/677903/dashboard/6293241?tw[end]=1428569263&tw[start]=1428547663 https://rpm.newrelic.com/accounts/677903/dashboard/6293252?tw[end]=1428569223&tw[start]=1428547623 I've: * restarting the log workers on stage and prod * restarted _all_ processes on prod * checked that the prod processor nodes are actually doing work (ie cpu isn't flat because a worker has disconnected), and they are seemingly fine: https://rpm.newrelic.com/accounts/677903/servers/4303171 https://rpm.newrelic.com/accounts/677903/servers/4303155 https://rpm.newrelic.com/accounts/677903/servers/5313432 (if anything, they've been doing more work since 7am today, implying an influx of tasks) * purged the log_parser and log_parser_json queues, since they are less urgent, to give the others a chance to catch up Things that seem strange: * The memory usage on the processor nodes went from being flat to being noisy: https://rpm.newrelic.com/accounts/677903/servers/4303171?tw[end]=1428569644&tw[start]=1428483244 Though this correlates with the nodes doing more work... in fact the amount of work has jumped quite a bit seemingly (at least the cpu usage on the nodes)... (Still looking)
Assignee | ||
Comment 1•9 years ago
|
||
Ok so for whatever reason, the parse-json-log task is now dominating the wall clock time graph: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw[end]=1428570274&tw[start]=1428548674#id=5b224f746865725472616e73616374696f6e2f43656c6572792f70617273652d6c6f67222c22225d&sort_by=throughput The requests per minute look fairly constant, but the task time has rocketed: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw[end]=1428570274&tw[start]=1428548674#id=5b224f746865725472616e73616374696f6e2f43656c6572792f70617273652d6a736f6e2d6c6f67222c22225d&sort_by=throughput A trace of a task that took 170s, 96% of which was inside parse-json-log (needs more annotation to show deeper inside, but at least we know it's presumably not the AWS fetch that is taking ages): https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw[end]=1428570447&tw[start]=1428566847#id=5b224f746865725472616e73616374696f6e2f43656c6572792f70617273652d6a736f6e2d6c6f67222c22225d&sort_by=throughput&app_trace_id=6387634562&tab-detail_6387634562=trace_details
Comment 2•9 years ago
|
||
closed all trees since this is a huge backlog currently on all trees with parsing
Assignee | ||
Comment 3•9 years ago
|
||
The same time yesterday: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw[end]=1428484020&tw[start]=1428480420#id=5b224f746865725472616e73616374696f6e2f43656c6572792f70617273652d6a736f6e2d6c6f67222c22225d&sort_by=throughput ...the parse-json-log tasks took between 1-4s. For the parse-json-log component of the trace, the average time jumped from 496ms to 37,000ms. This seems to correlate with the HTTP 404 errors declining here: https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors?utf8=%E2%9C%93&search[name]=urllib2%3AHTTPError&search[sort_by]=count&search[type]=all ie: I believe the log URLs being provided by either taskcluster or buildbot were previously 404ing, and now they are not, so the parse-json-log task went from being a near no-op much of the time, to having to do a lot of work perhaps?
Assignee | ||
Comment 4•9 years ago
|
||
Short term I can just disable the json log parsing added by bug 1113873 a few weeks ago. It's not yet clear whether the json parsing needs optimisation, or if TC is just spamming treeherder in a way it shouldn't - but either way the parsed json logs are not being used for anything at the moment, and they are causing normal log parsing to get behind.
Summary: Log parsing behind on prod (and stage) → Investigate log parsing falling behind on prod (and stage)
Assignee | ||
Comment 5•9 years ago
|
||
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Comment 6•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder-service https://github.com/mozilla/treeherder-service/commit/0ce875128189e783f892bc63bcf73fdb56557f09 Bug 1152681 - Temporarily disable the parse-json-log task Since it is taking 80x longer than it did previously and is causing backlogs in the standard log parsing tasks too.
Assignee | ||
Comment 7•9 years ago
|
||
Deployed & json queue + non-failure queue purged to expedite the catch up - queue size is falling & should be good to reopen now/soon :-)
Comment 8•9 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #7) > Deployed & json queue + non-failure queue purged to expedite the catch up - > queue size is falling & should be good to reopen now/soon :-) reopned :)
Comment 9•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder-service https://github.com/mozilla/treeherder-service/commit/94a081e480f8f45c1434dda0c44fe074938c5ed8 Bug 1152681 - Mark test_parse_mozlog_log as xfail to make Travis green We've disabled json log parsing, so we need to temporarily mark the corresponding test as expected fail.
Assignee | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Summary: Investigate log parsing falling behind on prod (and stage) → Log parsing falling behind on prod (and stage) due to json log parsing taking 80x longer than normal
You need to log in
before you can comment on or make changes to this bug.
Description
•