log parsing falling behind
Categories
(Tree Management :: Treeherder, defect, P1)
Tracking
(Not tracked)
People
(Reporter: aryx, Unassigned)
References
Details
Attachments
(1 file)
As discussed in https://chat.mozilla.org/#/room/#treeherder:mozilla.org log parsing is falling behind.
Armen, Cam and Sarah are investigating, restarting the dynos only helped for a short time.
Comment 1•5 years ago
|
||
Long story short.
There's something in the queued logs that is making the log parsing workers run out of memory and die within 1-3 minutes of starting up, thus, we barely have any throughput.
Our current logging (including DEBUG logging) has not yielded anything.
New Relic error analytics has not yielded anything.
The databases on Amazon are not having any trouble as far as we can tell.
We're deploying some code on treeherder-stage to gather some metrics and find the root cause. Over there we still have a polluted queue that will help us discover the problem (Clearing the queue on treeherder-prototype proved to help).
Comment 2•5 years ago
|
||
Comment 3•5 years ago
|
||
We've promoted a change to split up the log parsing worker into 4 different ones. One for each Celery queue. Now only the workers processing failed jobs are having memory issues.
We can probably re-open the trees while we still investigate what's happening when we parse those failed logs.
Comment 4•5 years ago
|
||
The trees got reopened few minutes ago.
We don't have a root issue. We will still be looking into it tomorrow.
| Reporter | ||
Updated•5 years ago
|
Comment 5•5 years ago
|
||
We split the dynos up to one worker per queue. When I attempted to turn the one worker that was the main issue here, log_parser_fail, we continued to get memory issues. R14 which is quota exceeded, and eventually R15 which is "quota vastly exceeded". R15 would trigger a reboot of the dyno.
So I opted to set the worker to a single Dyno instance of performance. These are equivalent to 8 dynos and more expensive, but give a higher memory overhead which seemed to alleviate the rebooting. Since then, we haven't had any R15 reboots on that dyno.
Comment 6•5 years ago
|
||
Log parsing slowed down to 6-9 minutes.
Comment 7•5 years ago
|
||
Cam and I changed the node from Perf-m (8 dynos) to 2x (2 dynos).
I moved the 2x worker from 1 worker to 3 workers.
No memory issues.
The queue has cleared up.
We're back to business.
| Reporter | ||
Updated•5 years ago
|
| Assignee | ||
Updated•4 years ago
|
Description
•