Closed Bug 1641645 Opened 5 years ago Closed 5 years ago

log parsing falling behind

Categories

(Tree Management :: Treeherder, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Unassigned)

References

Details

Attachments

(1 file)

As discussed in https://chat.mozilla.org/#/room/#treeherder:mozilla.org log parsing is falling behind.

Armen, Cam and Sarah are investigating, restarting the dynos only helped for a short time.

Long story short.

There's something in the queued logs that is making the log parsing workers run out of memory and die within 1-3 minutes of starting up, thus, we barely have any throughput.

Our current logging (including DEBUG logging) has not yielded anything.
New Relic error analytics has not yielded anything.
The databases on Amazon are not having any trouble as far as we can tell.

We're deploying some code on treeherder-stage to gather some metrics and find the root cause. Over there we still have a polluted queue that will help us discover the problem (Clearing the queue on treeherder-prototype proved to help).

We've promoted a change to split up the log parsing worker into 4 different ones. One for each Celery queue. Now only the workers processing failed jobs are having memory issues.

We can probably re-open the trees while we still investigate what's happening when we parse those failed logs.

The trees got reopened few minutes ago.

We don't have a root issue. We will still be looking into it tomorrow.

Severity: S1 → S2
Summary: log parsing falling behind - autoland closed → log parsing falling behind
See Also: → 1641884

We split the dynos up to one worker per queue. When I attempted to turn the one worker that was the main issue here, log_parser_fail, we continued to get memory issues. R14 which is quota exceeded, and eventually R15 which is "quota vastly exceeded". R15 would trigger a reboot of the dyno.

So I opted to set the worker to a single Dyno instance of performance. These are equivalent to 8 dynos and more expensive, but give a higher memory overhead which seemed to alleviate the rebooting. Since then, we haven't had any R15 reboots on that dyno.

Log parsing slowed down to 6-9 minutes.

Cam and I changed the node from Perf-m (8 dynos) to 2x (2 dynos).
I moved the 2x worker from 1 worker to 3 workers.
No memory issues.
The queue has cleared up.
We're back to business.

See Also: → 1641958
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: Treeherder: Log Parsing & Classification → TreeHerder
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: