Closed
Bug 1484642
Opened 7 years ago
Closed 6 years ago
No bug suggestions generated for jobs that finish around 1am UTC each day
Categories
(Tree Management :: Treeherder, enhancement, P1)
Tree Management
Treeherder
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1503172
People
(Reporter: aryx, Assigned: camd)
References
Details
Attachments
(1 file)
222.21 KB,
image/jpeg
|
Details |
This hit at least at 3 different times during the last 1-2 weeks. At around ~1am UTC, bug suggestions for failures are often missing.
Example (also many other jobs from that push):
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f2c0be296f6f10567864fcd0f4d2e7c5da342c34&filter-resultStatus=usercancel&filter-resultStatus=runnable&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&selectedJob=194785710
That one should suggest bug 1387666 and it gets suggested for other occurrences.
Assignee | ||
Updated•7 years ago
|
Summary: No failure suggestions during some time windows → No bug suggestions during some time windows
Assignee | ||
Comment 1•7 years ago
|
||
I got pinged by Cosmin today that bug suggestions were not showing up where they should. He mentioned a couple bugs: Bug 1424433 and Bug 1352675 that were not showing up. Sure enough, when I queried the ``bugscache`` table for records with a summary that matched those bugs, they did not show up. When I manually ran the command ``update_bugscache`` and then re-queried the database, they were there.
Ed is way better at this stuff than I am, but looking in Papertrail, I see crashes due to memory overloads:
Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: source=worker_default.1 dyno=heroku.47819539.ff36ae31-3a15-48b3-ad81-8cb99016dad4 sample#load_avg_1m=0.58 sample#load_avg_5m=0.41 sample#load_avg_15m=0.31
Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: source=worker_default.1 dyno=heroku.47819539.ff36ae31-3a15-48b3-ad81-8cb99016dad4 sample#memory_total=2279.04MB sample#memory_rss=1023.82MB sample#memory_cache=0.13MB sample#memory_swap=1255.09MB sample#memory_pgpgin=3263247pages sample#memory_pgpgout=3034842pages sample#memory_quota=1024.00MB
Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Process running mem=2279M(222.6%)
Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Error R15 (Memory quota vastly exceeded)
Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Stopping process with SIGKILL
Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Process exited with status 137
Aug 21 17:26:50 treeherder-prod heroku/worker_default.1: State changed from up to crashed
Aug 21 17:26:50 treeherder-prod heroku/worker_default.1: State changed from crashed to starting
Aug 21 17:26:56 treeherder-prod heroku/worker_default.1: Starting process with command `newrelic-admin run-program celery worker -A treeherder --without-gossip --without-mingle --without-heartbeat -Q default,cycle_data,fetch_bugs,fetch_runnablejobs,generate_perf_alerts,seta_analyze_failures,intermittents_commenter --concurrency=3`
I wonder if this means that we're not successfully running the fetch_bugs process. Ed? What do you think?
Flags: needinfo?(emorley)
Assignee | ||
Comment 2•7 years ago
|
||
Perhaps we should split that set of celery queues up to one or more separate dynos? Or we could upgrade the dyno, but not sure what's more cost effective there.
Comment 3•7 years ago
|
||
The log excerpt in comment 1 does show memory exceeded, however without more context of the lines before/after, it's hard to know if this occurred whilst a fetch-bugs tasks was executing or something else (Papertrial is down at the moment or I'd try and find that part of the log). It runs every 15 minutes, so even if one fails, the next should succeed.
I would suspect bug 1483289 is more to blame.
In the meantime I've run a `optimize table treeherder.bugscache;` on production's RDS instance, which might help improve performance on the table.
Worth noting is that the table now also powers intermittents-failure view, as well as the bugs commenter - and they have been timing out too. Perhaps one of the three consumers are increasing load and breaking the others?
Flags: needinfo?(emorley)
Comment 4•6 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #3)
> Worth noting is that the table now also powers intermittents-failure view,
> as well as the bugs commenter - and they have been timing out too. Perhaps
> one of the three consumers are increasing load and breaking the others?
The intermittents commenter and intermittent failures view all query BugJobMap, not Bugscache. There isn't a foreign key on Bugscache - bugzilla is queried directly for the metadata that bugscache would provide - so I'm not sure how Bugscache would be powering them?
Comment 5•6 years ago
|
||
Ah true, I misread sorry.
Comment 6•6 years ago
|
||
Recent examples:
this started with push https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=07b44bee061cf080e70e6872f5f15e22e23819e0&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=success&filter-resultStatus=pending&filter-resultStatus=running
Job examples: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=198545900
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=198544262
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=198548971
Cameron, Ed can you please take a look?
Updated•6 years ago
|
Flags: needinfo?(emorley)
Flags: needinfo?(cdawson)
Assignee | ||
Comment 8•6 years ago
|
||
Signs point to this being an out of memory error at these times. So we're taking a different approach to the problem and going to run these scheduled tasks with the Heroku Scheduler, rather than celery workers. That way they'll each get their own dynos. I actually scheduled them on proto, stage and prod, so we may see an immediate benefit. One can hope.
The PR change is in bug 1176492.
Flags: needinfo?(cdawson)
Comment 9•6 years ago
|
||
Attached is a screenshot of the metrics for the "default" worker on the stage Heroku app. Prior to bug 1176492, this worker ran all of the tasks listed here (which included fetch_bugs and cycle_data):
https://github.com/mozilla/treeherder/blob/1b52b07728423011d77e298d9df7cef48c508054/Procfile#L11
Once a day at 1am UTC the memory usage can be seen to spike so high that the events viewer shows an R15 error (memory usage exceeded and dyno killed) [1].
The only tasks that ran once a day (at the time of the errors in the screenshot), were [2]:
* cycle_data (which is known to be a memory hog; bug )
* seta-analyze-failures
* daily-intermittents-commenter (but it's pinned to 7am UTC so likely not responsible)
The green arrow I've added to the screenshot is when bug 1176492 was deployed, which moved cycle_data and fetch_bugs out of the default worker (since they will now be run by the heroku scheduler addon).
After the green arrow, there have been no more R15s at 1am UTC, so it seems safe to say this was cycle_data at fault, and so bug 1176492 worked.
[1] https://devcenter.heroku.com/articles/error-codes#r15-memory-quota-vastly-exceeded
[2] https://github.com/mozilla/treeherder/blob/255f246960d4902d40df1ec42e986e2f4814b7a5/treeherder/config/settings.py#L331-L382
Flags: needinfo?(emorley)
Updated•6 years ago
|
Assignee: nobody → cdawson
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Priority: -- → P1
Summary: No bug suggestions during some time windows → No bug suggestions generated for jobs that finish around 1am UTC each day
Comment 10•6 years ago
|
||
I believe the issue here was actually the one seen in bug 1503172.
(Even though the task here was failing due to running out of memory, it should still never have purged previously ingested bugs)
Resolution: FIXED → DUPLICATE
Updated•3 years ago
|
Component: Treeherder: Log Parsing & Classification → TreeHerder
You need to log in
before you can comment on or make changes to this bug.
Description
•