Closed Bug 1484642 Opened 7 years ago Closed 6 years ago

No bug suggestions generated for jobs that finish around 1am UTC each day

Categories

(Tree Management :: Treeherder, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1503172

People

(Reporter: aryx, Assigned: camd)

References

Details

Attachments

(1 file)

This hit at least at 3 different times during the last 1-2 weeks. At around ~1am UTC, bug suggestions for failures are often missing. Example (also many other jobs from that push): https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f2c0be296f6f10567864fcd0f4d2e7c5da342c34&filter-resultStatus=usercancel&filter-resultStatus=runnable&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&selectedJob=194785710 That one should suggest bug 1387666 and it gets suggested for other occurrences.
Summary: No failure suggestions during some time windows → No bug suggestions during some time windows
I got pinged by Cosmin today that bug suggestions were not showing up where they should. He mentioned a couple bugs: Bug 1424433 and Bug 1352675 that were not showing up. Sure enough, when I queried the ``bugscache`` table for records with a summary that matched those bugs, they did not show up. When I manually ran the command ``update_bugscache`` and then re-queried the database, they were there. Ed is way better at this stuff than I am, but looking in Papertrail, I see crashes due to memory overloads: Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: source=worker_default.1 dyno=heroku.47819539.ff36ae31-3a15-48b3-ad81-8cb99016dad4 sample#load_avg_1m=0.58 sample#load_avg_5m=0.41 sample#load_avg_15m=0.31 Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: source=worker_default.1 dyno=heroku.47819539.ff36ae31-3a15-48b3-ad81-8cb99016dad4 sample#memory_total=2279.04MB sample#memory_rss=1023.82MB sample#memory_cache=0.13MB sample#memory_swap=1255.09MB sample#memory_pgpgin=3263247pages sample#memory_pgpgout=3034842pages sample#memory_quota=1024.00MB Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Process running mem=2279M(222.6%) Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Error R15 (Memory quota vastly exceeded) Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Stopping process with SIGKILL Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Process exited with status 137 Aug 21 17:26:50 treeherder-prod heroku/worker_default.1: State changed from up to crashed Aug 21 17:26:50 treeherder-prod heroku/worker_default.1: State changed from crashed to starting Aug 21 17:26:56 treeherder-prod heroku/worker_default.1: Starting process with command `newrelic-admin run-program celery worker -A treeherder --without-gossip --without-mingle --without-heartbeat -Q default,cycle_data,fetch_bugs,fetch_runnablejobs,generate_perf_alerts,seta_analyze_failures,intermittents_commenter --concurrency=3` I wonder if this means that we're not successfully running the fetch_bugs process. Ed? What do you think?
Flags: needinfo?(emorley)
Perhaps we should split that set of celery queues up to one or more separate dynos? Or we could upgrade the dyno, but not sure what's more cost effective there.
The log excerpt in comment 1 does show memory exceeded, however without more context of the lines before/after, it's hard to know if this occurred whilst a fetch-bugs tasks was executing or something else (Papertrial is down at the moment or I'd try and find that part of the log). It runs every 15 minutes, so even if one fails, the next should succeed. I would suspect bug 1483289 is more to blame. In the meantime I've run a `optimize table treeherder.bugscache;` on production's RDS instance, which might help improve performance on the table. Worth noting is that the table now also powers intermittents-failure view, as well as the bugs commenter - and they have been timing out too. Perhaps one of the three consumers are increasing load and breaking the others?
Flags: needinfo?(emorley)
(In reply to Ed Morley [:emorley] from comment #3) > Worth noting is that the table now also powers intermittents-failure view, > as well as the bugs commenter - and they have been timing out too. Perhaps > one of the three consumers are increasing load and breaking the others? The intermittents commenter and intermittent failures view all query BugJobMap, not Bugscache. There isn't a foreign key on Bugscache - bugzilla is queried directly for the metadata that bugscache would provide - so I'm not sure how Bugscache would be powering them?
Ah true, I misread sorry.
Flags: needinfo?(emorley)
Flags: needinfo?(cdawson)
Signs point to this being an out of memory error at these times. So we're taking a different approach to the problem and going to run these scheduled tasks with the Heroku Scheduler, rather than celery workers. That way they'll each get their own dynos. I actually scheduled them on proto, stage and prod, so we may see an immediate benefit. One can hope. The PR change is in bug 1176492.
Flags: needinfo?(cdawson)
Attached is a screenshot of the metrics for the "default" worker on the stage Heroku app. Prior to bug 1176492, this worker ran all of the tasks listed here (which included fetch_bugs and cycle_data): https://github.com/mozilla/treeherder/blob/1b52b07728423011d77e298d9df7cef48c508054/Procfile#L11 Once a day at 1am UTC the memory usage can be seen to spike so high that the events viewer shows an R15 error (memory usage exceeded and dyno killed) [1]. The only tasks that ran once a day (at the time of the errors in the screenshot), were [2]: * cycle_data (which is known to be a memory hog; bug ) * seta-analyze-failures * daily-intermittents-commenter (but it's pinned to 7am UTC so likely not responsible) The green arrow I've added to the screenshot is when bug 1176492 was deployed, which moved cycle_data and fetch_bugs out of the default worker (since they will now be run by the heroku scheduler addon). After the green arrow, there have been no more R15s at 1am UTC, so it seems safe to say this was cycle_data at fault, and so bug 1176492 worked. [1] https://devcenter.heroku.com/articles/error-codes#r15-memory-quota-vastly-exceeded [2] https://github.com/mozilla/treeherder/blob/255f246960d4902d40df1ec42e986e2f4814b7a5/treeherder/config/settings.py#L331-L382
Flags: needinfo?(emorley)
Assignee: nobody → cdawson
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Priority: -- → P1
Summary: No bug suggestions during some time windows → No bug suggestions generated for jobs that finish around 1am UTC each day
Depends on: 1176492
I believe the issue here was actually the one seen in bug 1503172. (Even though the task here was failing due to running out of memory, it should still never have purged previously ingested bugs)
Resolution: FIXED → DUPLICATE
Component: Treeherder: Log Parsing & Classification → TreeHerder
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: