1484642 - No bug suggestions generated for jobs that finish around 1am UTC each day

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

7 years ago

This hit at least at 3 different times during the last 1-2 weeks. At around ~1am UTC, bug suggestions for failures are often missing. Example (also many other jobs from that push): https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f2c0be296f6f10567864fcd0f4d2e7c5da342c34&filter-resultStatus=usercancel&filter-resultStatus=runnable&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&selectedJob=194785710 That one should suggest bug 1387666 and it gets suggested for other occurrences.

Cameron Dawson [:camd]

Assignee

Updated

•

7 years ago

Summary: No failure suggestions during some time windows → No bug suggestions during some time windows

Cameron Dawson [:camd]

Assignee

Comment 1

•

7 years ago

I got pinged by Cosmin today that bug suggestions were not showing up where they should. He mentioned a couple bugs: Bug 1424433 and Bug 1352675 that were not showing up. Sure enough, when I queried the ``bugscache`` table for records with a summary that matched those bugs, they did not show up. When I manually ran the command ``update_bugscache`` and then re-queried the database, they were there. Ed is way better at this stuff than I am, but looking in Papertrail, I see crashes due to memory overloads: Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: source=worker_default.1 dyno=heroku.47819539.ff36ae31-3a15-48b3-ad81-8cb99016dad4 sample#load_avg_1m=0.58 sample#load_avg_5m=0.41 sample#load_avg_15m=0.31 Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: source=worker_default.1 dyno=heroku.47819539.ff36ae31-3a15-48b3-ad81-8cb99016dad4 sample#memory_total=2279.04MB sample#memory_rss=1023.82MB sample#memory_cache=0.13MB sample#memory_swap=1255.09MB sample#memory_pgpgin=3263247pages sample#memory_pgpgout=3034842pages sample#memory_quota=1024.00MB Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Process running mem=2279M(222.6%) Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Error R15 (Memory quota vastly exceeded) Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Stopping process with SIGKILL Aug 21 17:26:49 treeherder-prod heroku/worker_default.1: Process exited with status 137 Aug 21 17:26:50 treeherder-prod heroku/worker_default.1: State changed from up to crashed Aug 21 17:26:50 treeherder-prod heroku/worker_default.1: State changed from crashed to starting Aug 21 17:26:56 treeherder-prod heroku/worker_default.1: Starting process with command `newrelic-admin run-program celery worker -A treeherder --without-gossip --without-mingle --without-heartbeat -Q default,cycle_data,fetch_bugs,fetch_runnablejobs,generate_perf_alerts,seta_analyze_failures,intermittents_commenter --concurrency=3` I wonder if this means that we're not successfully running the fetch_bugs process. Ed? What do you think?

Flags: needinfo?(emorley)

Cameron Dawson [:camd]

Assignee

Comment 2

•

7 years ago

Perhaps we should split that set of celery queues up to one or more separate dynos? Or we could upgrade the dyno, but not sure what's more cost effective there.

Ed Morley [:emorley]

Comment 3

•

7 years ago

The log excerpt in comment 1 does show memory exceeded, however without more context of the lines before/after, it's hard to know if this occurred whilst a fetch-bugs tasks was executing or something else (Papertrial is down at the moment or I'd try and find that part of the log). It runs every 15 minutes, so even if one fails, the next should succeed. I would suspect bug 1483289 is more to blame. In the meantime I've run a `optimize table treeherder.bugscache;` on production's RDS instance, which might help improve performance on the table. Worth noting is that the table now also powers intermittents-failure view, as well as the bugs commenter - and they have been timing out too. Perhaps one of the three consumers are increasing load and breaking the others?

Flags: needinfo?(emorley)

Sarah Clements [:sclements]

Comment 4

•

6 years ago

(In reply to Ed Morley [:emorley] from comment #3) > Worth noting is that the table now also powers intermittents-failure view, > as well as the bugs commenter - and they have been timing out too. Perhaps > one of the three consumers are increasing load and breaking the others? The intermittents commenter and intermittent failures view all query BugJobMap, not Bugscache. There isn't a foreign key on Bugscache - bugzilla is queried directly for the metadata that bugscache would provide - so I'm not sure how Bugscache would be powering them?

Ed Morley [:emorley]

Comment 5

•

6 years ago

Ah true, I misread sorry.

Andreea Pavel [:apavel]

Comment 6

•

6 years ago

Recent examples: this started with push https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=07b44bee061cf080e70e6872f5f15e22e23819e0&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=success&filter-resultStatus=pending&filter-resultStatus=running Job examples: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=198545900 https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=198544262 https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=198548971 Cameron, Ed can you please take a look?

Andreea Pavel [:apavel]

Updated

•

6 years ago

Flags: needinfo?(emorley)

Flags: needinfo?(cdawson)

Cameron Dawson [:camd]

Assignee

Comment 8

•

6 years ago

Signs point to this being an out of memory error at these times. So we're taking a different approach to the problem and going to run these scheduled tasks with the Heroku Scheduler, rather than celery workers. That way they'll each get their own dynos. I actually scheduled them on proto, stage and prod, so we may see an immediate benefit. One can hope. The PR change is in bug 1176492.

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Comment 9

•

6 years ago

Attached image default worker Heroku metrics — Details

Attached is a screenshot of the metrics for the "default" worker on the stage Heroku app. Prior to bug 1176492, this worker ran all of the tasks listed here (which included fetch_bugs and cycle_data): https://github.com/mozilla/treeherder/blob/1b52b07728423011d77e298d9df7cef48c508054/Procfile#L11 Once a day at 1am UTC the memory usage can be seen to spike so high that the events viewer shows an R15 error (memory usage exceeded and dyno killed) [1]. The only tasks that ran once a day (at the time of the errors in the screenshot), were [2]: * cycle_data (which is known to be a memory hog; bug ) * seta-analyze-failures * daily-intermittents-commenter (but it's pinned to 7am UTC so likely not responsible) The green arrow I've added to the screenshot is when bug 1176492 was deployed, which moved cycle_data and fetch_bugs out of the default worker (since they will now be run by the heroku scheduler addon). After the green arrow, there have been no more R15s at 1am UTC, so it seems safe to say this was cycle_data at fault, and so bug 1176492 worked. [1] https://devcenter.heroku.com/articles/error-codes#r15-memory-quota-vastly-exceeded [2] https://github.com/mozilla/treeherder/blob/255f246960d4902d40df1ec42e986e2f4814b7a5/treeherder/config/settings.py#L331-L382

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Updated

•

6 years ago

Assignee: nobody → cdawson

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

6 years ago

Priority: -- → P1

Summary: No bug suggestions during some time windows → No bug suggestions generated for jobs that finish around 1am UTC each day

Ed Morley [:emorley]

Updated

•

6 years ago

Depends on: 1176492

Ed Morley [:emorley]

Comment 10

•

6 years ago

I believe the issue here was actually the one seen in bug 1503172. (Even though the task here was failing due to running out of memory, it should still never have purged previously ingested bugs)

Resolution: FIXED → DUPLICATE

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Treeherder: Log Parsing & Classification → TreeHerder

Bugzilla

No bug suggestions generated for jobs that finish around 1am UTC each day

Categories

(Tree Management :: Treeherder, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: aryx, Assigned: camd)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 8

Comment 9

Updated

Updated

Updated

Comment 10

Updated

Attachment

General

Description

File Name

Content Type