Closed Bug 1151629 Opened 9 years ago Closed 9 years ago

The bugscache population task has a hardcoded limit of 15,000 bugs, which we've now reached

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: emorley)

References

Details

Attachments

(1 file)

I've stared at it until my eyes swim, but it sure looks to me like it has kw:intermittent-failure, and browser_compartments.js in the summary. Still, https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8c716f35d9ec&filter-searchStr=Windows%207%2032-bit%20mozilla-inbound%20debug%20test%20mochitest-browser-chrome-3 only gets the closed bug 1150259 suggested, not the open one with a (truncated) exact match for the failure.
I seem to be having a similar issue with bug 1151711.
Flags: needinfo?(emorley)
Priority: -- → P1
I'm seeing this with other bugs I filed yesterday as well. Seems widespread.
Flags: needinfo?(mdoglio)
Flags: needinfo?(cdawson)
Flags: needinfo?(mdoglio)
Heads up sheriffs - until fixed, this basically means we need to search BMO for dupes before filing any new oranges.
yeah confirmed i saw a lot of this issues today, we should get this fixed asap
Affects stage as well.
For the link in comment 0, the artefact URL is:
https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=8508222&name=Bug+suggestions&type=json

Excerpt from it:

  "blob": [{
    "search": "179 INFO TEST-UNEXPECTED-FAIL | toolkit/components/aboutperformance/tests/browser/browser_compartments.js | Sanity check (): totalUserTime is monotonic.: 15600 <= 0 - false == true - JS frame :: chrome://mochitests/content/browser/toolkit/components/aboutperformance/tests/browser/browser_compartments.js :: Assert_leq :: line 36",
    "search_terms": ["browser_compartments.js"],
    "bugs": {
      "open_recent": [],
      "all_others": [{
        "crash_signature": "",
        "resolution": "FIXED",
        "summary": "Intermittent browser_compartments.js | Test timed out | Found a tab after previous test timed out: browser/browser_compartments.html?test=0.9079043654642461 | A promise chain failed to handle a rejection: - at browser-test.js:743",
        "relevance": 1.0,
        "keywords": "intermittent-failure",
        "os": "Windows XP",
        "id": 1150259
      }]
    }
  },


-> The correct search term is being used, "browser_compartments.js".

Searching with that term:
https://treeherder.mozilla.org/api/bugscache/?search=browser_compartments.js

Gives:
{
  "open_recent": [{
    "crash_signature": "",
    "resolution": "",
    "summary": "Intermittent browser_compartments.js | Sanity check (): totalUserTime is monotonic.: 15600 <= 0 - false == true - JS frame :: chrome://mochitests/content/browser/toolkit/components/aboutperformance/tests/browser/browser_compartments.js :: Assert_leq :: li",
    "relevance": 1.0,
    "keywords": "intermittent-failure",
    "os": "Windows 7",
    "id": 1151240
  }],
  "all_others": [{
    "crash_signature": "",
    "resolution": "FIXED",
    "summary": "Intermittent browser_compartments.js | Test timed out | Found a tab after previous test timed out: browser/browser_compartments.html?test=0.9079043654642461 | A promise chain failed to handle a rejection: - at browser-test.js:743",
    "relevance": 1.0,
    "keywords": "intermittent-failure",
    "os": "Windows XP",
    "id": 1150259
  }]
}

So the bug is there now.
Flags: needinfo?(emorley)
If I could have some more examples, it would really help...
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #1)
> I seem to be having a similar issue with bug 1151711.
Which was still affected as of ~4h ago on the last push to b2g37
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #9)
> Which was still affected as of ~4h ago on the last push to b2g37

https://treeherder.mozilla.org/#/jobs?repo=mozilla-b2g37_v2_2&revision=9ab8a3ae0fc3&filter-searchStr=b2g_emulator_vm mozilla-b2g37_v2_2 debug test mochitest-debug-5
https://treeherder.mozilla.org/api/project/mozilla-b2g37_v2_2/artifact/?job_id=97349&name=Bug+suggestions&type=json

  {
    "search": "PROCESS-CRASH | dom/canvas/test/test_2d.composite.canvas.color-burn.html | application crashed [None]",
    "search_terms": ["test_2d.composite.canvas.color-burn.html"],
    "bugs": {
      "open_recent": [],
      "all_others": []
    }
  },

https://treeherder.mozilla.org/api/bugscache/?search=test_2d.composite.canvas.color-burn.html

-> {"open_recent": [], "all_others": []}

Execute:
> SELECT * FROM treeherder.bugscache WHERE id = 1151711

+ ------- + ----------- + --------------- + ------------ + -------------------- + ------------- + ------- + ------------- +
| id      | status      | resolution      | summary      | crash_signature      | keywords      | os      | modified      |
+ ------- + ----------- + --------------- + ------------ + -------------------- + ------------- + ------- + ------------- +
| NULL    | NULL        | NULL            | NULL         | NULL                 | NULL          | NULL    | NULL          |
+ ------- + ----------- + --------------- + ------------ + -------------------- + ------------- + ------- + ------------- +
1 rows

(On both master and slave)
The fetch_bugs task runs from the celery_worker queue on rabbitmq1.

The queue is not backlogged, now are others:
[emorley@treeherder-rabbitmq1.private.scl3 ~]$ sudo rabbitmqctl list_queues -p treeherder
Listing queues ...
buildapi        0
calculate_eta   0
celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox  0
celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox  0
celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox      0
celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox   0
celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox  0
celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox  0
celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox  0
celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox   0
celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox   0
celeryev.30087534-00c5-4c58-9821-5c9496bf2858   0
celeryev.3244c743-300e-4735-afce-9fc81e418171   0
celeryev.35020252-8bb5-435a-88a5-7bedadc5d3b9   0
celeryev.7e754135-9ba9-4fd8-977e-0a5f590790d0   0
celeryev.89d31514-286d-481a-a823-95c216367f5e   0
celeryev.9c6176ea-fc6c-4af5-b52f-3763b3e1a59a   0
celeryev.a9a7e73f-412d-450f-936a-496feff741b3   0
celeryev.b6d92059-4d6c-410a-95f9-d875d7b63cf0   0
celeryev.cdcb6436-3c46-4d15-b0b7-ba696aeec73b   0
cycle_data      0
default 0
fetch_bugs      0
fetch_missing_push_logs 0
high_priority   0
log_parser      0
log_parser_fail 0
log_parser_hp   0
log_parser_json 0
populate_performance_series     0
process_objects 0
pushlog 0
...done.

Rabbitmq1 looks ok on New Relic:
https://rpm.newrelic.com/accounts/677903/servers/5575925

There are no fetch-bugs exceptions on:
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors

The 7 day transaction view for fetch-bugs doesn't show anything obvious:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&tw[end]=1428508479&tw[start]=1427903679&type=all#id=5b224f746865725472616e73616374696f6e2f43656c6572792f66657463682d62756773222c22225d
s/now/nor/
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #13)
> Bug 1152289

Thanks :-) Should be good with those now.

There are no errors being shown in /var/log/celery/celery_worker.log on rabbitmq1:

[2015-04-08 08:00:00,064: INFO/MainProcess] Received task: fetch-bugs[77925354-132e-4c67-8003-625c2a937f0c]
...
[2015-04-08 08:05:10,681: INFO/MainProcess] Task fetch-bugs[77925354-132e-4c67-8003-625c2a937f0c] succeeded in 310.615266436s: None

The runtime is sometimes above, sometimes below 300s, however this is pre-existing from March (and not that exceeded 300s should make any difference as far as I'm aware).
So the tl;dr of what I have so far is:
* The bugs aren't even in the bugscache table (this is an issue with populating the bugscache, not generating the summaries etc)
* There are zero exceptions/errors/... being reported
* The job is definitely still running and takes the same amount of time to run (so presumably any issue is say on the insert into the DB, rather than the task bailing early or just being skipped)

Will continue looking..
Sigh:
https://github.com/mozilla/treeherder-service/blob/f5c0b53e0ce6b527c5eb2d861adeb72e1e5859ea/treeherder/etl/bugzilla.py#L39

offset = 0
limit = 500

# fetch new pages no more than 30 times
# this is a safe guard to not generate an infinite loop
# in case something went wrong
for i in range(1, 30 + 1):
    # fetch the bugzilla service until we have an empty result
    paginated_url = "{0}&offset={1}&limit={2}".format(
        get_bz_source_url(),
        offset,
        limit
    )

30 * 500 = 15,000 results max

[~/src]$ curl 'https://bugzilla.mozilla.org/bzapi/count?keywords=intermittent-failure'
{"data":15048}

I don't know whether to laugh or cry.
Assignee: nobody → emorley
Status: NEW → ASSIGNED
The existing code should have generated an exception, not carried on silently. But we can just remove the limit now IMO, since we know the search terms are correct (ie we're not fetching all of Bugzilla by accident, just intermittent failure bugs).
Flags: needinfo?(cdawson)
Summary: Bug 1151240 isn't suggested for browser_compartments.js failures → The bugscache population task has a hardocded limit of 15,000 bugs, which we've now reached
Summary: The bugscache population task has a hardocded limit of 15,000 bugs, which we've now reached → The bugscache population task has a hardcoded limit of 15,000 bugs, which we've now reached
Component: Treeherder → Treeherder: Data Ingestion
Attachment #8589726 - Flags: review?(mdoglio) → review+
Depends on: 1152426
Commit pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/7be69d14922c10c0a30fcd8b4c842b80e4d89f36
Bug 1151629 - Don't limit the number of bugs retrieved by fetch_bugs

Previously we limited the number of pages to 30, which with a page size
of 500, meant a max of 15,000 intermittent-failure bugs retrieved from
Bugzilla, with no exceptions or log output to indicate this had
occurred.

We now have more than 15,000 intermittent failure bugs, so the limit is
being removed, since we're both confident that the search terms are
correct, and any other infinite loop would be caught by the existing
600s timeout.

The task currently takes ~300s to run, so there is still plenty of
headroom. Plus a timeout exception would be immediately visible in New
Relic and so much less of a pain to debug.
Before...

Execute:
> SELECT COUNT(*) FROM treeherder.bugscache

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 15000         |
+ ------------- +
1 rows

And now another hourly fetch-bugs task has completed, after this was deployed:

Execute:
> SELECT COUNT(*) FROM treeherder.bugscache

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 15043         |
+ ------------- +
1 rows
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: