Closed Bug 1429030 Opened 6 years ago Closed 3 years ago

filing new bugs is not using the bugzilla component defined in-tree

Categories

(Tree Management :: Treeherder, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

there is a process for the tree sheriffs for filing a new intermittent failure bug- unfortunately I don't know where this is documented or what tools are used.

I do know that many of the web-platform-tests are not filed in the bugzilla component the tests are associated with, instead they are filed in testing::web-platform-tests.  There are 560 intermittent failures in that component, 55 in the last month.  I spot checked a few of the 55 recent ones and they do have valid components associated in moz.build files.  For reference here are the recent ones:
https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure%2C%20&keywords_type=allwords&list_id=13958611&resolution=---&classification=Components&chfieldto=Now&query_format=advanced&chfieldfrom=2017-12-01&component=web-platform-tests&product=Testing

We need to fix up our tools to work properly, otherwise these bugs are not going to ever be seen and we are just wasting time.
:coop, I believe you are headmaster of the sheriffs, is this something you can drive to ensure our tools are working properly and the work the sheriffs are doing is not for waste?
Flags: needinfo?(coop)
I think this should be treated as a  meta bug rather than duping against a specific techical issue.
Depends on: 1354791
Thank you for filing this - I agree it's important that the tools help the bugs end up in the right place, so they don't get missed.

The bug filer tool that files the intermittent failure bugs was created by and almost solely maintained by Wes. It would be good to have someone outside the Treeherder team take over maintenance now he's left. If it would help to have a completely separate Bugzilla component for bugs relating to this tool, I have the necessary permissions to create one (it's currently part of the "Treeherder: Log Parsing & Classification" component.

The code for the tool is here:
https://github.com/mozilla/treeherder/blob/master/ui/js/controllers/bugfiler.js
https://github.com/mozilla/treeherder/blob/master/tests/ui/unit/controllers/bugfiler.tests.js

Re this class of bugs (wrong component), the best way to figure out whether it's an issue with the bug filer or the metadata returned from hg.m.o is to:
* follow the treeherder.m.o/logviewer.html link in the bug description
* then click the "revision" link (that links back to the main Treeherder jobs view with that job selected)
* then switch to the failure summary panel
* then click the bugfiler icon next to that failure line (think maybe needs "&bugfiler" added to the URL if not a sheriff?)
* find the XHR request made to hg.m.o (eg https://hg.mozilla.org/mozilla-central/json-mozbuildinfo?p=testing/web-platform/mozilla/tests/wasm/f32.wast.js.html) and check that (a) the request was for the correct file, (b) the response is as expected
Component: Treeherder → Treeherder: Log Parsing & Classification
No longer depends on: 1354791
(Sorry missed the dep change when using "make comment anyway")
Depends on: 1354791
ok, the problem looks to be the hg service, locally I have:
$ ./mach file-info bugzilla-component testing/web-platform/tests/html/semantics/embedded-content/media-elements/track/track-element/track-cues-missed.html
Core :: DOM
  testing/web-platform/tests/html/semantics/embedded-content/media-elements/track/track-element/track-cues-missed.html

but query hg via the web:
https://hg.mozilla.org/mozilla-central/json-mozbuildinfo/?p=testing/web-platform/tests/html/semantics/embedded-content/media-elements/track/track-element/track-cues-missed.html

I get:
{
  "error": "unable to obtain moz.build info"
}


:gps, do you know why this would be happening?
Flags: needinfo?(gps)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #1)
> :coop, I believe you are headmaster of the sheriffs, is this something you
> can drive to ensure our tools are working properly and the work the sheriffs
> are doing is not for waste?

Redirecting NI request to RyanVM
Flags: needinfo?(coop) → needinfo?(ryanvm)
json-mozbuildinfo has been failing for a while. Bug 1354791 tracks.

Getting it working is a non-trivial amount of work, both now and on an ongoing basis. I would encourage tools to consume the JSON produced by Firefox CI that contains Bugzilla metadata. See e.g. https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=bugzilla&selectedJob=156606589
Flags: needinfo?(gps)
this means that the new bug filing tool needs to use data from another source, not a query to the hg server.  One thought is we could query activedata given that all this information is ingested there.

Kyle- do you have concerns with all "new bugs" filed with the sheriff/treeherder tool to use activedata to query the bugzilla component based on a file given?  Only concerns I can think about is uptime/reliability- that all falls into error handling and is probably less error prone than our current methods.
Flags: needinfo?(klahnakoski)
I have no concerns.  The amount of data is tiny, and the queries against it are simple.
Flags: needinfo?(klahnakoski)
Why not just use TH/TC as gps suggsted to avoid a new dependency. The tool just needs to grab https://treeherder.mozilla.org/api/project/mozilla-central/jobs/?count=1&job_type_name=source-test-file-metadata-bugzilla-components or something to work out a recent bugzilla job and then either convert the job guid to a TC guid directly and grab https://queue.taskcluster.net/v1/task/<guid>/runs/0/artifacts/public/components.json directly or go via the relavant job details endpoint.
Flags: needinfo?(ryanvm)
I'd much rather just fix bug 1354791 than add more complexity in Treeherder, unless I'm overlooking something else?
I would like to work on this, I am not aware how bug 1354791 will fix this- maybe I don't understand that bug or the solution there.
The bug filer uses an API (on hg.mozilla.org) to fetch the bug component from in-repo metadata. That API is currently broken. 

:gps suggested using data from a different source instead, however I'm saying it might mean less complexity in Treeherder if we just fixed the original API instead.

If the long term goal were to get everyone to use the new data source instead (or if there were some other advantage of it, other than it not being broken), then I could be persuaded otherwise :-)
In case it isn't obvious, the effect on web-platform-tests intermittents is they basically all end up in Testing::web-platform-tests and therefore don't get seen by anyone who could fix the problem, at least  without further manual triage. Therefore this is having a real effect on our ability to handle bugs and other intermittent issues.

It would be good to determine the actual cost/benefit of different approaches to solving this problem, because currently the maintainers of the two most obvious pieces of code where a fix could be applied are both claiming it's too difficult to make this work in their system, and so no progress is being made.
On the Treeherder side, there is no maintainer. The one person who looked after this feature no longer works at Mozilla and the Treeherder team does not have the resources to take another feature under our wing. If people want that to change, then we need more headcount.
I have a pending NI for :gps https://bugzilla.mozilla.org/show_bug.cgi?id=1354791#c11 to solve this same issue.

:aryx- is fixing the treeherder bug filing tool something you can fix on your end?
Flags: needinfo?(aryx.bugmail)
I understand that the treeherder team is chronically understaffed. I also understand that this is a frustrating case because the feature relied on a "third party" API that suddenly stopped working. However the bug is wasting the time of engineers who have to deal with misclassified bugs, or have to ignore intermittent failures that would be fixed if only the right people knew that they existed.

If we don't have the resources to make the bug filer work correctly it should be disabled, as we did for the equally undermaintained autoclassify panel. If that isn't an acceptable solution we need to work out between the treeherder team and the sheriff's team how to prioritise fixing this kind of high-impact issue.
I agree this is something that should receive resources to fix. Coop, could you find someone to do this, and coordinate with the sheriffs to ensure they don't continue to mis-file bugs in the meantime?
Flags: needinfo?(coop)
I have a JS implementation locally, doing some cleanup right now.
Flags: needinfo?(aryx.bugmail)
Amazing - thank you! :-)
Flags: needinfo?(coop)
FWIW, the "high impact" part of this could have just been fixed months ago by removing the three lines at https://github.com/mozilla/treeherder/blob/cca48d14df73d470a59f79b8c1a4991c93a0da0b/ui/js/controllers/bugfiler.js#L172
Comment on attachment 8960747 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3355

Left some comments :-)

Since those I've just seen Phil's comment above - agree easiest to remove that for now.
Attachment #8960747 - Flags: review?(cdawson) → review-
Now that bug 1447771 has fixed the main issue this was causing, this seems to be a dupe of bug 1354791 (fixing the hg.m.o API), unless we've decided that it's preferred to move away from that API longer term?

this conversation hasn't happened in a while, we seem to have good quality bugs these days- closing unless there is new information.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX

Reopening as it's a time waster.

Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---

Hi Chris, are you the new the taskcluster SRE person? Could you help getting the deployment of https://prototype.treeherder.nonprod.cloudops.mozgcp.net/jobs?repo=autoland and the related the database schemas going? I'd like to check the new data ingestion works as expected before it gets merged into production. Thank you in advance.

Flags: needinfo?(cvalaas)

I've merged the new cronjob/schedule into the nonprod treeherder deploy spec.
I don't see anything regarding databases in the cloudops code, but I'm new on this project, so I could be missing something ... ?

Flags: needinfo?(cvalaas)

It's also my first time having such a request.

Questions:

  1. Does the migration mentioned python manage.py migrate at https://docs.djangoproject.com/en/3.1/topics/migrations/#workflow need to be run manually (I think so)?
  2. The call to /api/repository/ fails and causes https://prototype.treeherder.nonprod.cloudops.mozgcp.net/jobs?repo=autoland to remain blank. This data gets loaded into from treeherder/model/fixtures/repository.json into the database when the app starts/reboots. It got modified in the recent push but I don't spot an issue (and it ran successfully on the local machine): https://github.com/mozilla/treeherder/commit/22155749149bfeb018ffc3b1a3ed849b591d07dc#diff-4d440bff7edf0c62897c28f2cd87538d1e2929e06fbc8409856f3f1e492762fe As far as I remember, I found a message about a missing column related to performance earlier this week (in GCP?) but am unable to find it.
    Log Explorer shows nothing for resource.labels.database_id="moz-fx-treeherde-nonprod-34ec:treeherder-nonprod-prototype-v1" - could you check if GCP if failure messages are being logged elsewhere?
    Thank you in advance.

Found the error: https://console.cloud.google.com/errors/CLLUuf7x2fOAXw?time=P1D&project=moz-fx-treeherde-nonprod-34ec
OperationalError: (1054, "Unknown column 'repository.life_cycle_order' in 'field list'")
So it's missing https://github.com/mozilla/treeherder/pull/7151/files#diff-69a54a8c36fc6091dc4cb899a8593707d4166e68e8db8c08dd33c4740877c879R107 - the migration command from comment 31 (in full: docker-compose run backend ./manage.py migrate) should fix this.
(The performance error message I remembered was this one.)

It does seem like the db migrate needs to be run manually. It is not part of the prototype Jenkins pipeline, though it is part of the stage and production pipelines.
I don't see any documentation in any of cloudops' repos regarding db migrations for treeherder.
:sclements, since you seem to know more than us about how these deploys go, do you know how db migrations work for the prototype env?

Flags: needinfo?(sclements)

After s'more poking, it looks like the MIGRATE step in Jenkins does run a container with the entrypoint set to "release", which, according to entrypoint_prod.sh calls ./bin/pre_deploy, which runs the django migration.
Whether or not this should happen on prototype, though, I'm still unclear. If I understand prototype correctly, it gets reset to master every-so-often, so automatic db migrations may not be desirable (or workable)...

(In reply to chris valaas [:cvalaas] from comment #34)

After s'more poking, it looks like the MIGRATE step in Jenkins does run a container with the entrypoint set to "release", which, according to entrypoint_prod.sh calls ./bin/pre_deploy, which runs the django migration.
Whether or not this should happen on prototype, though, I'm still unclear. If I understand prototype correctly, it gets reset to master every-so-often, so automatic db migrations may not be desirable (or workable)...

Hi Chris, prototype should behave the exact same way as stage and prod - so migrations should run on every deploy. People will only occasionally use the prototype branch by pushing directly to it via Git, its not designed to automatically reset to master as far as I'm aware. Also, for your reference here's some docs that might be useful for you - just a general FYI. https://treeherder.readthedocs.io/infrastructure/administration.html#database-management-cloudops

Flags: needinfo?(sclements)

Okay, I added the MIGRATE step to the prototype deploy. Should happen next prototype deployment.

Prototype got a new push from me yesterday (a merge of master because the prototype branch is protected force pushes) and it shows the pushes now but no tasks - this could be explained if it fails to connect to Pulse and retrieve the messages about the tasks. This was working 11 days earlier and the recent commits look unrelated. The database is working because the pushes are being stored.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #37)

Prototype got a new push from me yesterday (a merge of master because the prototype branch is protected force pushes) and it shows the pushes now but no tasks - this could be explained if it fails to connect to Pulse and retrieve the messages about the tasks. This was working 11 days earlier and the recent commits look unrelated. The database is working because the pushes are being stored.

I'd file anything like this under cloudOps. Chris, can you look into this please? You'll probably need to look at the pulse_listener_tasks or other *task worker.

Flags: needinfo?(cvalaas)

Most of the log messages from the last 26 hours for pulse_listener_tasks in prototype look like this (several lines per second):

[2021-08-18 20:34:22,714] DEBUG [treeherder.services.pulse.consumers:155] received job message from exchange/taskcluster-queue/v1/task-completed#primary.QugQcbIrTCmCTpBz6tWUfg.0.us-west-1.i-08cb54ae59a438e02.gecko-t.t-linux-xlarge.gecko-level-1.b-Xxq0DqRJyxCYZSZwqoYA._

If I filter out "received job message", these are the only log messages left:

2021-08-17 22:33:45.847 | WARNING  | mozci.configuration:__init__:123 - Configuration path mozci_config.toml is not a file.
22:33:45.878 | INFO     | mozci.data.base:51  - Sources selected, in order of priority: ('hgmo', 'taskcluster', 'treeherder_client').
[2021-08-17 22:33:46,015] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-pending #.#
[2021-08-17 22:33:46,019] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-running #.#
[2021-08-17 22:33:46,022] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-completed #.#
[2021-08-17 22:33:46,025] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-failed #.#
[2021-08-17 22:33:46,028] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-exception #.#
[2021-08-17 22:33:46,138] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-pending #.#
[2021-08-17 22:33:46,142] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-running #.#
[2021-08-17 22:33:46,146] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-completed #.#
[2021-08-17 22:33:46,152] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-failed #.#
[2021-08-17 22:33:46,159] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-exception #.#

(There was a similar batch of messages from 19:41 (which seems to match the time of the deploy) that I left out. Same messages though.)

I don't see any other workload ending in *tasks in the cluster.

Flags: needinfo?(cvalaas)

Let's take this log message:

{
insertId: "h8lmn78h4q5vr7lc"
labels: {10}
logName: "projects/moz-fx-treeherde-nonprod-34ec/logs/stderr"
receiveTimestamp: "2021-08-18T21:14:02.870117823Z"
resource: {2}
severity: "ERROR"
textPayload: "[2021-08-18 21:13:59,696] DEBUG [treeherder.etl.taskcluster_pulse.handler:179] Message received for task YBAIl9huRU6FA3QPXRr0jw"
timestamp: "2021-08-18T21:13:59.696802846Z"
}

The mentioned task ran 9 days ago - the prototype instance might try to process the backlog of messages since my first push to prototype because it's unbounded. https://pulseguardian.mozilla.org/queues shows only my own queue.

The queue should be deleted and prototype restarted which will recreate it (worked with my queue) - this shall get rid of the backlog. https://mana.mozilla.org/wiki/display/ITEO/Systems+Engineering+Team might have admin access in case you don't have it. The credentials are likely in the vault Sarah passed to cloudOps when they took over (I don't have access).

From what I can see via CloudAMQP.com, the treeherder-prototype instance has 19 queues. All but one are empty. The store_pulse_tasks queue has 3+ million messages.
It looks like I can purge that queue, would that work?
It also seems I can delete it, but if purging it is sufficient, that seems the easier solution.

Are there other store_pulse_tasks queues (one per instance, because acked messages are not sent to other instances (?)).

If the answer is yes: 3+ million messages sound like it cannot be production. Please purge the messages.

treeherder-prod and treeherder-stage both have store_pulse_tasks queues, but they're both empty (messages are popping in and out, but they're hovering at 0).

Shall I go ahead and purge store_pulse_tasks on treeherder-prototype?

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #42)

Are there other store_pulse_tasks queues (one per instance, because acked messages are not sent to other instances (?)).

If the answer is yes: 3+ million messages sound like it cannot be production. Please purge the messages.

Each deployment has its own store_pulse_task worker and listeners, so deleting one should not affect other deployments. To clarify, the pulse_listener_tasks is a cloudAMQP queue, not a pulse guardian queue. It takes those messages from pulse guardian queues and then processes and stores those tasks with store_pulse_tasks, which also then kicks off log process workers if applicable. So we want to purge the store_pulse_tasks queue as Chris said, not delete the pulse guardian queues.

(In reply to chris valaas [:cvalaas] from comment #43)

treeherder-prod and treeherder-stage both have store_pulse_tasks queues, but they're both empty (messages are popping in and out, but they're hovering at 0).

Shall I go ahead and purge store_pulse_tasks on treeherder-prototype?

Before you purge it, we should figure out why the messages aren't being acknowledged because that would be why the tasks are not showing up in prototype. Is it under resourced or is there some other error?

I'm not seeing the connection between your changes on prototype Sebastian and why the tasks stopped being stored in the database.

We've had occasional issues with workers not working for some random infra reason and never being alerted to it until someone notices something is broken. Any ideas on how to set up alerts when something fails or we reach some sort of unacknowledged message limit, Chris?

Aha, I see some errors in new relic: https://onenr.io/08dQeJVA5we

So it looks like this error OperationalError: (1054, "Unknown column 'repository.life_cycle_order' in 'field list'") that Sebastian mentioned was the cause of the backlog of storing tasks and the last occurrence was August 17th. So safe to proceed with purging the queue then. But I wonder if the other 1.7 million messages from "Retry in 30s: MissingPushException('No push found in try for revision 0d206fdbd6564bd64904ffac7bf83ea3112fbe13 for task BkRVyClsRUO_OoOwnwmTkw')" is also an issue. That seems to be on going. That might need to be looked into more.

Chris, if you're going to be the Treeherder point of contact for troubleshooting I can give you access to New Relic if you'd like.

So safe to proceed with purging the queue then.

Queue has been purged.

Chris, if you're going to be the Treeherder point of contact for troubleshooting I can give you access to New Relic if you'd like.

That'd be great, thanks!

Sarah, could you merge this to master, please?

Flags: needinfo?(sclements)

Merged.

Flags: needinfo?(sclements)

Added new cron to stage (I assume you want it in stage? If not, let me know) and prod. Awaiting review and merge.
https://github.com/mozilla-services/cloudops-infra/pull/3358

EDIT: merged.

Flags: needinfo?(cvalaas)

Could you check the config, please? Searching Log Explorer for treeherder-prod for update_files_bugzilla_map doesn't find anything but does treeherder-nonprod - or is this awaiting deployment?

Flags: needinfo?(cvalaas)

Looks like :sclements approved a prod deploy within the last hour ... ?

Flags: needinfo?(cvalaas)

Thanks, wasn't sure if this needed a new TH deploy or the cloudops-infra change was managed independently. Production shows the desired behavior now.

Status: REOPENED → RESOLVED
Closed: 4 years ago3 years ago
Resolution: --- → FIXED
Component: Treeherder: Log Parsing & Classification → TreeHerder
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: