filing new bugs is not using the bugzilla component defined in-tree
Categories
(Tree Management :: Treeherder, defect)
Tracking
(Not tracked)
People
(Reporter: jmaher, Unassigned)
References
(Depends on 1 open bug)
Details
Attachments
(1 file, 1 obsolete file)
there is a process for the tree sheriffs for filing a new intermittent failure bug- unfortunately I don't know where this is documented or what tools are used. I do know that many of the web-platform-tests are not filed in the bugzilla component the tests are associated with, instead they are filed in testing::web-platform-tests. There are 560 intermittent failures in that component, 55 in the last month. I spot checked a few of the 55 recent ones and they do have valid components associated in moz.build files. For reference here are the recent ones: https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure%2C%20&keywords_type=allwords&list_id=13958611&resolution=---&classification=Components&chfieldto=Now&query_format=advanced&chfieldfrom=2017-12-01&component=web-platform-tests&product=Testing We need to fix up our tools to work properly, otherwise these bugs are not going to ever be seen and we are just wasting time.
Reporter | ||
Comment 1•6 years ago
|
||
:coop, I believe you are headmaster of the sheriffs, is this something you can drive to ensure our tools are working properly and the work the sheriffs are doing is not for waste?
Comment 2•6 years ago
|
||
The not working suggestions are bug 1354791 - duplicate?
Comment 3•6 years ago
|
||
I think this should be treated as a meta bug rather than duping against a specific techical issue.
Comment 4•6 years ago
|
||
Thank you for filing this - I agree it's important that the tools help the bugs end up in the right place, so they don't get missed. The bug filer tool that files the intermittent failure bugs was created by and almost solely maintained by Wes. It would be good to have someone outside the Treeherder team take over maintenance now he's left. If it would help to have a completely separate Bugzilla component for bugs relating to this tool, I have the necessary permissions to create one (it's currently part of the "Treeherder: Log Parsing & Classification" component. The code for the tool is here: https://github.com/mozilla/treeherder/blob/master/ui/js/controllers/bugfiler.js https://github.com/mozilla/treeherder/blob/master/tests/ui/unit/controllers/bugfiler.tests.js Re this class of bugs (wrong component), the best way to figure out whether it's an issue with the bug filer or the metadata returned from hg.m.o is to: * follow the treeherder.m.o/logviewer.html link in the bug description * then click the "revision" link (that links back to the main Treeherder jobs view with that job selected) * then switch to the failure summary panel * then click the bugfiler icon next to that failure line (think maybe needs "&bugfiler" added to the URL if not a sheriff?) * find the XHR request made to hg.m.o (eg https://hg.mozilla.org/mozilla-central/json-mozbuildinfo?p=testing/web-platform/mozilla/tests/wasm/f32.wast.js.html) and check that (a) the request was for the correct file, (b) the response is as expected
Comment 5•6 years ago
|
||
(Sorry missed the dep change when using "make comment anyway")
Reporter | ||
Comment 6•6 years ago
|
||
ok, the problem looks to be the hg service, locally I have: $ ./mach file-info bugzilla-component testing/web-platform/tests/html/semantics/embedded-content/media-elements/track/track-element/track-cues-missed.html Core :: DOM testing/web-platform/tests/html/semantics/embedded-content/media-elements/track/track-element/track-cues-missed.html but query hg via the web: https://hg.mozilla.org/mozilla-central/json-mozbuildinfo/?p=testing/web-platform/tests/html/semantics/embedded-content/media-elements/track/track-element/track-cues-missed.html I get: { "error": "unable to obtain moz.build info" } :gps, do you know why this would be happening?
Comment 7•6 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #1) > :coop, I believe you are headmaster of the sheriffs, is this something you > can drive to ensure our tools are working properly and the work the sheriffs > are doing is not for waste? Redirecting NI request to RyanVM
Comment 8•6 years ago
|
||
json-mozbuildinfo has been failing for a while. Bug 1354791 tracks. Getting it working is a non-trivial amount of work, both now and on an ongoing basis. I would encourage tools to consume the JSON produced by Firefox CI that contains Bugzilla metadata. See e.g. https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=bugzilla&selectedJob=156606589
Reporter | ||
Comment 9•6 years ago
|
||
this means that the new bug filing tool needs to use data from another source, not a query to the hg server. One thought is we could query activedata given that all this information is ingested there. Kyle- do you have concerns with all "new bugs" filed with the sheriff/treeherder tool to use activedata to query the bugzilla component based on a file given? Only concerns I can think about is uptime/reliability- that all falls into error handling and is probably less error prone than our current methods.
Comment 10•6 years ago
|
||
I have no concerns. The amount of data is tiny, and the queries against it are simple.
Comment 11•6 years ago
|
||
Why not just use TH/TC as gps suggsted to avoid a new dependency. The tool just needs to grab https://treeherder.mozilla.org/api/project/mozilla-central/jobs/?count=1&job_type_name=source-test-file-metadata-bugzilla-components or something to work out a recent bugzilla job and then either convert the job guid to a TC guid directly and grab https://queue.taskcluster.net/v1/task/<guid>/runs/0/artifacts/public/components.json directly or go via the relavant job details endpoint.
Updated•6 years ago
|
Comment 12•6 years ago
|
||
I'd much rather just fix bug 1354791 than add more complexity in Treeherder, unless I'm overlooking something else?
Reporter | ||
Comment 13•6 years ago
|
||
I would like to work on this, I am not aware how bug 1354791 will fix this- maybe I don't understand that bug or the solution there.
Comment 14•6 years ago
|
||
The bug filer uses an API (on hg.mozilla.org) to fetch the bug component from in-repo metadata. That API is currently broken. :gps suggested using data from a different source instead, however I'm saying it might mean less complexity in Treeherder if we just fixed the original API instead. If the long term goal were to get everyone to use the new data source instead (or if there were some other advantage of it, other than it not being broken), then I could be persuaded otherwise :-)
Comment 15•6 years ago
|
||
In case it isn't obvious, the effect on web-platform-tests intermittents is they basically all end up in Testing::web-platform-tests and therefore don't get seen by anyone who could fix the problem, at least without further manual triage. Therefore this is having a real effect on our ability to handle bugs and other intermittent issues. It would be good to determine the actual cost/benefit of different approaches to solving this problem, because currently the maintainers of the two most obvious pieces of code where a fix could be applied are both claiming it's too difficult to make this work in their system, and so no progress is being made.
Comment 16•6 years ago
|
||
On the Treeherder side, there is no maintainer. The one person who looked after this feature no longer works at Mozilla and the Treeherder team does not have the resources to take another feature under our wing. If people want that to change, then we need more headcount.
Reporter | ||
Comment 17•6 years ago
|
||
I have a pending NI for :gps https://bugzilla.mozilla.org/show_bug.cgi?id=1354791#c11 to solve this same issue. :aryx- is fixing the treeherder bug filing tool something you can fix on your end?
Comment 18•6 years ago
|
||
I understand that the treeherder team is chronically understaffed. I also understand that this is a frustrating case because the feature relied on a "third party" API that suddenly stopped working. However the bug is wasting the time of engineers who have to deal with misclassified bugs, or have to ignore intermittent failures that would be fixed if only the right people knew that they existed. If we don't have the resources to make the bug filer work correctly it should be disabled, as we did for the equally undermaintained autoclassify panel. If that isn't an acceptable solution we need to work out between the treeherder team and the sheriff's team how to prioritise fixing this kind of high-impact issue.
Comment 19•6 years ago
|
||
I agree this is something that should receive resources to fix. Coop, could you find someone to do this, and coordinate with the sheriffs to ensure they don't continue to mis-file bugs in the meantime?
Comment 20•6 years ago
|
||
I have a JS implementation locally, doing some cleanup right now.
Comment 22•6 years ago
|
||
Updated•6 years ago
|
Comment 23•6 years ago
|
||
FWIW, the "high impact" part of this could have just been fixed months ago by removing the three lines at https://github.com/mozilla/treeherder/blob/cca48d14df73d470a59f79b8c1a4991c93a0da0b/ui/js/controllers/bugfiler.js#L172
Comment 24•6 years ago
|
||
Comment on attachment 8960747 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3355 Left some comments :-) Since those I've just seen Phil's comment above - agree easiest to remove that for now.
Comment 25•6 years ago
|
||
Now that bug 1447771 has fixed the main issue this was causing, this seems to be a dupe of bug 1354791 (fixing the hg.m.o API), unless we've decided that it's preferred to move away from that API longer term?
Reporter | ||
Comment 26•4 years ago
|
||
this conversation hasn't happened in a while, we seem to have good quality bugs these days- closing unless there is new information.
Comment 27•4 years ago
|
||
Reopening as it's a time waster.
Updated•3 years ago
|
Comment 28•3 years ago
|
||
Comment 29•3 years ago
|
||
Hi Chris, are you the new the taskcluster SRE person? Could you help getting the deployment of https://prototype.treeherder.nonprod.cloudops.mozgcp.net/jobs?repo=autoland and the related the database schemas going? I'd like to check the new data ingestion works as expected before it gets merged into production. Thank you in advance.
Comment 30•3 years ago
|
||
I've merged the new cronjob/schedule into the nonprod treeherder deploy spec.
I don't see anything regarding databases in the cloudops code, but I'm new on this project, so I could be missing something ... ?
Comment 31•3 years ago
|
||
It's also my first time having such a request.
Questions:
- Does the migration mentioned
python manage.py migrate
at https://docs.djangoproject.com/en/3.1/topics/migrations/#workflow need to be run manually (I think so)? - The call to
/api/repository/
fails and causes https://prototype.treeherder.nonprod.cloudops.mozgcp.net/jobs?repo=autoland to remain blank. This data gets loaded into fromtreeherder/model/fixtures/repository.json
into the database when the app starts/reboots. It got modified in the recent push but I don't spot an issue (and it ran successfully on the local machine): https://github.com/mozilla/treeherder/commit/22155749149bfeb018ffc3b1a3ed849b591d07dc#diff-4d440bff7edf0c62897c28f2cd87538d1e2929e06fbc8409856f3f1e492762fe As far as I remember, I found a message about a missing column related to performance earlier this week (in GCP?) but am unable to find it.
Log Explorer shows nothing forresource.labels.database_id="moz-fx-treeherde-nonprod-34ec:treeherder-nonprod-prototype-v1"
- could you check if GCP if failure messages are being logged elsewhere?
Thank you in advance.
Comment 32•3 years ago
•
|
||
Found the error: https://console.cloud.google.com/errors/CLLUuf7x2fOAXw?time=P1D&project=moz-fx-treeherde-nonprod-34ec
OperationalError: (1054, "Unknown column 'repository.life_cycle_order' in 'field list'")
So it's missing https://github.com/mozilla/treeherder/pull/7151/files#diff-69a54a8c36fc6091dc4cb899a8593707d4166e68e8db8c08dd33c4740877c879R107 - the migration command from comment 31 (in full: docker-compose run backend ./manage.py migrate
) should fix this.
(The performance error message I remembered was this one.)
Comment 33•3 years ago
•
|
||
It does seem like the db migrate needs to be run manually. It is not part of the prototype
Jenkins pipeline, though it is part of the stage
and production
pipelines.
I don't see any documentation in any of cloudops' repos regarding db migrations for treeherder.
:sclements, since you seem to know more than us about how these deploys go, do you know how db migrations work for the prototype env?
Comment 34•3 years ago
•
|
||
After s'more poking, it looks like the MIGRATE step in Jenkins does run a container with the entrypoint set to "release", which, according to entrypoint_prod.sh calls ./bin/pre_deploy, which runs the django migration.
Whether or not this should happen on prototype
, though, I'm still unclear. If I understand prototype
correctly, it gets reset to master
every-so-often, so automatic db migrations may not be desirable (or workable)...
Comment 35•3 years ago
|
||
(In reply to chris valaas [:cvalaas] from comment #34)
After s'more poking, it looks like the MIGRATE step in Jenkins does run a container with the entrypoint set to "release", which, according to entrypoint_prod.sh calls ./bin/pre_deploy, which runs the django migration.
Whether or not this should happen onprototype
, though, I'm still unclear. If I understandprototype
correctly, it gets reset tomaster
every-so-often, so automatic db migrations may not be desirable (or workable)...
Hi Chris, prototype should behave the exact same way as stage and prod - so migrations should run on every deploy. People will only occasionally use the prototype branch by pushing directly to it via Git, its not designed to automatically reset to master as far as I'm aware. Also, for your reference here's some docs that might be useful for you - just a general FYI. https://treeherder.readthedocs.io/infrastructure/administration.html#database-management-cloudops
Comment 36•3 years ago
|
||
Okay, I added the MIGRATE step to the prototype deploy. Should happen next prototype deployment.
Comment 37•3 years ago
|
||
Prototype got a new push from me yesterday (a merge of master because the prototype branch is protected force pushes) and it shows the pushes now but no tasks - this could be explained if it fails to connect to Pulse and retrieve the messages about the tasks. This was working 11 days earlier and the recent commits look unrelated. The database is working because the pushes are being stored.
Comment 38•3 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #37)
Prototype got a new push from me yesterday (a merge of master because the prototype branch is protected force pushes) and it shows the pushes now but no tasks - this could be explained if it fails to connect to Pulse and retrieve the messages about the tasks. This was working 11 days earlier and the recent commits look unrelated. The database is working because the pushes are being stored.
I'd file anything like this under cloudOps. Chris, can you look into this please? You'll probably need to look at the pulse_listener_tasks or other *task worker.
Comment 39•3 years ago
|
||
Most of the log messages from the last 26 hours for pulse_listener_tasks
in prototype look like this (several lines per second):
[2021-08-18 20:34:22,714] DEBUG [treeherder.services.pulse.consumers:155] received job message from exchange/taskcluster-queue/v1/task-completed#primary.QugQcbIrTCmCTpBz6tWUfg.0.us-west-1.i-08cb54ae59a438e02.gecko-t.t-linux-xlarge.gecko-level-1.b-Xxq0DqRJyxCYZSZwqoYA._
If I filter out "received job message", these are the only log messages left:
2021-08-17 22:33:45.847 | WARNING | mozci.configuration:__init__:123 - Configuration path mozci_config.toml is not a file.
22:33:45.878 | INFO | mozci.data.base:51 - Sources selected, in order of priority: ('hgmo', 'taskcluster', 'treeherder_client').
[2021-08-17 22:33:46,015] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-pending #.#
[2021-08-17 22:33:46,019] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-running #.#
[2021-08-17 22:33:46,022] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-completed #.#
[2021-08-17 22:33:46,025] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-failed #.#
[2021-08-17 22:33:46,028] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-exception #.#
[2021-08-17 22:33:46,138] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-pending #.#
[2021-08-17 22:33:46,142] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-running #.#
[2021-08-17 22:33:46,146] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-completed #.#
[2021-08-17 22:33:46,152] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-failed #.#
[2021-08-17 22:33:46,159] INFO [treeherder.services.pulse.consumers:102] Pulse queue queue/treeherder-prototype/tasks bound to: exchange/taskcluster-queue/v1/task-exception #.#
(There was a similar batch of messages from 19:41 (which seems to match the time of the deploy) that I left out. Same messages though.)
I don't see any other workload ending in *tasks in the cluster.
Comment 40•3 years ago
|
||
Let's take this log message:
{
insertId: "h8lmn78h4q5vr7lc"
labels: {10}
logName: "projects/moz-fx-treeherde-nonprod-34ec/logs/stderr"
receiveTimestamp: "2021-08-18T21:14:02.870117823Z"
resource: {2}
severity: "ERROR"
textPayload: "[2021-08-18 21:13:59,696] DEBUG [treeherder.etl.taskcluster_pulse.handler:179] Message received for task YBAIl9huRU6FA3QPXRr0jw"
timestamp: "2021-08-18T21:13:59.696802846Z"
}
The mentioned task ran 9 days ago - the prototype instance might try to process the backlog of messages since my first push to prototype because it's unbounded. https://pulseguardian.mozilla.org/queues shows only my own queue.
The queue should be deleted and prototype restarted which will recreate it (worked with my queue) - this shall get rid of the backlog. https://mana.mozilla.org/wiki/display/ITEO/Systems+Engineering+Team might have admin access in case you don't have it. The credentials are likely in the vault Sarah passed to cloudOps when they took over (I don't have access).
Comment 41•3 years ago
|
||
From what I can see via CloudAMQP.com, the treeherder-prototype instance has 19 queues. All but one are empty. The store_pulse_tasks
queue has 3+ million messages.
It looks like I can purge that queue, would that work?
It also seems I can delete it, but if purging it is sufficient, that seems the easier solution.
Comment 42•3 years ago
|
||
Are there other store_pulse_tasks
queues (one per instance, because acked messages are not sent to other instances (?)).
If the answer is yes: 3+ million messages sound like it cannot be production. Please purge the messages.
Comment 43•3 years ago
|
||
treeherder-prod
and treeherder-stage
both have store_pulse_tasks
queues, but they're both empty (messages are popping in and out, but they're hovering at 0).
Shall I go ahead and purge store_pulse_tasks
on treeherder-prototype
?
Comment 44•3 years ago
•
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #42)
Are there other
store_pulse_tasks
queues (one per instance, because acked messages are not sent to other instances (?)).If the answer is yes: 3+ million messages sound like it cannot be production. Please purge the messages.
Each deployment has its own store_pulse_task worker and listeners, so deleting one should not affect other deployments. To clarify, the pulse_listener_tasks is a cloudAMQP queue, not a pulse guardian queue. It takes those messages from pulse guardian queues and then processes and stores those tasks with store_pulse_tasks, which also then kicks off log process workers if applicable. So we want to purge the store_pulse_tasks queue as Chris said, not delete the pulse guardian queues.
Comment 45•3 years ago
•
|
||
(In reply to chris valaas [:cvalaas] from comment #43)
treeherder-prod
andtreeherder-stage
both havestore_pulse_tasks
queues, but they're both empty (messages are popping in and out, but they're hovering at 0).Shall I go ahead and purge
store_pulse_tasks
ontreeherder-prototype
?
Before you purge it, we should figure out why the messages aren't being acknowledged because that would be why the tasks are not showing up in prototype. Is it under resourced or is there some other error?
I'm not seeing the connection between your changes on prototype Sebastian and why the tasks stopped being stored in the database.
We've had occasional issues with workers not working for some random infra reason and never being alerted to it until someone notices something is broken. Any ideas on how to set up alerts when something fails or we reach some sort of unacknowledged message limit, Chris?
Comment 46•3 years ago
•
|
||
Aha, I see some errors in new relic: https://onenr.io/08dQeJVA5we
So it looks like this error OperationalError: (1054, "Unknown column 'repository.life_cycle_order' in 'field list'") that Sebastian mentioned was the cause of the backlog of storing tasks and the last occurrence was August 17th. So safe to proceed with purging the queue then. But I wonder if the other 1.7 million messages from "Retry in 30s: MissingPushException('No push found in try for revision 0d206fdbd6564bd64904ffac7bf83ea3112fbe13 for task BkRVyClsRUO_OoOwnwmTkw')" is also an issue. That seems to be on going. That might need to be looked into more.
Chris, if you're going to be the Treeherder point of contact for troubleshooting I can give you access to New Relic if you'd like.
Comment 47•3 years ago
|
||
So safe to proceed with purging the queue then.
Queue has been purged.
Chris, if you're going to be the Treeherder point of contact for troubleshooting I can give you access to New Relic if you'd like.
That'd be great, thanks!
Comment 50•3 years ago
|
||
Chris, please set up the cron task for production.
Merge to master: https://github.com/mozilla/treeherder/commit/d3973898636dbfd91a4bba763ba742ced4d3979b
Updated•3 years ago
|
Comment 51•3 years ago
•
|
||
Added new cron to stage (I assume you want it in stage? If not, let me know) and prod. Awaiting review and merge.
https://github.com/mozilla-services/cloudops-infra/pull/3358
EDIT: merged.
Comment 52•3 years ago
|
||
Could you check the config, please? Searching Log Explorer for treeherder-prod for update_files_bugzilla_map
doesn't find anything but does treeherder-nonprod
- or is this awaiting deployment?
Comment 53•3 years ago
|
||
Looks like :sclements approved a prod deploy within the last hour ... ?
Comment 54•3 years ago
|
||
Thanks, wasn't sure if this needed a new TH deploy or the cloudops-infra change was managed independently. Production shows the desired behavior now.
Assignee | ||
Updated•2 years ago
|
Description
•