Add community-tc to PULSE_{PUSH,TASK}_SOURCES
Categories
(Tree Management :: Treeherder: Data Ingestion, enhancement)
Tracking
(Not tracked)
People
(Reporter: dustin, Assigned: armenzg)
References
Details
Attachments
(1 file)
122.20 KB,
image/png
|
Details |
These are currently of the form
PULSE_PUSH_SOURCES=[{"root_url": "https://taskcluster.net", "github": true, "hgmo": true, "pulse_url": "amqp://<mumble>:<mumble>@pulse.mozilla.org:5671/?ssl=true"}]
PULSE_TASK_SOURCES=[{"root_url": "https://taskcluster.net", "pulse_url": "amqp://<mumble>:<mumble>@pulse.mozilla.org:5671/?ssl=true"}]
and I think this would entail adding
{"root_url": "https://community-tc.services.mozilla.com", "github": true, "hgmo": false, "pulse_url": "amqp://<mumble>:<mumble>@pulse.mozilla.org:5671/communitytc?ssl=true"}
and ``{"root_url": "https://community-tc.services.mozilla.com", "pulse_url": "amqp://<mumble>:<mumble>@pulse.mozilla.org:5671/communitytc?ssl=true"}` respectively.
The service already has permission to access those vhosts. Same for prod, prototype, and staging.
This is an easy thing to roll back if it doesn't work.
Updated•5 years ago
|
Comment 1•5 years ago
|
||
As discussed with Dustin, I'll handle these changes.
Comment 2•5 years ago
•
|
||
I tried testing out the community pulse ingestion earlier today on prototype with SimonSapin however, we weren't able to get it working. I had pushed his branch with the servo tc_root_url changes upstream as servo-root-url
branch (I've confirmed the changes to the root url were applied to the repository table).
I didn't see any errors per say - looking in paper trail* - it just wasn't showing up in the Treeherder-prototype UI. I kept the taskcluster.net credentials, but added another object to each of PULSE_PUSH_SOURCES and PULSE_TASK_SOURCES for community-tc. However, I've just realized I made one mistake with the push source updates, where I put hgmo:true
, not sure if that would've made a difference.
*There was a lot of noise from worker_store_pulse_data for prototype: 'Could not determine Treeherder route. Either there is no route, or more than one matching route exists.Task ID: <id>' but those were in the logs before I had made these changes to prototype. I also didn't find anything of use with the pulse_listener worker.
Armen and Cam probably have more knowledge and insight into pulse ingestion than I do, so if either one of you would like to jump in and try this again with Simon in the morning please feel free. Otherwise I'll continue working on it with Simon.
Note: I have a WIP branch called taskcluster-rooturl-changes
that includes changes to repositories.json and a few other changes where taskcluster.net is used. There might still need to be additional backend code changes but for any other changes that are needed to get this working, we'll want to add to that branch so feel free to add commits if you need to.
Dustin, is there a way to see whether a task was processed by the pulse queue/exchange? This is one Simon created for testing: https://community-tc.services.mozilla.com/tasks/NSkZtsUQR-iS9Kui_4Mt1A
Edit: I've removed those credentials from the Heroku configs and reset the branch to master. Armen has said he'll take a stab at this in the morning.
Assignee | ||
Comment 3•5 years ago
|
||
I've deployed the branch, updated the env variables and verified that tc_root_url
is set to https://community-tc.services.mozilla.com
for servo-try
.
After a while of not knowing why I don't see output in Papertrail I restarted all dynos (there 14k jobs pending in the celery queues). I've doubled the number of workers to clear through the backlog.
I believe the code is getting confused as to what instance it should query.
I've rolled back the PULSE_*
env changes and deployed master
again.
I've seen these errors in Papertrail:
Skipping push for autoland with incorrect root_url https://community-tc.services.mozilla.com
In New Relic these errors can be found:
taskcluster.exceptions:TaskclusterRestFailure: `HHTCEzzGQ5mP7ZFk0lC01g` does not correspond to a task that exists. Are you sure this task has already been submitted? --- * method: task * errorCode: ResourceNotFound * statusCode: 404 * time: 2019-11-05T16:40:06.108Z
Traceback (most recent call last):
File "/app/.heroku/python/bin/celery", line 11, in <module>
File "/app/.heroku/python/lib/python3.7/site-packages/celery/__main__.py", line 16, in main
File "/app/.heroku/python/lib/python3.7/site-packages/celery/bin/celery.py", line 322, in main
File "/app/.heroku/python/lib/python3.7/site-packages/celery/bin/celery.py", line 496, in execute_from_commandline
File "/app/.heroku/python/lib/python3.7/site-packages/celery/bin/base.py", line 298, in execute_from_commandline
File "/app/.heroku/python/lib/python3.7/site-packages/celery/bin/celery.py", line 488, in handle_argv
...
File "/app/.heroku/python/lib/python3.7/site-packages/celery/app/trace.py", line 648, in __protected_call__
File "/app/treeherder/workers/task.py", line 44, in inner
File "/app/treeherder/etl/tasks/pulse_tasks.py", line 29, in store_pulse_tasks
File "/app/.heroku/python/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
File "/app/treeherder/etl/taskcluster_pulse/handler.py", line 113, in handleMessage
File "/app/.heroku/python/lib/python3.7/site-packages/taskcluster/aio/queue.py", line 57, in task
File "/app/.heroku/python/lib/python3.7/site-packages/taskcluster/aio/asyncclient.py", line 95, in _makeApiCall
File "/app/.heroku/python/lib/python3.7/site-packages/taskcluster/aio/asyncclient.py", line 212, in _makeHttpRequest
Assignee | ||
Comment 4•5 years ago
|
||
I've also deleted the celery queue and restored the number of Heroku pulse workers.
Reporter | ||
Comment 5•5 years ago
|
||
That noise might be normal -- it looks like we are actually listening to all pulse messages about tasks, not just those with a tc-treeherder.#
route. The linked lines could probably be changed to e.g., "exchange/taskcluster-queue/v1/task-pending.tc-treeherder.#"
. But let's not mess with success.
Dustin, is there a way to see whether a task was processed by the pulse queue/exchange? This is one Simon created for testing: https://community-tc.services.mozilla.com/tasks/NSkZtsUQR-iS9Kui_4Mt1A
No, not really -- rabbitmq's status is pretty ephemeral.
I'm happy to help out with this debugging -- please look me up.
Comment 6•5 years ago
|
||
Armen and I did some debugging and identified a few issues. Thanks to a script he wrote, he was able to ingest locally the pr and task that Simon had created on the community pulse exchange (and it showed up on the UI). I'll continue with the fixes and test it locally, then on prototype (I'll add it to the pr I'm working already working on in bug 1593869). Then we can test live ingestion (Simon creating a new task).
Assignee | ||
Comment 7•5 years ago
|
||
My attempt this morning could have failed because I did not include communitytc
in pulse.mozilla.org:5671/communitytc?ssl=true
.
Reporter | ||
Comment 8•5 years ago
|
||
(In reply to Armen [:armenzg] from comment #7)
My attempt this morning could have failed because I did not include
communitytc
inpulse.mozilla.org:5671/communitytc?ssl=true
.
That would do it!
Comment 9•5 years ago
|
||
I’m in UTC+1 and also traveling part of tomorrow. Dustin, could you give scopes queue:route:tc-treeherder.v2._/servo-try.*
to Armen and Sarah? That way they can trigger tasks on https://community-tc.services.mozilla.com/tasks/create while I’m away. It doesn’t need to run any Servo code, only have the definition contain:
scopes:
- queue:route:tc-treeherder.v2._/servo-try.* # I’m not sure if this one is actually required in the task definition
routes:
- tc-treeherder.v2._/servo-try.afbcbf75eaa63ff0eec8fd3858e9155eb8dbadaa
treeherder:
machine: {platform: Linux}
labels: [x64]
symbol: Dummy
afbcbf75eaa63ff0eec8fd3858e9155eb8dbadaa
can be any commit SHA from https://github.com/servo/servo/commits/master
Comment 10•5 years ago
|
||
So we've encountered an interesting error with this taskgroup: https://community-tc.services.mozilla.com/tasks/groups/Ckwize-XTxqXVfQmH5O5sg
I noticed that the taskgroup has the same id as the successful task, whereas the failed task within the group has its own. With Armen's script, I can successfully ingest the associated pr: https://github.com/servo/servo/pull/24662
And then ingest the successful task, but the failed task has "No push found in servo-prs for revision a4a113d684ab17d784c8e0005c4cdcc913494d64 for task BgOjF0GISDKoR4FJtgejJw".
The two tasks within the group have two different routes/revisions associated with it. Is that supposed to be the case?
Comment 11•5 years ago
|
||
Oh that’s interesting indeed.
Background
Note that only GitHub PR events are affected. Servo’s CI is mostly based on push events (the bors a.k.a. homu bot creates merge commits and pushes them to purpose-specific branches to trigger tasks), so PR events only provide initial pre-review testing as soon as a PR is opened or updated.
Let’s say that I create a feature-xyz
git branch that starts from master
of github.com/servo/servo
, work on it for a couple days, push it to github.com/SimonSapin/servo
, and open a PR #123 to merge it back into master
. Other PRs have been merged into master in the meantime.
The push creates a git reference refs/heads/feature-xyz
in my fork. When I open the PR, GitHub also creates refs/pull/123/head
in servo/servo
which points to the same commit in my fork and stays in sync as long as the PR is open. It also immediately creates a merge commit of my branch with master
, and keeps it at refs/pull/123/merge
. And of course creates a PR event that Taskcluster listens to.
We want to run tests on the merge commit. Even when there are no git merge conflicts, a branch might have logic conflicts that cause compilation or tests to fail. For example, if my branch renamed a function and updated all call sites, and another PR that merged into master
in the meantime added another call to that function with the old name.
When Taskcluster receives a PR event we get a bunch of information as input to JSON-e processing of .taskcluster.yml
.
What’s going on
Today, Servo has this logic:
version: 1
# …
tasks:
# …
- $if: >-
tasks_for == 'github-pull-request' &&
event['action'] in ['opened', 'reopened', 'synchronize']
then:
# …
routes:
- "tc-treeherder.v2._/servo-prs.${event.pull_request.head.sha}"
- "tc-treeherder-staging.v2._/servo-prs.${event.pull_request.head.sha}"
payload:
env:
# We use the merge commit made by GitHub, not the PR’s branch
GIT_REF: refs/pull/${event.pull_request.number}/merge
# `event.pull_request.merge_commit_sha` is tempting but not what we want:
# https://github.com/servo/servo/pull/22597#issuecomment-451518810
GIT_SHA: FETCH_HEAD
.taskcluster.yml
constructs a Treeherder route for the decision task based on ${event.pull_request.head.sha}
, which is the commit hash for refs/pull/123/head
. But then we make the decision task’s command run git fetch refs/pull/${event.pull_request.number}/merge
in order to test the merge commit. Then when decision_task.py
constructs Treeherder routes for tasks that it spawns, it uses the hash of the commit that it has fetched, which is different.
The fix
This is bug in Servo. ${event.pull_request.merge_commit_sha}
would be closer to describe what we’re actually doing, but as the YAML comment mentions I’ve experimentally found it to not always be accurate. I suspect this is because merge commits are created asynchronously and races with the creation of the event payload which often ends up with an outdated SHA.
But what matters as far as Treeherder is concerned is what what commit we’re actually testing. It just wants to match up tasks with “pushes” that it separately gathered from GitHub events. That code uses "revision": head_commit["sha"]
so ${event.pull_request.head.sha}
looks correct. We could separately pass it to the decision task as an environment variable, so that it sets correct routes for sub-tasks.
Or not?
This bug has existed for a long time, so I wanted to check what it looked like on https://treeherder.mozilla.org/#/jobs?repo=servo-prs for pre-migration tasks from taskcluster.net. I’ve discovered that this page is empty, then found bug 1588896 about removing it. So maybe none of this matters and we should simply remove tc-treeherder.v2._/servo-prs.*
routes?
Assignee | ||
Comment 12•5 years ago
|
||
I have set the env variables (this time with communitytc
in the Pulse url) and deployed the branch.
I see that the PR landed on master
so we don't need to worry about servo-prs
.
sclements, the big issues I saw yesterday on New Relic were because prototype
was connecting twice to the Firefox CI Pulse streams while at the same time configurying one consumer to use the Firefox TC instance and the other the Community TC instance (even though all tasks were FX CI tasks). Today I don't see any of those issues on New Relic.
Assignee | ||
Comment 13•5 years ago
|
||
If you load this push
you will see that two of decision tasks have task ID UMfSSb0jRcG3pp3g-fCb7g. The UI points you to a non-existant task in the Firefox CI instance (since Sarah's work needs to land). The task can be found in the Community TC instance.
At the same time, you can see that staging is not configured to ingest from the community Pulse stream, thus, not showing those tasks:
https://treeherder.allizom.org/#/jobs?repo=servo-master&revision=a931a80cf5ad644cba8d8445cd0a62e55f04b906
I believe we have verified that the pipeline works.
What are the next steps? What's left before we switch Servo to only run on the community instance on stage and production?
NOTE TO SELF:
- Include
communitytc
in the Pulse URL - Do not include
"hg": true
in PULSE_PUSH_SOURCES for the community configuration
Reporter | ||
Comment 14•5 years ago
|
||
I’m in UTC+1 and also traveling part of tomorrow. Dustin, could you give scopes queue:route:tc-treeherder.v2._/servo-try.* to Armen and Sarah? That way they can trigger tasks on https://community-tc.services.mozilla.com/tasks/create while I’m away.
Please ping in #taskcluster if you still need this.
Comment 15•5 years ago
•
|
||
Armen's going to test out the community-ci credentials on stage tomorrow morning. Per my convo with Dustin, we can also test out the firefox-ci "premigration" cluster: {"root_url": "https://firefox-ci-tc.services.mozilla.com", "github": true, "hgmo": true, "pulse_url": "amqp://<mumble>:<mumble>@pulse.mozilla.org:5671/premigration?ssl=true"}
But we'll only get tasks from hgmo, not pushes.
Once I get my pr finished we can test it on prototype and then maybe stage with both those credentials configured. Tom prince has said he'll start running pushes on the firefox-ci cluster today.
Assignee | ||
Comment 16•5 years ago
|
||
If we don't get pushes through the "premigration" exchanges the tasks won't show up on the UI (If a task does not have a revision in the DB it will be retried ten times).
Reporter | ||
Comment 17•5 years ago
|
||
That's tricky because at least for this testing period, there aren't even hg pushes that are initiating the tasks -- AIUI tom is triggering the hg-push hook manually, as if there was an hg-push message.
Maybe we can simulate those messages..
Assignee | ||
Comment 18•5 years ago
|
||
I can modify the code to allow creating pushes for tasks that don't yet have a revision.
We can put the behaviour behind and env variable.
Works for you?
Reporter | ||
Comment 19•5 years ago
|
||
Tom, would it be useful to have a "fake" hgmo that can send pulse messages on the appropriate domain in the premigration AMQP vhost (the one firefox-ci is configured to use)?
Comment 20•5 years ago
|
||
I had been thinking naively, that pushes could be injested from pulse and tasks from premigration.
Reporter | ||
Comment 21•5 years ago
|
||
Ah, that will work. Sorry I didn't see that yesterday!
PULSE_PUSH_SOURCES='[{"root_url": "https://firefox-ci-tc.services.mozilla.com", "github": true, "hgmo": true, "pulse_url": "amqp://<mumble>:<mumble>@pulse.mozilla.org:5671/?ssl=true"}]'
PULSE_TASK_SOURCES='[{"root_url": "https://firefox-ci-tc.services.mozilla.com", "pulse_url": "amqp://<mumble>:<mumble>@pulse.mozilla.org:5671/premigration?ssl=true"}]'
Assignee | ||
Comment 22•5 years ago
|
||
Please read this comment very closely.
This comment only involves changes related to comment 21 wrt to using https://firefox-ci-tc.services.mozilla.com
.
I tried making the changes and everything looked OK until I tried to load https://treeherder-prototype.herokuapp.com/#/jobs?repo=autoland
which showed the API was returning 503.
I reverted the changes while I investigate, however, I want to keep track of the notes [1] on how to deploy such change and monitor it.
We can avoid the careful steps if we remove the check for "tc_root_url" in here and here.
Any objections?
[1]
Because the change involves touching root_url
it requires care since we check the repository
's tc_root_url
value
Steps:
- Disable the two Pulse listerners
- Update the database
UPDATE treeherder.repository SET tc_root_url="https://firefox-ci-tc.services.mozilla.com" WHERE tc_root_url="https://taskcluster.net";
- Notice that this change will require a code change
- A deployment will most likely cause the column to revert (I'm not 100% familiar with the repositories fixture)
- Update the variables
- Re-enable the Pulse listeners
- Visit Pulse guardian and delete the
treeherder-prototype
queues- You can also let it be as they will eventually get deleted automatically
Monitoring of this change:
- Celery queues
- Load the Rabbit MQ add-on from Heroku
- Click the green button "RabbitMQ Manager"
- Click on "queues" tab
- Notice that there might be
fenix
andreference-browser
tasks stuck in the queue (They get retried 10 times)- You will see this in New Relic as
No push found in fenix for revision 8bff839bce795612f832991783b06766a1e6c5c6 for task DTfl1cWEQA-Pu9qSKAdNMQ
- You will see this in New Relic as
- New Relic
- The "Error analytics" tab is your friend
- Shorten the timeline to 30 minutes
- You will not see a vertical green light because you're not doing a deployment
- Papertrail
- You can use this search on Papertrail to narrow the events:
-source and -Could and -newrelic and -app/web and -heroku/router and -rabbitmq and -Relic and -ForkPool
- The event you don't want to see is
Skipping push for autoland with incorrect root_url https://firefox-ci-tc.services.mozilla.com
- This happens if you don't disable the Pulse listeners in time
- You can use this search on Papertrail to narrow the events:
Assignee | ||
Comment 23•5 years ago
|
||
tomprince let me know that there are no tasks coming through premigration
unless a push is manually created:
There is nothing that sends push messages to premigration. Task (when they run) send messages to premigration, but since there are no push messages, task only run when somebody starts them by hand.
Assignee | ||
Comment 24•5 years ago
|
||
I have changed the two variables on stage
.
If there are no issues we could prep the variables on production
in preparation for Saturday.
As of now, are there pushes/tasks coming for servo-auto
through the Firefox instance and the Community instance?
I want to determine if we will start not seeing jobs on stage for it since we have not yet landed this PR https://github.com/mozilla/treeherder/pull/5593
Assignee | ||
Comment 25•5 years ago
|
||
tomprince/mtabara re-run this task on the Firefox CI instance.
I deployed this PR to avoid skipping jobs with different TC Root URL + env variable changes from comment 21.
I can see the two re-runs in here:
https://treeherder-prototype.herokuapp.com/#/jobs?repo=mozilla-central&tier=1%2C2%2C3&revision=e8b7c48d4e7ed1b63aeedff379b51e566ea499d9&searchStr=deb10
Assignee | ||
Comment 26•5 years ago
|
||
I have made one more configuration change to prototype
in order to make sure we can ingest from all 3 places at the same time.
I can now again see tasks for autoland.
- Pushes are coming from:
- Original -
{"root_url": "https://firefox-ci-tc.services.mozilla.com", "github": true, "hgmo": true, "pulse_url": "amqp://<credentials>@pulse.mozilla.org:5671/?ssl=true"}
- Community -
{"root_url": "https://community-tc.services.mozilla.com", "github": true, "pulse_url": "amqp://<credentials>@pulse.mozilla.org:5671/communitytc?ssl=true"},
- Original -
- Tasks are coming from
- Original -
{"root_url": "https://taskcluster.net", "pulse_url": "amqp://<credentials>@pulse.mozilla.org:5671?ssl=true"}
- Community -
{"root_url": "https://firefox-ci-tc.services.mozilla.com", "pulse_url": "amqp://<credentials>@pulse.mozilla.org:5671/premigration?ssl=true"},
- Premigration -
{"root_url": "https://community-tc.services.mozilla.com", "pulse_url": "amqp://<credentials>@pulse.mozilla.org:5671/communitytc?ssl=true"}
- Original -
Notice that this is possible thanks to the PR I have requested review for.
Assignee | ||
Comment 27•5 years ago
|
||
I have rollaback this as per request on IRC:
Original - {"root_url": "https://taskcluster.net", "pulse_url": "amqp://<credentials>@pulse.mozilla.org:5671?ssl=true"}
Comment 28•5 years ago
•
|
||
(In reply to Armen [:armenzg] from comment #24)
I have changed the two variables on
stage
.
If there are no issues we could prep the variables onproduction
in preparation for Saturday.
If we do this we might want to keep the credentials for taskcluster.net in heroku for all environments until Saturday so we don't get warnings about queues overgrowing and losing that data (at least for production).
Assignee | ||
Comment 29•5 years ago
|
||
(In reply to Sarah Clements [:sclements] from comment #28)
(In reply to Armen [:armenzg] from comment #24)
I have changed the two variables on
stage
.
If there are no issues we could prep the variables onproduction
in preparation for Saturday.If we do this we might want to keep the credentials for taskcluster.net in heroku for all environments until Saturday so we don't get warnings about queues overgrowing and losing that data (at least for production).
Unfortunately, I have created two (if not three) different conversations in this bug.
The changes on stage
are to have the env variables prepared in advance for when Servo moves to the community repo.
The changes on prototype
are to help tomprince/sheriffs validate the Firefox TC instance.
The PR that just landed is to support the latter and even your "prep" PR.
Assignee | ||
Comment 30•5 years ago
|
||
I can already some tasks on stage that are not showing on production.
The release task:
https://community-tc.services.mozilla.com/tasks/ahSlQPuMTHmObDaPxQRpcw
I have deployed the PR from simon to stage.
I will verify few more things and promote to production.
Assignee | ||
Comment 31•5 years ago
|
||
I've made a deployment to production for Simon's PR and I've made these changes to the two variables:
[{"root_url": "https://taskcluster.net", "github": true, "hgmo": true, "pulse_url": "amqp://<prod credentials>@pulse.mozilla.org:5671/?ssl=true"},{"root_url": "https://community-tc.services.mozilla.com", "github": true, "pulse_url": "amqp://<prod credentials>@pulse.mozilla.org:5671/communitytc?ssl=true"}]
[{"root_url": "https://taskcluster.net", "pulse_url": "amqp://<prod credentials>@pulse.mozilla.org:5671/?ssl=true"},{"root_url": "https://community-tc.services.mozilla.com", "pulse_url": "amqp://<prod credentials>@pulse.mozilla.org:5671/communitytc?ssl=true"}]
This means that tasks from the community Pulse stream is being fed into Treeherder.
If this change was to need to be revert would require reverting to v524
which would revert the two env variable changes plus code changes. You would also need to revert the changes on master
.
Assignee | ||
Comment 32•5 years ago
|
||
I've also backfilled some community tasks for the latest push:
heroku run --app treeherder-prod bash
~ $ ./manage.py ingest_push_and_tasks task --root-url https://community-tc.services.mozilla.com --task-id ahSlQPuMTHmObDaPxQRpcw 2>1 | grep -v Deprecation
~ $ ./manage.py ingest_push_and_tasks task --root-url https://community-tc.services.mozilla.com --task-id Rs08lIFpR1CMsBZATr4PJg 2>1 | grep -v Deprecation
~ $ ./manage.py ingest_push_and_tasks task --root-url https://community-tc.services.mozilla.com --task-id QcMzmslSQ3qjG1tR30kd5A 2>1 | grep -v Deprecation
~ $ ./manage.py ingest_push_and_tasks task --root-url https://community-tc.services.mozilla.com --task-id ZMQTUUjURiqPq2ebYIwInQ 2>1 | grep -v Deprecation
I just wanted to document it that it is possible to do.
Assignee | ||
Comment 33•5 years ago
|
||
tomprince: Let us know what the sheriffs say about prototype
SimonSapin: Keep an eye on the Servo repos and let us know if there's something not working right. I would like to see a new change land on master
and compare both treeherder-stage
and treeherder-production
Comment 34•5 years ago
•
|
||
https://treeherder.allizom.org/#/jobs?repo=servo-auto looks good. I assume prod will do the same soon.
Assignee | ||
Comment 35•5 years ago
|
||
It seems that my PULSE_* env changes on production
were not being picked up.
The treeherder-prod
Pulse credentials were not working for the community vhost.
Dustin and I troubleshooted it this morning and there was a white space in a regex.
We can now see the tasks being ingested on production:
https://treeherder.mozilla.org/#/jobs?repo=servo-auto&selectedJob=275335928
There's also some UI code that I want to land today in preparation for tomorrow's migration.
You can follow along in https://github.com/mozilla/treeherder/pull/5612
Assignee | ||
Updated•5 years ago
|
Description
•