Closed
Bug 1242038
Opened 9 years ago
Closed 9 years ago
Send the right buildername over the pulse wire
Categories
(Tree Management :: Treeherder, defect, P2)
Tree Management
Treeherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: camd)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
KWierso selected that 'Rev7 MacOSX Yosemite 10.10.5 mozilla-inbound opt test jsreftest' job on rev 114a806647a2.
We received a pulse message asking for 'Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest' which is the old name of that job.
I can land a change on pulse_actions to rename the builder for now but please fix this issue if possible during this quarter.
This is causing pulse_actions to try to schedule invalid builders [1] over these revisions [a61bb2a7ffa3:114a806647a2] [2].
The Rev5 builder does not show up on allthethings.json anymore.
The rev7 change landed back in December 21st [3]
[1]
Jan 22 12:18:29 pulse-actions app/worker2.1: th_buildbot INFO: backfill action requested by wkocher@mozilla.com on repo_name mozilla-inbound with job_id: 20313404
Jan 22 12:18:31 pulse-actions app/worker2.1: mozci INFO: We want to find a job for 'Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest' in this range: [a61bb2a7ffa3:114a806647a2] (7 revisions)
Jan 22 12:18:34 pulse-actions app/worker2.1: mozci INFO: We want to have 1 job(s) of Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest on these revisions:
Jan 22 12:18:34 pulse-actions app/worker2.1: mozci INFO: We found a job for buildername 'Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest' on b7ea61be4cad91d1e3d69e22d1c1e0a1b4bb1501
Jan 22 12:18:34 pulse-actions app/worker2.1: mozci INFO: a61bb2a7ffa3667eba17dbe8826faf956656dd8b
Jan 22 12:18:34 pulse-actions app/worker2.1: mozci INFO:
Jan 22 12:18:34 pulse-actions app/worker2.1: mozci INFO: === a61bb2a7ffa3667eba17dbe8826faf956656dd8b ===
Jan 22 12:18:34 pulse-actions app/worker2.1: mozci INFO: We want to have 1 job(s) of Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest
Jan 22 12:18:36 pulse-actions app/worker2.1: mozci INFO: We have found 0 potential job(s) matching 'Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest' on a61bb2a7ffa3667eba17dbe8826faf956656dd8b. We need to trigger more.
Jan 22 12:18:36 pulse-actions app/worker2.1: mozci INFO: ===> We want to trigger 'Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest' on revision 'a61bb2a7ffa3667eba17dbe8826faf956656dd8b' a total of 1 time(s).
Jan 22 12:18:36 pulse-actions app/worker2.1: mozci INFO:
Jan 22 12:18:36 pulse-actions app/worker2.1: mozci ERROR: We didn't find a build job matching Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest
Jan 22 12:18:36 pulse-actions app/worker2.1: Traceback (most recent call last):
Jan 22 12:18:36 pulse-actions app/worker2.1: File "pulse_actions/worker.py", line 75, in run_pulse
Jan 22 12:18:36 pulse-actions app/worker2.1: pulse.listen()
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/mozillapulse/consumers.py", line 151, in listen
Jan 22 12:18:36 pulse-actions app/worker2.1: self._drain_events_loop()
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/mozillapulse/consumers.py", line 198, in _drain_events_loop
Jan 22 12:18:36 pulse-actions app/worker2.1: self.connection.drain_events(timeout=self.timeout)
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events
Jan 22 12:18:36 pulse-actions app/worker2.1: return self.transport.drain_events(self.connection, **kwargs)
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 95, in drain_events
Jan 22 12:18:36 pulse-actions app/worker2.1: return connection.drain_events(**kwargs)
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 326, in drain_events
Jan 22 12:18:36 pulse-actions app/worker2.1: return amqp_method(channel, args, content)
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/amqp/channel.py", line 1909, in _basic_deliver
Jan 22 12:18:36 pulse-actions app/worker2.1: fun(msg)
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/kombu/messaging.py", line 598, in _receive_callback
Jan 22 12:18:36 pulse-actions app/worker2.1: return on_m(message) if on_m else self.receive(decoded, message)
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/kombu/messaging.py", line 564, in receive
Jan 22 12:18:36 pulse-actions app/worker2.1: [callback(body, message) for callback in callbacks]
Jan 22 12:18:36 pulse-actions app/worker2.1: File "pulse_actions/worker.py", line 66, in handler_with_dry_run
Jan 22 12:18:36 pulse-actions app/worker2.1: return event_handler(data, message, dry_run)
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/pulse_actions/handlers/route_functions.py", line 13, in route
Jan 22 12:18:36 pulse-actions app/worker2.1: treeherder_buildbot.on_buildbot_event(data, message, dry_run)
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/pulse_actions/handlers/treeherder_buildbot.py", line 74, in on_buildbot_event
Jan 22 12:18:36 pulse-actions app/worker2.1: dry_run=dry_run
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/mozci/mozci.py", line 582, in manual_backfill
Jan 22 12:18:36 pulse-actions app/worker2.1: 'builders': [buildername]}
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/mozci/mozci.py", line 502, in trigger_range
Jan 22 12:18:36 pulse-actions app/worker2.1: trigger_build_if_missing=trigger_build_if_missing)
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/mozci/mozci.py", line 389, in trigger_job
Jan 22 12:18:36 pulse-actions app/worker2.1: will_use_buildapi=True
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/mozci/mozci.py", line 129, in determine_trigger_objective
Jan 22 12:18:36 pulse-actions app/worker2.1: build_buildername = determine_upstream_builder(buildername)
Jan 22 12:18:36 pulse-actions app/worker2.1: File "/app/.heroku/python/lib/python2.7/site-packages/mozci/platforms.py", line 185, in determine_upstream_builder
Jan 22 12:18:36 pulse-actions app/worker2.1: raise MozciError("No build job matching %s found." % buildername)
Jan 22 12:18:36 pulse-actions app/worker2.1: MozciError: No build job matching Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest found.
[2] https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&tochange=114a806647a2&filter-searchStr=rev5%20macosx%20yosemite%20opt%20reftest&fromchange=a61bb2a7ffa3&selectedJob=20309185
[3] http://hg.mozilla.org/build/buildbot-configs/rev/d6c16fba68e6
| Reporter | ||
Comment 1•9 years ago
|
||
emorley, I'm working around this. I don't think I will have to filter too many builders as time goes on, however, I believe we should invest some time to at least find where the issue could be found.
I could even ask volunteers to help us with this if I had some pointers.
Comment 2•9 years ago
|
||
I don't really have any idea of why this might be occurring; I only have a rudimentary understanding of how fetchallthethings works.
If there's a specific DB query I can run to help debug this I'm happy to help, or failing that maybe stale data in the reference data signatures table (more Cameron's expertise) might be to blame?
| Reporter | ||
Comment 3•9 years ago
|
||
camd: can you see 'Rev7 MacOSX Yosemite 10.10.5 mozilla-inbound opt test reftest' in runnable jobs?
It doesn't show up for me.
I hope this helps:
* Click on link https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=67c37b7d8cd2&filter-searchStr=rev5%20macosx%20yosemite%20opt%20reftest&selectedJob=20445873
* The job's buildername is 'Rev7 MacOSX Yosemite 10.10.5 mozilla-inbound opt test reftest'
* Load https://treeherder.mozilla.org/api/project/try/runnable_jobs/
* Search for 'Rev7 MacOSX Yosemite 10.10.5 mozilla-inbound opt test reftest'; no matches
* Load https://secure.pub.build.mozilla.org/builddata/reports/allthethings.json
* Search for 'Rev7 MacOSX Yosemite 10.10.5 mozilla-inbound opt test reftest'; it matches
Flags: needinfo?(cdawson)
| Assignee | ||
Comment 4•9 years ago
|
||
Armen, thanks for the STR. I hate to say that I can't reproduce... but... :)
When I look at those two URLs(/runnable_jobs/ and /allthethings.json) I don't see the buildername in EITHER one.
The query in Treeherder that populates for the /runnable_jobs/ endpoint is only run once per day. So that lag may account for it. Perhaps we need to bump that to more frequent? Once per hour? 15 mins? I'm not sure how expensive the query is, tbh.
Flags: needinfo?(cdawson)
| Reporter | ||
Comment 5•9 years ago
|
||
My apologies.
I didn't realize that "try" was in the URL.
This URL instead:
https://treeherder.mozilla.org/api/project/mozilla-inbound/runnable_jobs/
I see now "Rev7 MacOSX Yosemite 10.10.5 mozilla-inbound opt test reftest" on runnable. Good.
I don't see the old one "Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest". Good.
However, I should not be able to filter with the Rev5 builder:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=0e9213d8a0f8&filter-searchStr=Rev5%20MacOSX%20Yosemite%2010.10%20mozilla-inbound%20opt%20test%20reftest&selectedJob=20658023
If you look at the "Job details" pane for the select job you will see _Rev7_ rather than _Rev5_.
| Assignee | ||
Comment 6•9 years ago
|
||
Interesting: So, for this job (and a few others I looked at in the same filter) It has some conflicting data:
https://treeherder.mozilla.org/api/project/mozilla-inbound/jobs/20658023/
The "ref_data_name" has "Rev5"
However, the buildapi artifact:
https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=20658023&name=buildapi&type=json
The buildername says "Rev7"
The filter will go off the "ref_data_name" field. But those two values SHOULD be coming from the same place. So, yeah, something is awry alright! :) Looking into it now...
Assignee: nobody → cdawson
| Assignee | ||
Comment 7•9 years ago
|
||
for reference:
job: "ref_data_name": "Rev5 MacOSX Yosemite 10.10 mozilla-inbound opt test reftest"
artifact: "buildername": "Rev7 MacOSX Yosemite 10.10.5 mozilla-inbound opt test reftest"
| Assignee | ||
Comment 8•9 years ago
|
||
I must admit, this is stumping me, so far. The code in buidapi.py for ingestion looks good to me so far. n-i'ing ed to see if he has any ideas.
The only thing I could notice was that build4hr and pending.js didn't have the Rev5, but running.js DID. Running.js actually had both. I can't tell if that' means anything yet. Just wanted to mention. Those files are constantly changing, so it may mean nothing, just be timing related.
| Assignee | ||
Comment 9•9 years ago
|
||
Ed, you recently did the blacklisting work with buildernames. Does this issue ring a bell?
Flags: needinfo?(emorley)
| Assignee | ||
Comment 10•9 years ago
|
||
Oh! So we have 3 records in the ``reference_data_signatures`` table with signature: 2daddeb0761ddf09c02269f244465574018baa3e
One is Rev7 and 2 are Rev5
We are not distinguishing the Rev7 from the Rev5 in the signatures table. All the other fields are identical. So in the DB join, it's just picking the first buildername that matches the signature, which is a Rev5 one.
In fact, doing this query shows we have several cases of this:
SELECT count(id), signature FROM treeherder.reference_data_signatures
group by signature;
Perhaps we need to expand the platform to be more specific? Not just 10.10, but be 10.10.5, etc?
| Assignee | ||
Comment 11•9 years ago
|
||
So, just to be clear, I think data ingestion went correctly, as planned. But it's on the data retrieval that we're getting the wrong signature entry, and therefore the wrong buildername.
| Assignee | ||
Comment 12•9 years ago
|
||
Goodness, here's the most egregious case of this:
SELECT * FROM treeherder.reference_data_signatures
where signature="56a5265caddace3ec6217b07162b7ebafe795b42"
| Reporter | ||
Comment 13•9 years ago
|
||
With regards to seeing Rev5 sometimes as running jobs that is because we switched Rev5 for Rev7 on trunk trees first.
Then, as the train rides, release repositories will also see the switch of names.
I wouldn't pay too too much attention to anything that has to do with cedar as I assume it is the repository where most experimenting happens wrt to jobs being added or renamed, however, that is disturbing! :)
Thanks camd for poking at this!
Updated•9 years ago
|
Flags: needinfo?(emorley)
| Assignee | ||
Comment 14•9 years ago
|
||
I'm not totally sure how to fix this, at the moment, but I'll begin working on it next week. I am out from Wednesday to Friday this week.
Priority: -- → P2
| Assignee | ||
Comment 15•9 years ago
|
||
I think the solution here is just that we need to make fields like ``build_platform`` or ``machine_platform`` more specific. At this point, we lump any "10.10.X" into just "10.10". Since the buildernames are different for them, but we calculate the signature based on "10.10" then we will get dup signatures. If we stored the values as 10.10.5, 10.10.3, etc, we'd get unique signatures like we should.
These signatures are really intended to be unique, so it's a bit of a bug that we've allowed this, I think.
So We need to store more info, but we don't want to increase the number of platforms shown in the UI. all 10.10.x should be lumped together.
I can investigate:
1. storing a new field with a more exact value that the UI doesn't use, but that we use to calculate the signature
2. storing the more exact value in the build_platform field, but aggregating it in the api call
3. like 2, but aggregate in the UI
4. something cleverer that any of those... :)
| Assignee | ||
Comment 16•9 years ago
|
||
I had a thought on a simpler solution. But I'd love feedback from Mauro and Ed. I can't SEE a real downside to it myself at the moment...
What if I included the buildername (if present) in the list of values used to generate the signature hash?
This would have a few implications:
1. guarantee that we have unique signatures for every unique buildername.
2. the individual fields used to generate the signature are a little pointless wrt uniqueness
in the buildbot case. But still have meaning in the Task Cluster case.
3. Jobs with identical field values from Task Cluster and Buildbot would never have the same signature.
But I don't think this matters. Jobs of a specific type are EITHER run by Buildbot OR Task Cluster.
Not both in different circumstances.
4. Would have no effect on backfilling or retriggering with Task Cluster. It wouldn't add the buildername
value for those jobs since it won't exist.
This would be easy to remove when Buildbot goes away. It would really just become a no-op.
Flags: needinfo?(mdoglio)
Flags: needinfo?(emorley)
Comment 17•9 years ago
|
||
| Assignee | ||
Comment 18•9 years ago
|
||
Comment on attachment 8735619 [details] [review]
[treeherder] mozilla:bug1242038 > mozilla:master
I figured a PR on this simple fix may be more illustrative. :)
Flags: needinfo?(emorley)
Attachment #8735619 -
Flags: feedback?(emorley)
| Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(mdoglio)
Attachment #8735619 -
Flags: feedback?(mdoglio)
Comment 19•9 years ago
|
||
Comment on attachment 8735619 [details] [review]
[treeherder] mozilla:bug1242038 > mozilla:master
Sorry for the delay, catching up from DjangoCon.
This looks reasonable enough to me - though I don't have a full understanding of the reference data signatures, so may have missed something :-)
Attachment #8735619 -
Flags: feedback?(emorley) → feedback+
Comment 20•9 years ago
|
||
Comment on attachment 8735619 [details] [review]
[treeherder] mozilla:bug1242038 > mozilla:master
I can't think of any real issue this change could bring. In theory if somewhere in the codebase we used the signature as a grouper/filter, and we expected the result to include both buildbot and non-buildbot jobs together, in that case we would have a problem. But I don't think we have such a case, so f+
Attachment #8735619 -
Flags: feedback?(mdoglio) → feedback+
| Assignee | ||
Updated•9 years ago
|
Attachment #8735619 -
Flags: review?(wlachance)
Attachment #8735619 -
Flags: feedback+
Updated•9 years ago
|
Attachment #8735619 -
Flags: review?(wlachance) → review+
Comment 21•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/3b5f39b787bdfc5160f4ae5eebaf4770cd3c4101
Bug 1242038 - Store unique job signatures for buildernames
If our regexes give the same platform values for several different
buildernames, then we may find the wrong signature and buildername for
those jobs. This means that we will send the wrong buildername when
attempting to backfill or retrigger jobs.
This change ensures that the reference_data_signature is unique for
every buildername we get from buildbot.
| Assignee | ||
Comment 22•9 years ago
|
||
I had talked with Ed and James in our meeting yesterday and discussed just having our query for job get the latest reference_data_signature record in the table with a matching signature.
The problem is that we are actually getting dups in that table that are all current. (I got 4 of one signature after ingesting locally for 1 hour). Also, adding a query to get the ``MAX(id)`` involved a sub-query on the join and was just too slow.
I did it like this:
LEFT JOIN (
SELECT MAX(id) maxid, signature from `REP0`.reference_data_signatures
GROUP BY signature
) rdsmax
ON j.signature = rdsmax.signature
LEFT JOIN `REP0`.reference_data_signatures rds
ON rdsmax.maxid = rds.id
for the ``get_job_list`` procedure in jobs.json. Perhaps my approach was flawed, but I think this was the only way to get that.
anyway, talked with wlach about my original PR and he was good with it. I think it's the best way to guarantee unique signatures for different buildernames.
Bug 1262938 is a follow-up to create a unique index on the signature field for that table to ensure our assumption of uniqueness is guaranteed.
| Assignee | ||
Comment 23•9 years ago
|
||
This is fixed now, but only for NEWLY ingested jobs. Older jobs may still get the wrong buildername till that data expires.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 24•9 years ago
|
||
Thank you gentlemen!
You need to log in
before you can comment on or make changes to this bug.
Description
•