Closed
Bug 791385
Opened 13 years ago
Closed 10 years ago
Add an hg hook to the try server in order to prevent having multiple try jobs in flight by default
Categories
(Release Engineering :: General, defect, P4)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: ehsan.akhgari, Unassigned)
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2233] [tryserver][capacity][hg][hook])
Attachments
(1 file)
|
2.88 KB,
text/plain
|
Details |
We are interested in doing this in order to cut down the infrastructure load caused by people pushing multiple patches to try without realizing that their previous jobs are still active.
Chris has written the hook. mconnor is supposed to write the announcement to dev.something.
Comment 1•13 years ago
|
||
Updated•13 years ago
|
Attachment #661385 -
Attachment mime type: text/x-python → text/plain
Comment 2•13 years ago
|
||
What problem are we solving here? I can see it as a clear solution to "we have 280 developers with try access, and 280 machines, so if you push twice you are taking someone else's machine and you need to realize it" but that doesn't strike me as being a problem that we have.
We certainly have problems like "we have a shocking number of people with try access who do not know that they can retrigger tests, who believe that the way to retrigger an intermittent failure is to push again," but this doesn't solve that, it just makes them wait, or qref and not wait.
Are we solving "people push, see the result from the first platform, realize they need to change foo, and push with foo changed without killing the first push"? This seems like a massive annoyance for the normal case ("I'm working on seven separate patches, three of which require Windows coverage, I pushed two of them three hours ago and I'm still waiting on Windows tests") in order to solve that. I don't watch Try very closely, what percentage of the load is from those cases?
| Reporter | ||
Comment 3•13 years ago
|
||
(In reply to comment #2)
> Are we solving "people push, see the result from the first platform, realize
> they need to change foo, and push with foo changed without killing the first
> push"? This seems like a massive annoyance for the normal case ("I'm working on
> seven separate patches, three of which require Windows coverage, I pushed two
> of them three hours ago and I'm still waiting on Windows tests") in order to
> solve that. I don't watch Try very closely, what percentage of the load is from
> those cases?
This is the problem that we're trying to solve here. For the people who use try the most, the case you described does not seem to be the normal case at all, after eyeballing the try load for a time (not that I have more concrete evidence than that, of course.)
Note that this will also serve to make people realize that there is a cost incurred on our shared infrastructure when they push something to the try server.
Comment 4•13 years ago
|
||
I don't think the normal case is "I'm constantly context switching between lots of different patches." I think the normal case I've observed is "My try run was unsuccesful, I should push a new version with fixes."
Ideally this would be an interactive hook, but hg makes that hard/impossible.
Comment 5•13 years ago
|
||
Are we going to alter the trychooser hg extension to totally defeat this hook by automatically repushing with the token, or are we going to make the extension prohibitively annoying to use if you are not wasteful of your time, and work on more than one patch at a time?
Comment 6•13 years ago
|
||
So the primary question, as with all bugs, is what is the outcome we want? My assertion is that we want to reduce unnecessary load on our infrastructure, where unnecessary is defined as "jobs that the developer does not actually need." The goal is not "create an arbitrary hoop to jump through" but "do the thing that makes sense for the situation."
In this case, where we are trying to raise/force awareness is "re-pushing the same patch queue with changes without killing the now-obsolete jobs" and not "pushing multiple separate patch queues at once." I've seen relatively little evidence of the latter, and more of the former, but that's not really data, or even relevant. If our solution impairs the latter, that's a cost we should seek to remove.
An alternative approach would be to have trychooser automatically generate a per-repo/branch/patchqueue token (which doesn't need to be universally unique, just per-user unique), and treating each user/token pair as a job key. If a user re-pushes with the same token (implying that they're building the same tree), we'd kill existing jobs with that key and start a new run. If specific platforms are specified in the second push, we should optimize and only kill jobs for those platforms (i.e. a re-push just for Mac would not kill previous Windows jobs).
This would mean devs would be able to run multiple unrelated runs with little effort, but we would catch the "new push obsoletes previous push" cases automatically and free up resources without relying on devs to think it through.
(In reply to Mike Connor [:mconnor] from comment #6)
> In this case, where we are trying to raise/force awareness is "re-pushing
> the same patch queue with changes without killing the now-obsolete jobs" and
> not "pushing multiple separate patch queues at once."
So why not just raise awareness, by producing a report on who's doing the former (the most) and contacting them?
Comment 8•13 years ago
|
||
This is just pointless annoyance for people who work on more than one patch at a time... Do you think the people who don't even know their LDAP password will bother trying to get it reset because of this hook, if they didn't before?
| Reporter | ||
Comment 9•13 years ago
|
||
(In reply to comment #8)
> This is just pointless annoyance for people who work on more than one patch at
> a time... Do you think the people who don't even know their LDAP password will
> bother trying to get it reset because of this hook, if they didn't before?
Hmm, I'm not sure what the LDAP password has to do with what's being suggested here!
Comment 10•13 years ago
|
||
You can't cancel builds without one
Comment 11•13 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) (offline September 29-30 NZ time, i.e. 28-29 US time) from comment #7)
> (In reply to Mike Connor [:mconnor] from comment #6)
> > In this case, where we are trying to raise/force awareness is "re-pushing
> > the same patch queue with changes without killing the now-obsolete jobs" and
> > not "pushing multiple separate patch queues at once."
>
> So why not just raise awareness, by producing a report on who's doing the
> former (the most) and contacting them?
People have been trying to raise awareness for years, with limited success. It's not going to be a one-off process (new people always starting) or usefully effective (usage follows a long tail pattern, and I don't think worst offenders are necessarily the lion's share of unnecessary load). We can/will do this (catlee has a report) but I think that automagically handling things for people feels like a win.
Comment 12•13 years ago
|
||
(In reply to Ehsan Akhgari [:ehsan] from comment #0)
> We are interested in doing this in order to cut down the infrastructure load
> caused by people pushing multiple patches to try without realizing that
> their previous jobs are still active.
If the assumption is that people are doing this without realising perhaps a better solution would be to dispatch an email notification.
| Reporter | ||
Comment 13•13 years ago
|
||
(In reply to comment #12)
> (In reply to Ehsan Akhgari [:ehsan] from comment #0)
> > We are interested in doing this in order to cut down the infrastructure load
> > caused by people pushing multiple patches to try without realizing that
> > their previous jobs are still active.
>
> If the assumption is that people are doing this without realising perhaps a
> better solution would be to dispatch an email notification.
People already get emails from try. Clearly that is not working!
Comment 14•13 years ago
|
||
Actually, they don't get emails from try.
We changed the default to no email, which was a mistake that results in total failure pushes continuing to run until everything has failed, because "everyone" had already filtered all email from try to the trash. So step 1 in any solution that involves email is to send it from a new address.
Updated•13 years ago
|
Assignee: catlee → nobody
Priority: -- → P4
Comment 15•13 years ago
|
||
Any resolution here?
Do we need to create/promote/enforce a set of try best practices?
Component: Release Engineering → Release Engineering: Automation (General)
OS: Mac OS X → All
QA Contact: catlee
Hardware: x86 → All
Whiteboard: [tryserver][capacity][hg][hook]
Comment 16•13 years ago
|
||
My impression from a brief conversation with dumitru is that what I suspected was the case (from people sometimes asking for someone to cancel or retrigger jobs for them because they didn't think they had an ldap account) is indeed the case: we never made it clear to IT that whenever someone gets access to push and has an ldap account created for them because of that, they do need to be told that it was created and need to be told the password, because using try calls for using self-serve and thus requires using an ldap password.
Those careless people who aren't cancelling their try jobs when they've obviously gone bad, and aren't cancelling one job when they push a revised patch? They don't know that they can cancel, and they don't know that they should be filing a bug asking to be told their password.
| Assignee | ||
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•11 years ago
|
Whiteboard: [tryserver][capacity][hg][hook] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2219] [tryserver][capacity][hg][hook]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2219] [tryserver][capacity][hg][hook] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2229] [tryserver][capacity][hg][hook]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2229] [tryserver][capacity][hg][hook] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2233] [tryserver][capacity][hg][hook]
Comment 17•10 years ago
|
||
I think we're way past the point of this being viable.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
| Assignee | ||
Updated•8 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•