Closed Bug 591688 Opened 14 years ago Closed 7 years ago

Push to try should validate |try:| parameters

Categories

(Firefox Build System :: Task Configuration, task, P3)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: justin.lebar+bug, Assigned: bstack)

References

Details

(Keywords: trychooser, Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2076] [tryserver][capacity])

Attachments

(2 files)

Khuey posted to the dev.platform:

> can we just error in the hook if you lack the proper syntax?

Indeed, can we?  I just pushed with a minor syntax error (--build opt instead of --build o) and lost a few hours after I discovered that I was getting the wrong builds (bug 591686).
It would also be nice if in a post-commit hook it told you what was being run so you knew you got it right.
The parsing isn't done in a hook for mercurial, it's done by buildbot when it gets the list of changes from mercurial. That said we could probably error on bogus syntax and email rather than start building.
(In reply to comment #2)
> The parsing isn't done in a hook for mercurial, it's done by buildbot when it
> gets the list of changes from mercurial.

But this doesn't explain why it shouldn't be done in an hg hook.

> That said we could probably error on
> bogus syntax and email rather than start building.

This would of course be better than what we currently have.  But instant "you screwed up" feedback is a lot better than an e-mail, which I may or may not see for a while.

Additionally, doesn't try already send enough e-mails?  :)
Getting hg commit hook validating input would be a good thing here. I'll take this bug and look into how to make that happen.
Assignee: nobody → lsblakk
Priority: -- → P2
Priority: P2 → P3
I'm merging this with bug 594236 which has the beginnings of an hg extension in it and will better track progress on getting something happening with push to try validation and/or syntax prompting
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
I'd like us to reconsider this bug.

Call me incompetent, but I often push to try with invalid parameters.  Today I did it four separate times, because I had a variety of patches in my queue with try syntax in various states of incorrectness.

This bug is not the same as bug 594236.  An hg extension would not have solved my problems here, because I'm pushing to try via a git extension.

But more generally, having an hg extension which prompts you for try syntax is not the same thing as having an hg hook which rejects pushes with invalid trychooser syntax.  We already have a hook that checks for "try:".  Can we just do the right thing here and extend that?

If we wanted to see whether this work would be justified, we could look through the try pushes from the past few months and see what fraction of pushes had invalid syntax.  I'd be willing to bet I'm not the only bozo who can't get it right.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
passing the torch here since I'm not actively working on this and am not longer in RelEng.
Assignee: lsblakk → nobody
QA Contact: hwine
(This comment was posted in bug #693565 but I forgot to finish.)

If the try: push command specified an invalid parameter, would it be possible/
feasible to do the command-line syntax and if something is wrong with the
syntax (i.e. invalid platform), it doesn't do anything but send an email to
the 'pusher' that the syntax was invalid.  

Or should it continue with the try even with the invalid syntax? i.e.
try: -b od -p mall 
(since the platform is invalid, it shouldn't be building at all, but it
doesn't (afaik) tell the pusher there's something wrong with the syntax.)
I think we should be avoiding unnecessary work whenever possible. If the chooser syntax is wrong, we should just fail-n-mail rather than trying to guess what the developer actually wanted.
Component: Release Engineering → Release Engineering: Automation (General)
QA Contact: hwine → catlee
Whiteboard: [tryserver][capacity]
I agree about failing hard and fast, but please don't rely on tryserver e-mail to notify of the failure.  Everyone filters them.
(In reply to Justin Lebar [:jlebar] from comment #11)
> Everyone filters them.

I love how we're our own internal example of how absolutely vital it is to choose the right default the very first time. The default has been failures-only since June of 2011, and has been no-email since May of 2012, and with a fairly large margin of error you're still right that "everyone filters them."
(In reply to Phil Ringnalda (:philor) from comment #12) 
> I love how we're our own internal example of how absolutely vital it is to
> choose the right default the very first time. The default has been
> failures-only since June of 2011, and has been no-email since May of 2012,
> and with a fairly large margin of error you're still right that "everyone
> filters them."

Sigh, yes.

I hope you can understand my frustration though: it seems like people are unwilling to check their trychooser syntax before pushing, and then unwilling to monitor the email that would tell them they made a mistake.

(In reply to Nick Thomas [:nthomas] from comment #2)
> The parsing isn't done in a hook for mercurial, it's done by buildbot when
> it gets the list of changes from mercurial. That said we could probably
> error on bogus syntax and email rather than start building.

I actually favor a pre-commit hook to avoid unnecessary churn everywhere. Ideally, this hook would feed from a data source of allowable options that could also feed a similar system in git, buildbot, trychooser webpage, etc.
(In reply to Chris Cooper [:coop] from comment #13)
> (In reply to Phil Ringnalda (:philor) from comment #12) 
> > I love how we're our own internal example of how absolutely vital it is to
> > choose the right default the very first time. The default has been
> > failures-only since June of 2011, and has been no-email since May of 2012,
> > and with a fairly large margin of error you're still right that "everyone
> > filters them."
> 
> Sigh, yes.
> 
> I hope you can understand my frustration though: it seems like people are
> unwilling to check their trychooser syntax before pushing, and then
> unwilling to monitor the email that would tell them they made a mistake.
> 
> (In reply to Nick Thomas [:nthomas] from comment #2)
> > The parsing isn't done in a hook for mercurial, it's done by buildbot when
> > it gets the list of changes from mercurial. That said we could probably
> > error on bogus syntax and email rather than start building.
> 
> I actually favor a pre-commit hook to avoid unnecessary churn everywhere.
> Ideally, this hook would feed from a data source of allowable options that
> could also feed a similar system in git, buildbot, trychooser webpage, etc.

Pardon my ignorance (as I'm both unfamiliar with hooks and data sources),
I'd like to try (pun not intended) to fix this (unless someone has an
objection to this, or that I don't have the prior qualifications in fixing
this).  My initial thought was fixing it in try_parser, but I know little
to nothing about it (and I don't -atm- see where I can get the e-mail from
the pusher since the try_parser has just message, buildernames, etc as
parameters).  Then on IRC, nthomas pointed out that it could be done
via the try_mandatory hook (pretxnchangegroup hook) but reading comment #2,
I'm now utterly confused. (Not nthomas' fault.  My knowledge in these matters
is non-existent at best.)
ewong, the current situation is that the reference implementation for parsing try: strings is in buildbot scheduler (try_parser and friends), and then we have the mercurial hook, the trychooser website, assorted extensions people have written. When a change is made to the builds and/or platforms that are run by default then all these things get out of sync (unless the person making the change is very diligent).

Coop is proposing that all those tools use some external data source to determine how to behave, so that a single change keeps everything up to date. This is a great idea, but maybe more than you intended to tackle here.
Keywords: trychooser
Status: REOPENED → NEW
Product: mozilla.org → Release Engineering
Whiteboard: [tryserver][capacity] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2067] [tryserver][capacity]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2067] [tryserver][capacity] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2076] [tryserver][capacity]
Component: General Automation → Mercurial: hg.mozilla.org
Product: Release Engineering → Developer Services
QA Contact: catlee
There is no formal definition of what constitutes valid try syntax. buildbot has a parser. Taskcluster / decision graph has a parser. The latter is the closest we have to a canonical parser.

Complicating that is that the syntax changes over time, which means it varies by revision pushed. That means having relatively static code on the server enforcing try syntax will only yield valid results for a subset of time.

Now, we could probably live with a "good enough" parser. But we don't want it to be too strict because then people tweaking try syntax via try pushes will get screened by the hook. Not good.

What I would prefer we do is have automation send a notification if the decision task fails. That failure should occur within 2 minutes of the push - sometimes as long as a few dozen seconds. The problem will be ensuring the user actually sees it. Taskcluster can send browser notifications. But I don't know how we encourage people to enable those or how difficult it would be to hook up "notify on Try push failure" to that system.

bstack?
Flags: needinfo?(bstack)
Assignee: nobody → bstack
Status: NEW → ASSIGNED
Flags: needinfo?(bstack)
I've made an attempt at satisfying the constraints we have here. The current patch will throw an exception if any platforms or build_types are passed in that don't exist in the full task graph in addition to if any unknown options are passed to "try:". I
think that this approach will work for the remaining sorts of arguments we want to validate (jobs, tests, etc) but I've stopped short of implementing them to see if my approach makes sense.

Exceptions thrown here will result in emails to the "owner" email address as decided by mozilla-taskcluster. I'm not sure by which method that it picks that field. I should figure that out before we use it for something like this. 

As it stands the emails are the generic ones produced by taskcluster notify, which have subjects like "Task failed: Gecko Decision Task - CrQU143TTUuF9op-wVs7CA" and the body contains links to the task itself in addition to the task metadata. An image of an email is viewable at https://imbstack.com/2016/09/09/taskcluster-notify.html. I assume that pretty much nobody has filtered out all email from taskcluster as of yet, so we shouldn't have much of an issue there. The email doesn't mention try and does not link to treeherder, but I figure it should be fairly obvious what just happened in 90% of cases. We can do the extra work to make the email more gecko-specific if we want, but that will be a good bit more work.


Examples:

try: -b od -p mall    :::::   https://tools.taskcluster.net/task-inspector/#CrQU143TTUuF9op-wVs7CA/0
try: FOO BAR BAZ      :::::   https://tools.taskcluster.net/task-inspector/#EL1inoRkSLGLkiCrMC-WXA/0
I would be happy if there was any feedback at all, even if I have to wait for it.
At least I could then adjust my future pushes.

It would also be extraordinarily helpful if I could get a list of jobs and machine names.
Alternatively, including on treeherder (or somewhere) the reverse mapping of <actual machine/job> to <try syntax>, so I could figure it out myself.
Jeff, what do you mean by "machine name" here?  We use spot instances in ec2 so they have such informative names as i-1098ab98e70f and only last a day or two.

:ahal is working on -- at least thinking about -- a way to make try drastically simpler which would likely address, or could be informed by, your requests.  See
  - https://ahal.ca/blog/2015/try-syntax/
  - https://ahal.ca/blog/2017/fuzzy-try-chooser/

It's all on a bit of a hold right now until we're not using Buildbot anymore (trying to make parallel changes in two unrelated systems is kind of ridiculously hard), although we can make nice incremental improvements like Brian has here.
"machine name" as in configuration name. I've wanted to run tests on specific versions of Android before an been unable to figure out how to ask for this.
Comment on attachment 8857751 [details]
Bug 591688 - Notify when decision task fails

https://reviewboard.mozilla.org/r/129718/#review132638

On today's list for "why we can't have nice things," the email address used to authenticate hg.mozilla.org SSH requests and stored in the pushlog and recorded in TaskCluster as the owner is not guaranteed to actually work or be the ideal email for the user. The email address is simply an LDAP account. And for various reasons, people have emails in their account name that don't work or don't go to the address for which they want a notification.

Solving this is a hard problem. And perfect is the enemy. So I think a possible email notification is better than no notification.

The approach in this patch seems reasonable to me as a first implementation. I defer to dustin for the code review.
Attachment #8857751 - Flags: review?(gps)
(having not looked at the patch yet...) another option is to just fail the decision task with a nice warning in the log.  that tends to get people's attention!

I'll look at the patch and comment some more.
Comment on attachment 8857751 [details]
Bug 591688 - Notify when decision task fails

https://reviewboard.mozilla.org/r/129718/#review132666

Ah, so this isn't adding a new notification -- it's just doing better validation in the decision task.  So it will show up in TH *and* generate an email, albiet potentially to an unmonitored email address.  Yay!

My concern is, I think that there has been an assumption that unused arguments don't cause errors, and later mozharness can look at those arguments and do its mozharness-y thing. I may be totally off-base with that. Or that may be correct, but not something we wish to continue.  Maybe those try arguments should be parsed by (but not acted on) the try parser. Then users will get early feedback if they typo even a mozharness command.

I think some of that comes from `./mach try` which has some way of communicating "please run this specific test file" to mozharness.  I'm not sure who to ask about that, though.  Anyway, if that's a non-issue then I'm happy to see this land.

::: taskcluster/taskgraph/try_option_syntax.py:346
(Diff revision 2)
> +        all_types = set(t.attributes['build_type']
> +                        for t in full_task_graph.tasks.itervalues()
> +                        if 'build_type' in t.attributes)
> +        if not all_types.issuperset(build_types):
> +            bad_types = ', '.join(set(build_types).difference(all_types))
> +            raise Exception("Unknown type(s) [{}] specified for try".format(bad_types))

"Unknown build type(s)"..
Attachment #8857751 - Flags: review?(dustin) → review+
jmaher, I'm told you might know the answer to dustin's question about unused arguments in comment 33?

For context the way I've written this now will make "try: FOO BAR BAZ" fail because those aren't known arguments to try syntax. Does it make sense to fail in that case or should they be allowed through?
Flags: needinfo?(jmaher)
I believe there are some tools out there that look at pulse and custom parse try syntax to run their own tools.  I know this happened in the past, I am not sure if it happens today.  It seems ok to go forward here and then let the complaints surface :)

One concern is that the try syntax isn't easy to figure out and the choices do not match what people want to run- so this will help and hurt at the same time.

on the topic of mozharness arguments, there is one specific action I know of:
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/talos.py#175
Flags: needinfo?(jmaher)
Comment on attachment 8857751 [details]
Bug 591688 - Notify when decision task fails

https://reviewboard.mozilla.org/r/129718/#review133552

Aside from `set` usage (which isn't critical), this looks good.

Per IRL conversations, I'd like to see a follow-up to explore using web push notifications for the alert. TC's one-click loaner interface already supports web push notifications for telling me when the loaner is ready to use. While I'm not a fan of excessive warnings and interruptions, I think warning people of a dead-in-the-water try push with as little latency as possible is warranted because a) the user likely just did the try push and hasn't fully context switched away yet b) it is really frustrating to either poll Treeherder or find our dozens of minutes or hours later that something was busted.

::: taskcluster/taskgraph/try_option_syntax.py:366
(Diff revision 3)
> +                             for t in full_task_graph.tasks.itervalues()
> +                             if 'test_platform' in t.attributes)
> +        build_platforms = set(t.attributes['build_platform']
> +                              for t in full_task_graph.tasks.itervalues()
> +                              if 'build_platform' in t.attributes)
> +        all_platforms = test_platforms.union(build_platforms)

Nit: nobody really uses the methods of set instances to do set operations: they use the overloaded operators. While not relevant here, use of operators also increases the odds that things still work with different types that quack like sets but aren't actually `set` instances. So this should be `all_platforms = test_platforms | build_platforms`.

::: taskcluster/taskgraph/try_option_syntax.py:367
(Diff revision 3)
> +        if not all_platforms.issuperset(results):
> +            bad_platforms = ', '.join(set(results).difference(all_platforms))

`issuperset()` can be expressed in terms of `>` or `>=` and `difference()` can be expressed in terms of `-`. You typically just ignore the super set computation and do something like:

```
bad_platforms = set(results) - all_platforms
if bad_platforms:
    raise Exception(... % ', '.join(bad_platforms))
```
Attachment #8857751 - Flags: review?(gps) → review+
Comment on attachment 8857751 [details]
Bug 591688 - Notify when decision task fails

https://reviewboard.mozilla.org/r/129718/#review133572

The `set` code in this version is perfect. Thanks for fixing.
Keywords: checkin-needed
hg error in cmd: hg push -r tip ssh://hg.mozilla.org/integration/autoland: pushing to ssh://hg.mozilla.org/integration/autoland
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 2 changes to 2 files
remote: 
remote: 
remote: ************************** ERROR ****************************
remote: Rev deeae87c6513 uses try syntax. (Did you mean to push to Try instead?)
remote: Brian Stack <bstack@mozilla.com>
remote: Bug 591688 - Notify when decision task fails r=dustin,gps
remote: 
remote: This uses the email provided by mozilla-taskcluster to find who to
remote: email about failed decision tasks. It also adds some validation of
remote: the try syntax that we've previously ignored.
remote: 
remote: Any platforms or build types specified in "try: " that don't exist
remote: in the full task graph will throw an error.
remote: 
remote: 
remote: MozReview-Commit-ID: JOKkLle7hEe
remote: *************************************************************
remote: 
remote: 
remote: transaction abort!
remote: rollback completed
remote: pretxnchangegroup.c_commitmessage hook failed
abort: push failed on remote
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/5748607f3014
Notify when decision task fails r=dustin,gps
Keywords: checkin-needed
Let's put this in a more appropriate component since it no longer has anything to do with hg.
Component: Mercurial: hg.mozilla.org → Task Configuration
Product: Developer Services → Taskcluster
Version: other → unspecified
Attached image task_failed_email.png
It works!

There are some obvious UX improvements. (If I didn't know what the "Gecko Decision Task" and "task graph" where I wouldn't know my Try push was basically dead in the water.) But this is a fantastic start.
I regularly push to try with '-b do -p foo -t none -u none' because that's the only way to voluntarily *not* trigger firefox builds, to focus on, for example, toolchain builds.
Hi Brian,

seems this caused https://treeherder.mozilla.org/logviewer.html#?job_id=92605371&repo=mozilla-central 

since this is tier 2 i do not back this out but we should fix this. Filed Bug 1357673 for this
Depends on: 1357673
dustin and I think possible the best way to fix glandium's issue is to make both hg hooks and the decision task accept "-p none". We'll work on that after bug 1357673 is resolved.
Actually, a workaround for glandium's issue that I've used in the past is to just specify "try: ". This results in just the bare minimum of jobs being run. All of the jobs in https://treeherder.mozilla.org/#/jobs?repo=try&revision=417915a3e29d859b194e2f491ed80bce061324a1 are from me changing python files that automatically trigger their tests. I expect if there was no diff, this would just be a decision task. I'll close this bug for now since the test failure is tracked in another bug and we have a workaround for glandium's issue. Please file another bug and assign to me if the workaround doesn't work.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago7 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: