Closed Bug 691459 Opened 13 years ago Closed 9 years ago

Should have monitoring of how long it takes to push to try.

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: justin.lebar+bug, Assigned: hwine)

References

(Blocks 1 open bug)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1355] )

Attachments

(3 files, 1 obsolete file)

Pushing tip of trunk to try often takes a long time for me.  Today it took more than 15 minutes.

I think the reason this problem continues to resurface is that we currently don't monitor how long it takes to push to try.  The feedback loop consists solely of people complaining in IRC.

There are a number of ways we could monitor this.  We could have a process which makes a nop push to try every so often and times how long it takes.  Or we could have pre- and post-commit hooks which detect how long each push took.  (The pre-commit hooks seem to be run immediately, since my pushes are generally rejected without delay for not having trychooser syntax, but of course this would need to be verified.)
Assignee: server-ops → dustin
Hm, I don't think hooks will do the trick here:

dustin@Lorentz ~/tmp/src-repo $ hg push ssh://hg.mozilla.org/try
pushing to ssh://hg.mozilla.org/try
remote: Warning: Permanently added the RSA host key for IP address '63.245.208.189' to the list of known hosts.
searching for changes
<--- hangs here, before the try_mandatory pretxnchangegroup run (which would have failed)
abort: repository is unrelated

This *does*, however, suggest that a push with a bogus commit message will fail in any case, but give a reasonable indication of push delay.

dustin@Lorentz ~/tmp/mozilla-central $ hg push -f ssh://hg.mozilla.org/try
pushing to ssh://hg.mozilla.org/try
searching for changes
<--- hangs here for a while
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 1 changes to 1 files (+1 heads)
remote: To push to try you must use try syntax in the push comment of the *last* change 
remote: See http://people.mozilla.org/~lsblakk/trychooser/ to build your syntax
remote: For assistance using the syntax and https://wiki.mozilla.org/Build:TryChooser for more information.
remote: Thank you for helping to reduce CPU cyles by asking for exactly what you need.
remote: transaction abort!
remote: rollback completed
remote: abort: pretxnchangegroup.c_try_mandatory hook failed
abort: unexpected response: empty string

That should be pretty easy to monitor with nagios, if it is indeed the timing we're interested in.
So something like

dustin@Lorentz ~/tmp/mozilla-central $ hg pull -u && time hg push -f ssh://hg.mozilla.org/try
pulling from http://hg.mozilla.org/mozilla-central
searching for changes
no changes found
pushing to ssh://hg.mozilla.org/try
searching for changes
remote: waiting for lock on repository /repo/hg/mozilla/try held by 'dm-svn02.mozilla.org:3534'
remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock held by dm-svn02.mozilla.org:3534
abort: unexpected response: empty string

real    13m49.328s
user    0m1.244s
sys     0m0.271s
I was wondering if we could monitor existing logs on hg to gather this information. Either that or add pre/post commit hooks as suggested in comment #0. Dummy commits scare me because of the possible side-effects they have (triggering try builds, interfering with pushlog)
As I said in comment #1, there are no hooks early enough to be "pre" hooks.

The logfiles only give access times, not durations.

The dummy commits won't have try syntax, so nothing is actually pushed, hence no risk.
Attached file test-push
Here's what I'm working on so far.  It's pretty careful to keep its repo up to date, but will fail if it succeeds in pushing.

I think this will take too long for an active nagios check, but it could easily be configured to run from cron every 30m or so (maybe with a lockfile), and report its results passively to nagios, with a freshness timeout.

The tricky part that I don't know the answer to is, what userid and key should this use to push?  Do we have a machine account that's suitable?  I'll also need to figure out what machine to run it on, but I'm less worried about that.
It sounds like I should make a new LDAP account with access only by this host.  I'm not sure *where* it should run, though - is there precedent for this sort of thing?
dm-svn02 seems to be a decent place for it.
Try is really slow right now.  khuey, bz, espindola, and I all can't push.

Without monitoring, I suspect that these problems are never going to be fixed.
It's not that it's slow, but that multiple people are pushing simultaneously.  Of course, being slow makes that more common.  Still -- the list you give was the list of people whose pushes I was watching block one another.  I'm not sure who won, but at the moment I only see dougt and khuey (which suggesets khuey lost..).
For the record, in this case, bz, espindola, and I all cancelled our pushes.  And it looks like dougt and khuey both were unable to push.

Is your theory that multiple people are queueing on the lock for a reason other than try being slow?  Otherwise I don't understand.  It seems that multiple people queueing on the lock is a symptom of the fact that the person holding the lock has had it for a long time -- that is, it's not that multiple people are pushing simultaneously, but that someone is taking a long time to push to try.

Whatever the problem is, do we agree that it's a problem and that it would be detected with the monitoring suggested in this bug?
Ben, over to you - I wonder if this is still worthwhile with the new, awesome hg infra?
Assignee: dustin → bkero
FWIW I've had basically no problems pushing to try in the past few months.
I'm not sure if this is worthwhile anymore either.  If /try isn't improved for multiple simultaneous commits, perhaps refactoring try would be a better approach.  Something like an interface to submit jobs to try that simply checks out a revision/tag of code from somebody's user repository?
That's a fraught question, and probably OT for this bug.  It sounds like we're all reasonably confident we can WONTFIX this, but of course there's that "REOPENED" status when you need it.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Latest data in bug 734225 comment #17 implies timing issue not solely due to # heads (the only thing releng can measure and affect), so we need the connection timing information.

Reopening & marking as blocker for bug 734225
Blocks: 734225
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Assignee: bkero → nobody
Component: Server Operations → Release Engineering: Developer Tools
QA Contact: cshields → lsblakk
(In reply to Hal Wine [:hwine] from comment #15)
> Latest data in bug 734225 comment #17 implies timing issue not solely due to
> # heads (the only thing releng can measure and affect), so we need the
> connection timing information.
> 
> Reopening & marking as blocker for bug 734225

FYI, some time ago I did local try repo clone timings before and after try head pruning.  I know that the clone time was greatly reduced with a smaller number of heads.  I did not test push time.
sorry, i meant sever ops dev tools
Assignee: nobody → server-ops-devservices
Component: Release Engineering: Developer Tools → Server Operations: Developer Services
QA Contact: lsblakk → shyam
(In reply to Justin Lebar [:jlebar] from comment #12)
> FWIW I've had basically no problems pushing to try in the past few months.

Try has been much slower than usual today, and others in #developers have been complaining of late as well.
Despite repeated attempts at pruning heads from the try repo over the weekend, the try server has been almost unusable since late last week. We're doing a emergency 30min downtime *for try only* this evening to delete and reclone the try repo. This is the nuclear option, but has a proven track record of success.

Meanwhile, the need for RelEng to have ability to monitor performance of this repo remains. The former "just delete every month" approach is obviously not working hence bumping so we have a chance to avoid further last-minute-production-downtimes like this.
Assignee: server-ops-devservices → server-ops
Severity: normal → major
Component: Server Operations: Developer Services → Server Operations
QA Contact: shyam → phong
Assignee: server-ops → server-ops-devservices
Component: Server Operations → Server Operations: Developer Services
QA Contact: phong → shyam
Assignee: server-ops-devservices → mburns
Assignee: mburns → server-ops-devservices
Severity: major → normal
Bear in mind push to try time depends greatly on the mercurial client version.
Times differ by minutes vs seconds.  Mercurial 2.0.2 is new enough to perform better; 1.8.something is too old.  (2.1 brings in the complexity of phases - bug 725362.)
I mentioned it on the mailing list, but to get it down here too:

A simple way to do this would be to periodically push a head to tryserver with invalid trychooser syntax.  At the moment, a push with invalid syntax will be accepted, but trigger no builds.
Example/syntax of such a push?
(In reply to comment #22)
> Example/syntax of such a push?

hg qnew empty
hg qref -e -m 'try: -b do -p foo -u bar -t baz'
hg qfi .
$ cd path/to/mozilla-central/clone
$ hg pull -u
$ date > tryserver-timing-test-file
$ hg add tryserver-timing-test-file
$ hg commit -m "Testing how long it takes to push to try.  try: INTENTIONALLY INVALID SYNTAX"
$ time hg push ssh://hg.mozilla.org/try
[...]
remote: Thanks for helping save resources, you're the best!
remote: You can view the progress of your build at the following URL:
remote:   https://tbpl.mozilla.org/?tree=Try&rev=36db6b11dd58

I'm pretty sure this will not trigger any jobs.

Remove this dummy commit from your repository with

$ hg strip 'outgoing()'
> $ time hg push ssh://hg.mozilla.org/try

This should be hg push -f.

> $ hg commit -m "Testing how long it takes to push to try.  try: INTENTIONALLY INVALID SYNTAX"

This in fact did trigger builds!  Let's try (per Ehsan's comment) "try: -b foo -p foo -u foo"

https://tbpl.mozilla.org/?tree=Try&rev=4166b85d6337
I've been collecting this data since June, and just updated with the latest. Looks like great improvement on Oct 1 (or the script assumptions broke).

Data from September though roughly now:
  https://docs.google.com/spreadsheet/ccc?key=0AsioYpicZmlndFd1QzdhU0xMc3VJZ0RaLUZ6ZnpiTUE

Compare with data from June:
  https://docs.google.com/spreadsheet/ccc?key=0AsioYpicZmlndGJ6cVhBOExlNmp1clpDbkcwWmw1UUE

Holler if you want more of the raw data - I had to prune to just September to get google docs to accept the upload.
(In reply to comment #25)
> > $ hg commit -m "Testing how long it takes to push to try.  try: INTENTIONALLY INVALID SYNTAX"
> 
> This in fact did trigger builds!  Let's try (per Ehsan's comment) "try: -b foo
> -p foo -u foo"
> 
> https://tbpl.mozilla.org/?tree=Try&rev=4166b85d6337

This seems to have worked.
Hal, you mentioned on IRC that the data you'd been collecting did not match engineers' experience?

From the spreadsheet, it seems like you're pushing to try without try syntax, so that the push is rejected?  IIRC, even when pushing to try was very slow, it was reasonably fast to reject a push.  That is, the slow part must have come after the pre-commit hook.

ISTM that correct testing would actually push something to try, just not kick off any jobs.
Justin, yeah - I don't know if that's a "feature" or a "bug" - there's no _real_ reason to push to try if you aren't going to at least compile.

The other problem with actually doing a commit is then you start tickling the bad behavior in the running version of hgweb w.r.t. "depth of commit" causing timeouts when displaying the summary page (which is what the tbpl links point to, so...)

Since there isn't any major urgency at the moment, it's probably safest to leave this sit until after the next hg upgrade (bug 741353), which I believe has the hgweb fix.
> Justin, yeah - I don't know if that's a "feature" or a "bug"

Oh, it's definitely a bug that we're exploiting for our own nefarious purposes.

> The other problem with actually doing a commit is then you start tickling the bad behavior in the 
> running version of hgweb w.r.t. "depth of commit" causing timeouts when displaying the summary page 
> (which is what the tbpl links point to, so...)

I don't understand what you mean, but if it's only an hgweb problem, why does it matter?  Presumably nobody is going to look at the hgweb reports for these pushes anyway.

Now, there is something to be said for not cluttering TBPL with a bunch of fake commits, but that can't be too hard to solve on TBPL's end.
"-p none" looks like valid syntax.
Yep - lots of this ground covered in other bugs. Recapping here to spread the lore (and to document some things I can't find the bug for at the moment):
 - bug 734225 - set up periodic try merges to keep head count down.
                Abandoned as that (also) tickled the hgweb issue
 - bug 737089 - suppress the extra tbpl noise

The hgweb issue _does_ affect everyone, as the tbpl link is to the summary page of the repo, and that summary page loads the last n (20?) commits.  If any of those n-1 commits irrelevant to the current user were "too deep", then the page load would timeout.

Bottom line, it's dev ops call on whether or not this is a pressing issue at the moment.
(In reply to Karl Tomlinson (:karlt) from comment #31)
> "-p none" looks like valid syntax.

So what's the full command that doesn't trigger builds?
(In reply to comment #33)
> (In reply to Karl Tomlinson (:karlt) from comment #31)
> > "-p none" looks like valid syntax.
> 
> So what's the full command that doesn't trigger builds?

See comment 27.
(In reply to Ehsan Akhgari [:ehsan] from comment #34)

> See comment 27.

Doesn't help, which is why I asked again :)
hg pull -u && hg qnew empty && hg qref -m 'try: -p none' && time hg push -f ssh://hg.mozilla.org/try/ && hg qpop -a && hg qrm empty

(with having an existing clone, having ssh set up to not prompt, and having

[extensions]
mq =

in the user's ~/.hgrc left as an exercise for the able)
D'oh, but that's dumb, because you don't want to push the tip of m-c, because sometimes you'll be pushing csets that haven't been pushed yet, and you also don't want to push from some single old rev, both because we want to know how long pushing from tip takes and because you'll be pushing increasingly deep parents and triggering the hgweb bug. So possibly that, but push twice, once to get any csets from m-c that haven't yet been pushed pushed, and a second time for the timing.
(In reply to Ehsan Akhgari [:ehsan] from comment #34)
> (In reply to comment #33)
> > (In reply to Karl Tomlinson (:karlt) from comment #31)
> > > "-p none" looks like valid syntax.
> > 
> > So what's the full command that doesn't trigger builds?
> 
> See comment 27.

Comment 27 said take comment 24 and substitute "try: -b foo -p foo -u foo" for "try: INTENTIONALLY INVALID SYNTAX".

But I feel like if you're having difficulty following this conversation, you may not be the right person to whip up this script.  Is there anyone else in your team who's a bit more familiar with hg and/or our processes that you could pass this off to?
Justin,

The bug isn't assigned to anyone. I'm trying to gather information before we take any action on it. I don't see why clarifying something is so troublesome? In the same tone, I could easily say how about you give us a script and we'd hook it up to Nagios, but that's not how I like to work. I'm more than happy to help you guys with something, but do understand what's trivial for you, might not be for people who don't necessarily work with try syntax day in and day out.
> The bug isn't assigned to anyone.

You're right; I made the mistake of mixing up the QA contact with the assignee.  I'm sorry.
bug#827123 is active example of slow try performance which should be detected by nagios. Adding to tracking bug to avoid losing this bug again.
Blocks: re-nagios
Blocks: 770811
from offline discussions about *how* to detect problems with try repo.

1) jakem: is it possible to detect how long a lock exists for a repo on hg server? From email with developers this morning, they had 5 different stuck lock incidents on try in the last 24 hours. If lock-on-repo is possible to monitor, maybe it'd be a good early-warning of repo problems?

2) fox2mike, hwine, jlebar, ehsan, bkero: I've just read this entire bug, especially comment#27, #31, #32-#36 and also talked offline with some of you. There's lots of back-forth in this bug about what command is best to have nagios invoke... but nothing resolved afaict. Who can best give an emotionally neutral, verified-to-work, commandline that IT can trigger running from nagios? If it helps, I'm happy to put together a meeting (will invite anyone who is interested).
Flags: needinfo?(nmaul)
Flags: needinfo?(justin.lebar+bug)
Flags: needinfo?(hwine)
Flags: needinfo?(ehsan)
Flags: needinfo?(bkero)
Can we empower whoever is assigned to fix this bug to decide what's the right command to run?
Flags: needinfo?(justin.lebar+bug)
QA Contact: shyam → nmaul
To repeat comment 27 once again:

$ cd path/to/mozilla-central/clone
$ hg pull -u
$ date > tryserver-timing-test-file
$ hg add tryserver-timing-test-file
$ hg commit -m "Testing how long it takes to push to try.  try: -b do -p foo -u bar -t baz"
$ time hg push ssh://hg.mozilla.org/try

It was tried by Justin back in October.  I have not been involved in any offline conversations, but as far as I can tell, it's pretty obvious what people have tried on the bug so far.
Flags: needinfo?(ehsan)
Quick historical note -- at the time this bug was originally filed, there was no way to known way to test an actual commit. Best we could do was "time until rejected by pre-commit hook". Everyone agreed that wasn't "good enough". (We couldn't get "time to commit" without triggering builds, which was too expensive to do frequently.)

Time has passed & tools have improved. There's now a way to handle that, as noted in comment 31. (However, support for "-p none" has been dropped in recent commits. It now works only by coincidence.)

Time has passed & servers have been upgraded. The hgweb issue mentioned in comment 32 should no longer apply. That can be monitored independently from push time.

There remain some timing issues, as mentioned in comment 37, which may make the data difficult to interpret. But it's worth another try.

I think a reasonable approach from here is:
 - monitor adhoc usage of that approach for a day or two to ensure it gives meaningful data (I've got this running -- ping for details)
 - figure how to wrap this for nagios
 - deploy
 - file bug to get "-p none" officially supported by try-parser
Flags: needinfo?(hwine)
(In reply to Hal Wine [:hwine] from comment #45)
> I think a reasonable approach from here is:
>  - monitor adhoc usage of that approach for a day or two to ensure it gives
> meaningful data (I've got this running -- ping for details)

The above monitoring has been disabled until folks sort out the requirements:
 - commenters in this thread want "real push" timing
 - running real pushes has the following side effects:
   - inflates load metrics (could be resolved by changes to tbpl & buildbot)
   - causes developer confusion on tbpl (could be resolved by changes to tbpl)
Attached image try_metrics_to_date.png
Data from about a week of try pushes. The push time has remained constant at 4 +/- 1 seconds as the # heads has grown to almost 1,500 (1.5 Kheads as graphed).
(In reply to Hal Wine [:hwine] from comment #46)
> (In reply to Hal Wine [:hwine] from comment #45)
> > I think a reasonable approach from here is:
> >  - monitor adhoc usage of that approach for a day or two to ensure it gives
> > meaningful data (I've got this running -- ping for details)
> 
> The above monitoring has been disabled until folks sort out the requirements:
>  - commenters in this thread want "real push" timing
>  - running real pushes has the following side effects:
>    - inflates load metrics (could be resolved by changes to tbpl & buildbot)
>    - causes developer confusion on tbpl (could be resolved by changes to
> tbpl)

I don't pretend to be an expert here, but creating a new email alias for this tester and filtering it out of tbpl and buildbot seems like it shouldn't require much effort.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #48)
> I don't pretend to be an expert here, but creating a new email alias for
> this tester and filtering it out of tbpl and buildbot seems like it
> shouldn't require much effort.

We could either filter it out at the json-pushes level (&ignorebots=1 ?):
https://hg.mozilla.org/mozilla-central/json-pushes?full=1&maxhours=24
http://hg.mozilla.org/hgcustom/pushlog/file/d445a97d837d/pushlog-feed.py

Or else:
https://hg.mozilla.org/webtools/tbpl/file/d7d9d36625fe/js/PushlogJSONParser.js#l50
(In reply to Hal Wine [:hwine] from comment #45)
> Quick historical note -- at the time this bug was originally filed, there
> was no way to known way to test an actual commit. Best we could do was "time
> until rejected by pre-commit hook". Everyone agreed that wasn't "good
> enough". (We couldn't get "time to commit" without triggering builds, which
> was too expensive to do frequently.)
> 
> Time has passed & tools have improved. There's now a way to handle that, as
> noted in comment 31. (However, support for "-p none" has been dropped in
> recent commits. It now works only by coincidence.)
> 
> Time has passed & servers have been upgraded. The hgweb issue mentioned in
> comment 32 should no longer apply. That can be monitored independently from
> push time.
> 
> There remain some timing issues, as mentioned in comment 37, which may make
> the data difficult to interpret. But it's worth another try.
> 
> I think a reasonable approach from here is:
>  - monitor adhoc usage of that approach for a day or two to ensure it gives
> meaningful data (I've got this running -- ping for details)
>  - figure how to wrap this for nagios
>  - deploy
>  - file bug to get "-p none" officially supported by try-parser

fyi, work in bug#837303 to set "-p none" as a warning.
(In reply to Ed Morley [:edmorley UTC+1] from comment #49)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #48)
> > I don't pretend to be an expert here, but creating a new email alias for
> > this tester and filtering it out of tbpl and buildbot seems like it
> > shouldn't require much effort.
> 
> We could either filter it out at the json-pushes level (&ignorebots=1 ?):
> https://hg.mozilla.org/mozilla-central/json-pushes?full=1&maxhours=24
> http://hg.mozilla.org/hgcustom/pushlog/file/d445a97d837d/pushlog-feed.py
> 
> Or else:
> https://hg.mozilla.org/webtools/tbpl/file/d7d9d36625fe/js/PushlogJSONParser.
> js#l50

*Or*, we can just send an email to dev.platform and do not try to fix a problem which nobody has complained about.  I've never heard anybody complain about pushes not belonging to themselves.

(In reply to Hal Wine [:hwine] from comment #46)
> (In reply to Hal Wine [:hwine] from comment #45)
> > I think a reasonable approach from here is:
> >  - monitor adhoc usage of that approach for a day or two to ensure it gives
> > meaningful data (I've got this running -- ping for details)
> 
> The above monitoring has been disabled until folks sort out the requirements:
>  - commenters in this thread want "real push" timing
>  - running real pushes has the following side effects:
>    - inflates load metrics (could be resolved by changes to tbpl & buildbot)

Even if the push doesn't trigger any jobs?


Really, I'm beginning to think that comment 43 was right on, we should just empower the person who's going to implement this to make these kinds of decisions.
(In reply to John O'Duinn [:joduinn] from comment #42)
> from offline discussions about *how* to detect problems with try repo.
> 
> 1) jakem: is it possible to detect how long a lock exists for a repo on hg
> server? From email with developers this morning, they had 5 different stuck
> lock incidents on try in the last 24 hours. If lock-on-repo is possible to
> monitor, maybe it'd be a good early-warning of repo problems?

It can be done, but will take some doing. One problem is the lock files are in many different places. They're all something like "<repo>/.hg/lock" of course, but the <repo> is not at a consistent depth, and so scanning for them is extremely slow. We essentially have to find out where they all live, monitor them somehow, and then report that data out in a usable way.

The simplest solution I can think of is a cron that (presuming it knows what files to look at) simply runs "ls -l" on each file, writes the output to a file, and puts that file somewhere it becomes web-visible. This contains a timestamp and a username. The consumers are developers and admins... I think this format would be easily understood. :)

This is merely a "who has it locked right now and for how long" capability. Trending on this is harder... it won't answer the question of "how long has repo X been locked this month". That's the kind of thing we would have to think about more.
Flags: needinfo?(nmaul)
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #51)
> > The above monitoring has been disabled until folks sort out the requirements:
> >  - commenters in this thread want "real push" timing
> >  - running real pushes has the following side effects:
> >    - inflates load metrics (could be resolved by changes to tbpl & buildbot)
> 
> Even if the push doesn't trigger any jobs?

Yes -- it counts as a commit, and that skews the ratio of jobs/commit and compute hours/commit.

Those are the longest running metrics we have, and need to be kept meaningful for comparison. We did adjust the numbers for the duration of my experiment, but that was non-trivial.

> Really, I'm beginning to think that comment 43 was right on, we should just
> empower the person who's going to implement this to make these kinds of
> decisions.

And the requirements have now been expanded to "not mess with existing metrics" and "not mess with developer views" (tbpl, etc.).
(In reply to comment #53)
> (In reply to :Ehsan Akhgari (needinfo? me!) from comment #51)
> > > The above monitoring has been disabled until folks sort out the requirements:
> > >  - commenters in this thread want "real push" timing
> > >  - running real pushes has the following side effects:
> > >    - inflates load metrics (could be resolved by changes to tbpl & buildbot)
> > 
> > Even if the push doesn't trigger any jobs?
> 
> Yes -- it counts as a commit, and that skews the ratio of jobs/commit and
> compute hours/commit.
> 
> Those are the longest running metrics we have, and need to be kept meaningful
> for comparison. We did adjust the numbers for the duration of my experiment,
> but that was non-trivial.

What are those data points used for, and who uses them?

> > Really, I'm beginning to think that comment 43 was right on, we should just
> > empower the person who's going to implement this to make these kinds of
> > decisions.
> 
> And the requirements have now been expanded to "not mess with existing metrics"
> and "not mess with developer views" (tbpl, etc.).

Well, this bug was filed for a very specific need.  I asked about the first item above, and I don't agree that these jobs not showing on the try TBPL is a requirement.  Who asked for that?
Flags: needinfo?(hwine)
comment 45 still holds -- lots of things have changed since this bug was filed. The requested monitoring may no longer provide any value.

I just ran 2 monitored pushes over the last day -- we're currently over 5K heads, and commit time of 8 seconds. That's up 2x on time for a 3x increase in heads from one month ago (comment 47).

My monitoring is biased to reduced the chance of lock contention in the measurement (I first attempt a push that will be rejected for bad try syntax.) I've seen outlier times of up to 66 seconds for the first, likely due to someone else committing at the same time. I don't know how hg does backoffs before retrying.

To answer Ehsan's questions for the record:
 - folks in New Zealand were the ones who complained about the tbpl noise. Sheriffs also expressed desire to have that data muted.
 - the data points are part of joduinn's load metrics that are used for making projections on equipment needs.
Flags: needinfo?(hwine)
(In reply to Hal Wine [:hwine] from comment #55)
> To answer Ehsan's questions for the record:
>  - folks in New Zealand were the ones who complained about the tbpl noise.

Can you please name specific people, so that we can talk to them and see if their objection is valid?

> Sheriffs also expressed desire to have that data muted.

Needinfo-ing Ryan and Ed.  Guys, my contention here is that very few developers, if any, look at https://tbpl.mozilla.org/?tree=Try directly, so commits coming from a monitoring bot should not disrupt their workflow.  Do you guys agree?

>  - the data points are part of joduinn's load metrics that are used for
> making projections on equipment needs.

Can we then count the number of commits made by the monitoring tool and ask John to subtract them when preparing his reports?

We need to make progress here, I'm just trying to make sure that we're only blocked by issues which are worse than the problem that we've had with the try server over the years.
Flags: needinfo?(ryanvm)
Flags: needinfo?(joduinn)
Flags: needinfo?(emorley)
I rarely look at Try directly on TBPL. I know philor does, but I don't know at what frequency.
Flags: needinfo?(ryanvm)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #57)
> I rarely look at Try directly on TBPL. I know philor does, but I don't know
> at what frequency.

Philor, how much do you care about https://tbpl.mozilla.org/?tree=Try?  Please see comment 56.
Flags: needinfo?(philringnalda)
I care a great deal about it, but I'm a big boy, I can look right past a push every 30 minutes that doesn't even trigger anything.
Flags: needinfo?(philringnalda)
Summarizing a quick chat on IRC with :ehsan -- I pinged him to see if there was a new urgency to resolving this (based on the recent increase in activity on this bug). There is no new urgency.

 - There are no current problems with try push times being reported by devs.

 - We're all interested in avoiding being hit by another "too many heads causing problems" issue.

 - While there are technical solutions to all the collateral damage caused by monitoring, that will take some time and effort.

 - We're all interested in getting the long term fixes implemented ASAP (e.g. moving off NFS).

We're all still in the middle of summer PTO season, so let me propose the following as the actions for now:

 1. I'll continue to pull numbers once per month (comment 45, comment 55). The trend so far is encouraging that we'll be able to spot issues well in advance.

 2. I'll pull numbers if there are any new reports, and we can act as needed to reset the try repository.

My hope is that the recent server upgrade, and the subsequent reset, has cleared out a bunch of the issues. (The upgrade was supposed to help with some of these issues.) That should buy us enough time to complete the longer term back end upgrades. And that, we all hope, will eliminate this problem.
(In reply to comment #60)
> Summarizing a quick chat on IRC with :ehsan -- I pinged him to see if there was
> a new urgency to resolving this (based on the recent increase in activity on
> this bug). There is no new urgency.

But there is old urgency here!  :-)

>  - There are no current problems with try push times being reported by devs.
> 
>  - We're all interested in avoiding being hit by another "too many heads
> causing problems" issue.
> 
>  - While there are technical solutions to all the collateral damage caused by
> monitoring, that will take some time and effort.

This is the core of what I'm trying to get at.  Philor said that he can live with those pushes.  I'm really interested in talking about other *concrete* problems we have here and involve the people who think those are problems to see if we can find easy solutions to them.  I'm convinced that the answer is yes, since none of the problems that you mentioned seem difficult to fix.

That said, talking about these things in the abstract will probably not get us anywhere.

>  - We're all interested in getting the long term fixes implemented ASAP (e.g.
> moving off NFS).
> 
> We're all still in the middle of summer PTO season, so let me propose the
> following as the actions for now:
> 
>  1. I'll continue to pull numbers once per month (comment 45, comment 55). The
> trend so far is encouraging that we'll be able to spot issues well in advance.
> 
>  2. I'll pull numbers if there are any new reports, and we can act as needed to
> reset the try repository.

I think you mentioned on IRC that the people who would be able to work on this are all working on the NFS move.  If that's the case, then I think the above is pretty much the only thing that we can do, and this discussion doesn't really change things.  But I still don't see why we can't try to reach consensus on solutions to other problems here in the mean time.
The spurious try pushes appearing in TBPL and hgweb are annoying, but not so much so that I'd stand in the way of that method of monitoring.
Flags: needinfo?(emorley)
Reading this bug again. Half to clear my (probably unneeded at this point) needinfo, and half to suggest this:

I believe that creating new pushes (and heads) on the try repository every 30 minutes has a likely chance of creating greater load problems for us in the future. A way to alleviate this is to create a new hook to ultimately reject this push at the end, although this would need to be run before insertion into the pushlog.

Would a nagios check to verify repository lock age be sufficient for most of these problems? We haven't had one of these locking issues (afaik) for several months.
Flags: needinfo?(bkero)
Blocks: try-tracker
Updating status:
 - in general try is much improved from when this bug was opened
 - we still need this monitoring data
 - current try issues should be in separate bugs blocking bug 1017724 (current tracker)
 - cancelling existing ni's as being no longer relevant

I'll be attaching a new proposed patch which monitors ssh connections to the hg.mozilla.org servers. That avoids all the "collateral damage" associated with the previous approaches.
Flags: needinfo?(john+bugzilla)
Attached file pash_wrapper (obsolete) —
Wrapper script to be deployed as replacement for sshd forced command on hgssh* machines. Simply logs connection information, and invokes existing forced command 'pash.py'.

This has been tested on hgssh2 for clones & updates:
 - log messages appear as expected (see below)
 - no problem log messages appear in /var/log/messages
 - no apparent difference in time to clone or pull a repo (used mozilla-central)
 - write (push) testing not done, but hg command is identical

The script provides a pair of entries for each ssh connection to the host, along with the final exit code from pash.py

Example of successful hg operation:
 Aug  3 16:16:31 hgssh2 pash_wrapper-29666: start user: 'hwine@mozilla.com; ssh_command: 'hg -R build/tools serve --stdio'; connection: '10.22.248.10 56572 10.22.74.29 22'
 Aug  3 16:16:32 hgssh2 pash_wrapper-29666: end user: 'hwine@mozilla.com; ssh_command: 'hg -R build/tools serve --stdio'; connection: '10.22.248.10 56572 10.22.74.29 22'; ec: '0'

Example of failed hg operation:
 Aug  3 16:25:07 hgssh2 pash_wrapper-32150: start user: 'hwine@mozilla.com; ssh_command: 'hg -R build/xxxtools serve --stdio'; connection: '10.22.248.10 56833 10.22.74.29 22'
 Aug  3 16:25:07 hgssh2 pash_wrapper-32150: end user: 'hwine@mozilla.com; ssh_command: 'hg -R build/xxxtools serve --stdio'; connection: '10.22.248.10 56833 10.22.74.29 22'; ec: '255'
Assignee: server-ops-devservices → hwine
Attachment #8466871 - Flags: review?(klibby)
Comment on attachment 8466871 [details]
pash_wrapper

> facility=local3

local[345] are in use in some form (or at least configured in rsyslog.conf). for whatever reason, local[23567] are excluded from /var/log/messages, so if we
want logs there, local1 is the way to go. 

> tag="${0##*/}-$$"   # script-PID

would prefer tag="${0##*/}[$$]" for script[PID]

> prog_dir=$(cd $(dirname "$0");/bin/pwd)

seems cumbersome given that ForceCommand will always be the full path, but otherwise ok.
Attachment #8466871 - Flags: review?(klibby) → review+
Attached file pash_wrapper
(In reply to Kendall Libby [:fubar] from comment #66)
> Comment on attachment 8466871 [details]
> pash_wrapper
> 
> > facility=local3
> 
> local[345] are in use in some form (or at least configured in rsyslog.conf).
> for whatever reason, local[23567] are excluded from /var/log/messages, so if
> we want logs there, local1 is the way to go. 

Changed to local2 -- this may need fine tuning, and collection from a separate log file is easier for me. 

> would prefer tag="${0##*/}[$$]" for script[PID]

done

> 
> > prog_dir=$(cd $(dirname "$0");/bin/pwd)
> 
> seems cumbersome given that ForceCommand will always be the full path, but
> otherwise ok.

Well, doing it this way leaves the only hard coded location in the sshd_config file. Only requirement is that both this wrapper and pash.py be deployed in same directory.


Interdiff:
--- pash_wrapper.old    2014-08-06 11:35:08.000000000 -0700
+++ pash_wrapper        2014-08-06 11:34:40.000000000 -0700
@@ -3,9 +3,9 @@
 
 # minimal wrapper to log
 
-facility=local3
+facility=local2
 level=INFO
-tag="${0##*/}-$$"   # script-PID
+tag="${0##*/}[$$]"   # script[PID]
 
 prog_dir=$(cd $(dirname "$0");/bin/pwd)
Attachment #8466871 - Attachment is obsolete: true
Attachment #8468645 - Flags: review+
Deployed in 91682.
Component: Server Operations: Developer Services → Mercurial: hg.mozilla.org
Product: mozilla.org → Developer Services
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/52]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/52] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1346] [kanban:engops:https://kanbanize.com/ctrl_board/6/52]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1346] [kanban:engops:https://kanbanize.com/ctrl_board/6/52] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1349] [kanban:engops:https://kanbanize.com/ctrl_board/6/52]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1349] [kanban:engops:https://kanbanize.com/ctrl_board/6/52] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1350] [kanban:engops:https://kanbanize.com/ctrl_board/6/52]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1350] [kanban:engops:https://kanbanize.com/ctrl_board/6/52] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1355] [kanban:engops:https://kanbanize.com/ctrl_board/6/52]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1355] [kanban:engops:https://kanbanize.com/ctrl_board/6/52] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1355]
done as of comment 68
Status: REOPENED → RESOLVED
Closed: 12 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: