<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

14 years ago

Hm, I don't think hooks will do the trick here: dustin@Lorentz ~/tmp/src-repo $ hg push ssh://hg.mozilla.org/try pushing to ssh://hg.mozilla.org/try remote: Warning: Permanently added the RSA host key for IP address '63.245.208.189' to the list of known hosts. searching for changes <--- hangs here, before the try_mandatory pretxnchangegroup run (which would have failed) abort: repository is unrelated This *does*, however, suggest that a push with a bogus commit message will fail in any case, but give a reasonable indication of push delay. dustin@Lorentz ~/tmp/mozilla-central $ hg push -f ssh://hg.mozilla.org/try pushing to ssh://hg.mozilla.org/try searching for changes <--- hangs here for a while remote: adding changesets remote: adding manifests remote: adding file changes remote: added 1 changesets with 1 changes to 1 files (+1 heads) remote: To push to try you must use try syntax in the push comment of the *last* change remote: See http://people.mozilla.org/~lsblakk/trychooser/ to build your syntax remote: For assistance using the syntax and https://wiki.mozilla.org/Build:TryChooser for more information. remote: Thank you for helping to reduce CPU cyles by asking for exactly what you need. remote: transaction abort! remote: rollback completed remote: abort: pretxnchangegroup.c_try_mandatory hook failed abort: unexpected response: empty string That should be pretty easy to monitor with nagios, if it is indeed the timing we're interested in.

Comment 2

•

14 years ago

So something like dustin@Lorentz ~/tmp/mozilla-central $ hg pull -u && time hg push -f ssh://hg.mozilla.org/try pulling from http://hg.mozilla.org/mozilla-central searching for changes no changes found pushing to ssh://hg.mozilla.org/try searching for changes remote: waiting for lock on repository /repo/hg/mozilla/try held by 'dm-svn02.mozilla.org:3534' remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock held by dm-svn02.mozilla.org:3534 abort: unexpected response: empty string real 13m49.328s user 0m1.244s sys 0m0.271s

Chris AtLee [:catlee]

Comment 3

•

14 years ago

I was wondering if we could monitor existing logs on hg to gather this information. Either that or add pre/post commit hooks as suggested in comment #0. Dummy commits scare me because of the possible side-effects they have (triggering try builds, interfering with pushlog)

Comment 4

•

14 years ago

As I said in comment #1, there are no hooks early enough to be "pre" hooks. The logfiles only give access times, not durations. The dummy commits won't have try syntax, so nothing is actually pushed, hence no risk.

Comment 5

•

14 years ago

Attached file test-push — Details

Here's what I'm working on so far. It's pretty careful to keep its repo up to date, but will fail if it succeeds in pushing. I think this will take too long for an active nagios check, but it could easily be configured to run from cron every 30m or so (maybe with a lockfile), and report its results passively to nagios, with a freshness timeout. The tricky part that I don't know the answer to is, what userid and key should this use to push? Do we have a machine account that's suitable? I'll also need to figure out what machine to run it on, but I'm less worried about that.

Comment 6

•

14 years ago

It sounds like I should make a new LDAP account with access only by this host. I'm not sure *where* it should run, though - is there precedent for this sort of thing?

Comment 7

•

14 years ago

dm-svn02 seems to be a decent place for it.

Reporter

Comment 8

•

14 years ago

Try is really slow right now. khuey, bz, espindola, and I all can't push. Without monitoring, I suspect that these problems are never going to be fixed.

Comment 9

•

14 years ago

It's not that it's slow, but that multiple people are pushing simultaneously. Of course, being slow makes that more common. Still -- the list you give was the list of people whose pushes I was watching block one another. I'm not sure who won, but at the moment I only see dougt and khuey (which suggesets khuey lost..).

Reporter

Comment 10

•

14 years ago

For the record, in this case, bz, espindola, and I all cancelled our pushes. And it looks like dougt and khuey both were unable to push. Is your theory that multiple people are queueing on the lock for a reason other than try being slow? Otherwise I don't understand. It seems that multiple people queueing on the lock is a symptom of the fact that the person holding the lock has had it for a long time -- that is, it's not that multiple people are pushing simultaneously, but that someone is taking a long time to push to try. Whatever the problem is, do we agree that it's a problem and that it would be detected with the monitoring suggested in this bug?

Comment 11

•

13 years ago

Ben, over to you - I wonder if this is still worthwhile with the new, awesome hg infra?

Assignee: dustin → bkero

Reporter

Comment 12

•

13 years ago

FWIW I've had basically no problems pushing to try in the past few months.

Ben Kero [:bkero]

Comment 13

•

13 years ago

I'm not sure if this is worthwhile anymore either. If /try isn't improved for multiple simultaneous commits, perhaps refactoring try would be a better approach. Something like an interface to submit jobs to try that simply checks out a revision/tag of code from somebody's user repository?

Comment 14

•

13 years ago

That's a fraught question, and probably OT for this bug. It sounds like we're all reasonably confident we can WONTFIX this, but of course there's that "REOPENED" status when you need it.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → WONTFIX

Assignee

Comment 15

•

13 years ago

Latest data in bug 734225 comment #17 implies timing issue not solely due to # heads (the only thing releng can measure and affect), so we need the connection timing information. Reopening & marking as blocker for bug 734225

Blocks: 734225

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Updated

•

13 years ago

Assignee: bkero → nobody

Component: Server Operations → Release Engineering: Developer Tools

QA Contact: cshields → lsblakk

John Hopkins (:jhopkins)

Comment 16

•

13 years ago

(In reply to Hal Wine [:hwine] from comment #15) > Latest data in bug 734225 comment #17 implies timing issue not solely due to > # heads (the only thing releng can measure and affect), so we need the > connection timing information. > > Reopening & marking as blocker for bug 734225 FYI, some time ago I did local try repo clone timings before and after try head pruning. I know that the clone time was greatly reduced with a smaller number of heads. I did not test push time.

Comment 17

•

13 years ago

sorry, i meant sever ops dev tools

Assignee: nobody → server-ops-devservices

Component: Release Engineering: Developer Tools → Server Operations: Developer Services

QA Contact: lsblakk → shyam

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 18

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #12) > FWIW I've had basically no problems pushing to try in the past few months. Try has been much slower than usual today, and others in #developers have been complaining of late as well.

Comment 19

•

13 years ago

Despite repeated attempts at pruning heads from the try repo over the weekend, the try server has been almost unusable since late last week. We're doing a emergency 30min downtime *for try only* this evening to delete and reclone the try repo. This is the nuclear option, but has a proven track record of success. Meanwhile, the need for RelEng to have ability to monitor performance of this repo remains. The former "just delete every month" approach is obviously not working hence bumping so we have a chance to avoid further last-minute-production-downtimes like this.

Assignee: server-ops-devservices → server-ops

Severity: normal → major

Component: Server Operations: Developer Services → Server Operations

QA Contact: shyam → phong

Michael Burns [:mburns]

Updated

•

13 years ago

Assignee: server-ops → server-ops-devservices

Component: Server Operations → Server Operations: Developer Services

QA Contact: phong → shyam

Michael Burns [:mburns]

Updated

•

13 years ago

Assignee: server-ops-devservices → mburns

Phong Tran [:phong]

Updated

•

13 years ago

Assignee: mburns → server-ops-devservices

Severity: major → normal

Karl Tomlinson (:karlt)

Comment 20

•

13 years ago

Bear in mind push to try time depends greatly on the mercurial client version. Times differ by minutes vs seconds. Mercurial 2.0.2 is new enough to perform better; 1.8.something is too old. (2.1 brings in the complexity of phases - bug 725362.)

Reporter

Comment 21

•

13 years ago

I mentioned it on the mailing list, but to get it down here too: A simple way to do this would be to periodically push a head to tryserver with invalid trychooser syntax. At the moment, a push with invalid syntax will be accepted, but trigger no builds.

Comment 22

•

13 years ago

Example/syntax of such a push?

Comment 23

•

13 years ago

(In reply to comment #22) > Example/syntax of such a push? hg qnew empty hg qref -e -m 'try: -b do -p foo -u bar -t baz' hg qfi .

Reporter

Comment 24

•

13 years ago

$ cd path/to/mozilla-central/clone $ hg pull -u $ date > tryserver-timing-test-file $ hg add tryserver-timing-test-file $ hg commit -m "Testing how long it takes to push to try. try: INTENTIONALLY INVALID SYNTAX" $ time hg push ssh://hg.mozilla.org/try [...] remote: Thanks for helping save resources, you're the best! remote: You can view the progress of your build at the following URL: remote: https://tbpl.mozilla.org/?tree=Try&rev=36db6b11dd58 I'm pretty sure this will not trigger any jobs. Remove this dummy commit from your repository with $ hg strip 'outgoing()'

Reporter

Comment 25

•

13 years ago

> $ time hg push ssh://hg.mozilla.org/try This should be hg push -f. > $ hg commit -m "Testing how long it takes to push to try. try: INTENTIONALLY INVALID SYNTAX" This in fact did trigger builds! Let's try (per Ehsan's comment) "try: -b foo -p foo -u foo" https://tbpl.mozilla.org/?tree=Try&rev=4166b85d6337

Assignee

Comment 26

•

13 years ago

I've been collecting this data since June, and just updated with the latest. Looks like great improvement on Oct 1 (or the script assumptions broke). Data from September though roughly now: https://docs.google.com/spreadsheet/ccc?key=0AsioYpicZmlndFd1QzdhU0xMc3VJZ0RaLUZ6ZnpiTUE Compare with data from June: https://docs.google.com/spreadsheet/ccc?key=0AsioYpicZmlndGJ6cVhBOExlNmp1clpDbkcwWmw1UUE Holler if you want more of the raw data - I had to prune to just September to get google docs to accept the upload.

Comment 27

•

13 years ago

(In reply to comment #25) > > $ hg commit -m "Testing how long it takes to push to try. try: INTENTIONALLY INVALID SYNTAX" > > This in fact did trigger builds! Let's try (per Ehsan's comment) "try: -b foo > -p foo -u foo" > > https://tbpl.mozilla.org/?tree=Try&rev=4166b85d6337 This seems to have worked.

Reporter

Comment 28

•

13 years ago

Hal, you mentioned on IRC that the data you'd been collecting did not match engineers' experience? From the spreadsheet, it seems like you're pushing to try without try syntax, so that the push is rejected? IIRC, even when pushing to try was very slow, it was reasonably fast to reject a push. That is, the slow part must have come after the pre-commit hook. ISTM that correct testing would actually push something to try, just not kick off any jobs.

Assignee

Comment 29

•

13 years ago

Justin, yeah - I don't know if that's a "feature" or a "bug" - there's no _real_ reason to push to try if you aren't going to at least compile. The other problem with actually doing a commit is then you start tickling the bad behavior in the running version of hgweb w.r.t. "depth of commit" causing timeouts when displaying the summary page (which is what the tbpl links point to, so...) Since there isn't any major urgency at the moment, it's probably safest to leave this sit until after the next hg upgrade (bug 741353), which I believe has the hgweb fix.

Reporter

Comment 30

•

13 years ago

> Justin, yeah - I don't know if that's a "feature" or a "bug" Oh, it's definitely a bug that we're exploiting for our own nefarious purposes. > The other problem with actually doing a commit is then you start tickling the bad behavior in the > running version of hgweb w.r.t. "depth of commit" causing timeouts when displaying the summary page > (which is what the tbpl links point to, so...) I don't understand what you mean, but if it's only an hgweb problem, why does it matter? Presumably nobody is going to look at the hgweb reports for these pushes anyway. Now, there is something to be said for not cluttering TBPL with a bunch of fake commits, but that can't be too hard to solve on TBPL's end.

Karl Tomlinson (:karlt)

Comment 31

•

13 years ago

"-p none" looks like valid syntax.

Assignee

Comment 32

•

13 years ago

Yep - lots of this ground covered in other bugs. Recapping here to spread the lore (and to document some things I can't find the bug for at the moment): - bug 734225 - set up periodic try merges to keep head count down. Abandoned as that (also) tickled the hgweb issue - bug 737089 - suppress the extra tbpl noise The hgweb issue _does_ affect everyone, as the tbpl link is to the summary page of the repo, and that summary page loads the last n (20?) commits. If any of those n-1 commits irrelevant to the current user were "too deep", then the page load would timeout. Bottom line, it's dev ops call on whether or not this is a pressing issue at the moment.

Comment 33

•

13 years ago

(In reply to Karl Tomlinson (:karlt) from comment #31) > "-p none" looks like valid syntax. So what's the full command that doesn't trigger builds?

Comment 34

•

13 years ago

(In reply to comment #33) > (In reply to Karl Tomlinson (:karlt) from comment #31) > > "-p none" looks like valid syntax. > > So what's the full command that doesn't trigger builds? See comment 27.

Comment 35

•

13 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #34) > See comment 27. Doesn't help, which is why I asked again :)

Phil Ringnalda (:philor)

Comment 36

•

13 years ago

hg pull -u && hg qnew empty && hg qref -m 'try: -p none' && time hg push -f ssh://hg.mozilla.org/try/ && hg qpop -a && hg qrm empty (with having an existing clone, having ssh set up to not prompt, and having [extensions] mq = in the user's ~/.hgrc left as an exercise for the able)

Phil Ringnalda (:philor)

Comment 37

•

13 years ago

D'oh, but that's dumb, because you don't want to push the tip of m-c, because sometimes you'll be pushing csets that haven't been pushed yet, and you also don't want to push from some single old rev, both because we want to know how long pushing from tip takes and because you'll be pushing increasingly deep parents and triggering the hgweb bug. So possibly that, but push twice, once to get any csets from m-c that haven't yet been pushed pushed, and a second time for the timing.

Reporter

Comment 38

•

13 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #34) > (In reply to comment #33) > > (In reply to Karl Tomlinson (:karlt) from comment #31) > > > "-p none" looks like valid syntax. > > > > So what's the full command that doesn't trigger builds? > > See comment 27. Comment 27 said take comment 24 and substitute "try: -b foo -p foo -u foo" for "try: INTENTIONALLY INVALID SYNTAX". But I feel like if you're having difficulty following this conversation, you may not be the right person to whip up this script. Is there anyone else in your team who's a bit more familiar with hg and/or our processes that you could pass this off to?

Comment 39

•

13 years ago

Justin, The bug isn't assigned to anyone. I'm trying to gather information before we take any action on it. I don't see why clarifying something is so troublesome? In the same tone, I could easily say how about you give us a script and we'd hook it up to Nagios, but that's not how I like to work. I'm more than happy to help you guys with something, but do understand what's trivial for you, might not be for people who don't necessarily work with try syntax day in and day out.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 40

•

13 years ago

> The bug isn't assigned to anyone. You're right; I made the mistake of mixing up the QA contact with the assignee. I'm sorry.

Comment 41

•

12 years ago

bug#827123 is active example of slow try performance which should be detected by nagios. Adding to tracking bug to avoid losing this bug again.

Blocks: re-nagios

Ed Morley [:emorley]

Updated

•

12 years ago

Blocks: 770811

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 42

•

12 years ago

from offline discussions about *how* to detect problems with try repo. 1) jakem: is it possible to detect how long a lock exists for a repo on hg server? From email with developers this morning, they had 5 different stuck lock incidents on try in the last 24 hours. If lock-on-repo is possible to monitor, maybe it'd be a good early-warning of repo problems? 2) fox2mike, hwine, jlebar, ehsan, bkero: I've just read this entire bug, especially comment#27, #31, #32-#36 and also talked offline with some of you. There's lots of back-forth in this bug about what command is best to have nagios invoke... but nothing resolved afaict. Who can best give an emotionally neutral, verified-to-work, commandline that IT can trigger running from nagios? If it helps, I'm happy to put together a meeting (will invite anyone who is interested).

Flags: needinfo?(nmaul)

Flags: needinfo?(justin.lebar+bug)

Flags: needinfo?(hwine)

Flags: needinfo?(ehsan)

Flags: needinfo?(bkero)

Reporter

Comment 43

•

12 years ago

Can we empower whoever is assigned to fix this bug to decide what's the right command to run?

Flags: needinfo?(justin.lebar+bug)

Updated

•

12 years ago

QA Contact: shyam → nmaul

Comment 44

•

12 years ago

To repeat comment 27 once again: $ cd path/to/mozilla-central/clone $ hg pull -u $ date > tryserver-timing-test-file $ hg add tryserver-timing-test-file $ hg commit -m "Testing how long it takes to push to try. try: -b do -p foo -u bar -t baz" $ time hg push ssh://hg.mozilla.org/try It was tried by Justin back in October. I have not been involved in any offline conversations, but as far as I can tell, it's pretty obvious what people have tried on the bug so far.

Flags: needinfo?(ehsan)

Assignee

Comment 45

•

12 years ago

Quick historical note -- at the time this bug was originally filed, there was no way to known way to test an actual commit. Best we could do was "time until rejected by pre-commit hook". Everyone agreed that wasn't "good enough". (We couldn't get "time to commit" without triggering builds, which was too expensive to do frequently.) Time has passed & tools have improved. There's now a way to handle that, as noted in comment 31. (However, support for "-p none" has been dropped in recent commits. It now works only by coincidence.) Time has passed & servers have been upgraded. The hgweb issue mentioned in comment 32 should no longer apply. That can be monitored independently from push time. There remain some timing issues, as mentioned in comment 37, which may make the data difficult to interpret. But it's worth another try. I think a reasonable approach from here is: - monitor adhoc usage of that approach for a day or two to ensure it gives meaningful data (I've got this running -- ping for details) - figure how to wrap this for nagios - deploy - file bug to get "-p none" officially supported by try-parser

Flags: needinfo?(hwine)

Assignee

Comment 46

•

12 years ago

(In reply to Hal Wine [:hwine] from comment #45) > I think a reasonable approach from here is: > - monitor adhoc usage of that approach for a day or two to ensure it gives > meaningful data (I've got this running -- ping for details) The above monitoring has been disabled until folks sort out the requirements: - commenters in this thread want "real push" timing - running real pushes has the following side effects: - inflates load metrics (could be resolved by changes to tbpl & buildbot) - causes developer confusion on tbpl (could be resolved by changes to tbpl)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 47

•

12 years ago

Attached image try_metrics_to_date.png — Details

Data from about a week of try pushes. The push time has remained constant at 4 +/- 1 seconds as the # heads has grown to almost 1,500 (1.5 Kheads as graphed).

Comment 48

•

12 years ago

(In reply to Hal Wine [:hwine] from comment #46) > (In reply to Hal Wine [:hwine] from comment #45) > > I think a reasonable approach from here is: > > - monitor adhoc usage of that approach for a day or two to ensure it gives > > meaningful data (I've got this running -- ping for details) > > The above monitoring has been disabled until folks sort out the requirements: > - commenters in this thread want "real push" timing > - running real pushes has the following side effects: > - inflates load metrics (could be resolved by changes to tbpl & buildbot) > - causes developer confusion on tbpl (could be resolved by changes to > tbpl) I don't pretend to be an expert here, but creating a new email alias for this tester and filtering it out of tbpl and buildbot seems like it shouldn't require much effort.

Ed Morley [:emorley]

Comment 49

•

12 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #48) > I don't pretend to be an expert here, but creating a new email alias for > this tester and filtering it out of tbpl and buildbot seems like it > shouldn't require much effort. We could either filter it out at the json-pushes level (&ignorebots=1 ?): https://hg.mozilla.org/mozilla-central/json-pushes?full=1&maxhours=24 http://hg.mozilla.org/hgcustom/pushlog/file/d445a97d837d/pushlog-feed.py Or else: https://hg.mozilla.org/webtools/tbpl/file/d7d9d36625fe/js/PushlogJSONParser.js#l50

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 50

•

12 years ago

(In reply to Hal Wine [:hwine] from comment #45) > Quick historical note -- at the time this bug was originally filed, there > was no way to known way to test an actual commit. Best we could do was "time > until rejected by pre-commit hook". Everyone agreed that wasn't "good > enough". (We couldn't get "time to commit" without triggering builds, which > was too expensive to do frequently.) > > Time has passed & tools have improved. There's now a way to handle that, as > noted in comment 31. (However, support for "-p none" has been dropped in > recent commits. It now works only by coincidence.) > > Time has passed & servers have been upgraded. The hgweb issue mentioned in > comment 32 should no longer apply. That can be monitored independently from > push time. > > There remain some timing issues, as mentioned in comment 37, which may make > the data difficult to interpret. But it's worth another try. > > I think a reasonable approach from here is: > - monitor adhoc usage of that approach for a day or two to ensure it gives > meaningful data (I've got this running -- ping for details) > - figure how to wrap this for nagios > - deploy > - file bug to get "-p none" officially supported by try-parser fyi, work in bug#837303 to set "-p none" as a warning.

Comment 51

•

12 years ago

(In reply to Ed Morley [:edmorley UTC+1] from comment #49) > (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #48) > > I don't pretend to be an expert here, but creating a new email alias for > > this tester and filtering it out of tbpl and buildbot seems like it > > shouldn't require much effort. > > We could either filter it out at the json-pushes level (&ignorebots=1 ?): > https://hg.mozilla.org/mozilla-central/json-pushes?full=1&maxhours=24 > http://hg.mozilla.org/hgcustom/pushlog/file/d445a97d837d/pushlog-feed.py > > Or else: > https://hg.mozilla.org/webtools/tbpl/file/d7d9d36625fe/js/PushlogJSONParser. > js#l50 *Or*, we can just send an email to dev.platform and do not try to fix a problem which nobody has complained about. I've never heard anybody complain about pushes not belonging to themselves. (In reply to Hal Wine [:hwine] from comment #46) > (In reply to Hal Wine [:hwine] from comment #45) > > I think a reasonable approach from here is: > > - monitor adhoc usage of that approach for a day or two to ensure it gives > > meaningful data (I've got this running -- ping for details) > > The above monitoring has been disabled until folks sort out the requirements: > - commenters in this thread want "real push" timing > - running real pushes has the following side effects: > - inflates load metrics (could be resolved by changes to tbpl & buildbot) Even if the push doesn't trigger any jobs? Really, I'm beginning to think that comment 43 was right on, we should just empower the person who's going to implement this to make these kinds of decisions.

Jake Maul [:jakem]

Comment 52

•

12 years ago

(In reply to John O'Duinn [:joduinn] from comment #42) > from offline discussions about *how* to detect problems with try repo. > > 1) jakem: is it possible to detect how long a lock exists for a repo on hg > server? From email with developers this morning, they had 5 different stuck > lock incidents on try in the last 24 hours. If lock-on-repo is possible to > monitor, maybe it'd be a good early-warning of repo problems? It can be done, but will take some doing. One problem is the lock files are in many different places. They're all something like "<repo>/.hg/lock" of course, but the <repo> is not at a consistent depth, and so scanning for them is extremely slow. We essentially have to find out where they all live, monitor them somehow, and then report that data out in a usable way. The simplest solution I can think of is a cron that (presuming it knows what files to look at) simply runs "ls -l" on each file, writes the output to a file, and puts that file somewhere it becomes web-visible. This contains a timestamp and a username. The consumers are developers and admins... I think this format would be easily understood. :) This is merely a "who has it locked right now and for how long" capability. Trending on this is harder... it won't answer the question of "how long has repo X been locked this month". That's the kind of thing we would have to think about more.

Flags: needinfo?(nmaul)

Assignee

Comment 53

•

12 years ago

(In reply to :Ehsan Akhgari (needinfo? me!) from comment #51) > > The above monitoring has been disabled until folks sort out the requirements: > > - commenters in this thread want "real push" timing > > - running real pushes has the following side effects: > > - inflates load metrics (could be resolved by changes to tbpl & buildbot) > > Even if the push doesn't trigger any jobs? Yes -- it counts as a commit, and that skews the ratio of jobs/commit and compute hours/commit. Those are the longest running metrics we have, and need to be kept meaningful for comparison. We did adjust the numbers for the duration of my experiment, but that was non-trivial. > Really, I'm beginning to think that comment 43 was right on, we should just > empower the person who's going to implement this to make these kinds of > decisions. And the requirements have now been expanded to "not mess with existing metrics" and "not mess with developer views" (tbpl, etc.).

Comment 54

•

12 years ago

(In reply to comment #53) > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #51) > > > The above monitoring has been disabled until folks sort out the requirements: > > > - commenters in this thread want "real push" timing > > > - running real pushes has the following side effects: > > > - inflates load metrics (could be resolved by changes to tbpl & buildbot) > > > > Even if the push doesn't trigger any jobs? > > Yes -- it counts as a commit, and that skews the ratio of jobs/commit and > compute hours/commit. > > Those are the longest running metrics we have, and need to be kept meaningful > for comparison. We did adjust the numbers for the duration of my experiment, > but that was non-trivial. What are those data points used for, and who uses them? > > Really, I'm beginning to think that comment 43 was right on, we should just > > empower the person who's going to implement this to make these kinds of > > decisions. > > And the requirements have now been expanded to "not mess with existing metrics" > and "not mess with developer views" (tbpl, etc.). Well, this bug was filed for a very specific need. I asked about the first item above, and I don't agree that these jobs not showing on the try TBPL is a requirement. Who asked for that?

Updated

•

12 years ago

Flags: needinfo?(hwine)

Assignee

Comment 55

•

12 years ago

comment 45 still holds -- lots of things have changed since this bug was filed. The requested monitoring may no longer provide any value. I just ran 2 monitored pushes over the last day -- we're currently over 5K heads, and commit time of 8 seconds. That's up 2x on time for a 3x increase in heads from one month ago (comment 47). My monitoring is biased to reduced the chance of lock contention in the measurement (I first attempt a push that will be rejected for bad try syntax.) I've seen outlier times of up to 66 seconds for the first, likely due to someone else committing at the same time. I don't know how hg does backoffs before retrying. To answer Ehsan's questions for the record: - folks in New Zealand were the ones who complained about the tbpl noise. Sheriffs also expressed desire to have that data muted. - the data points are part of joduinn's load metrics that are used for making projections on equipment needs.

Flags: needinfo?(hwine)

Ryan VanderMeulen [:RyanVM]

Comment 56

•

12 years ago

(In reply to Hal Wine [:hwine] from comment #55) > To answer Ehsan's questions for the record: > - folks in New Zealand were the ones who complained about the tbpl noise. Can you please name specific people, so that we can talk to them and see if their objection is valid? > Sheriffs also expressed desire to have that data muted. Needinfo-ing Ryan and Ed. Guys, my contention here is that very few developers, if any, look at https://tbpl.mozilla.org/?tree=Try directly, so commits coming from a monitoring bot should not disrupt their workflow. Do you guys agree? > - the data points are part of joduinn's load metrics that are used for > making projections on equipment needs. Can we then count the number of commits made by the monitoring tool and ask John to subtract them when preparing his reports? We need to make progress here, I'm just trying to make sure that we're only blocked by issues which are worse than the problem that we've had with the try server over the years.

Flags: needinfo?(ryanvm)

Flags: needinfo?(joduinn)

Flags: needinfo?(emorley)

Comment 57

•

12 years ago

I rarely look at Try directly on TBPL. I know philor does, but I don't know at what frequency.

Flags: needinfo?(ryanvm)

Comment 58

•

12 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #57) > I rarely look at Try directly on TBPL. I know philor does, but I don't know > at what frequency. Philor, how much do you care about https://tbpl.mozilla.org/?tree=Try? Please see comment 56.

Flags: needinfo?(philringnalda)

Phil Ringnalda (:philor)

Comment 59

•

12 years ago

I care a great deal about it, but I'm a big boy, I can look right past a push every 30 minutes that doesn't even trigger anything.

Flags: needinfo?(philringnalda)

Assignee

Comment 60

•

12 years ago

Summarizing a quick chat on IRC with :ehsan -- I pinged him to see if there was a new urgency to resolving this (based on the recent increase in activity on this bug). There is no new urgency. - There are no current problems with try push times being reported by devs. - We're all interested in avoiding being hit by another "too many heads causing problems" issue. - While there are technical solutions to all the collateral damage caused by monitoring, that will take some time and effort. - We're all interested in getting the long term fixes implemented ASAP (e.g. moving off NFS). We're all still in the middle of summer PTO season, so let me propose the following as the actions for now: 1. I'll continue to pull numbers once per month (comment 45, comment 55). The trend so far is encouraging that we'll be able to spot issues well in advance. 2. I'll pull numbers if there are any new reports, and we can act as needed to reset the try repository. My hope is that the recent server upgrade, and the subsequent reset, has cleared out a bunch of the issues. (The upgrade was supposed to help with some of these issues.) That should buy us enough time to complete the longer term back end upgrades. And that, we all hope, will eliminate this problem.

Comment 61

•

12 years ago

(In reply to comment #60) > Summarizing a quick chat on IRC with :ehsan -- I pinged him to see if there was > a new urgency to resolving this (based on the recent increase in activity on > this bug). There is no new urgency. But there is old urgency here! :-) > - There are no current problems with try push times being reported by devs. > > - We're all interested in avoiding being hit by another "too many heads > causing problems" issue. > > - While there are technical solutions to all the collateral damage caused by > monitoring, that will take some time and effort. This is the core of what I'm trying to get at. Philor said that he can live with those pushes. I'm really interested in talking about other *concrete* problems we have here and involve the people who think those are problems to see if we can find easy solutions to them. I'm convinced that the answer is yes, since none of the problems that you mentioned seem difficult to fix. That said, talking about these things in the abstract will probably not get us anywhere. > - We're all interested in getting the long term fixes implemented ASAP (e.g. > moving off NFS). > > We're all still in the middle of summer PTO season, so let me propose the > following as the actions for now: > > 1. I'll continue to pull numbers once per month (comment 45, comment 55). The > trend so far is encouraging that we'll be able to spot issues well in advance. > > 2. I'll pull numbers if there are any new reports, and we can act as needed to > reset the try repository. I think you mentioned on IRC that the people who would be able to work on this are all working on the NFS move. If that's the case, then I think the above is pretty much the only thing that we can do, and this discussion doesn't really change things. But I still don't see why we can't try to reach consensus on solutions to other problems here in the mean time.

Ed Morley [:emorley]

Comment 62

•

12 years ago

The spurious try pushes appearing in TBPL and hgweb are annoying, but not so much so that I'd stand in the way of that method of monitoring.

Flags: needinfo?(emorley)

Ben Kero [:bkero]

Comment 63

•

12 years ago

Reading this bug again. Half to clear my (probably unneeded at this point) needinfo, and half to suggest this: I believe that creating new pushes (and heads) on the try repository every 30 minutes has a likely chance of creating greater load problems for us in the future. A way to alleviate this is to create a new hook to ultimately reject this push at the end, although this would need to be run before insertion into the pushlog. Would a nagios check to verify repository lock age be sufficient for most of these problems? We haven't had one of these locking issues (afaik) for several months.

Flags: needinfo?(bkero)

Assignee

Updated

•

11 years ago

Blocks: try-tracker

Assignee

Comment 64

•

11 years ago

Updating status: - in general try is much improved from when this bug was opened - we still need this monitoring data - current try issues should be in separate bugs blocking bug 1017724 (current tracker) - cancelling existing ni's as being no longer relevant I'll be attaching a new proposed patch which monitors ssh connections to the hg.mozilla.org servers. That avoids all the "collateral damage" associated with the previous approaches.

Flags: needinfo?(john+bugzilla)

Kendall Libby [:fubar] (he/him)

Assignee

Comment 65

•

11 years ago

Attached file pash_wrapper (obsolete) — Details

Wrapper script to be deployed as replacement for sshd forced command on hgssh* machines. Simply logs connection information, and invokes existing forced command 'pash.py'. This has been tested on hgssh2 for clones & updates: - log messages appear as expected (see below) - no problem log messages appear in /var/log/messages - no apparent difference in time to clone or pull a repo (used mozilla-central) - write (push) testing not done, but hg command is identical The script provides a pair of entries for each ssh connection to the host, along with the final exit code from pash.py Example of successful hg operation: Aug 3 16:16:31 hgssh2 pash_wrapper-29666: start user: 'hwine@mozilla.com; ssh_command: 'hg -R build/tools serve --stdio'; connection: '10.22.248.10 56572 10.22.74.29 22' Aug 3 16:16:32 hgssh2 pash_wrapper-29666: end user: 'hwine@mozilla.com; ssh_command: 'hg -R build/tools serve --stdio'; connection: '10.22.248.10 56572 10.22.74.29 22'; ec: '0' Example of failed hg operation: Aug 3 16:25:07 hgssh2 pash_wrapper-32150: start user: 'hwine@mozilla.com; ssh_command: 'hg -R build/xxxtools serve --stdio'; connection: '10.22.248.10 56833 10.22.74.29 22' Aug 3 16:25:07 hgssh2 pash_wrapper-32150: end user: 'hwine@mozilla.com; ssh_command: 'hg -R build/xxxtools serve --stdio'; connection: '10.22.248.10 56833 10.22.74.29 22'; ec: '255'

Assignee: server-ops-devservices → hwine

Attachment #8466871 - Flags: review?(klibby)

Comment 66

•

11 years ago

Comment on attachment 8466871 [details] pash_wrapper > facility=local3 local[345] are in use in some form (or at least configured in rsyslog.conf). for whatever reason, local[23567] are excluded from /var/log/messages, so if we want logs there, local1 is the way to go. > tag="${0##*/}-$$" # script-PID would prefer tag="${0##*/}[$$]" for script[PID] > prog_dir=$(cd $(dirname "$0");/bin/pwd) seems cumbersome given that ForceCommand will always be the full path, but otherwise ok.

Attachment #8466871 - Flags: review?(klibby) → review+

Kendall Libby [:fubar] (he/him)

Assignee

Comment 67

•

11 years ago

Attached file pash_wrapper — Details

(In reply to Kendall Libby [:fubar] from comment #66) > Comment on attachment 8466871 [details] > pash_wrapper > > > facility=local3 > > local[345] are in use in some form (or at least configured in rsyslog.conf). > for whatever reason, local[23567] are excluded from /var/log/messages, so if > we want logs there, local1 is the way to go. Changed to local2 -- this may need fine tuning, and collection from a separate log file is easier for me. > would prefer tag="${0##*/}[$$]" for script[PID] done > > > prog_dir=$(cd $(dirname "$0");/bin/pwd) > > seems cumbersome given that ForceCommand will always be the full path, but > otherwise ok. Well, doing it this way leaves the only hard coded location in the sshd_config file. Only requirement is that both this wrapper and pash.py be deployed in same directory. Interdiff: --- pash_wrapper.old 2014-08-06 11:35:08.000000000 -0700 +++ pash_wrapper 2014-08-06 11:34:40.000000000 -0700 @@ -3,9 +3,9 @@ # minimal wrapper to log -facility=local3 +facility=local2 level=INFO -tag="${0##*/}-$$" # script-PID +tag="${0##*/}[$$]" # script[PID] prog_dir=$(cd $(dirname "$0");/bin/pwd)

Attachment #8466871 - Attachment is obsolete: true

Attachment #8468645 - Flags: review+

Comment 68

•

11 years ago

Deployed in 91682.

Kendall Libby [:fubar] (he/him)

Updated

•

11 years ago

Component: Server Operations: Developer Services → Mercurial: hg.mozilla.org

Product: mozilla.org → Developer Services

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/52]

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/52] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1346] [kanban:engops:https://kanbanize.com/ctrl_board/6/52]

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1346] [kanban:engops:https://kanbanize.com/ctrl_board/6/52] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1349] [kanban:engops:https://kanbanize.com/ctrl_board/6/52]

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1349] [kanban:engops:https://kanbanize.com/ctrl_board/6/52] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1350] [kanban:engops:https://kanbanize.com/ctrl_board/6/52]

Nobody; OK to take it and work on it

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1350] [kanban:engops:https://kanbanize.com/ctrl_board/6/52] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1355] [kanban:engops:https://kanbanize.com/ctrl_board/6/52]

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1355] [kanban:engops:https://kanbanize.com/ctrl_board/6/52] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1355]