Closed Bug 994028 Opened 10 years ago Closed 9 years ago

Timeouts on try while searching for changes ("remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock held by hgssh1.dmz.scl3.mozilla.com:NNNNN")

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: cbook, Assigned: fubar)

References

(Blocks 1 open bug)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1024] see comment #11 if you run into this problem)

Attachments

(1 file)

A coupe of devs complaint that to try result in  "searching for changes" for more then +5 minutes and then times out. 

Not sure what the issue is, but this seems to affect several people in different locations
> $ try
> pushing to ssh://hg.mozilla.org/try
> searching for changes
> ^Cinterrupted!
> remote: waiting for lock on repository /repo/hg/mozilla/try held by 'hgssh1.dmz.scl3.mozilla.com:1850'
$ hg -v push -f ssh://jseward@mozilla.com@hg.mozilla.org/try/
pushing to ssh://jseward%40mozilla.com@hg.mozilla.org/try/
running ssh jseward@mozilla.com@hg.mozilla.org 'hg -R try/ serve --stdio'
searching for changes
1 changesets found
^Cinterrupted!
remote: waiting for lock on repository /repo/hg/mozilla/try/ held by 'hgssh1.dmz.scl3.mozilla.com:25220'
remote: Killed by signal 2.
Duplicate of: 994647
Seems to be working again. At least for me.
It has been working on and off for a couple of days now but it down for a good couple of hours when it fails. Occasionally it does go through but is really slow (10-15 minutes).

I keep getting:
remote: waiting for lock on repository /repo/hg/mozilla/try held by 'hgssh1.dmz.scl3.mozilla.com:25220'

So the same server for everyone in this bug.
No longer blocks: 821809
Hardware: x86 → All
Summary: timeouts on try while searching for changes → Timeouts on try while searching for changes ("remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock held by hgssh1.dmz.scl3.mozilla.com:NNNNN")
this is happening consistently for me.  Is there a workaround we can do locally?
raising severity since again devs complaint about this issue and that this is blocking them for using try
Severity: normal → major
Taking until I can find someone from webops to work on this.
Assignee: server-ops-webops → rwatson
Assignee: rwatson → klibby
Try is currently up to 7700 heads, which is around when things start to get ugly. OTOH, we've managed to get up to over 21,000...

It would be helpful to know what the parent revs are to changesets that are failing. The older they are, the longer it takes for hg to process, which can cause things to timeout and fall over.

I'd like to also note that we had planned to do regular resets of try during the 6-week tree closing windows, but were told by "devs" that doing so was painful and to not reset try. And yet, here we are, all unhappy. :-(
Note: 5 minutes (comment 1) is not considered excessive. The guidelines at https://wiki.mozilla.org/ReleaseEngineering/TryServer#Pushes_to_try_take_a_very_long_time suggests 15 minutes is about when things are "bad".

Please confirm you are requesting a reset of try -- that process is disruptive to other devs.
Whiteboard: see comment #11 if you run into this problem
In comment 1 and 2, it seems like people are being blocked by a lock which was not properly released.  That is a different issue than the traditional "push to try takes long because too many heads" (the distinction being the push taking long versus timing out).  Which problem do you see, Joel?

(FWIW I have not noticed pushing to try to take an unusually long amount of time as of Sunday.)
Flags: needinfo?(jmaher)
when I was trying to push last week, I was unable to push at all to try for about 4 hours, then it started working.  While I couldn't push, I did see new pushes to try.  My case was a timeout on april 16th.
Flags: needinfo?(jmaher)
(In reply to comment #14)
> when I was trying to push last week, I was unable to push at all to try for
> about 4 hours, then it started working.  While I couldn't push, I did see new
> pushes to try.  My case was a timeout on april 16th.

Yeah, seems like this was not the usual problem of pushes to try being slow because of the number of heads then.  Thanks!
This has striken again, today I got the following twice:

pushing to ssh://hg.mozilla.org/try
searching for changes
remote: waiting for lock on repository /repo/hg/mozilla/try held by 'hgssh1.dmz.scl3.mozilla.com:15332'
remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock held by hgssh1.dmz.scl3.mozilla.com:15332
I'm unable to push. No mater what I do. Stuck on "searching for changes". After ~10 minutes, it times out.
Same logs for the past 2 hours (I tried to push many times):

pushing to ssh://hg.mozilla.org/try
searching for changes
remote: waiting for lock on repository /repo/hg/mozilla/try held by 'hgssh1.dmz.scl3.mozilla.com:28056'
remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock held by hgssh1.dmz.scl3.mozilla.com:28056
abort: unexpected response: empty string
I have just been hit by this

(marionette)☁  marionette  hg try
pushing to ssh://hg.mozilla.org/try
searching for changes
remote: waiting for lock on repository /repo/hg/mozilla/try held by 'hgssh1.dmz.scl3.mozilla.com:18064'
remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock held by hgssh1.dmz.scl3.mozilla.com:28994
abort: unexpected response: empty string
:hwine asked me to post this on IRC. This is a try push with the --debug option from hg. I actually only have 1 changeset, so I'm not sure where the other 38 came from.
(The other 38 are just csets that you've pulled from upstream [m-c or m-i or wheverer] & you're the first person to push those csets to try. That's normal.)

There have been no successful pushes to Try by anyone since 11:55 PDT today (so, approaching 4 hours). hwine, not sure if you're actively looking into this -- if you are, awesome. If not, do you know who should look at it & how we can escalate it?
Flags: needinfo?(hwine)
It unhorked itself, after four hours and ten minutes.

If you don't want to do this again, there are two things to do:

Right now, reply to the thread in https://groups.google.com/d/msg/mozilla.dev.platform/Hb2EKXZmY70/Ijzo3Jo2WxcJ saying that you disagree that it is better to let try get to the point of completely sucking and then do an emergency reset (and that both of those things are better than just sitting around for four hours and ten minutes).

Next time it happens, https://wiki.mozilla.org/ReleaseEngineering/TryServer#Pushes_to_try_take_a_very_long_time says that if you are unable to push for more than 15 minutes (and implies that if your timeouts keep coming up with the same PID), to file a bug. Judging by the outcome of bug 1001735, where the eventual response was to ask the reporter whether or not he personally was still unable to push, at a time five days later when it was known by all that nobody had been able to push for three hours, that means *everyone* file *their own* bug asking for a reset.
(In reply to Phil Ringnalda (:philor) from comment #22)
> It unhorked itself, after four hours and ten minutes.
> 
> If you don't want to do this again, there are two things to do:
> 
> Right now, reply to the thread in
> https://groups.google.com/d/msg/mozilla.dev.platform/Hb2EKXZmY70/
> Ijzo3Jo2WxcJ 

Excellent suggestion. The major impact of resets is on developers. We need their help in coming up with criteria to make the call of "enough pain - reset!".


> Next time it happens, Judging by the outcome of bug 1001735,
> *everyone* file *their own* bug asking for a reset.

Please don't -- what we need is a process that works for all devs. The dev.platform thread is the right place to do this.
Flags: needinfo?(hwine)
Blocks: 770811
Blocks: try-tracker
No longer blocks: 770811
Component: WebOps: Source Control → Mercurial: hg.mozilla.org
Product: Infrastructure & Operations → Developer Services
Whiteboard: see comment #11 if you run into this problem → [kanban:engops:https://kanbanize.com/ctrl_board/6/143] see comment #11 if you run into this problem
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/143] see comment #11 if you run into this problem → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1014] [kanban:engops:https://kanbanize.com/ctrl_board/6/143] see comment #11 if you run into this problem
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1014] [kanban:engops:https://kanbanize.com/ctrl_board/6/143] see comment #11 if you run into this problem [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1023] [kanban:engops:https://kanb… → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1023] [kanban:engops:https://kanbanize.com/ctrl_board/6/143] see comment #11 if you run into this problem [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1024] [kanban:engops:https://kanb…
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1024] [kanban:engops:https://kanbanize.com/ctrl_board/6/143] see comment #11 if you run into this problem → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1024] see comment #11 if you run into this problem
Don't see much value in keeping this bug open.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: