Closed Bug 1769750 Opened 3 years ago Closed 10 months ago

Very slow to create Try push

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: sheehan)

References

(Depends on 1 open bug, )

Details

There have been reports of very slow Try pushes which take several minutes to create (e.g. by padenot and bszekely). bszekely mentioned they have been slower for him for the last 3-4 days and I observed one occurrence today in a video session and it might have taken 5 minutes to create the Try push (we continued with other work).

This happens for me:

remote: error committing pushlog transaction on attempt 1; retrying

The lock is usually taken after a small number of seconds (e.g. 10-12), but I see waits for a good 30s or more after having the log taken, and before (I think) seeing this line in my terminal.

Taking the lock is quite long compared to usual, but it's probably a consequence of the problem mentioned above, during concurrent pushes (which is bound to happen all the time because reviewbot constantly make pushes these days).

Then this makes a long queue of clients waiting on the hg lock, and we find ourselves in the situation we're in.

https://bugzilla.mozilla.org/show_bug.cgi?id=1735633 fits the timeline somewhat, and the error message is suspiciously close to this fix.

Connor, does the above make sense to you?

Flags: needinfo?(sheehan)

I just had to wait ~10 minutes just to get the lock, and then another minute or two to finish the try push:

remote: got lock after 589 seconds

I dont know why there is not more activity on this bug, I have been hitting that almost daily for several months already. No late than right now:

$ BUILD_DEBUG=1 ./mach try auto
Bundling 9 changesetswarning: unable to access '/root/etc/gitattributes': Permission denied
Bundling 10 changesets
Bundling 10 manifests
Bundling 49 revisions of 37 files
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: recorded push in pushlog
remote: added 2 changesets with 6 changes to 37 files (+1 heads)
remote:
remote: View your changes here:
remote:   https://hg.mozilla.org/try/rev/aa72d51c94dfa31f913b080beb4ae1567e7da625
remote:   https://hg.mozilla.org/try/rev/0fb7f91cbe1d82fb6ba81f560d522d911d747753
remote:
remote: Follow the progress of your build on Treeherder:
remote:   https://treeherder.mozilla.org/jobs?repo=try&revision=0fb7f91cbe1d82fb6ba81f560d522d911d747753
remote: recorded changegroup in replication log in 0.016s
To hg::ssh://hg.mozilla.org/try
 * [new branch]                  HEAD -> branches/default/tip

There is no mention of lock time but it stayed blocked on remote: recorded push in pushlog more than 5-7 minutes.

Try pushes are slow for quite a few reasons. The main reason right now is the Kafka cluster we use to facilitate replication of hg changesets from the master push server to the hgweb mirrors. There are often long waits in the logs just to send "heartbeat" messages to Kafka while making try pushes. These long waits would appear in the timing for acquiring the lock since the locks holder must wait for Kafka.

I don't think backing out the fix from bug 1735633 is a good idea, I doubt it is causing much slowdown but if it is, it's a tradeoff for not having production outages when pushlog's transactions break.

Flags: needinfo?(sheehan)

Is there an ETA for pruning the heads? This got mentioned again today in chat.

Flags: needinfo?(sheehan)

FYI, I got this today:

./mach try fuzzy --preset devtools-linux
Artifact builds enabled, pass --no-artifact to disable
Task configuration changed, generating target task set
estimates: Runs 86 tasks (73 selected, 13 dependencies)
estimates: Total task duration 1 day, 3:35:14
estimates: In the top 40% of durations
estimates: Should take about 1:25:33 (Finished around 2023-02-23 05:48)
Bundling 5 changesets
Bundling 5 manifests
Bundling 40 revisions of 37 files
remote: waiting for lock on working directory of /repo/hg/mozilla/try held by process '4515' on host 'hgssh1.dmz.mdc1.mozilla.com/effffffc'
remote: abort: working directory of /repo/hg/mozilla/try: timed out waiting for lock held by 'hgssh1.dmz.mdc1.mozilla.com/effffffc:5233'
fatal: called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }

Re-running the same command a few minutes later worked without waiting for any lock.

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #7)

Is there an ETA for pruning the heads? This got mentioned again today in chat.

I've pruned half of them now, the remainder should be pruned by the start of next week.

Flags: needinfo?(sheehan)

I just got the following error:

Task configuration changed, generating target task set
estimates: Runs 86 tasks (73 selected, 13 dependencies)
estimates: Total task duration 1 day, 3:04:26
estimates: In the top 42% of durations
estimates: Should take about 1:28:01 (Finished around 2023-03-07 03:11)
Bundling 5 changesets
Bundling 5 manifests
Bundling 40 revisions of 37 files
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: recorded push in pushlog
remote: error committing pushlog transaction on attempt 1; retrying
remote: error committing pushlog transaction on attempt 2; retrying
remote: error committing pushlog transaction on attempt 3; retrying
remote: transaction abort!
remote: rolling back pushlog
remote: rollback completed
ERROR could not complete push due to pushlog operational errors; please retry, and file a bug if the issue persists

Is it any realted to this bug?
When this error happens (only once for now) it takes a little while to fail and display the error committing pushlog transaction message.
I retried a second time after this error and it worked.

I'm getting new error messages, second time I getting this this week:

estimates: Runs 86 tasks (73 selected, 13 dependencies)
estimates: Total task duration 1 day, 2:55:17
estimates: In the top 41% of durations
estimates: Should take about 1:30:32 (Finished around 2023-03-14 09:05)
Bundling 5 changesets
Bundling 5 manifests
Bundling 119 revisions of 77 files
remote: waiting for lock on working directory of /repo/hg/mozilla/try held by process '521' on host 'hgssh1.dmz.mdc1.mozilla.com/effffffc'
remote: abort: working directory of /repo/hg/mozilla/try: timed out waiting for lock held by 'hgssh1.dmz.mdc1.mozilla.com/effffffc:1563'
Summary: very slow to create Try push for the last few days → Very slow to create Try push

(In reply to Paul Adenot (:padenot) from comment #1)

Taking the lock is quite long compared to usual, but it's probably a consequence of the problem mentioned above, during concurrent pushes (which is bound to happen all the time because reviewbot constantly make pushes these days).

With apologies for the fly-by, I wonder if we should consider moving reviewbot pushes to a separate repository to reduce the chance of developers having races? (This is assuming that the slow down is per-repo, and not a result of the overload load to hg.mozilla.org.)

Is there any bug, jira ticket to track the work involved to address this?

This slowness is still happening, now for months, and is a small productivity loss for probably most engineers contributing to mozilla-central.

(In reply to Alexandre Poirot [:ochameau] from comment #13)

Is there any bug, jira ticket to track the work involved to address this?

This slowness is still happening, now for months, and is a small productivity loss for probably most engineers contributing to mozilla-central.

You can follow bug 1835958 for updates on moving try submission to Lando, which will make submissions to try substantially faster.

Depends on: lando-try-push
Duplicate of this bug: 1838953
Component: Mercurial: hg.mozilla.org → Try
Product: Developer Services → Developer Infrastructure

Sorry Mathew but the component was correct. It is mostly a hg issue.

Component: Try → Mercurial: hg.mozilla.org
Product: Developer Infrastructure → Developer Services

The recent hg upgrade has made this a lot faster.

I have upgraded the storage format of the try repository on the push server, which should provide an improvement in push performance.

Depends on: 1877809

Calling this one fixed - pushing to try via Lando addressed the core concern.

Status: NEW → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Assignee: nobody → sheehan
You need to log in before you can comment on or make changes to this bug.