Closed
Bug 1415233
Opened 8 years ago
Closed 7 years ago
[merge-day] Can't push to repos on the first try: pretxnclose.vcsreplicator hook failed
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlorenzo, Assigned: sheehan)
References
Details
(Whiteboard: [releaseduty])
Attachments
(2 files)
From bug 1399916 comment 1, bug 1399916 comment 5, and bug 1399916 comment 6, this error occurs each time somebody merges one branch to another:
> 18:19:35 INFO - pushing to ssh://hg.mozilla.org/releases/mozilla-release
> 18:19:35 INFO - searching for changes
> 18:19:35 INFO - remote: adding changesets
> 18:19:35 INFO - remote: adding manifests
> 18:19:35 INFO - remote: adding file changes
> 18:19:35 INFO - remote: added 9557 changesets with 95413 changes to 61664 files
> 18:19:35 INFO - remote: (prevent_symlinks check disabled per config override)
> 18:19:35 INFO - remote: recorded push in pushlog
> 18:19:35 INFO - remote: replication log not available; cannot close transaction
> 18:19:35 INFO - remote: transaction abort!
> 18:19:35 INFO - remote: rolling back pushlog
> 18:19:35 INFO - remote: rollback completed
> 18:19:35 INFO - remote: pretxnclose.vcsreplicator hook failed
> 18:19:35 ERROR - abort: push failed on remote
> 18:19:35 ERROR - Automation Error: hg not responding
> 18:19:36 ERROR - Return code: 255
> 18:19:36 FATAL - Push failed! If there was a push race, try rerunning
This can be worked around by pushing another time. However, such pushes take about 20 minutes to be processed. This makes use wait for 40 minutes until a push is done.
Comment 1•8 years ago
|
||
I thought we had a bug on this general issue (large pushes cause vcsreplicator to fail). But I can't find it at the moment. It is a known issue.
Comment 2•7 years ago
|
||
from irc:
15:12:19 < jlorenzo> fubar: hi! I'm trying to stage a merge day (which will happen next week)
and I can perma repro
https://bugzilla.mozilla.org/show_bug.cgi?id=1415233 . I tried about 5
times. That blocks the testing of the new esr60 config. Would you have
something on the server side logs?
15:12:21 < firebot> Bug 1415233 -- NEW, nobody@mozilla.org -- [merge-day] Can't push to repos
on the first try: pretxnclose.vcsreplicator hook failed
15:12:36 < jlorenzo> here are the logs I have https://irccloud.mozilla.com/pastebin/5F7xYK6e/
On hgssh4:
2018-04-17T19:07:10.153537+00:00 hgssh4.dmz.scl3.mozilla.com vcsreplicator: jlorenzo@mozilla.com {moz}/projects/jamun PRETXNCLOSE_HEARTBEATSENDING
2018-04-17T19:07:10.155667+00:00 hgssh4.dmz.scl3.mozilla.com vcsreplicator: jlorenzo@mozilla.com {moz}/projects/jamun EXCEPTION Traceback (most recent call last):#012 File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 163, in pretxnclosehook#012 repo.replicationpartition)#012 File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/producer.py", line 56, in send_heartbeat#012 }, partition=partition)#012 File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/producer.py", line 49, in send_message#012 self.topic, partition, msg)#012 File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/producer/base.py", line 349, in send_messages#012 return self._send_messages(topic, partition, *msg)#012 File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/producer/base.py", line 390, in _send_messages#012 fail_on_error=self.sync_fail_on_error#012 File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/client.py", line 480, in send_produce_request#012 (not fail_on_error or not self._raise_on_response_error(resp))]#012 File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/client.py", line 247, in _raise_on_response_error#012 raise resp#012FailedPayloadsError
Comment hidden (typo) |
Reporter | ||
Comment 4•7 years ago
|
||
For reference, this blocks the testing of ESR60. Merge day is planned to happen Thursday of next week. Releng would love to run several staging releases beforehand.
Reporter | ||
Comment 5•7 years ago
|
||
Workaround given by gps: send smaller pushes thanks to `hg push -r ${some_old_enough_rev}`. That worked. Data points:
* 22299 changesets with 211455 changes to 90002 files (+1 heads) => perma fail (5/5 times)
* 13585 changesets with 125540 changes to 68303 files (+1 heads) => failed (tried once)
* 4397 changesets with 28479 changes to 16071 files (+1 heads) => passed on first try (right after previous push @ 13585 changesets).
Severity: blocker → normal
Comment 6•7 years ago
|
||
Large pushes with many heads (for example, 1000 heads) send very large
messages to Kafka. When Kafka reads these messages, the default buffer
size for the consumer is 4K. If the message doesn't fit in the buffer,
kafka-python will repeatedly double the buffer size until the message can
be read from the buffer, or the max buffer size is reached. This procedure
takes time, and during that time the process times out and the push fails.
This commit increases the default buffer size of the consumer to 1MB so
large pushes avoid this time consuming procedure.
Comment 7•7 years ago
|
||
Comment on attachment 8971418 [details]
vcsreplicator: set default buffer size to 1MB (Bug 1415233) r?gps
Gregory Szorc [:gps] has approved the revision.
https://phabricator.services.mozilla.com/D1051
Attachment #8971418 -
Flags: review+
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/105472c1c29e
vcsreplicator: set default buffer size to 1MB r=gps
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 9•7 years ago
|
||
The push from comment 8 was created as a possible fix for this issue but we don't believe it is truly the root cause. Re-opening.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 10•7 years ago
|
||
I think the root cause is the client being disconnected after a 10 minute timeout. IIRC the timeout is defined in the Kafka server. This is also arguably a bug in the Kafka client not reconnecting automatically.
And the reason why we have a 10 minute delay between Kafka activity leading to that timeout is slow hooks. See bug 1457342 and bug 1460722.
We should make all client code more resilient to disconnects and/or increase the timeout threshold.
Assignee | ||
Comment 11•7 years ago
|
||
A notable event regarding comment 10 - jlorenzo came to #vcs to request the `prevent_symlinks` hook be turned off so he could push to birch. After turning off the symlinks hook, he ran into this issue. The symlinks hook may not be the primary culprit here, however if the overall hook run time is the issue it may be a contributing factor. We will need to do some more digging.
IRC logs from our conversation: https://mozilla.logbot.info/vcs/20180814#c15164712
Assignee | ||
Comment 12•7 years ago
|
||
This commit makes the vcsreplicator Producer retry message
sending on error. We are currently running into an issue
on merge-days and during other large pushes to hg.mozilla.org,
where the push will fail during the vcsreplicator.pretxnclose
hook. This hook does nothing except send a heartbeat to Kafka
after all other message processing, to ensure the replication
log is writable. Since the hook is so simple, we believe the
Kafka connection is timing out while waiting for processing the
large push to complete and therefore the heartbeat sent over
the dead connection fails. This commit makes the message sending
procedure re-initialize the client on a failure and try to
send the message again.
Comment 13•7 years ago
|
||
Comment on attachment 9005653 [details]
vcsreplicator: retry message sending on error (Bug 1415233) r?gps
Gregory Szorc [:gps] has approved the revision.
Attachment #9005653 -
Flags: review+
Comment 14•7 years ago
|
||
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/687e1b127694
vcsreplicator: retry message sending on error r=gps
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 15•7 years ago
|
||
Hopefully once we deploy the patch from comment 14 this problem will go away. We should leave this bug open in the meantime.
Assignee | ||
Comment 16•7 years ago
|
||
The fix from comment 14 has been deployed, and I've tested it with some promising results. I made an empty user repo at https://hg.mozilla.org/users/cosheehan_mozilla.com/unified and made a large push from mozilla-unified with the following output:
$ hg push user -r 30000
pushing to ssh://hg.mozilla.org/users/cosheehan_mozilla.com/unified
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 29981 changesets with 146198 changes to 48016 files
remote: recorded push in pushlog
remote:
remote: View the pushlog for these changes here:
remote: https://hg.mozilla.org/users/cosheehan_mozilla.com/unified/pushloghtml?changeset=fe1f520ff772967e481ab11958095eaa14ce7a9d
remote: recorded changegroup in replication log in 0.132s
The push took 24m to complete on the first try, so if the 10 minute timeout was the problem as suggested in comment 10 then this is a fairly promising result!
Reporter | ||
Comment 17•7 years ago
|
||
That looks awesome, Connor :D I've never been about to push 30K changeset since this issue was first filed. Next merge day is either on Friday October 12th or on Monday the 15th. Let's see how this goes!
Comment 18•7 years ago
|
||
The large push on the last merge day "just worked" and we didn't have to disable the hook as a workaround. So calling this one resolved.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Keywords: leave-open
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → sheehan
You need to log in
before you can comment on or make changes to this bug.
Description
•