Closed Bug 1415233 Opened 4 years ago Closed 3 years ago

[merge-day] Can't push to repos on the first try: pretxnclose.vcsreplicator hook failed

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: u504868, Assigned: sheehan)

References

Details

(Whiteboard: [releaseduty])

Attachments

(2 files)

From bug 1399916 comment 1, bug 1399916 comment 5, and bug 1399916 comment 6, this error occurs each time somebody merges one branch to another:

> 18:19:35     INFO -  pushing to ssh://hg.mozilla.org/releases/mozilla-release
> 18:19:35     INFO -  searching for changes
> 18:19:35     INFO -  remote: adding changesets
> 18:19:35     INFO -  remote: adding manifests
> 18:19:35     INFO -  remote: adding file changes
> 18:19:35     INFO -  remote: added 9557 changesets with 95413 changes to 61664 files
> 18:19:35     INFO -  remote: (prevent_symlinks check disabled per config override)
> 18:19:35     INFO -  remote: recorded push in pushlog
> 18:19:35     INFO -  remote: replication log not available; cannot close transaction
> 18:19:35     INFO -  remote: transaction abort!
> 18:19:35     INFO -  remote: rolling back pushlog
> 18:19:35     INFO -  remote: rollback completed
> 18:19:35     INFO -  remote: pretxnclose.vcsreplicator hook failed
> 18:19:35    ERROR -  abort: push failed on remote
> 18:19:35    ERROR -  Automation Error: hg not responding
> 18:19:36    ERROR - Return code: 255
> 18:19:36    FATAL - Push failed!  If there was a push race, try rerunning

This can be worked around by pushing another time. However, such pushes take about 20 minutes to be processed. This makes use wait for 40 minutes until a push is done.
See Also: → 1399916
I thought we had a bug on this general issue (large pushes cause vcsreplicator to fail). But I can't find it at the moment. It is a known issue.
from irc:

15:12:19 < jlorenzo> fubar: hi! I'm trying to stage a merge day (which will happen next week)
                     and I can perma repro
                     https://bugzilla.mozilla.org/show_bug.cgi?id=1415233 . I tried about 5
                     times. That blocks the testing of the new esr60 config. Would you have
                     something on the server side logs?
15:12:21 < firebot> Bug 1415233 -- NEW, nobody@mozilla.org -- [merge-day] Can't push to repos
                    on the first try: pretxnclose.vcsreplicator hook failed
15:12:36 < jlorenzo> here are the logs I have https://irccloud.mozilla.com/pastebin/5F7xYK6e/

On hgssh4:

2018-04-17T19:07:10.153537+00:00 hgssh4.dmz.scl3.mozilla.com vcsreplicator: jlorenzo@mozilla.com {moz}/projects/jamun PRETXNCLOSE_HEARTBEATSENDING
2018-04-17T19:07:10.155667+00:00 hgssh4.dmz.scl3.mozilla.com vcsreplicator: jlorenzo@mozilla.com {moz}/projects/jamun EXCEPTION Traceback (most recent call last):#012  File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 163, in pretxnclosehook#012    repo.replicationpartition)#012  File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/producer.py", line 56, in send_heartbeat#012    }, partition=partition)#012  File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/producer.py", line 49, in send_message#012    self.topic, partition, msg)#012  File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/producer/base.py", line 349, in send_messages#012    return self._send_messages(topic, partition, *msg)#012  File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/producer/base.py", line 390, in _send_messages#012    fail_on_error=self.sync_fail_on_error#012  File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/client.py", line 480, in send_produce_request#012    (not fail_on_error or not self._raise_on_response_error(resp))]#012  File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/client.py", line 247, in _raise_on_response_error#012    raise resp#012FailedPayloadsError
For reference, this blocks the testing of ESR60. Merge day is planned to happen Thursday of next week. Releng would love to run several staging releases beforehand.
Blocks: esr60, 1453993
Workaround given by gps: send smaller pushes thanks to `hg push -r ${some_old_enough_rev}`. That worked. Data points:
* 22299 changesets with 211455 changes to 90002 files (+1 heads) => perma fail (5/5 times)
* 13585 changesets with 125540 changes to 68303 files (+1 heads) => failed (tried once)
* 4397 changesets with 28479 changes to 16071 files (+1 heads) => passed on first try (right after previous push @ 13585 changesets).
Severity: blocker → normal
Depends on: 1455023
Large pushes with many heads (for example, 1000 heads) send very large
messages to Kafka. When Kafka reads these messages, the default buffer
size for the consumer is 4K. If the message doesn't fit in the buffer,
kafka-python will repeatedly double the buffer size until the message can
be read from the buffer, or the max buffer size is reached. This procedure
takes time, and during that time the process times out and the push fails.
This commit increases the default buffer size of the consumer to 1MB so
large pushes avoid this time consuming procedure.
Comment on attachment 8971418 [details]
vcsreplicator: set default buffer size to 1MB (Bug 1415233) r?gps

Gregory Szorc [:gps] has approved the revision.

https://phabricator.services.mozilla.com/D1051
Attachment #8971418 - Flags: review+
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/105472c1c29e
vcsreplicator: set default buffer size to 1MB r=gps
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
The push from comment 8 was created as a possible fix for this issue but we don't believe it is truly the root cause. Re-opening.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I think the root cause is the client being disconnected after a 10 minute timeout. IIRC the timeout is defined in the Kafka server. This is also arguably a bug in the Kafka client not reconnecting automatically.

And the reason why we have a 10 minute delay between Kafka activity leading to that timeout is slow hooks. See bug 1457342 and bug 1460722.

We should make all client code more resilient to disconnects and/or increase the timeout threshold.
A notable event regarding comment 10 - jlorenzo came to #vcs to request the `prevent_symlinks` hook be turned off so he could push to birch. After turning off the symlinks hook, he ran into this issue. The symlinks hook may not be the primary culprit here, however if the overall hook run time is the issue it may be a contributing factor. We will need to do some more digging.

IRC logs from our conversation: https://mozilla.logbot.info/vcs/20180814#c15164712
See Also: → 1460722, 1457342
This commit makes the vcsreplicator Producer retry message
sending on error. We are currently running into an issue
on merge-days and during other large pushes to hg.mozilla.org,
where the push will fail during the vcsreplicator.pretxnclose
hook. This hook does nothing except send a heartbeat to Kafka
after all other message processing, to ensure the replication
log is writable. Since the hook is so simple, we believe the
Kafka connection is timing out while waiting for processing the
large push to complete and therefore the heartbeat sent over
the dead connection fails. This commit makes the message sending
procedure re-initialize the client on a failure and try to
send the message again.
Comment on attachment 9005653 [details]
vcsreplicator: retry message sending on error (Bug 1415233) r?gps

Gregory Szorc [:gps] has approved the revision.
Attachment #9005653 - Flags: review+
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/687e1b127694
vcsreplicator: retry message sending on error r=gps
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Hopefully once we deploy the patch from comment 14 this problem will go away. We should leave this bug open in the meantime.
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
The fix from comment 14 has been deployed, and I've tested it with some promising results. I made an empty user repo at https://hg.mozilla.org/users/cosheehan_mozilla.com/unified and made a large push from mozilla-unified with the following output:

$ hg push user -r 30000
pushing to ssh://hg.mozilla.org/users/cosheehan_mozilla.com/unified
searching for changes
remote: adding changesets                                                                                                                                     
remote: adding manifests
remote: adding file changes
remote: added 29981 changesets with 146198 changes to 48016 files
remote: recorded push in pushlog
remote: 
remote: View the pushlog for these changes here:
remote:   https://hg.mozilla.org/users/cosheehan_mozilla.com/unified/pushloghtml?changeset=fe1f520ff772967e481ab11958095eaa14ce7a9d
remote: recorded changegroup in replication log in 0.132s

The push took 24m to complete on the first try, so if the 10 minute timeout was the problem as suggested in comment 10 then this is a fairly promising result!
That looks awesome, Connor :D I've never been about to push 30K changeset since this issue was first filed. Next merge day is either on Friday October 12th or on Monday the 15th. Let's see how this goes!
The large push on the last merge day "just worked" and we didn't have to disable the hook as a workaround. So calling this one resolved.
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Assignee: nobody → sheehan
You need to log in before you can comment on or make changes to this bug.