Closed Bug 1415233 Opened 8 years ago Closed 7 years ago

[merge-day] Can't push to repos on the first try: pretxnclose.vcsreplicator hook failed

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jlorenzo, Assigned: sheehan)

References

Details

(Whiteboard: [releaseduty])

Attachments

(2 files)

vcsreplicator: set default buffer size to 1MB (Bug 1415233) r?gps 7 years ago Phabricator Automation 46 bytes, text/x-phabricator-request	gps : review+	Details \| Review
vcsreplicator: retry message sending on error (Bug 1415233) r?gps 7 years ago Connor Sheehan [:sheehan] 46 bytes, text/x-phabricator-request	gps : review+	Details \| Review

Johan Lorenzo [:jlorenzo]

Reporter

Description

•

8 years ago

From bug 1399916 comment 1, bug 1399916 comment 5, and bug 1399916 comment 6, this error occurs each time somebody merges one branch to another: > 18:19:35 INFO - pushing to ssh://hg.mozilla.org/releases/mozilla-release > 18:19:35 INFO - searching for changes > 18:19:35 INFO - remote: adding changesets > 18:19:35 INFO - remote: adding manifests > 18:19:35 INFO - remote: adding file changes > 18:19:35 INFO - remote: added 9557 changesets with 95413 changes to 61664 files > 18:19:35 INFO - remote: (prevent_symlinks check disabled per config override) > 18:19:35 INFO - remote: recorded push in pushlog > 18:19:35 INFO - remote: replication log not available; cannot close transaction > 18:19:35 INFO - remote: transaction abort! > 18:19:35 INFO - remote: rolling back pushlog > 18:19:35 INFO - remote: rollback completed > 18:19:35 INFO - remote: pretxnclose.vcsreplicator hook failed > 18:19:35 ERROR - abort: push failed on remote > 18:19:35 ERROR - Automation Error: hg not responding > 18:19:36 ERROR - Return code: 255 > 18:19:36 FATAL - Push failed! If there was a push race, try rerunning This can be worked around by pushing another time. However, such pushes take about 20 minutes to be processed. This makes use wait for 40 minutes until a push is done.

Johan Lorenzo [:jlorenzo]

Reporter

Updated

•

8 years ago

Comment 1

•

8 years ago

I thought we had a bug on this general issue (large pushes cause vcsreplicator to fail). But I can't find it at the moment. It is a known issue.

Kendall Libby [:fubar] (he/him)

Comment 2

•

7 years ago

from irc: 15:12:19 < jlorenzo> fubar: hi! I'm trying to stage a merge day (which will happen next week) and I can perma repro https://bugzilla.mozilla.org/show_bug.cgi?id=1415233 . I tried about 5 times. That blocks the testing of the new esr60 config. Would you have something on the server side logs? 15:12:21 < firebot> Bug 1415233 -- NEW, nobody@mozilla.org -- [merge-day] Can't push to repos on the first try: pretxnclose.vcsreplicator hook failed 15:12:36 < jlorenzo> here are the logs I have https://irccloud.mozilla.com/pastebin/5F7xYK6e/ On hgssh4: 2018-04-17T19:07:10.153537+00:00 hgssh4.dmz.scl3.mozilla.com vcsreplicator: jlorenzo@mozilla.com {moz}/projects/jamun PRETXNCLOSE_HEARTBEATSENDING 2018-04-17T19:07:10.155667+00:00 hgssh4.dmz.scl3.mozilla.com vcsreplicator: jlorenzo@mozilla.com {moz}/projects/jamun EXCEPTION Traceback (most recent call last):#012 File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 163, in pretxnclosehook#012 repo.replicationpartition)#012 File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/producer.py", line 56, in send_heartbeat#012 }, partition=partition)#012 File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/producer.py", line 49, in send_message#012 self.topic, partition, msg)#012 File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/producer/base.py", line 349, in send_messages#012 return self._send_messages(topic, partition, *msg)#012 File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/producer/base.py", line 390, in _send_messages#012 fail_on_error=self.sync_fail_on_error#012 File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/client.py", line 480, in send_produce_request#012 (not fail_on_error or not self._raise_on_response_error(resp))]#012 File "/var/hg/venv_pash/lib/python2.7/site-packages/kafka/client.py", line 247, in _raise_on_response_error#012 raise resp#012FailedPayloadsError

Comment hidden (typo)

Johan Lorenzo [:jlorenzo]

Reporter

Comment 4

•

7 years ago

For reference, this blocks the testing of ESR60. Merge day is planned to happen Thursday of next week. Releng would love to run several staging releases beforehand.

Blocks: esr60, 1453993

Johan Lorenzo [:jlorenzo]

Reporter

Comment 5

•

7 years ago

Workaround given by gps: send smaller pushes thanks to `hg push -r ${some_old_enough_rev}`. That worked. Data points: * 22299 changesets with 211455 changes to 90002 files (+1 heads) => perma fail (5/5 times) * 13585 changesets with 125540 changes to 68303 files (+1 heads) => failed (tried once) * 4397 changesets with 28479 changes to 16071 files (+1 heads) => passed on first try (right after previous push @ 13585 changesets).

Severity: blocker → normal

Johan Lorenzo [:jlorenzo]

Reporter

Updated

•

7 years ago

Depends on: 1455023

Phabricator Automation

Comment 6

•

7 years ago

Attached file vcsreplicator: set default buffer size to 1MB (Bug 1415233) r?gps — Details

Large pushes with many heads (for example, 1000 heads) send very large messages to Kafka. When Kafka reads these messages, the default buffer size for the consumer is 4K. If the message doesn't fit in the buffer, kafka-python will repeatedly double the buffer size until the message can be read from the buffer, or the max buffer size is reached. This procedure takes time, and during that time the process times out and the push fails. This commit increases the default buffer size of the consumer to 1MB so large pushes avoid this time consuming procedure.

Phabricator Automation

Comment 7

•

7 years ago

Comment on attachment 8971418 [details] vcsreplicator: set default buffer size to 1MB (Bug 1415233) r?gps Gregory Szorc [:gps] has approved the revision. https://phabricator.services.mozilla.com/D1051

Attachment #8971418 - Flags: review+

Pulsebot

Comment 8

•

7 years ago

Pushed by cosheehan@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/105472c1c29e vcsreplicator: set default buffer size to 1MB r=gps

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Connor Sheehan [:sheehan]

Assignee

Comment 9

•

7 years ago

The push from comment 8 was created as a possible fix for this issue but we don't believe it is truly the root cause. Re-opening.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Gregory Szorc [:gps]

Comment 10

•

7 years ago

I think the root cause is the client being disconnected after a 10 minute timeout. IIRC the timeout is defined in the Kafka server. This is also arguably a bug in the Kafka client not reconnecting automatically. And the reason why we have a 10 minute delay between Kafka activity leading to that timeout is slow hooks. See bug 1457342 and bug 1460722. We should make all client code more resilient to disconnects and/or increase the timeout threshold.

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

Blocks: 1464026

Connor Sheehan [:sheehan]

Assignee

Comment 11

•

7 years ago

A notable event regarding comment 10 - jlorenzo came to #vcs to request the `prevent_symlinks` hook be turned off so he could push to birch. After turning off the symlinks hook, he ran into this issue. The symlinks hook may not be the primary culprit here, however if the overall hook run time is the issue it may be a contributing factor. We will need to do some more digging. IRC logs from our conversation: https://mozilla.logbot.info/vcs/20180814#c15164712

Comment 12

•

7 years ago

Attached file vcsreplicator: retry message sending on error (Bug 1415233) r?gps — Details

This commit makes the vcsreplicator Producer retry message sending on error. We are currently running into an issue on merge-days and during other large pushes to hg.mozilla.org, where the push will fail during the vcsreplicator.pretxnclose hook. This hook does nothing except send a heartbeat to Kafka after all other message processing, to ensure the replication log is writable. Since the hook is so simple, we believe the Kafka connection is timing out while waiting for processing the large push to complete and therefore the heartbeat sent over the dead connection fails. This commit makes the message sending procedure re-initialize the client on a failure and try to send the message again.

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

Blocks: 1489406

Gregory Szorc [:gps]

Comment 13

•

7 years ago

Comment on attachment 9005653 [details] vcsreplicator: retry message sending on error (Bug 1415233) r?gps Gregory Szorc [:gps] has approved the revision.

Attachment #9005653 - Flags: review+

Pulsebot

Comment 14

•

7 years ago

Pushed by cosheehan@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/687e1b127694 vcsreplicator: retry message sending on error r=gps

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Connor Sheehan [:sheehan]

Assignee

Comment 15

•

7 years ago

Hopefully once we deploy the patch from comment 14 this problem will go away. We should leave this bug open in the meantime.

Status: RESOLVED → REOPENED

Keywords: leave-open

Resolution: FIXED → ---

Connor Sheehan [:sheehan]

Assignee

Comment 16

•

7 years ago

The fix from comment 14 has been deployed, and I've tested it with some promising results. I made an empty user repo at https://hg.mozilla.org/users/cosheehan_mozilla.com/unified and made a large push from mozilla-unified with the following output: $ hg push user -r 30000 pushing to ssh://hg.mozilla.org/users/cosheehan_mozilla.com/unified searching for changes remote: adding changesets remote: adding manifests remote: adding file changes remote: added 29981 changesets with 146198 changes to 48016 files remote: recorded push in pushlog remote: remote: View the pushlog for these changes here: remote: https://hg.mozilla.org/users/cosheehan_mozilla.com/unified/pushloghtml?changeset=fe1f520ff772967e481ab11958095eaa14ce7a9d remote: recorded changegroup in replication log in 0.132s The push took 24m to complete on the first try, so if the 10 minute timeout was the problem as suggested in comment 10 then this is a fairly promising result!

Johan Lorenzo [:jlorenzo]

Reporter

Comment 17

•

7 years ago

That looks awesome, Connor :D I've never been about to push 30K changeset since this issue was first filed. Next merge day is either on Friday October 12th or on Monday the 15th. Let's see how this goes!

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

Blocks: 1499047

Gregory Szorc [:gps]

Comment 18

•

7 years ago

The large push on the last merge day "just worked" and we didn't have to disable the hook as a workaround. So calling this one resolved.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

BugBot [:suhaib / :marco/ :calixte]

Updated

•

7 years ago

Keywords: leave-open

Connor Sheehan [:sheehan]

Assignee

Updated

•

7 years ago

Assignee: nobody → sheehan

You need to log in before you can comment on or make changes to this bug.