Closed
Bug 1387407
Opened 7 years ago
Closed 7 years ago
hg.mozilla.org pushlog for Try isn't showing pushes for recent commits
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect, P1)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: aryx, Unassigned)
References
Details
inbound and autoland seem to be fine, but Try hasn't shown new jobs pushed in the last 100 minutes which one can see at https://hg.mozilla.org/try/shortlog
Updated•7 years ago
|
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Priority: -- → P1
Comment 1•7 years ago
|
||
On New Relic: https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors#/show/a2ee18a3-7911-11e7-931c-0242ac110012_8314_12387/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=rinpuv File "/app/treeherder/etl/tasks/pulse_tasks.py", line 30, in store_pulse_resultsets File "/app/treeherder/etl/resultset_loader.py", line 42, in process File "/app/treeherder/etl/resultset_loader.py", line 237, in transform File "/app/treeherder/etl/resultset_loader.py", line 244, in fetch_resultset exceptions:IndexError: list index out of range Corresponds to: https://github.com/mozilla/treeherder/blob/3648340ab4155a24747a68ddde3f5b8a0eb24f99/treeherder/etl/resultset_loader.py#L244 It looks like the hg.mozilla.org pushlog is broken. The most recent entries here are from 2 hours ago: https://hg.mozilla.org/try/pushloghtml https://hg.mozilla.org/try/json-pushes ...whereas the most recent commits are from minutes ago: https://hg.mozilla.org/try/shortlog This is the last push that the API returns: https://hg.mozilla.org/try/json-pushes?version=2&full=1&startID=210594&endID=210595 The next push ID (210596) doesn't exist: https://hg.mozilla.org/try/json-pushes?version=2&full=1&startID=210595&endID=210596 Note also that when visiting a specific recent commit, the "push id" and related fields all show as "unknown": https://hg.mozilla.org/try/rev/9b6f166ff58b
Assignee: emorley → nobody
Status: ASSIGNED → NEW
Component: Treeherder: Data Ingestion → Mercurial: hg.mozilla.org
Product: Tree Management → Developer Services
Version: --- → unspecified
Updated•7 years ago
|
Summary: New jobs/pushes don't show up on Try → hg.mozilla.org pushlog for Try isn't showing pushes for recent commits
Reporter | ||
Updated•7 years ago
|
Flags: needinfo?(klibby)
Comment 2•7 years ago
|
||
For "maybe-related" information, Mercurial 4.2 was just deployed half a day ago, in https://bugzilla.mozilla.org/show_bug.cgi?id=1385979
Comment 3•7 years ago
|
||
Attempted to run replicatesync to make sure there weren't any issues with replication, and it's not happy with try: 13:32 hgssh4.dmz.scl3#/var/hg/venv_tools/bin/hg replicatesync ** unknown exception encountered, please report by visiting ** https://mercurial-scm.org/wiki/BugTracker ** Python 2.7.5 (default, Nov 6 2016, 00:28:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] ** Mercurial Distributed SCM (version 4.2.2) ** Extensions loaded: blackbox, clonebundles, obsolescencehacks, pushlog, serverlog, readonly, vcsreplicator Traceback (most recent call last): File "/var/hg/venv_tools/bin/hg", line 45, in <module> mercurial.dispatch.run() File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 83, in run status = (dispatch(req) or 0) & 255 File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 164, in dispatch ret = _runcatch(req) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 295, in _runcatch return _callcatch(ui, _runcatchfunc) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 303, in _callcatch return scmutil.callcatch(ui, func) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/scmutil.py", line 146, in callcatch return func() File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 285, in _runcatchfunc return _dispatch(req) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 912, in _dispatch cmdpats, cmdoptions) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 648, in runcommand ret = _runcommand(ui, options, cmd, d) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 920, in _runcommand return cmdfunc() File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 909, in <lambda> d = lambda: util.checksignature(func)(ui, *args, **strcmdopt) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/util.py", line 1080, in check return func(*args, **kwargs) File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 317, in replicatecommand sendreposyncmessage(ui, repo) File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 224, in sendreposyncmessage partition=repo.replicationpartition) File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/producer.py", line 128, in record_hg_repo_sync }, partition=partition) File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/producer.py", line 49, in send_message self.topic, partition, msg) File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 349, in send_messages return self._send_messages(topic, partition, *msg) File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 390, in _send_messages fail_on_error=self.sync_fail_on_error File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 480, in send_produce_request (not fail_on_error or not self._raise_on_response_error(resp))] File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 247, in _raise_on_response_error raise resp kafka.common.FailedPayloadsError and then kafka gets sad and needs restarting. I don't know that this is actually related to the problem at hand, however.
Flags: needinfo?(klibby)
Comment 4•7 years ago
|
||
ok, looks like replication of pushlog data: hgssh4.dmz.scl3# ls -al /repo/hg/mozilla/try/.hg/pushlog2.db -rw-rw-r-- 1 hg scm_level_1 159708160 Aug 4 13:57 /repo/hg/mozilla/try/.hg/pushlog2.db hgweb11.dmz.scl3# ls -al /repo/hg/mozilla/try/.hg/pushlog2.db -rw-rw-r-- 1 hg hg 159698944 Aug 4 10:32 /repo/hg/mozilla/try/.hg/pushlog2.db
Comment 5•7 years ago
|
||
Aug 03 07:42:45 hgssh4.dmz.scl3.mozilla.com java[1054]: [2017-08-03 07:42:45,397] ERROR [Replica Manager on Broker 11]: Error when processing fetch request for partition [pushdata,4] offset 4012838 from consumer with correlation id 18. Possible cause: Request for offset 4012838 but we only have log segments in the range 1972025 to 2040814. (kafka.server.ReplicaManager)
Comment 6•7 years ago
|
||
(In reply to Kendall Libby [:fubar] from comment #5) > Aug 03 07:42:45 hgssh4.dmz.scl3.mozilla.com java[1054]: [2017-08-03 > 07:42:45,397] ERROR [Replica Manager on Broker 11]: Error when processing > fetch request for partition [pushdata,4] offset 4012838 from consumer with > correlation id 18. Possible cause: Request for offset 4012838 but we only > have log segments in the range 1972025 to 2040814. > (kafka.server.ReplicaManager) or that's a red herring, since it predates at least where pushlog data stops on the clients.
Flags: needinfo?(gps)
Comment 7•7 years ago
|
||
:gps suggested bouncing kafka on all of the nodes, but it hasn't helped (and replicatesync still throws the above error). he'll be online in < 30min to take a look
Comment 8•7 years ago
|
||
The issue is actively resolved. I'm still trying to piece together what happened. I do know there was a failure writing to the pushlog for one autoland push. There are a few changesets in the repo with no pushlog entry. This almost certainly confused Treeherder. It may have confused aspects of the replication mechanism.
Comment 9•7 years ago
|
||
Timeline of events: 09:40:59 VYV03354@nifty.ne.jp connects via ssh 09:41:11 VYV03354@nifty.ne.jp transaction starts 09:41:18 bind-autoland@mozilla.com connections via ssh 09:41:19 VYV03354@nifty.ne.jp transaction to add 4 changesets with head d8ded36356b0 closes 09:41:19 hgweb mirrors start connecting to replicate changes 09:41:21 hgweb mirrors start fetching bundles 09:41:23 bind-autoland@mozilla.com transaction starts 09:41:25 VYV03354@nifty.ne.jp disconnects 09:41:28 hgweb mirrors start fetching pushlog data 09:41:30 hgweb11 confirms replication of VYV03354@nifty.ne.jp's push 09:41:31 bind-autoland@mozilla.com transaction starts to close 09:41:37 bind-autoland@mozilla.com transaction closes; added 14 changesets with head 82464b3a6ebd 09:41:37 hgweb mirrors start fetching bind-autoland's transaction 09:41:37 bind-autoland@mozilla.com disconnects 09:41:50 hgweb11 confirms replication of autoland's push 09:43:03 mikokm@gmail.com connects via ssh, submits a push with 4 changesets and head 5b4ed2075b1c and hgssh appears happy 09:43:54 hgweb11 confirms replication of mikokm's push ... XX:XX:XX various other pushes to try complete successfully ... 12:19:XX this bug filed 13:06:XX emorley makes noise in #vcs 13:24:XX fubar answers ping 13:27:47 fubar runs `hg replicatesync` 13:27:57 command fails with FailedPayloadsError attempting to write to Kafka 13:28:12 pid 23830 fails with RequestTimedOutError writing to Kafka (not sure what process this is) 13:28:20 first Nagios check fails due to failing to write heartbeat test message to partition 4 (try's partition) 13:29:51 jkingston@mozilla.com establishes ssh connection; push experiences problems 13:29:21 first nagios alert in #vcs 13:31:24 kafka restarted on hgssh4 XX:XX:XX kafka bounced on other machines and again on hgsshh; other attempts to run `hg replicatesync` ... 16:XX:XX gps runs `hg pull` on hgweb machines; lots of missing changesets pulled in 16:XX:XX gps also runs `hg replicatesync`, manages to make Kafka unhappy Included in the autoland push is this traceback: File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 359, in wireprotodispatch return orig(repo, proto, command) File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/wireproto.py", line 570, in dispatch return func(repo, proto, *args) File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/wireproto.py", line 982, in unbundle proto._client()) File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/exchange.py", line 1777, in unbundle lockandtr[2].close() File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/transaction.py", line 43, in _active return func(self, *args, **kwds) File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/transaction.py", line 490, in close self._postclosecallback[cat](self) File "/var/hg/version-control-tools/hgext/pushlog/__init__.py", line 245, in onpostclose conn.commit() OperationalError: database is locked Essentially, SQLite raised an exception when attempting to close the transaction adding the pushlog entry. The Mercurial transaction had completed. The changeset data was there but the pushlog data never was committed. So, the summary of events is: 1. Error writing to pushlog meant we recorded Mercurial data without a pushlog entry for head 82464b3a6ebd and prior 13 changesets 2. Treeherder started choking because of missing pushlog data 3. Upon suspicion that things were wedged, `hg replicatesync` was executed. This overwhelmed Kafka and wedged Kafka. 4. Heartbeat checks and pushes stopped working completely 5. Kafka was bounced 6. 3-5 were repeated a few times 7. Eventually, ran `hg pull` from hgweb to obtain missing changesets, everything restored I see the following potential follow-up actions: 1. Fix pushlog so this database lock issue doesn't happen (I suspect we're using SQLite not optimally and are running into a race between readers and a writer) 2. Treeherder to potentially react more robustly if pushlog data is missing (although fail fast applies and I'd like to say that hg.mo can guarantee pushlog data existence) 3. Figure out what is causing Kafka to get wedged during `hg replicatesync` and prevent that from happening (I think it chokes on a very large message)
Flags: needinfo?(gps)
Comment 10•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #9) > 2. Treeherder to potentially react more robustly if pushlog data is missing > (although fail fast applies and I'd like to say that hg.mo can guarantee > pushlog data existence) I could be wrong, but I don't think there was any issue on the Treeherder side. Each push received over Pulse is ingested separately, and the `push_full_json_url` property used to determine what jsonpushlog URL to retrieve. However none of the more recent pushes existed in the pushlog - not just those 13. For example, the 9b6f166ff58b commit mentioned in comment 1 didn't have a pushlog entry, and that was ~25 pushes after the last good push at the time of the incident.
Comment 11•7 years ago
|
||
Regarding the interaction between Mercurial and SQLite, the architectural problem is that we don't have atomic transactions. Mercurial has its transaction. SQLite has its transaction. A failure to close SQLite's transaction doesn't roll back Mercurial's transaction. Way back in bug 966545 I improved pushlog code to at least be scoped to Mercurial's transaction. Before, if you aborted a push, we would orphan data in the SQLite database and things would get confused. For several years, we told people not to ^C pushes or it could corrupt repos. After that change, we roll back the SQLite transaction if Mercurial aborts its transaction. Transactions are fundamentally a hard problem. More so when you try to link discrete software components to the same logical transaction. The thing making it difficult in this case is there is no easy way to "undo" the last committed transaction in Mercurial and SQLite. If you have discrete transactions, you need a way to undo the last transaction if a subsequent transaction fails in order to make them behave as an atomic unit. Lacking a mechanism to do this, we have to hope and pray or work outside the box. Up to this day, we've somehow never had an issue (or at least recorded an instance of) the SQLite transaction failing after the Mercurial transaction closed. So the "hope and pray" solution has worked. But since it failed, we should probably do something. Ideas I see: 1) Create a copy of the SQLite database on transaction open. Save SQLite transaction before Mercurial's. Use Mercurial's built-in restore-file-on-rollback/copy-file-on-close primitives to lean on file copies instead of SQLite transactions to achieve atomic transactions. (There will likely be a perf hit and the change is quite hacky.) 2) Improve usage of SQLite so we don't have database locks and/or retry harder when a lock is observed. This is still "hope and pray" but it might be good enough.
Comment 12•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #10) > Each push received over Pulse is ingested separately, and the > `push_full_json_url` property used to determine what jsonpushlog URL to > retrieve. However none of the more recent pushes existed in the pushlog - > not just those 13. For example, the 9b6f166ff58b commit mentioned in comment > 1 didn't have a pushlog entry, and that was ~25 pushes after the last good > push at the time of the incident. OK. Then it appears the hgweb machines didn't replicate pushlog entries until I ran a manual `hg pull`. We don't log the raw command output when running some replication commands. So I can't confirm this. But it is a plausible explanation. So additional follow-ups would be: * Log output from replication commands to facilitate forensics * Make pushlog data mirroring more robust to missing pushlog data
Comment 13•7 years ago
|
||
The pushlog SQLite databases don't use WAL, which means there is contention between readers and writers. Unfortunately, I don't think we can use WAL because the pushlog on hgssh is hosted on NFS. The WAL index is in shared memory, which means it won't work across machines. While we typically only have one master, a shared memory index for SQLite feels brittle to me. I don't want to be dealing with issues from multiple machines attempting to access the SQLite database and touching the index. You can disable the use of shared memory with the WAL index. https://sqlite.org/wal.html#noshm. But this requires locking_mode=EXCLUSIVE, which prevents concurrent readers. We need concurrent reads for replication requests. So I think the WAL is out. Refreshing my brain on transaction semantics (https://www.sqlite.org/lang_transaction.html#immediate), we /could/ move to an IMMEDIATE transaction from deferred. This will obtain a RESERVED lock immediately, thus locking out other writers. However, a RESERVED lock will also lock out other readers. That's not desirable because it will prevent replication. I think the best we can do is retry harder when the COMMIT fails. That being said, if we did actually see SQLITE_LOCKED (https://www.sqlite.org/rescode.html#locked) from Python, that means the lock is coming from within the Python process rather than from a reader. Perhaps we are inadvertently opening multiple SQLite connections from the same Python process? Due to the way the pushlog object is cached on the Mercurial repo, this is certainly possible...
Comment 14•7 years ago
|
||
According to the SQLite source code "database is locked" corresponds to SQLITE_BUSY. So this is likely a reader locking out a writer. We should increase the busy timeout when we commit the transaction to maximize our chances for success.
Comment 15•7 years ago
|
||
This seems to have happened again today, https://treeherder.mozilla.org/#/jobs?repo=try&revision=9fd674cc9b840320a1f39421141863058c74e45f and https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4a9037ae138067133c64a7d4b83321a26aa8c12 were two I've done this morning the latest entry on https://hg.mozilla.org/try/pushloghtml is [was] from ~90 minutes ago
Comment 16•7 years ago
|
||
Tried running 'hg pull' on the webheads to catch try back up, but I'm clearly missing something as it was a no-op. There weren't any 'database is locked' errors that I could find on hgssh4; presuming they were found elsewhere.
Flags: needinfo?(gps)
Comment 17•7 years ago
|
||
Correct me if I'm wrong, but SQLite only has table-wide locks, correct? Is this the correct solution for us if we're having problems with concurrency? In MySQL with InnoDB, for instance, locking is generally row-level, so lock contention is much less likely.
Comment 18•7 years ago
|
||
We use SQLite because it is simple and fits the requirements (assuming we use it correctly). Using something more complex like MySQL introduces more failure modes, operational complexity, etc.
Reporter | ||
Comment 19•7 years ago
|
||
No new pushes on Treeherder for Try or the last 2 hours. Kendall, please take a look.
Flags: needinfo?(klibby)
Comment 20•7 years ago
|
||
restarted kafka, jic. pulled data over to the webheads.
Flags: needinfo?(klibby)
Comment 21•7 years ago
|
||
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/0795a1fde163 pushlog: increase busy_timeout
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment 22•7 years ago
|
||
I suspect that change (which I'm deploying now) will wallpaper over the issue. But so far I have yet to reproduce the exact failure we're seeing in production in tests. Speaking of tests, I intend to land some more test coverage for database contention. But that doesn't need to block the server change.
Comment 23•7 years ago
|
||
This broke again a few minutes ago. Had to do an `hg pull` to fix. Ugh.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 24•7 years ago
|
||
I suspect bug 1297153 may be partially responsible for some of the failures. Anyway, I don't think we've had a pushlog failure since both the timeout increase and the try heads were pruned over the past several days. While I'm still concerned about what caused this, I'm willing to call it resolved once bug 1297153 lands.
Comment 25•7 years ago
|
||
There is no indication that this is still a problem.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•