inbound and autoland seem to be fine, but Try hasn't shown new jobs pushed in the last 100 minutes which one can see at https://hg.mozilla.org/try/shortlog
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Priority: -- → P1
On New Relic: https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors#/show/a2ee18a3-7911-11e7-931c-0242ac110012_8314_12387/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=rinpuv File "/app/treeherder/etl/tasks/pulse_tasks.py", line 30, in store_pulse_resultsets File "/app/treeherder/etl/resultset_loader.py", line 42, in process File "/app/treeherder/etl/resultset_loader.py", line 237, in transform File "/app/treeherder/etl/resultset_loader.py", line 244, in fetch_resultset exceptions:IndexError: list index out of range Corresponds to: https://github.com/mozilla/treeherder/blob/3648340ab4155a24747a68ddde3f5b8a0eb24f99/treeherder/etl/resultset_loader.py#L244 It looks like the hg.mozilla.org pushlog is broken. The most recent entries here are from 2 hours ago: https://hg.mozilla.org/try/pushloghtml https://hg.mozilla.org/try/json-pushes ...whereas the most recent commits are from minutes ago: https://hg.mozilla.org/try/shortlog This is the last push that the API returns: https://hg.mozilla.org/try/json-pushes?version=2&full=1&startID=210594&endID=210595 The next push ID (210596) doesn't exist: https://hg.mozilla.org/try/json-pushes?version=2&full=1&startID=210595&endID=210596 Note also that when visiting a specific recent commit, the "push id" and related fields all show as "unknown": https://hg.mozilla.org/try/rev/9b6f166ff58b
Assignee: emorley → nobody
Status: ASSIGNED → NEW
Component: Treeherder: Data Ingestion → Mercurial: hg.mozilla.org
Product: Tree Management → Developer Services
Version: --- → unspecified
Summary: New jobs/pushes don't show up on Try → hg.mozilla.org pushlog for Try isn't showing pushes for recent commits
a year ago
For "maybe-related" information, Mercurial 4.2 was just deployed half a day ago, in https://bugzilla.mozilla.org/show_bug.cgi?id=1385979
Attempted to run replicatesync to make sure there weren't any issues with replication, and it's not happy with try: 13:32 hgssh4.dmz.scl3#/var/hg/venv_tools/bin/hg replicatesync ** unknown exception encountered, please report by visiting ** https://mercurial-scm.org/wiki/BugTracker ** Python 2.7.5 (default, Nov 6 2016, 00:28:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] ** Mercurial Distributed SCM (version 4.2.2) ** Extensions loaded: blackbox, clonebundles, obsolescencehacks, pushlog, serverlog, readonly, vcsreplicator Traceback (most recent call last): File "/var/hg/venv_tools/bin/hg", line 45, in <module> mercurial.dispatch.run() File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 83, in run status = (dispatch(req) or 0) & 255 File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 164, in dispatch ret = _runcatch(req) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 295, in _runcatch return _callcatch(ui, _runcatchfunc) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 303, in _callcatch return scmutil.callcatch(ui, func) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/scmutil.py", line 146, in callcatch return func() File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 285, in _runcatchfunc return _dispatch(req) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 912, in _dispatch cmdpats, cmdoptions) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 648, in runcommand ret = _runcommand(ui, options, cmd, d) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 920, in _runcommand return cmdfunc() File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 909, in <lambda> d = lambda: util.checksignature(func)(ui, *args, **strcmdopt) File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/util.py", line 1080, in check return func(*args, **kwargs) File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 317, in replicatecommand sendreposyncmessage(ui, repo) File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 224, in sendreposyncmessage partition=repo.replicationpartition) File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/producer.py", line 128, in record_hg_repo_sync }, partition=partition) File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/producer.py", line 49, in send_message self.topic, partition, msg) File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 349, in send_messages return self._send_messages(topic, partition, *msg) File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 390, in _send_messages fail_on_error=self.sync_fail_on_error File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 480, in send_produce_request (not fail_on_error or not self._raise_on_response_error(resp))] File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 247, in _raise_on_response_error raise resp kafka.common.FailedPayloadsError and then kafka gets sad and needs restarting. I don't know that this is actually related to the problem at hand, however.
ok, looks like replication of pushlog data: hgssh4.dmz.scl3# ls -al /repo/hg/mozilla/try/.hg/pushlog2.db -rw-rw-r-- 1 hg scm_level_1 159708160 Aug 4 13:57 /repo/hg/mozilla/try/.hg/pushlog2.db hgweb11.dmz.scl3# ls -al /repo/hg/mozilla/try/.hg/pushlog2.db -rw-rw-r-- 1 hg hg 159698944 Aug 4 10:32 /repo/hg/mozilla/try/.hg/pushlog2.db
Aug 03 07:42:45 hgssh4.dmz.scl3.mozilla.com java: [2017-08-03 07:42:45,397] ERROR [Replica Manager on Broker 11]: Error when processing fetch request for partition [pushdata,4] offset 4012838 from consumer with correlation id 18. Possible cause: Request for offset 4012838 but we only have log segments in the range 1972025 to 2040814. (kafka.server.ReplicaManager)
(In reply to Kendall Libby [:fubar] from comment #5) > Aug 03 07:42:45 hgssh4.dmz.scl3.mozilla.com java: [2017-08-03 > 07:42:45,397] ERROR [Replica Manager on Broker 11]: Error when processing > fetch request for partition [pushdata,4] offset 4012838 from consumer with > correlation id 18. Possible cause: Request for offset 4012838 but we only > have log segments in the range 1972025 to 2040814. > (kafka.server.ReplicaManager) or that's a red herring, since it predates at least where pushlog data stops on the clients.
:gps suggested bouncing kafka on all of the nodes, but it hasn't helped (and replicatesync still throws the above error). he'll be online in < 30min to take a look
The issue is actively resolved. I'm still trying to piece together what happened. I do know there was a failure writing to the pushlog for one autoland push. There are a few changesets in the repo with no pushlog entry. This almost certainly confused Treeherder. It may have confused aspects of the replication mechanism.
Timeline of events: 09:40:59 VYV03354@nifty.ne.jp connects via ssh 09:41:11 VYV03354@nifty.ne.jp transaction starts 09:41:18 firstname.lastname@example.org connections via ssh 09:41:19 VYV03354@nifty.ne.jp transaction to add 4 changesets with head d8ded36356b0 closes 09:41:19 hgweb mirrors start connecting to replicate changes 09:41:21 hgweb mirrors start fetching bundles 09:41:23 email@example.com transaction starts 09:41:25 VYV03354@nifty.ne.jp disconnects 09:41:28 hgweb mirrors start fetching pushlog data 09:41:30 hgweb11 confirms replication of VYV03354@nifty.ne.jp's push 09:41:31 firstname.lastname@example.org transaction starts to close 09:41:37 email@example.com transaction closes; added 14 changesets with head 82464b3a6ebd 09:41:37 hgweb mirrors start fetching bind-autoland's transaction 09:41:37 firstname.lastname@example.org disconnects 09:41:50 hgweb11 confirms replication of autoland's push 09:43:03 email@example.com connects via ssh, submits a push with 4 changesets and head 5b4ed2075b1c and hgssh appears happy 09:43:54 hgweb11 confirms replication of mikokm's push ... XX:XX:XX various other pushes to try complete successfully ... 12:19:XX this bug filed 13:06:XX emorley makes noise in #vcs 13:24:XX fubar answers ping 13:27:47 fubar runs `hg replicatesync` 13:27:57 command fails with FailedPayloadsError attempting to write to Kafka 13:28:12 pid 23830 fails with RequestTimedOutError writing to Kafka (not sure what process this is) 13:28:20 first Nagios check fails due to failing to write heartbeat test message to partition 4 (try's partition) 13:29:51 firstname.lastname@example.org establishes ssh connection; push experiences problems 13:29:21 first nagios alert in #vcs 13:31:24 kafka restarted on hgssh4 XX:XX:XX kafka bounced on other machines and again on hgsshh; other attempts to run `hg replicatesync` ... 16:XX:XX gps runs `hg pull` on hgweb machines; lots of missing changesets pulled in 16:XX:XX gps also runs `hg replicatesync`, manages to make Kafka unhappy Included in the autoland push is this traceback: File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 359, in wireprotodispatch return orig(repo, proto, command) File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/wireproto.py", line 570, in dispatch return func(repo, proto, *args) File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/wireproto.py", line 982, in unbundle proto._client()) File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/exchange.py", line 1777, in unbundle lockandtr.close() File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/transaction.py", line 43, in _active return func(self, *args, **kwds) File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/transaction.py", line 490, in close self._postclosecallback[cat](self) File "/var/hg/version-control-tools/hgext/pushlog/__init__.py", line 245, in onpostclose conn.commit() OperationalError: database is locked Essentially, SQLite raised an exception when attempting to close the transaction adding the pushlog entry. The Mercurial transaction had completed. The changeset data was there but the pushlog data never was committed. So, the summary of events is: 1. Error writing to pushlog meant we recorded Mercurial data without a pushlog entry for head 82464b3a6ebd and prior 13 changesets 2. Treeherder started choking because of missing pushlog data 3. Upon suspicion that things were wedged, `hg replicatesync` was executed. This overwhelmed Kafka and wedged Kafka. 4. Heartbeat checks and pushes stopped working completely 5. Kafka was bounced 6. 3-5 were repeated a few times 7. Eventually, ran `hg pull` from hgweb to obtain missing changesets, everything restored I see the following potential follow-up actions: 1. Fix pushlog so this database lock issue doesn't happen (I suspect we're using SQLite not optimally and are running into a race between readers and a writer) 2. Treeherder to potentially react more robustly if pushlog data is missing (although fail fast applies and I'd like to say that hg.mo can guarantee pushlog data existence) 3. Figure out what is causing Kafka to get wedged during `hg replicatesync` and prevent that from happening (I think it chokes on a very large message)
(In reply to Gregory Szorc [:gps] from comment #9) > 2. Treeherder to potentially react more robustly if pushlog data is missing > (although fail fast applies and I'd like to say that hg.mo can guarantee > pushlog data existence) I could be wrong, but I don't think there was any issue on the Treeherder side. Each push received over Pulse is ingested separately, and the `push_full_json_url` property used to determine what jsonpushlog URL to retrieve. However none of the more recent pushes existed in the pushlog - not just those 13. For example, the 9b6f166ff58b commit mentioned in comment 1 didn't have a pushlog entry, and that was ~25 pushes after the last good push at the time of the incident.
Regarding the interaction between Mercurial and SQLite, the architectural problem is that we don't have atomic transactions. Mercurial has its transaction. SQLite has its transaction. A failure to close SQLite's transaction doesn't roll back Mercurial's transaction. Way back in bug 966545 I improved pushlog code to at least be scoped to Mercurial's transaction. Before, if you aborted a push, we would orphan data in the SQLite database and things would get confused. For several years, we told people not to ^C pushes or it could corrupt repos. After that change, we roll back the SQLite transaction if Mercurial aborts its transaction. Transactions are fundamentally a hard problem. More so when you try to link discrete software components to the same logical transaction. The thing making it difficult in this case is there is no easy way to "undo" the last committed transaction in Mercurial and SQLite. If you have discrete transactions, you need a way to undo the last transaction if a subsequent transaction fails in order to make them behave as an atomic unit. Lacking a mechanism to do this, we have to hope and pray or work outside the box. Up to this day, we've somehow never had an issue (or at least recorded an instance of) the SQLite transaction failing after the Mercurial transaction closed. So the "hope and pray" solution has worked. But since it failed, we should probably do something. Ideas I see: 1) Create a copy of the SQLite database on transaction open. Save SQLite transaction before Mercurial's. Use Mercurial's built-in restore-file-on-rollback/copy-file-on-close primitives to lean on file copies instead of SQLite transactions to achieve atomic transactions. (There will likely be a perf hit and the change is quite hacky.) 2) Improve usage of SQLite so we don't have database locks and/or retry harder when a lock is observed. This is still "hope and pray" but it might be good enough.
(In reply to Ed Morley [:emorley] from comment #10) > Each push received over Pulse is ingested separately, and the > `push_full_json_url` property used to determine what jsonpushlog URL to > retrieve. However none of the more recent pushes existed in the pushlog - > not just those 13. For example, the 9b6f166ff58b commit mentioned in comment > 1 didn't have a pushlog entry, and that was ~25 pushes after the last good > push at the time of the incident. OK. Then it appears the hgweb machines didn't replicate pushlog entries until I ran a manual `hg pull`. We don't log the raw command output when running some replication commands. So I can't confirm this. But it is a plausible explanation. So additional follow-ups would be: * Log output from replication commands to facilitate forensics * Make pushlog data mirroring more robust to missing pushlog data
The pushlog SQLite databases don't use WAL, which means there is contention between readers and writers. Unfortunately, I don't think we can use WAL because the pushlog on hgssh is hosted on NFS. The WAL index is in shared memory, which means it won't work across machines. While we typically only have one master, a shared memory index for SQLite feels brittle to me. I don't want to be dealing with issues from multiple machines attempting to access the SQLite database and touching the index. You can disable the use of shared memory with the WAL index. https://sqlite.org/wal.html#noshm. But this requires locking_mode=EXCLUSIVE, which prevents concurrent readers. We need concurrent reads for replication requests. So I think the WAL is out. Refreshing my brain on transaction semantics (https://www.sqlite.org/lang_transaction.html#immediate), we /could/ move to an IMMEDIATE transaction from deferred. This will obtain a RESERVED lock immediately, thus locking out other writers. However, a RESERVED lock will also lock out other readers. That's not desirable because it will prevent replication. I think the best we can do is retry harder when the COMMIT fails. That being said, if we did actually see SQLITE_LOCKED (https://www.sqlite.org/rescode.html#locked) from Python, that means the lock is coming from within the Python process rather than from a reader. Perhaps we are inadvertently opening multiple SQLite connections from the same Python process? Due to the way the pushlog object is cached on the Mercurial repo, this is certainly possible...
According to the SQLite source code "database is locked" corresponds to SQLITE_BUSY. So this is likely a reader locking out a writer. We should increase the busy timeout when we commit the transaction to maximize our chances for success.
This seems to have happened again today, https://treeherder.mozilla.org/#/jobs?repo=try&revision=9fd674cc9b840320a1f39421141863058c74e45f and https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4a9037ae138067133c64a7d4b83321a26aa8c12 were two I've done this morning the latest entry on https://hg.mozilla.org/try/pushloghtml is [was] from ~90 minutes ago
Tried running 'hg pull' on the webheads to catch try back up, but I'm clearly missing something as it was a no-op. There weren't any 'database is locked' errors that I could find on hgssh4; presuming they were found elsewhere.
Correct me if I'm wrong, but SQLite only has table-wide locks, correct? Is this the correct solution for us if we're having problems with concurrency? In MySQL with InnoDB, for instance, locking is generally row-level, so lock contention is much less likely.
We use SQLite because it is simple and fits the requirements (assuming we use it correctly). Using something more complex like MySQL introduces more failure modes, operational complexity, etc.
No new pushes on Treeherder for Try or the last 2 hours. Kendall, please take a look.
restarted kafka, jic. pulled data over to the webheads.
Pushed by email@example.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/0795a1fde163 pushlog: increase busy_timeout
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
I suspect that change (which I'm deploying now) will wallpaper over the issue. But so far I have yet to reproduce the exact failure we're seeing in production in tests. Speaking of tests, I intend to land some more test coverage for database contention. But that doesn't need to block the server change.
This broke again a few minutes ago. Had to do an `hg pull` to fix. Ugh.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I suspect bug 1297153 may be partially responsible for some of the failures. Anyway, I don't think we've had a pushlog failure since both the timeout increase and the try heads were pruned over the past several days. While I'm still concerned about what caused this, I'm willing to call it resolved once bug 1297153 lands.
There is no indication that this is still a problem.
Status: REOPENED → RESOLVED
Last Resolved: a year ago → a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.