Closed Bug 1475656 Opened 3 years ago Closed 3 years ago

FilteredIndexError accessing automationrelevance web command

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(1 file)

After deploying bug 1470606 to production, we're seeing intermittent errors like the following on <repo>/json-automationrelevance/<rev> requests:

[Fri Jul 13 19:13:37.537429 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730] mod_wsgi (pid=29134): Exception occurred processing WSGI script '/repo_local/mozilla/webroot_wsgi/hgweb.wsgi'.
[Fri Jul 13 19:13:37.537517 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730] Traceback (most recent call last):
[Fri Jul 13 19:13:37.537539 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/hgwebdir_mod.py", line 229, in run_wsgi
[Fri Jul 13 19:13:37.537612 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     for r in self._runwsgi(req):
[Fri Jul 13 19:13:37.537627 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/hgweb_mod.py", line 318, in run_wsgi
[Fri Jul 13 19:13:37.537808 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     for r in self._runwsgi(req, repo):
[Fri Jul 13 19:13:37.537823 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/version-control-tools/hgext/serverlog/__init__.py", line 316, in _runwsgi
[Fri Jul 13 19:13:37.537847 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     for what in super(hgwebwrapped, self)._runwsgi(req, repo):
[Fri Jul 13 19:13:37.537859 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/hgweb_mod.py", line 446, in _runwsgi
[Fri Jul 13 19:13:37.537875 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     content = getattr(webcommands, cmd)(rctx, req, tmpl)
[Fri Jul 13 19:13:37.537887 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/webcommands.py", line 455, in changeset
[Fri Jul 13 19:13:37.538155 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     return tmpl('changeset', **webutil.changesetentry(web, req, tmpl, ctx))
[Fri Jul 13 19:13:37.538170 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/version-control-tools/hgext/hgmo/__init__.py", line 307, in changesetentry
[Fri Jul 13 19:13:37.538399 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     d = orig(web, req, tmpl, ctx)
[Fri Jul 13 19:13:37.538417 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/version-control-tools/hgext/firefoxreleases/__init__.py", line 188, in changesetentry
[Fri Jul 13 19:13:37.538519 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     d = orig(web, req, tmpl, ctx)
[Fri Jul 13 19:13:37.538533 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/webutil.py", line 452, in changesetentry
[Fri Jul 13 19:13:37.538673 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     **commonentry(web.repo, ctx))
[Fri Jul 13 19:13:37.538687 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/version-control-tools/hgext/pushlog/__init__.py", line 217, in commonentry
[Fri Jul 13 19:13:37.538898 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     d = orig(repo, ctx)
[Fri Jul 13 19:13:37.538911 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/webutil.py", line 375, in commonentry
[Fri Jul 13 19:13:37.538931 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     'inbranch': nodeinbranch(repo, ctx),
[Fri Jul 13 19:13:37.538943 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/webutil.py", line 230, in nodeinbranch
[Fri Jul 13 19:13:37.538962 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     branchnode = repo.branchtip(branch)
[Fri Jul 13 19:13:37.538975 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/localrepo.py", line 964, in branchtip
[Fri Jul 13 19:13:37.539002 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     return self.branchmap().branchtip(branch)
[Fri Jul 13 19:13:37.539013 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/localrepo.py", line 946, in branchmap
[Fri Jul 13 19:13:37.539028 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     branchmap.updatecache(self)
[Fri Jul 13 19:13:37.539039 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/branchmap.py", line 111, in updatecache
[Fri Jul 13 19:13:37.539168 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     partial.update(repo, revs)
[Fri Jul 13 19:13:37.539182 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/branchmap.py", line 296, in update
[Fri Jul 13 19:13:37.539201 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     ancestors = set(cl.ancestors(newheadrevs, floorrev))
[Fri Jul 13 19:13:37.539212 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/ancestor.py", line 328, in __iter__
[Fri Jul 13 19:13:37.539311 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     for parent in parentrevs(visit.popleft()):
[Fri Jul 13 19:13:37.539324 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/changelog.py", line 379, in parentrevs
[Fri Jul 13 19:13:37.539805 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730]     raise error.FilteredIndexError(rev)
[Fri Jul 13 19:13:37.539836 2018] [wsgi:error] [pid 29134] [remote 10.22.74.212:45730] FilteredIndexError: 1207896

The automationrelevance web command accesses an unfiltered repository. This has access to unreplicated changesets. One of these likely leaks into other code, causing the FilteredIndexError.
Looking at log timestamps, I'm pretty convinced that all these failures occur in the window between when the local repository has replicated data and when that new data is exposed via hgweb. But I'm not yet able to reproduce the failure in a test...
@localrepo.repofilecache doesn't do what its docstring says it does.
The docstring says that the function result will be cached for as long
as stat() on the listed file(s) returns the same result. In reality,
the function result is cached until the cache is cleared, at which
point a stat() will be performed on next access and the cached result
will be returned, if possible. And the cache is only cleared when
the localrepository object is mutated by itself.

(I have submitted a patch upstream to have the docstring reflect
reality.)

In the context of hgweb, localrepository instances are cached between
requests. And there's a list of files in mercurial.hg.foi that
are stat()ed to determine whether a cached localrepository instance
can be used for a subsequent request.

Without .hg/replicated-data in mercurial.hg.foi, this file could change
on disk and a subsequent HTTP request could use the cached value of the
replicated_data attribute. This would lead to badness. The old cached
value would linger until either:

a) the cached repo was evicted from the repo cache
b) the hgweb process exits
c) any other path in mercurial.hg.foi changes (they would change when
   new changesets are added to the repo)
I'm not 100% convinced the patch I just submitted will fix the issue. But the patch certainly fixes *an* issue that could cause the failure. So I'm optimistic.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Comment on attachment 8992091 [details]
vcsreplicator: invalidate repositories when .hg/replicated-data changes (bug 1475656); r?glob, sheehan

Byron Jones ‹:glob› 🎈 has approved the revision.

https://phabricator.services.mozilla.com/D2147
Attachment #8992091 - Flags: review+
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/a3a8b163e065
vcsreplicator: invalidate repositories when .hg/replicated-data changes ; r=glob
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
We are still seeing failures with the original stack after this was deployed.

Looking at the code some more, I have some more theories...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I performed some big pushes to try this evening. For reasons unknown to me, hgweb6 is *very* slow at replicating. Much slower than its "identical" hgweb5 and hgweb16. This resulted in hgweb6 being several minutes behind hgweb[12] at times.

When things were out of sync, we saw a bunch of FilteredRepoError on the machines that had fully replicated everything.

And json-pushes was giving 404 errors for startID=X&endID=Y queries. But the 404's only occurred when full=1 appeared in the querystring. The stack is getting swallowed. But I suspect the ctx.description() that is called when full=1 appears is triggering the same failure in comment 0.

So it appears branchmap cache generation on a fully replicated but not globally synced repository doesn't work. At least in some DAG shapes. I qualify that because I can't reproduce this locally.

This will actively cause has hardship in CI until it is resolved. So this will be my #1 priority until it is fixed. Ugh.
Status: REOPENED → ASSIGNED
Priority: -- → P1
Keywords: leave-open
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/8f4111327c5d
ansible/hg-web: disable vcsreplicator extension
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/ce8a6ecf5464
vcsreplicator: enable hgweb extension again 
https://hg.mozilla.org/hgcustom/version-control-tools/rev/08ec2f5f0472
vcsreplicator: make replicated_data a regular @property
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/d47e8b2ce5f3
ansible/hg-web: disable vcsreplicator extension
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/99190d8a6d8b
vcsreplicator: copy branchmap.updatecache into extension 
https://hg.mozilla.org/hgcustom/version-control-tools/rev/3359ab172551
vcsreplicator: add forensic logging to branch cache updates
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/d9fdfdebb58b
vcsreplicator: remove @property experiment 
https://hg.mozilla.org/hgcustom/version-control-tools/rev/3c511be410fa
vcsreplicator: change how we look for missing changesets 
https://hg.mozilla.org/hgcustom/version-control-tools/rev/4169f09207aa
vcsreplicator: enable hgweb extension again
I think 3c511be410fa fixed things. It has been deployed for upwards of an hour and I haven't seen any odd errors in logs.

So I'm going to call this one done.
Status: ASSIGNED → RESOLVED
Closed: 3 years ago3 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.