Closed
Bug 1118267
Opened 9 years ago
Closed 9 years ago
hgweb8 had corrupted user repos
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Unassigned)
References
Details
For instance, in this push there are a lot of them: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c99df8976076 I think some of the hg heads are taking longer to have the user repositories updated. Nevertheless, repository_manifest.py should be trying few times, however, it is only taking less than 2 seconds to execute which means that we are not retrying. From looking at the code, it is clear that I completely missed adding the retry logic for that second retrieval [1] From looking at jgraham, this is a higher occurrence than the pushes I did before the Christmas break. ERROR:root:https://hg.mozilla.org/users/james_hoppipolla.co.uk/mozharness/rev/tip Traceback (most recent call last): File "repository_manifest.py", line 149, in main urllib2.urlopen(url, timeout=options.timeout) File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 406, in open response = meth(req, response) File "/usr/lib/python2.7/urllib2.py", line 519, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib/python2.7/urllib2.py", line 444, in error return self._call_chain(*args) File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) HTTPError: HTTP Error 404: Not Found program finished with exit code 1 elapsedTime=1.555251 [1] hg.mozilla.org/build/tools/file/default/buildfarm/utils/repository_manifest.py#l149
Comment 1•9 years ago
|
||
hgweb servers aren't synchronized atomically. There will always be a small window where slave X has data before slave Y. This window should almost never be more than a few second. And, if a sync is aborted, we don't have a good mechanism to recover from that in a timely manner, meaning individual slaves can drift out of date for several minutes or hours. I'm not yet sure what's going on here. But, user vs non-user repositories shouldn't matter.
Reporter | ||
Comment 2•9 years ago
|
||
For some reason we're getting a very high number of 404s when we never used to back in December. I will add the missing retry logic but something is not quite right in the backend.
Comment 3•9 years ago
|
||
https://hg.mozilla.org/users/james_hoppipolla.co.uk/mozharness is consistently 404ing on hgweb8.dmz.scl3.mozilla.com. I'm poking around now to see why this server is special.
Comment 4•9 years ago
|
||
The /repo_local/mozilla/webroot_wsgi/users/james_hoppipolla.co.uk directory and its hgweb files didn't exist on hgweb8.dmz.scl3.mozilla.com. This is almost certainly a bug with pash and our server setup.
Component: Tools → General
Product: Release Engineering → Developer Services
QA Contact: hwine
Comment 5•9 years ago
|
||
James: Do you remember anything about when you created your user repo? Timeouts? Ctrl+c? Error messages? You didn't do anything wrong. I'm just curious.
Flags: needinfo?(james)
Comment 6•9 years ago
|
||
I don't remember any such errors, but it was a while ago. I think gbrown was having a similar issue, maybe he recalls something.
Flags: needinfo?(james) → needinfo?(gbrown)
Comment 7•9 years ago
|
||
gbrown also has the same problem on this host. So does jgriffin, wlach, and a host of others.
Comment 9•9 years ago
|
||
Auditing reveals hgweb8 is the only issue with a sync problem.
Comment 10•9 years ago
|
||
missing user wsgi dir: atolfsen_mozilla.com missing user wsgi dir: build missing user wsgi dir: cvs-trunk-base-old missing user wsgi dir: edilee_gmail.com missing user wsgi dir: gbrown_mozilla.com missing user wsgi dir: jgriffin_mozilla.com missing user wsgi dir: johnlzeller_gmail.com missing user wsgi dir: mozilla_christophkerschbaumer.com missing user wsgi dir: pbrosset_mozilla.com missing user wsgi dir: ricky060709_gmail.com missing user wsgi dir: sledru_mozilla.com missing user wsgi dir: tchou_mozilla.com missing user wsgi dir: wlachance_mozilla.com build and cvs-trunk-base-old are missing on all hosts.
Comment 11•9 years ago
|
||
I created my user repo just a couple of weeks ago and did not see any errors at that time. However, I had the curious problem that I could not clone from http immediately after creating the repo -- I had to clone from ssh.
Flags: needinfo?(gbrown)
Comment 12•9 years ago
|
||
I manually created the wsgi dirs for the users mentioned in comment 10. Still no clue why they didn't get created in the first place. Leaving open until the underlying issue is fixed or proved to not be happening any more.
Updated•9 years ago
|
Component: General → Mercurial: hg.mozilla.org
QA Contact: hwine
Comment 13•9 years ago
|
||
the sync code only retires once, so if whatever issue causes the initial sync to fail lasts longer than the retry takes...
Reporter | ||
Updated•9 years ago
|
Assignee: armenzg → nobody
Summary: Getting intermittent 404 errors for repository_manifest.py → hgweb8 had corrupted user repos
Reporter | ||
Comment 14•9 years ago
|
||
I have added retry logic to the second fetch [1] which hard failed against hgweb 8. I assume now that with the retrying logic we will pick other hg web heads which would have not been corrupted plus I assume that this only affected a minimal amount of users which lacked their wsgi dir on a specific web head. In other words, even if we get into a similar state in the future, it would only affect try users attempting to use their mozharness user repo. This does not block the general pinning of mozharness on non-try repositories (bug 1110286). [1] http://hg.mozilla.org/build/tools/file/b0dc9e1cd9b9/buildfarm/utils/repository_manifest.py#l187
No longer blocks: 1110286
Comment 15•9 years ago
|
||
smacleod just ran into this. Was missing wsgi files for his user directory on this host. Running /usr/local/bin/make_user_wsgi_dirs.sh fixed things up. I think I'll add that script to CRON or something.
Comment 16•9 years ago
|
||
OK. CRON to run make_user_wsgi_dirs.sh is already present. But for whatever reason /repo/hg/webroot_wsgi is mostly root owned on hgweb8 as opposed to hg owned.
Comment 17•9 years ago
|
||
url: https://hg.mozilla.org/hgcustom/version-control-tools/rev/14754f05d688174eca791f51c6f3e49f2cb8e776 changeset: 14754f05d688174eca791f51c6f3e49f2cb8e776 user: Gregory Szorc <gps@mozilla.com> date: Mon Jun 29 13:39:31 2015 -0700 description: ansible/hg-web: ensure /repo/hg/webroot_wsgi files are owned by hg (bug 1118267) Historically, files under /repo/hg/webroot_wsgi were sometimes managed by hand. Numerous files on the production machines are owned by the root user when they should be owned by the hg user. Add a task to mass chown this directory tree.
Comment 18•9 years ago
|
||
And with a deployment of the commit described above, all permissions in production should be sane and we should no longer have this problem.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•