Closed Bug 1118267 Opened 9 years ago Closed 9 years ago

hgweb8 had corrupted user repos

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

For instance, in this push there are a lot of them:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c99df8976076

I think some of the hg heads are taking longer to have the user repositories updated.

Nevertheless, repository_manifest.py should be trying few times, however, it is only taking less than 2 seconds to execute which means that we are not retrying.
From looking at the code, it is clear that I completely missed adding the retry logic for that second retrieval [1]

From looking at jgraham, this is a higher occurrence than the pushes I did before the Christmas break.

ERROR:root:https://hg.mozilla.org/users/james_hoppipolla.co.uk/mozharness/rev/tip
Traceback (most recent call last):
  File "repository_manifest.py", line 149, in main
    urllib2.urlopen(url, timeout=options.timeout)
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
program finished with exit code 1
elapsedTime=1.555251

[1] hg.mozilla.org/build/tools/file/default/buildfarm/utils/repository_manifest.py#l149
hgweb servers aren't synchronized atomically. There will always be a small window where slave X has data before slave Y. This window should almost never be more than a few second. And, if a sync is aborted, we don't have a good mechanism to recover from that in a timely manner, meaning individual slaves can drift out of date for several minutes or hours.

I'm not yet sure what's going on here. But, user vs non-user repositories shouldn't matter.
For some reason we're getting a very high number of 404s when we never used to back in December.

I will add the missing retry logic but something is not quite right in the backend.
https://hg.mozilla.org/users/james_hoppipolla.co.uk/mozharness is consistently 404ing on hgweb8.dmz.scl3.mozilla.com. I'm poking around now to see why this server is special.
The /repo_local/mozilla/webroot_wsgi/users/james_hoppipolla.co.uk directory and its hgweb files didn't exist on hgweb8.dmz.scl3.mozilla.com.

This is almost certainly a bug with pash and our server setup.
Component: Tools → General
Product: Release Engineering → Developer Services
QA Contact: hwine
James: Do you remember anything about when you created your user repo? Timeouts? Ctrl+c? Error messages?

You didn't do anything wrong. I'm just curious.
Flags: needinfo?(james)
I don't remember any such errors, but it was a while ago. I think gbrown was having a similar issue, maybe he recalls something.
Flags: needinfo?(james) → needinfo?(gbrown)
gbrown also has the same problem on this host. So does jgriffin, wlach, and a host of others.
Auditing reveals hgweb8 is the only issue with a sync problem.
missing user wsgi dir: atolfsen_mozilla.com
missing user wsgi dir: build
missing user wsgi dir: cvs-trunk-base-old
missing user wsgi dir: edilee_gmail.com
missing user wsgi dir: gbrown_mozilla.com
missing user wsgi dir: jgriffin_mozilla.com
missing user wsgi dir: johnlzeller_gmail.com
missing user wsgi dir: mozilla_christophkerschbaumer.com
missing user wsgi dir: pbrosset_mozilla.com
missing user wsgi dir: ricky060709_gmail.com
missing user wsgi dir: sledru_mozilla.com
missing user wsgi dir: tchou_mozilla.com
missing user wsgi dir: wlachance_mozilla.com

build and cvs-trunk-base-old are missing on all hosts.
I created my user repo just a couple of weeks ago and did not see any errors at that time. However, I had the curious problem that I could not clone from http immediately after creating the repo -- I had to clone from ssh.
Flags: needinfo?(gbrown)
I manually created the wsgi dirs for the users mentioned in comment 10. Still no clue why they didn't get created in the first place.

Leaving open until the underlying issue is fixed or proved to not be happening any more.
Component: General → Mercurial: hg.mozilla.org
QA Contact: hwine
the sync code only retires once, so if whatever issue causes the initial sync to fail lasts longer than the retry takes...
Assignee: armenzg → nobody
Summary: Getting intermittent 404 errors for repository_manifest.py → hgweb8 had corrupted user repos
I have added retry logic to the second fetch [1] which hard failed against hgweb 8.
I assume now that with the retrying logic we will pick other hg web heads which would have not been corrupted plus I assume that this only affected a minimal amount of users which lacked their wsgi dir on a specific web head.

In other words, even if we get into a similar state in the future, it would only affect try users attempting to use their mozharness user repo.

This does not block the general pinning of mozharness on non-try repositories (bug 1110286).

[1] http://hg.mozilla.org/build/tools/file/b0dc9e1cd9b9/buildfarm/utils/repository_manifest.py#l187
No longer blocks: 1110286
smacleod just ran into this. Was missing wsgi files for his user directory on this host. Running /usr/local/bin/make_user_wsgi_dirs.sh fixed things up. I think I'll add that script to CRON or something.
OK. CRON to run make_user_wsgi_dirs.sh is already present. But for whatever reason /repo/hg/webroot_wsgi is mostly root owned on hgweb8 as opposed to hg owned.
url:        https://hg.mozilla.org/hgcustom/version-control-tools/rev/14754f05d688174eca791f51c6f3e49f2cb8e776
changeset:  14754f05d688174eca791f51c6f3e49f2cb8e776
user:       Gregory Szorc <gps@mozilla.com>
date:       Mon Jun 29 13:39:31 2015 -0700
description:
ansible/hg-web: ensure /repo/hg/webroot_wsgi files are owned by hg (bug 1118267)

Historically, files under /repo/hg/webroot_wsgi were sometimes managed
by hand. Numerous files on the production machines are owned by the root
user when they should be owned by the hg user.

Add a task to mass chown this directory tree.
And with a deployment of the commit described above, all permissions in production should be sane and we should no longer have this problem.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.