1118267 - hgweb8 had corrupted user repos

Reporter

Description

•

9 years ago

For instance, in this push there are a lot of them:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c99df8976076

I think some of the hg heads are taking longer to have the user repositories updated.

Nevertheless, repository_manifest.py should be trying few times, however, it is only taking less than 2 seconds to execute which means that we are not retrying.
From looking at the code, it is clear that I completely missed adding the retry logic for that second retrieval [1]

From looking at jgraham, this is a higher occurrence than the pushes I did before the Christmas break.

ERROR:root:https://hg.mozilla.org/users/james_hoppipolla.co.uk/mozharness/rev/tip
Traceback (most recent call last):
  File "repository_manifest.py", line 149, in main
    urllib2.urlopen(url, timeout=options.timeout)
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
program finished with exit code 1
elapsedTime=1.555251

[1] hg.mozilla.org/build/tools/file/default/buildfarm/utils/repository_manifest.py#l149

Gregory Szorc [:gps]

Comment 1

•

9 years ago

hgweb servers aren't synchronized atomically. There will always be a small window where slave X has data before slave Y. This window should almost never be more than a few second. And, if a sync is aborted, we don't have a good mechanism to recover from that in a timely manner, meaning individual slaves can drift out of date for several minutes or hours.

I'm not yet sure what's going on here. But, user vs non-user repositories shouldn't matter.

Armen [:armenzg]

Reporter

Comment 2

•

9 years ago

For some reason we're getting a very high number of 404s when we never used to back in December.

I will add the missing retry logic but something is not quite right in the backend.

Gregory Szorc [:gps]

Comment 3

•

9 years ago

https://hg.mozilla.org/users/james_hoppipolla.co.uk/mozharness is consistently 404ing on hgweb8.dmz.scl3.mozilla.com. I'm poking around now to see why this server is special.

Gregory Szorc [:gps]

Comment 4

•

9 years ago

The /repo_local/mozilla/webroot_wsgi/users/james_hoppipolla.co.uk directory and its hgweb files didn't exist on hgweb8.dmz.scl3.mozilla.com.

This is almost certainly a bug with pash and our server setup.

Component: Tools → General

Product: Release Engineering → Developer Services

QA Contact: hwine

Gregory Szorc [:gps]

Comment 5

•

9 years ago

James: Do you remember anything about when you created your user repo? Timeouts? Ctrl+c? Error messages?

You didn't do anything wrong. I'm just curious.

Flags: needinfo?(james)

James Graham [:jgraham]

Comment 6

•

9 years ago

I don't remember any such errors, but it was a while ago. I think gbrown was having a similar issue, maybe he recalls something.

Flags: needinfo?(james) → needinfo?(gbrown)

Gregory Szorc [:gps]

Comment 7

•

9 years ago

gbrown also has the same problem on this host. So does jgriffin, wlach, and a host of others.

Gregory Szorc [:gps]

Comment 8

•

9 years ago

https://hg.mozilla.org/hgcustom/version-control-tools/rev/cd83295d7965

Gregory Szorc [:gps]

Comment 9

•

9 years ago

Auditing reveals hgweb8 is the only issue with a sync problem.

Gregory Szorc [:gps]

Comment 10

•

9 years ago

missing user wsgi dir: atolfsen_mozilla.com
missing user wsgi dir: build
missing user wsgi dir: cvs-trunk-base-old
missing user wsgi dir: edilee_gmail.com
missing user wsgi dir: gbrown_mozilla.com
missing user wsgi dir: jgriffin_mozilla.com
missing user wsgi dir: johnlzeller_gmail.com
missing user wsgi dir: mozilla_christophkerschbaumer.com
missing user wsgi dir: pbrosset_mozilla.com
missing user wsgi dir: ricky060709_gmail.com
missing user wsgi dir: sledru_mozilla.com
missing user wsgi dir: tchou_mozilla.com
missing user wsgi dir: wlachance_mozilla.com

build and cvs-trunk-base-old are missing on all hosts.

Geoff Brown [:gbrown]

Comment 11

•

9 years ago

I created my user repo just a couple of weeks ago and did not see any errors at that time. However, I had the curious problem that I could not clone from http immediately after creating the repo -- I had to clone from ssh.

Flags: needinfo?(gbrown)

Gregory Szorc [:gps]

Comment 12

•

9 years ago

I manually created the wsgi dirs for the users mentioned in comment 10. Still no clue why they didn't get created in the first place.

Leaving open until the underlying issue is fixed or proved to not be happening any more.

Gregory Szorc [:gps]

Updated

•

9 years ago

Component: General → Mercurial: hg.mozilla.org

QA Contact: hwine

Kendall Libby [:fubar] (he/him)

Comment 13

•

9 years ago

the sync code only retires once, so if whatever issue causes the initial sync to fail lasts longer than the retry takes...

Armen [:armenzg]

Reporter

Updated

•

9 years ago

Assignee: armenzg → nobody

Summary: Getting intermittent 404 errors for repository_manifest.py → hgweb8 had corrupted user repos

Armen [:armenzg]

Reporter

Comment 14

•

9 years ago

I have added retry logic to the second fetch [1] which hard failed against hgweb 8.
I assume now that with the retrying logic we will pick other hg web heads which would have not been corrupted plus I assume that this only affected a minimal amount of users which lacked their wsgi dir on a specific web head.

In other words, even if we get into a similar state in the future, it would only affect try users attempting to use their mozharness user repo.

This does not block the general pinning of mozharness on non-try repositories (bug 1110286).

[1] http://hg.mozilla.org/build/tools/file/b0dc9e1cd9b9/buildfarm/utils/repository_manifest.py#l187

No longer blocks: 1110286

Gregory Szorc [:gps]

Comment 15

•

9 years ago

smacleod just ran into this. Was missing wsgi files for his user directory on this host. Running /usr/local/bin/make_user_wsgi_dirs.sh fixed things up. I think I'll add that script to CRON or something.

Gregory Szorc [:gps]

Comment 16

•

9 years ago

OK. CRON to run make_user_wsgi_dirs.sh is already present. But for whatever reason /repo/hg/webroot_wsgi is mostly root owned on hgweb8 as opposed to hg owned.

Gregory Szorc [:gps]

Comment 17

•

9 years ago

url:        https://hg.mozilla.org/hgcustom/version-control-tools/rev/14754f05d688174eca791f51c6f3e49f2cb8e776
changeset:  14754f05d688174eca791f51c6f3e49f2cb8e776
user:       Gregory Szorc <gps@mozilla.com>
date:       Mon Jun 29 13:39:31 2015 -0700
description:
ansible/hg-web: ensure /repo/hg/webroot_wsgi files are owned by hg (bug 1118267)

Historically, files under /repo/hg/webroot_wsgi were sometimes managed
by hand. Numerous files on the production machines are owned by the root
user when they should be owned by the hg user.

Add a task to mass chown this directory tree.

Gregory Szorc [:gps]

Comment 18

•

9 years ago

And with a deployment of the commit described above, all permissions in production should be sane and we should no longer have this problem.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

hgweb8 had corrupted user repos

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

Tracking

(Not tracked)

People

(Reporter: armenzg, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Comment 13

Updated

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18