Closed Bug 1047207 Opened 10 years ago Closed 10 years ago

hgtool should retry or exit if it hits a DNS or server error during pull, not clobber and unbundle

Categories

(Release Engineering :: General, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: catlee)

References

Details

(Whiteboard: [capacity])

Attachments

(2 files)

Wasteful:

21:15:04     INFO - Copy/paste: /usr/local/bin/hgtool.py --bundle https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/bundles/mozilla-b2g32_v2_0.hg https://hg.mozilla.org/releases/mozilla-b2g32_v2_0 /builds/b2g_bumper/v2.0/build/mozilla-b2g32_v2_0
21:15:04     INFO - Using env: {'PATH': '/usr/local/bin:/usr/bin:/bin'}
21:15:04     INFO -  Reporting hg version in use
21:15:04     INFO -  command: START
21:15:04     INFO -  command: hg -q version
21:15:04     INFO -  command: cwd: .
21:15:04     INFO -  command: output:
21:15:05     INFO -  Mercurial Distributed SCM (version 2.5.4)
21:15:05     INFO -  command: END (0.34s elapsed)

21:15:05     INFO -  command: START
21:15:05     INFO -  command: hg path default
21:15:05     INFO -  command: cwd: /builds/b2g_bumper/v2.0/build/mozilla-b2g32_v2_0
21:15:05     INFO -  command: output:
21:15:05     INFO -  https://hg.mozilla.org/releases/mozilla-b2g32_v2_0
21:15:05     INFO -  command: END (0.38 elapsed)

21:15:05     INFO -  command: START
21:15:05     INFO -  command: hg pull https://hg.mozilla.org/releases/mozilla-b2g32_v2_0
21:15:05     INFO -  command: cwd: /builds/b2g_bumper/v2.0/build/mozilla-b2g32_v2_0
21:15:05     INFO -  command: output:
21:15:31    ERROR -  abort: error: Name or service not known
21:15:31    ERROR -  Automation Error: hg not responding
21:15:31     INFO -  command: ERROR
21:15:31     INFO -  Traceback (most recent call last):
21:15:31     INFO -    File "<string>", line 47, in run_cmd
21:15:31     INFO -    File "/usr/lib64/python2.6/subprocess.py", line 502, in check_call
21:15:31     INFO -      raise CalledProcessError(retcode, cmd)
21:15:31     INFO -  CalledProcessError: Command '['hg', 'pull', 'https://hg.mozilla.org/releases/mozilla-b2g32_v2_0']' returned non-zero exit status 255
21:15:31     INFO -  command: END (25.54s elapsed)

21:15:31     INFO -  Error pulling changes into /builds/b2g_bumper/v2.0/build/mozilla-b2g32_v2_0 from https://hg.mozilla.org/releases/mozilla-b2g32_v2_0; clobbering
21:17:44     INFO -  Attempting to initialize clone with bundles
21:17:44     INFO -  command: START
21:17:44     INFO -  command: hg init /builds/b2g_bumper/v2.0/build/mozilla-b2g32_v2_0
21:17:44     INFO -  command: cwd: /builds/b2g_bumper/v2.0/build
21:17:44     INFO -  command: output:
21:17:44     INFO -  command: END (0.22s elapsed)

21:17:44     INFO -  Trying to use bundle https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/bundles/mozilla-b2g32_v2_0.hg
21:17:44     INFO -  command: START
21:17:44     INFO -  command: hg unbundle https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/bundles/mozilla-b2g32_v2_0.hg
21:17:44     INFO -  command: cwd: /builds/b2g_bumper/v2.0/build/mozilla-b2g32_v2_0

And more than 30 minutes later still going, plus three others on bm66 in the for b2g_bumper.
Summary: hgtool should abort and exit if it hits a DNS error, not clobber and unbundle → hgtool should retry or exit if it hits a DNS error during pull, not clobber and unbundle
Blocks: 1036468
Whiteboard: [capacity]
Same applies to 500/502/503/504 responses. The hard bit seems to be that the exit statuses of mercurial are poor, so we may have to parse the output.
Summary: hgtool should retry or exit if it hits a DNS error during pull, not clobber and unbundle → hgtool should retry or exit if it hits a DNS or server error during pull, not clobber and unbundle
Assignee: nobody → catlee
You'll have to parse output to determine DNS failures from other failures.

Also, I question the sanity of our automation environment if hg.mozilla.org ever fails to resolve. I'm somewhat surprised at the frequency DNS seems to break in automation. I think that is a problem you should investigate fixing.

Also, I think wiping the local repo after a single pull failure is bad. hg.mozilla.org just has to go down for a few minutes and then you effectively DDoS hg.mozilla.org.

I'm trying to think of a valid scenario where wiping the local repo immediately after pull failure. Corruption is the only one that comes to mind.
Networks and servers are flaky, so we'll always have to deal with some amount of hiccups.

This bug is precisely about what you describe - not blowing away local repos on a single remote failure. My first approach is going to parse hg's output to look for dns or http 5XX errors, and retry pull/clone operations in those cases. Hopefully hg's output here isn't version/platform or locale dependent.
hg does have locale dependent output.

Automated agents should have the HGPLAIN environment variable set to keep hg's output as consistent as possible.

HGPLAIN
    When set, this disables any configuration settings that might
    change Mercurial's default output. This includes encoding,
    defaults, verbose mode, debug mode, quiet mode, tracebacks, and
    localization. This can be useful when scripting against Mercurial
    in the face of existing user configuration.

    Equivalent options set via command line flags or environment
    variables are not overridden.
Attachment #8489362 - Flags: review?(rail)
Attachment #8489362 - Flags: review?(rail) → review+
Comment on attachment 8489362 [details] [diff] [review]
retry pull/clone operations

if this sticks, still need to update the pre-built version in puppet.
Attachment #8489362 - Flags: checked-in+
Comment on attachment 8490945 [details] [diff] [review]
update dependency-free version of hgtool in puppet.

rubber stamp 8-)
Attachment #8490945 - Flags: review?(rail) → review+
Attachment #8490945 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: