Closed Bug 869051 Opened 11 years ago Closed 10 years ago

Race condition between builders that push updates to in-tree files

Categories

(Release Engineering :: General, defect, P2)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: coop)

References

Details

(Whiteboard: [weekly])

Attachments

(3 files)

The builders for the blocklist update and the hsts update are both triggered by the weekly scheduler at the same time every week. There's a very real chance that one of these builders is going to fail every week because of the changes made by the other builder.

I think the best way to fix this would be to chain the various update tasks together, but not make them dependent on each other, i.e. we would still try to run the hsts update if the blocklist update fails. We could also re-use the same source checkout for the chained jobs to save some time and load on the hg server.

We lose the individual job status in buildbot, but that should matter less if the job has a better chance of succeeding by not racing its peers.
If these jobs aren't updating the same files you can make use of the apply_and_push method we have in our hg library: https://github.com/mozilla/build-tools/blob/master/lib/python/util/hg.py#L479

That's what we use to resolve push races in release automation.
We still risk racing with developers checking in code, so we should be dealing with races anyway.
To my surprise, we're losing far fewer of them to push races than we are to the rash of 500 errors we get from hg.m.o shortly after 3am every night. If you look at any tree that has jobs going at that time of night, you can pick out where 3am is without looking at push times, by just looking for pushes with big swathes of blue like https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=fc0a521fad43.
Product: mozilla.org → Release Engineering
Assignee: coop → nobody
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
I'm working on a unified script for this over in bug 1004279.
Pete: by virtue of your recent demonstrated shell script prowess, you've won a review request! ;)

There are undoubtedly better ways to do this (separate function dirs/libs for each type of update), but I took the quick approach and modified the existing script for hsts updates (tools/scripts/hsts/update_hsts_preload_list.sh) and modified it to do blocklist, hsts, and the new hpkp updates in one go. We can optimize at a later date IMO.

Log of the most recent run is here:

http://dev-master1.srv.releng.scl3.mozilla.com:8044/builders/Linux%20x86-64%20mozilla-central%20periodic%20file%20update/builds/6/steps/run_script/logs/stdio

Everything succeeds modulo the final push due to keys issues. The push is working find in production right now for the isolated file updates (blocklist and hsts).
Attachment #8439606 - Flags: review?(pmoore)
Comment on attachment 8439606 [details] [diff] [review]
New script to unify periodic file updates

Hey Coop,

Flattery will get you everywhere!

When do you need this by?

Thanks,
Pete
(In reply to Pete Moore [:pete][:pmoore] from comment #6)
> When do you need this by?

Absolutely before Firefox 31 uplift starts, so that gives you about 4.5 weeks. Sooner is always appreciated, of course. ;)
Comment on attachment 8439606 [details] [diff] [review]
New script to unify periodic file updates

Review of attachment 8439606 [details] [diff] [review]:
-----------------------------------------------------------------

Hey Coop,

It's quite a piece of work you've done - nice work!

This is my first round of feedback from superficial matters - I'll do a second pass now, after I've done some testing.

Therefore I haven't marked it as + or - yet, as I'm still working on the review, but I thought it pragmatic to at least get some initial feedback to you already. :)

High level requests:
1) custom exit codes would be nice, also documented in the help (not essential)
2) preferable to write error messages to standard error, rather than standard out (>&2) - again, not essential
3) quoted strings, when passed to commands as a single argument (e.g. cd "${xxx}", rm "${xxx}", ...) - again not essential if files/directories are known not to contain spaces, but a good principle to follow
4) error messages to contain as much instance data as possible (e.g. file "xxx" missing from directory "yyy" is preferable to "downloaded file xxx not found")

Pete

::: scripts/periodic_file_updates/periodic_file_updates.sh
@@ +4,5 @@
> +cat <<EOF
> +
> +Usage: `basename $0` -h
> +Usage: `basename $0` [-n] [-c] [-d] [-a]
> +           [-p product]

I think it would be good to provide a list of allowed products, or an explanation of how the product parameter is used, so the user can work out what values they are allowed to pass. If not a complete list, a partial list of examples would be useful.

@@ +5,5 @@
> +
> +Usage: `basename $0` -h
> +Usage: `basename $0` [-n] [-c] [-d] [-a]
> +           [-p product]
> +           [--hgtool hgtool_location]

I think this option is not needed - since hgtool is in the same repo, I think the script can always use the hgtool in the same repo, and internally know the relative path from this script to hg tool.

@@ +15,5 @@
> +           --hsts | --hpkp | --blocklist
> +           -b branch
> +
> +EOF
> +}

I think it might be good to give an explanation in the help about exactly what the tool does, what the actions are for, that it is currently called by buildbot. After a description, then it could show the options - otherwise if you have no context, it is difficult to understand what this script is for.

@@ +30,5 @@
> +APPROVAL=false
> +HG_SSH_USER='ffxbld'
> +HG_SSH_KEY='~cltbld/.ssh/ffxbld_dsa'
> +REPODIR=''
> +HGTOOL=''

HGTOOL="$(dirname "${0}")/../../buildfarm/utils/hgtool.py" ?

@@ +40,5 @@
> +LOCALHOST=`/bin/hostname -s`
> +HGHOST="hg.mozilla.org"
> +STAGEHOST="stage.mozilla.org"
> +HG=hg
> +WGET=wget

Maybe:
WGET=wget -nv
so that we don't get all the progress bars in the logs?

@@ +67,5 @@
> +BLOCKLIST_UPDATED=false
> +
> +# Get the current in-tree version for this code branch.
> +function get_version {
> +    cd ${BASEDIR}

in general, if parameters may contain spaces (e.g. a directory name) yet the value is meant to be passed as a single argument, it should be double-quoted - e.g. cd "${BASEDIR}" - otherwise, if it contains a space, the shell will split it into tokens, which will get passed as multiple arguments to the command. throughout the script there are examples of this, but i won't write against all of them. admittedly in environments where it is known that there are no spaces in directory names, it is no problem, so it is not a show stopper, but can affect local testing.

@@ +88,5 @@
> +
> +# Cleanup common artifacts.
> +function preflight_cleanup {
> +    cd ${BASEDIR}
> +    rm -rf ${PRODUCT} tests ${BROWSER_ARCHIVE} ${TESTS_ARCHIVE}

e.g. rm -rf "${PRODUCT}" tests "${BROWSER_ARCHIVE}" "${TESTS_ARCHIVE}"

@@ +142,5 @@
> +	fi
> +    done
> +    for F in ${HSTS_PRELOAD_SCRIPT} ${HSTS_PRELOAD_ERRORS} ${HSTS_PRELOAD_INC}; do
> +	if [ ! -f ${F} ]; then
> +	    echo "Downloaded file ${F} not found."

see comment below from similar block

@@ +156,5 @@
> +
> +    # The created files should be non-empty.
> +    echo "INFO: Checking whether new HSTS preload list is valid..."
> +    if [ ! -s "${HSTS_PRELOAD_ERRORS}" ]; then
> +        echo "New HSTS preload error list is empty. I guess that's good?"

Maybe specify in log file which file was being tested?

@@ +159,5 @@
> +    if [ ! -s "${HSTS_PRELOAD_ERRORS}" ]; then
> +        echo "New HSTS preload error list is empty. I guess that's good?"
> +    fi
> +    if [ ! -s "${HSTS_PRELOAD_INC}" ]; then
> +        echo "New HSTS preload list is empty. That's less good."

same as above - nice to specify ${HSTS_PRELOAD_INC} in log message (also send to standard error)

@@ +169,5 @@
> +    echo "INFO: diffing old/new HSTS error lists..."
> +    ${DIFF} ${HSTS_PRELOAD_ERRORS} ${PRODUCT}/${HSTS_PRELOAD_ERRORS}
> +    DIFF_STATUS=$?
> +    case "${DIFF_STATUS}" in
> +        0) echo "INFO: HSTS preload error lists are the same.";;

maybe specify which lists are different, which urls they came from, where they were downloaded to on the file system etc - to aid debugging. e.g. which project?

@@ +212,5 @@
> +	fi
> +    done
> +    for F in ${HPKP_PRELOAD_SCRIPT} ${HPKP_PRELOAD_JSON} ${HPKP_DER_TEST} ${HPKP_PRELOAD_OUTPUT} ${HPKP_PRELOAD_ERRORS}; do
> +	if [ ! -f ${F} ]; then
> +	    echo "Downloaded file ${F} not found."

Maybe good to say which directory you are looking in? e.g. 

echo "Downloaded file '${F}' not found in directory '$(pwd)' - this should have been downloaded above from url <url>." >&2

Also nice to write error messages to standard error, rather than standard out. Lastly, it might be nice to have different exit codes for different failure cases, e.g. starting from 64, and to output these in the usage function.

In order to know the url, may be nice to combine these two loops into a single loop, like this:

for URL in "${HPKP_PRELOAD_SCRIPT_HG}" "${HPKP_PRELOAD_JSON_HG}" "${HPKP_DER_TEST_HG}" "${HPKP_PRELOAD_OUTPUT_HG}" "${HPKP_PRELOAD_ERRORS_HG}"; do
    ${WGET} --no-check-certificate "${URL}"
    WGET_STATUS=$?
    if [ "${WGET_STATUS}" != 0 ]; then
        echo "ERROR: wget exited with a non-zero exit code: ${WGET_STATUS}" >&2
        exit <custom exit code> # since wget exit code status is known from log
    fi
    if [ ! -f "$(basename "${URL}")" ]; then
        echo "Downloaded file '$(basename "${URL}")' not found in directory '$(pwd)' - this should have been downloaded above from ${URL}." >&2
        exit <custom exit code>
    fi
done

@@ +227,5 @@
> +    if [ ! -s "${HPKP_PRELOAD_ERRORS}" ]; then
> +        echo "New ${HPKP_PRELOAD_ERRORS} error list is empty. I guess that's good?"
> +    fi
> +    if [ ! -s "${HPKP_PRELOAD_OUTPUT}" ]; then
> +        echo "${HPKP_PRELOAD_OUTPUT} is empty. That's less good."

again, nice to give more detail here - which files are being compared, where they came from etc

@@ +259,5 @@
> +# Downloads the current in-tree blocklist file.
> +# Downloads the current blocklist file from AMO.
> +# Compares the AMO blocklist with the in-tree blocklist to determine whether we need to update.
> +function compare_blocklist_files {
> +    BLOCKLIST_URL_AMO="https://addons.mozilla.org/blocklist/3/${APP_ID}/${VERSION}/${APP_NAME}/20090105024647/blocklist-sync/en-US/nightly/blocklist-sync/default/default/"

Is this date really fixed, and does not change? 20090105024647

@@ +297,5 @@
> +    echo "INFO: diffing in-tree blocklist against the blocklist from AMO..."
> +    ${DIFF} blocklist_hg.xml blocklist_amo.xml >/dev/null 2>&1
> +    DIFF_STATUS=$?
> +    case "${DIFF_STATUS}" in
> +        0) echo "INFO: blocklists are the same.";;

which blocklists?

@@ +324,5 @@
> +        else
> +	    # Fallback on vanilla hg
> +	    echo "hgtool.py not found. Falling back to vanilla hg."
> +            CLONE_CMD="${HG} clone"
> +        fi

I think since hgtool is part of tools repo, we should be able to assume relative path?

@@ +460,5 @@
> +    shift
> +done
> +
> +# Must supply a code branch to work with.
> +if [ "${BRANCH}" == "" ]; then

maybe say that a branch name is missing, so the user knows that is the problem, rather than e.g. an action is missing

@@ +501,5 @@
> +    if [ $? != 0 ]; then
> +	HSTS_UPDATED=true
> +    fi
> +fi
> +if [ "${DO_HPKP}" == "true" ]; then

just a trip - you can also just write
if "${DO_HPKP}"
rather than
if [ "${DO_HPKP}" == "true" ]
since it is always 'true' or 'false'.

this obviously applies also to all the other places. just a stylistic matter, not so important. :)

@@ +518,5 @@
> +if [ "${HSTS_UPDATED}" == "false" -a "${HPKP_UPDATED}" == "false" -a "${BLOCKLIST_UPDATED}" == "false" ]; then
> +    exit 1
> +else
> +    if [ "${DRY_RUN}" == "true" ]; then
> +        echo "INFO: Updates are available, bot updating hg in dry-run mode."

maybe the wording can be made clearer/more specific here - e.g. something like hg repos have been cloned to locations xyz, the files have been updated locally, but not committed.

Why do we exit 1? Is that to indicate there are changes? I think we have so many cases where we exit 1, it would be nicer to change these all to different values, because you cannot rely on exit 1 meaning updates are available - other places also exit with exit code 1.

@@ +535,5 @@
> +
> +if [ "${BLOCKLIST_UPDATED}" == "true" ]; then
> +  commit_blocklist_files
> +fi
> +push_repo

should we pull/update before we push, to reduce chance of collisions? (see catlee's comment 2 above)
Comment on attachment 8439606 [details] [diff] [review]
New script to unify periodic file updates

Review of attachment 8439606 [details] [diff] [review]:
-----------------------------------------------------------------

::: scripts/periodic_file_updates/periodic_file_updates.sh
@@ +1,1 @@
> +#!/bin/bash -e

Since you have this first line, it would be good to make the script executable for all (chmod a+x) so that it can be executed directly

@@ +23,5 @@
> +BRANCH=""
> +LATEST_DIR=""
> +PLATFORM="linux-x86_64"
> +PLATFORM_EXT="tar.bz2"
> +UNPACK_CMD="tar jxf"

UNPACK_CMD="$(basename "${0}")/../../release/common/unpack.sh" ?
Currently this can only be tested on linux, but I think if you use the generic unpacker, we could theoretically run this on any platform, including mac and windows? At least it would make local testing on a mac easier. :)

Maybe better to autodetect platform with uname -s, and then set PLATFORM and PLATFORM_EXT based on this?
Note about the above - unpack.sh is a library file, so you probably need to source the unpack.sh file, and then call unpack_build function with appropriate arguments rather than just running ${UNPACK_CMD}. I haven't looked too deeply into it.
Comment on attachment 8439606 [details] [diff] [review]
New script to unify periodic file updates

Review of attachment 8439606 [details] [diff] [review]:
-----------------------------------------------------------------

So I haven't been able to test this fully today (I didn't get round to getting myself a linux loaner) but I don't see anything really sinister there. It all seems to work, and my only comments were around improving logging in case of problems, providing a bit more detail in the help message, and a couple of nice tweaks that were optional.

So based on that, I'm going to give it an r+ with the caveat that if you'd like me to look at it again when I'm back from PTO, I'm happy to do that, and get myself a loaner to do some thorough testing on.
Attachment #8439606 - Flags: review?(pmoore) → review+
(In reply to Pete Moore - on PTO until June 27 [:pete][:pmoore] from comment #9) 
> @@ +23,5 @@
> > +BRANCH=""
> > +LATEST_DIR=""
> > +PLATFORM="linux-x86_64"
> > +PLATFORM_EXT="tar.bz2"
> > +UNPACK_CMD="tar jxf"
> 
> UNPACK_CMD="$(basename "${0}")/../../release/common/unpack.sh" ?
> Currently this can only be tested on linux, but I think if you use the
> generic unpacker, we could theoretically run this on any platform, including
> mac and windows? At least it would make local testing on a mac easier. :)
> 
> Maybe better to autodetect platform with uname -s, and then set PLATFORM and
> PLATFORM_EXT based on this?

While I'm hopefully that the developer-supplied scripts would also work on at least Darwin (since they're just xpcshell), they've only been tested on Linux. There's the possibility of me spending a lot of time to make these scripts work on Mac when we only run them on Linux in production anyway. I'll pass on that.

I'm running a version of the script through staging right now that incorporates all your suggestions from comment #8.

buildbot patches coming shortly.
Comment on attachment 8439606 [details] [diff] [review]
New script to unify periodic file updates

Review of attachment 8439606 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/6b39c6438489
Attachment #8439606 - Flags: checked-in+
Most recent staging run is here:

http://dev-master1.srv.releng.scl3.mozilla.com:8044/builders/Linux%20x86-64%20mozilla-central%20periodic%20file%20update/builds/12

It's only failing because of the missing keys in staging...the same process works right now in production for weekly blocklist and HSTS updates.
Attachment #8444457 - Flags: review?(kmoir)
Attachment #8444456 - Flags: review?(kmoir) → review+
Attachment #8444457 - Flags: review?(kmoir) → review+
Comment on attachment 8444457 [details] [diff] [review]
[buildbotcustom] Unify periodic file update builders

Review of attachment 8444457 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/buildbotcustom/rev/c218109ea60b
Attachment #8444457 - Flags: checked-in+
Comment on attachment 8444456 [details] [diff] [review]
[buildbot-configs] Unify periodic file update vars; enable hpkp updates on m-c, aurora, and b2g branches

Review of attachment 8444456 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/buildbot-configs/rev/7c1d7f728805
Attachment #8444456 - Flags: checked-in+
This got deployed yesterday.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Depends on: 1035003
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: