Figure out how to update Balrog release blobs with multiple partials

RESOLVED FIXED

Status

defect
P2
normal
RESOLVED FIXED
4 years ago
Last year

People

(Reporter: rail, Assigned: rail)

Tracking

(Blocks 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 3 obsolete attachments)

ATM if you use multiple instances of the balrog client to update SingleLocale blobs, you end up overwriting the blob every time. Funsize can generate multiple partials in parallel, so we end up with a single random partial in the blob.

Instead of blindly submitting the update info, we can teach the client to read the blob and it's data_version, merge the update info with existing one and submit the result. If the submission fails (due to race condition guarded by data_version), the client should repeat the operation, starting from the "read the current data and data_version" step.

The logic is a bit more complicated for copying the info to the latest blob. Merging partials won't work, because the blob is not as clean as dated blobs - it contains old partials.


There are 2 options available so far for the latest blob:

1) after the dated blob is updated, "transplant" the data from the dated blob to the latest blob for a particular platform/locale combination:
  * read data_version of the latest blob
  * read the dated blob data
  * "transplant" the data to the latest blob using data_version

2) Validate partials whenever submit data to the latest blob. 

Both methods imply stop using of the copyTo parameter.
This is a untested patch. If the approach looks plausible, I'll need to test it and add unittests.
Attachment #8631540 - Flags: feedback?(bhearsum)
Comment on attachment 8631540 [details] [diff] [review]
multiple_submissions-tools.diff

Review of attachment 8631540 [details] [diff] [review]:
-----------------------------------------------------------------

All of the logic around constructing the new sets of updates looks fine to me, but I have reservations about the structure of this change. I see api.py as a minimal companion to the server side endpoints - it's responsibility is translate python args into URLs/post data. I don't think we should be doing major manipulation of the data here - by doing that you're limiting what other consumers of the API can do. IE: if I want to write a different script that, eg, removes some updates from a locale it's not possible to do when construction of the update data is embedded into the API wrapper. Data construction belongs in a level above IMO.

There's a few other structural comments inline as well:

::: lib/python/balrog/submitter/api.py
@@ +127,5 @@
> +            return old_updates
> +        merged_data = set(
> +            Update(u["from"], u["filesize"], u["hashValue"], u["fileUrl"])
> +            for u in old_updates + new_updates)
> +        return [u.to_dict() for u in merged_data]

This is definitely not part of the base API, seems like it should probably just be a function on its own, not a method. If you want to attach it to something, it seems like it might make sense as a staticmethod on the Update class.

@@ +182,5 @@
> +        dated_blob_url = self.api_root + \
> +            self.url_template % dict(name=name, build_target=build_target,
> +                                     locale=locale)
> +        top_blob_url = self.api_root + \
> +            self.prerequest_url_template % dict(name=name)

I'm finding this method to be a little weird. My first instinct is to say that things would be simpler if the caller was made to deal with fallback. It would simply be:

data = {}
data_version = None
try:
  data, data_version = SingleLocale.get_data()
except:
  try:
      _, data_version = Release.get_data()
  except:
    pass

I don't feel super strong about this, but it struck me enough to mention...maybe it doesn't make much of a difference, or maybe I'm missing a subtlely as to why it needs to be handled here.

@@ +256,5 @@
> +            self.request(method='PUT', data=data,
> +                         url_template_vars=url_template_vars)
> +        copyTo = copyTo or []  #
> +        for target_release in copyTo:
> +            self.copy(name, target_release, build_target, locale)

This is another area where I think making the caller deal with will make things simpler. If update_build is left to simply make API requests, you can have logic like this in the caller:

# construct new data
buildData = ...

SingleLocale.updateBuild(datedBlob, buildData)
SingleLocale.updateBuild(latestBlob, buildData)

With this, copyTo should be able to die as part of api.py's interface (and we can kill it on the server later).
Attachment #8631540 - Flags: feedback?(bhearsum) → feedback-
Priority: -- → P2
This is still WIP.

API:

* Refactored Release and SingleLocale to be more "classy" :) It makes some signatures less chatty and the copy_to() method easier to read

* Refactored SingleLocale.get_data() to be more readable 

* SingleLocale.update_build() doesn't accept copyTo anymore, there is a separate method

CLI:

* I released that there is recursive_update(), which I used in NightlySubmitterBase. This should eliminate the Update class I wrote to compare dicts. I can make this operation optional, something like NightlySubmitterBase.run(..., merge_data=False).

* Explicitly call SingleLocale.copy_to(latest)


I haven't tested this yet.
Attachment #8631540 - Attachment is obsolete: true
Attachment #8632202 - Flags: feedback?(bhearsum)
Comment on attachment 8632202 [details] [diff] [review]
multiple_submissions-tools.diff

Review of attachment 8632202 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/python/balrog/submitter/api.py
@@ +123,5 @@
>              raise
>  
> +    def get_data_generic(self, url_template_vars):
> +        resp = self.request(url_template_vars=url_template_vars)
> +        return (json.loads(resp.content), resp.headers['X-Data-Version'])

Yay, factoring! Have you checked to make sure that all endpoints will return X-Data-Version?

If you wanted to avoid the need for get_data in all the subclasses you could have an url_template_vars class attribute, with a list of things to look for on the class, and then Rule would get this for free ;). I'm fine either way, no pressure.

@@ +166,5 @@
> +        data, data_version = {}, None
> +        top_level = Release(name=self.name, **self.kwargs)
> +        # Use data version from the top level blob
> +        try:
> +            _, data_version = top_level.get_data()

What's the point in doing this look-up first? The full blob URL and any locale URLs underneath it will return the same data version. I don't think there's any point in hitting /releases/:name unless the locale 404s. If you insist on doing it in this class, I think the flow in this block should be:

try:
  return locale data + data version
except 404:
  try:
    return empty data + /releases/:name data version
  except:
    return empty data + no data version

@@ +203,5 @@
>  
>          return self.request(method='PUT', data=data,
>                              url_template_vars=url_template_vars)
>  
> +    def copy_to(self, target_release):

As evidenced on IRC, this part really threw me. If I try to forget what copy_to used to mean it makes it a bit easier to read, but I still think that making the consumer get the data and call update_build itself would be clearer. I'm not going to block on this, though.
Attachment #8632202 - Flags: feedback?(bhearsum) → feedback+
(In reply to Ben Hearsum [:bhearsum] from comment #4)
> Comment on attachment 8632202 [details] [diff] [review]
> multiple_submissions-tools.diff
> 
> Review of attachment 8632202 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: lib/python/balrog/submitter/api.py
> @@ +123,5 @@
> >              raise
> >  
> > +    def get_data_generic(self, url_template_vars):
> > +        resp = self.request(url_template_vars=url_template_vars)
> > +        return (json.loads(resp.content), resp.headers['X-Data-Version'])
> 
> Yay, factoring! Have you checked to make sure that all endpoints will return
> X-Data-Version?

I haven't checked this. TODO.

> If you wanted to avoid the need for get_data in all the subclasses you could
> have an url_template_vars class attribute, with a list of things to look for
> on the class, and then Rule would get this for free ;). I'm fine either way,
> no pressure.

Yeah, maybe. I'll take a look at this too.

> @@ +166,5 @@
> > +        data, data_version = {}, None
> > +        top_level = Release(name=self.name, **self.kwargs)
> > +        # Use data version from the top level blob
> > +        try:
> > +            _, data_version = top_level.get_data()
> 
> What's the point in doing this look-up first? The full blob URL and any
> locale URLs underneath it will return the same data version. I don't think
> there's any point in hitting /releases/:name unless the locale 404s. If you
> insist on doing it in this class, I think the flow in this block should be:


Imagine the following scenario:

1) request the dated blob, get 404, assume data = {}
2) another job updates the dated blob
3) request to top level blob to get data_version, assume data = {}, boom - data overwritten!

:)

I tried to get data_version first. If there is one, it's in the top level blob.
Data comes (or not) from the dated blob, which is requested *after* we know data_version. So the scenario above looks like this:

1) request to the top level blob to get data_version, data is not set yet
2) request to the dated blob to get data
3) another job updates the blob, increment data_version
4) try to update the blob using data_version from step 1) - boom! no data is overwritten, retry!

I hope it describes what it does.

> @@ +203,5 @@
> >  
> >          return self.request(method='PUT', data=data,
> >                              url_template_vars=url_template_vars)
> >  
> > +    def copy_to(self, target_release):
> 
> As evidenced on IRC, this part really threw me. If I try to forget what
> copy_to used to mean it makes it a bit easier to read, but I still think
> that making the consumer get the data and call update_build itself would be
> clearer. I'm not going to block on this, though.

I think this method can be moved to cli.py. I'll take a look. I'll also add some comments.

Thanks a lot for the feedback.
* refactored classes to use self.url_template_vars

* moved the logic of copyTo in cli.py

* Unfortunately recursive_update() didn't work as expected. It doesn't support list merges, it overwrites them. I had to use jsonmerge (it's neat!) to explicitly merge partial updates and overwrite complete updates (specified in the merger schema). Had to add some other required libraries into lib/python/vendor.

* added unittests to verify jsonmerge behavior

I baked a docker image with these changes and used it for today's funsize jobs. The following task shows how it even recovers a race condition, when some other task submits data after we get our data_version: https://tools.taskcluster.net/task-inspector/#-VRBcYkdRdyp74oTANbtEw/0

And these are the blobs generated:
https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-dummy-nightly-20150713004006/builds/Linux_x86-gcc3/en-US
https://aus4-admin.mozilla.org/api/releases/Firefox-mozilla-aurora-dummy-nightly-latest/builds/Linux_x86-gcc3/en-US
Attachment #8632202 - Attachment is obsolete: true
Attachment #8632795 - Flags: review?(bhearsum)
Comment on attachment 8632795 [details] [diff] [review]
multiple_submissions-tools-1.diff

Review of attachment 8632795 [details] [diff] [review]:
-----------------------------------------------------------------

One minor nit, and one question, but this looks great overall! r=me if you rename kwargs when you land.

::: lib/python/balrog/submitter/api.py
@@ +159,5 @@
> +        self.locale = locale
> +        self.url_template_vars = dict(name=name, build_target=build_target,
> +                                      locale=locale)
> +        # keep a copy to be used in get_data()
> +        self.kwargs = kwargs

Naming nit: I think should be something like release_kwargs since it's not just being passed to parent class.

::: lib/python/balrog/submitter/updates.py
@@ +7,5 @@
> +
> +def merge_partial_updates(base_obj, new_obj):
> +    """Merges 2 update objects, merging partials and replacing completes"""
> +    schema = {
> +        "properties": {

What happens to string items that are present in both but differ in value? Eg: appVersion is sibling to partials and completes.

@@ +22,5 @@
> +            }
> +        }
> +    }
> +    merger = jsonmerge.Merger(schema=schema)
> +    return merger.merge(base_obj, new_obj)

jsonmerge is very....wordy, but I'm happy to be using something like it rather than reinventing the wheel.
Attachment #8632795 - Flags: review?(bhearsum) → review+
(In reply to Ben Hearsum [:bhearsum] from comment #7)
> Naming nit: I think should be something like release_kwargs since it's not
> just being passed to parent class.

Good idea.

> What happens to string items that are present in both but differ in value?
> Eg: appVersion is sibling to partials and completes.

All values other than "partials" will be overwritten by values in new_data. This is the default behavior of jsonmerge.
Comment on attachment 8632795 [details] [diff] [review]
multiple_submissions-tools-1.diff

https://hg.mozilla.org/build/tools/rev/41e417746421
Attachment #8632795 - Flags: checked-in+
Comment on attachment 8632795 [details] [diff] [review]
multiple_submissions-tools-1.diff

Python 2.6 didn't like me: https://travis-ci.org/mozilla/build-tools/jobs/70806770

https://hg.mozilla.org/build/tools/rev/ed2440625d8b

Either need to disable 2.6 tests for this library or add repoze.lru and argparse
Attachment #8632795 - Flags: checked-in+ → checked-in-
This one passes python 2.6:

* imported repoze.lru into lib/python/vendor
* python-2.6 uses unittest2 to utilize assertDictEqual()

Interdiff (without lib/python/vendor) is here: https://gist.github.com/rail/7011b5441fb9bb47db74
Attachment #8632795 - Attachment is obsolete: true
Attachment #8633145 - Flags: review?(bhearsum)
Attachment #8633145 - Flags: review?(bhearsum) → review+
Comment on attachment 8633145 [details] [diff] [review]
multiple_submissions-tools-3.diff

https://hg.mozilla.org/build/tools/rev/b2b3389de6a9
Attachment #8633145 - Flags: checked-in+
Backout #2!!!! https://hg.mozilla.org/build/tools/rev/13f02c3a6a16

This time mshal found (thanks!) that we use python 2.6 when we call balrog-submitter.py from mozharness. jsonschema uses repoze.lru on 2.6 and it uses the following in lib/python/vendor/repoze.lru-0.6/repoze/__init__.py:

__import__('pkg_resources').declare_namespace(__name__)

it fails, because python-setuptools is not installed (and we use default python, which is 2.6)

There are 2 options:
1) install python-setuptools and continue using 2.6 (eeew)
2) fix mozharness to use python 2.7

I tend to go with the second option.
Attachment #8633145 - Flags: checked-in+ → checked-in-
using sys.executable ensures that instead of using a random version of python found in PATH or some specific in "exes" we use the same binary that used to run the script. I think, this is a bit better than altering multiple configs and adding "python" to exes or some other hacks.
Attachment #8634126 - Flags: review?(jlund)
Comment on attachment 8634126 [details] [diff] [review]
sys.executable-mozharness.diff

Review of attachment 8634126 [details] [diff] [review]:
-----------------------------------------------------------------

assuming all our mh script calls use python2.7, this should do it :)

a quick glance shows that we explicitly pass 2.7 to most ScriptFactory calls: http://mxr.mozilla.org/build/search?string=mozharness_python&find=configs&findi=&filter=%5E%5B%5E%5C0%5D*%24&hitlimit=&tree=build

Sometimes we don't pass interpreter like here: http://mxr.mozilla.org/build/source/buildbotcustom/misc.py#1877 

which makes me think we are leaving it up to the PATH in mobile_l10n.py to use the right python. Not sure if that will affect you
Attachment #8634126 - Flags: review?(jlund) → review+
(In reply to Jordan Lund (:jlund) from comment #15)
> Sometimes we don't pass interpreter like here:
> http://mxr.mozilla.org/build/source/buildbotcustom/misc.py#1877 
> 
> which makes me think we are leaving it up to the PATH in mobile_l10n.py to
> use the right python. Not sure if that will affect you

I'd rather fix it and make it use 2.7 if it doesn't. I'll keep an eye on it.
Comment on attachment 8634126 [details] [diff] [review]
sys.executable-mozharness.diff

Back out:
remote:   https://hg.mozilla.org/build/mozharness/rev/ff2b9ad5f836
remote:   https://hg.mozilla.org/build/mozharness/rev/239491bc393a

04:16:28     INFO - retry: Calling run_command_m with args: (['/tools/buildbot/bin/python', '/builds/slave/m-aurora-lx-l10n-ntly-1-000000/build/tools/scripts/updates/balrog-submitter.py', '--build-properties', '/builds/slave/m-aurora-lx-l10n-ntly-1-000000/balrog_props.json', '-t', 'nightly', '--credentials-file', '/builds/slave/m-aurora-lx-l10n-ntly-1-000000/oauth.txt', '--verbose', '--api-root', 'https://aus4-admin.mozilla.org/api', '--username', 'ffxbld', '--url-replacement', 'http://ftp.mozilla.org/pub/mozilla.org,http://download.cdn.mozilla.net/pub/mozilla.org'],), kwargs: {}, attempt #2
04:16:28     INFO - Running command: ['mock_mozilla', '-r', 'mozilla-centos6-x86_64', '-q', '--unpriv', '--shell', '/tools/buildbot/bin/python /builds/slave/m-aurora-lx-l10n-ntly-1-000000/build/tools/scripts/updates/balrog-submitter.py --build-properties /builds/slave/m-aurora-lx-l10n-ntly-1-000000/balrog_props.json -t nightly --credentials-file /builds/slave/m-aurora-lx-l10n-ntly-1-000000/oauth.txt --verbose --api-root https://aus4-admin.mozilla.org/api --username ffxbld --url-replacement http://ftp.mozilla.org/pub/mozilla.org,http://download.cdn.mozilla.net/pub/mozilla.org']
04:16:28     INFO - Copy/paste: mock_mozilla -r mozilla-centos6-x86_64 -q --unpriv --shell "/tools/buildbot/bin/python /builds/slave/m-aurora-lx-l10n-ntly-1-000000/build/tools/scripts/updates/balrog-submitter.py --build-properties /builds/slave/m-aurora-lx-l10n-ntly-1-000000/balrog_props.json -t nightly --credentials-file /builds/slave/m-aurora-lx-l10n-ntly-1-000000/oauth.txt --verbose --api-root https://aus4-admin.mozilla.org/api --username ffxbld --url-replacement http://ftp.mozilla.org/pub/mozilla.org,http://download.cdn.mozilla.net/pub/mozilla.org"
04:16:28     INFO -  /bin/sh: /tools/buildbot/bin/python: No such file or directory
04:16:28    ERROR - Return code: 127
04:16:28     INFO - retry: Failed, sl

Of course, because this is mock. :/
Attachment #8634126 - Flags: checked-in+ → checked-in-
Comment on attachment 8633145 [details] [diff] [review]
multiple_submissions-tools-3.diff

I ended up pushing the same patch wit different version of repoze.lru in the vendors directory. They fixed the issue in https://github.com/repoze/repoze.lru/commit/2465a8b2538326f56304c6a542c9386b1760b495#diff-5ad6eb84b87d1c3a56c7c4165a241154

https://hg.mozilla.org/build/tools/rev/da436987c292
Attachment #8633145 - Flags: checked-in- → checked-in+
It's been working fine since Friday.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Depends on: 1186058
Resolution: FIXED → ---
This fixes release related usage. This is an incremental patch.
Attachment #8637183 - Flags: review?(bhearsum)
Attachment #8637183 - Flags: review?(bhearsum) → review+
Comment on attachment 8637183 [details] [diff] [review]
fix-release-submission.diff

Combined push: https://hg.mozilla.org/build/tools/rev/8ff1afeae374
Attachment #8637183 - Flags: checked-in+
It worked this time \o/
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.