Closed Bug 847640 Opened 11 years ago Closed 10 years ago

db-based mapper on web cluster

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: pmoore)

References

Details

Attachments

(4 files, 3 obsolete files)

This may include making the mapper app public.

The status should solve bug 799845 (when was the last successful sync; when did we last try; how many failures and when).  We could also monitor these via nagios or the like.

This is potentially a discrete portion of work that might be assignable to someone else, but I'm taking it for now.
Blocks: 799845
Assignee: aki → seth
Attachment #723151 - Attachment is obsolete: true
Argh. Sorry about that. Posted to wrong bug.
Assignee: seth → nobody
Product: mozilla.org → Release Engineering
Blocks: 929336
Morphing, since otherwise this is a dup of bug 929336.
Summary: publish hg-git mapfiles, status somewhere public → db-based mapper on web cluster
Blocks: 946019
Assignee: nobody → aki
Component: Other → Tools
QA Contact: hwine
Blocks: 962853
Blocks: 962863
Blocks: 962869
Blocks: 962870
Assignee: aki → pmoore
Attached file test_mapper_post.py
This script:

* restarts apache via apachectl, since I kept forgetting to after updating the app
* takes the 'foo' file in cwd as the mapfile to insert
* inserts that mapfile into the 'gecko-dev' project in localhost/mapper (currently via ignoredups; change the url to http://localhost/mapper/gecko-dev/insert to error on dups.  change the url to localhost/mapper/gecko-projects/... to insert into gecko-projects)

The mapfiles you can insert from are http://people.mozilla.org/~asasaki/vcs2vcs/gecko-dev/gecko-mapfile for gecko-dev and http://people.mozilla.org/~asasaki/vcs2vcs/gecko-projects/project-branches-mapfile for gecko-projects.
Thanks Aki.

Working on this fork now: https://github.com/petemoore/mapper
So having looked into this a bit more deeply, I think there may be value in (also) using git notes to store the hg revisions in git. I'm not aware of any equivalent feature that would allow the git commit shas to be visible in hg, however this may be possible by creating an hg extension, and I'll have a chat with gps to get his opinion on this too.

With git notes, this would mean that any dev that is using git, would be able to see natively from his local git repo, without having to query any external service, the hg SHA for a commit retrieved from the git repo (obviously not local commits that have not yet been pushed).

To demonstrate:

$ git clone -o gd git@github.com:mozilla/gecko-dev b2g-dev
$ cd b2g-dev
$ git fetch gd
$ git remote add gp git@github.com:mozilla/gecko-projects
$ git fetch gp
$ git show -s c2d4e3781d2f9d0077837ce021afdb68dc036bd4
$ git notes add -m 'hg id: 0000b936e5c67d1fa428c263633bcb27325a195b' c2d4e3781d2f9d0077837ce021afdb68dc036bd4
$ git show -s c2d4e3781d2f9d0077837ce021afdb68dc036bd4

commit c2d4e3781d2f9d0077837ce021afdb68dc036bd4
Author: Wes Johnston <wjohnston@mozilla.com>
Date:   Tue Feb 15 16:54:47 2011 -0800

    Bug 631058 - Part 2 -Only decode Only decode metadata for preload=metadata. r=cpearce, a=blocking-fennec

Notes:
    hg id: 0000b936e5c67d1fa428c263633bcb27325a195b

Above, you see I simply add a "note" to a commit to say what its hg counterpart is, and a git show or git log will show the hg id.

This mechanism is nice for the following reasons:

1) The git developer has everything locally - no network calls to find an hg id - it is with all his metadata in his repository already.
2) The git developer can choose whether he wishes to receive this data or not - i.e. whether to include refs/notes/* or not.
3) When fetching from the git remote, only the new notes will get fetched, not the full notes history.

Hopefully I'll be able to find a way to do a similar thing on the hg side.

I've cloned gecko-dev and gecko-projects, and am currently creating the notes, and will push back to https://github.com/petermoore/gecko-dev and https://github.com/petermoore/gecko-projects.

When this is done I will provide some simple instructions about how to set a refspec to fetch notes from there, to access the notes. This is just a proof-of-concept. If people like it, we can of course push the notes to the mozilla account in the end (https://github.com/mozilla/gecko-dev and https://github.com/mozilla/gecko-projects).
I've published the notes to my clone of gecko-dev.

For example, see: https://github.com/petermoore/gecko-dev/commit/c2d4e3781d2f9d0077837ce021afdb68dc036bd4

Here you see the hg id appear under the "notes" section.

This is the same commit as comment 8, where we see what it looks like from the command line.

I've not experienced any performance issues - I added hg ids for all commits, apart from the last few which I picked up in a pull before pushing.
OK - so a belated update!

The code is now sat nicely in the relengapi framework. The code has been migrated from bottle -> flask to sit nicely with relengapi. The database layer is converted to sqlalchemy with simple object relational mapping. Unit tests have been written, and can be run using nose. All the tests (20 of them) are currently passing.

There are different HTTP return codes for the various types of failures. The methods are returning json when possible, rather than HTML.

Currently the code is sat here: https://github.com/petemoore/mapper/tree/master

Work is still in progress, so have not asked for a review yet.

In the process of following https://wiki.mozilla.org/Security/Reviews/Review_Request_Form#Questions_to_Address_within_Request_Body to start the security review process, and getting it deployed to staging for further testing.

Currently using oAuth2 authentication (e.g. this is the same schema used by github and bitbucket APIs). This will be included in the security review, so no doubt we'll get some feedback about whether it is acceptable.
Flags: sec-review?
=== Who is/are the point of contact(s) for this review? ===

:pmoore

=== Please provide a short description of the feature / application (e.g. problem solved, use cases, etc.) ===

This is a restful api to provide mappings between hg changeset shas and git commit shas, after the vcs2vcs mirroring process has run. It uses oAuth2 authentication, and we plan to run it in relengapi on the web cluster.

This service is consumed by the b2g_build.py mozharness module for building b2g. The b2g build requires mappings in order to build its manifest file, describing the commits that were pulled in from all repos as part of the build process.

=== Please provide links to additional information (e.g. feature page, wiki) if available and not yet included in feature description ===

Code is currently sat here: https://github.com/petemoore/mapper/tree/master
Wiki page here: https://wiki.mozilla.org/ReleaseEngineering/Applications/Mapper

=== Does this request block another bug? If so, please indicate the bug number ===

Blocks: 799719 799845 929336 946019 962853 962863 962869 962870

=== This review will be scheduled amongst other requested reviews. What is the urgency or needed completion date of this review? ===

Due to this blocking 962863 and 946019 this is quite urgent.

=== To help prioritize this work request, does this project support a goal specifically listed on this quarter's goal list? If so, which goal? ===

yes, this bug (847640) is listed as a 2014 Q1 releng goal in this doc:
https://docs.google.com/a/mozilla.com/spreadsheet/ccc?key=0Ajlqy6z1WDBvdGNxaFFpbnQ4UDdySzZVcGU1RkN4b0E&usp=sharing#gid=7

=== Please answer the following few questions: (Note: If you are asked to describe anything, 1-2 sentences shall suffice.) ===
        1) Does this feature or code change affect Firefox, Thunderbird or any product or service the Mozilla ships to end users?

No

        2) Are there any portions of the project that interact with 3rd party services?

No

        Will your application/service collect user data? If so, please describe 

No

    If you feel something is missing here or you would like to provide other kind of feedback, feel free to do so here (no limits on size):

=== Desired Date of review (if known from https://mail.mozilla.com/home/ckoenig@mozilla.com/Security%20Review.html) and whom to invite. ===

Tue 25 March seems like soonest available slot?


Many thanks!
Pete
Add me as a POC too, for the relengapi bits.
Depends on: 985672
Flags: sec-review? → sec-review?(yboily)
Assignee: pmoore → sarentz
Hey folks just a heads up that I will be doing a quick review of this project.

Some additional questions:

* Where will this be deployed?
* Do you have an instance running anywhere at this moment where I can take a look?
* It looks like this is part of a bigger project, relengapi, does that project also need a review? Or was that already done?
Flags: needinfo?(pmoore)
This is a component in the relengapi *framework*, yes -- and relengapi is responsible for some of the security implementations here.  However, relengapi itself isn't quite ready for security review -- I'll be filing a request soon, though.
I have a done a code review of this project. Since this is basically a single file Flask Blueprint, I will just report my findings here in the bug.

I was a little concerned about abort(404, "<script>...</script>") possibilities in get_rev() where it passes the vcs_type blindly, but it seems that flask.abort() (which is an alias for werkzeug.exceptions.abort) properly escapes the error message.

Another thing that I find a little scary is this:

> q = q.filter("%s_changeset == :changeset" % (vcs_type,)).params(changeset=changeset)

That happens in _check_existing_sha() and in get_rev(). But, in the first case it is only called with "hg" and in the second case the vcs_type is properly checked. I think an explicit check for a valid vcs_type in _check_existing_sha() would give me a better feeling though. Just to have that check in one place and so that you don't forget when future changes are made. It is the one place where people could potentially mess with a (prepared) statement.

So no major concerns that must be addressed. Keep up the secure work!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Just reopening, as my work is not quite finished on this yet.
Status: RESOLVED → REOPENED
Flags: needinfo?(pmoore)
Resolution: FIXED → ---
Assignee: sarentz → pmoore
Deployed in test environment, but still need to finish debugging before creating review, e.g.:

https://api-pub-build.allizom.org/mapper/gecko-projects/mapfile/full
https://api-pub-build.allizom.org/mapper/gecko-dev/mapfile/full
https://api-pub-build.allizom.org/mapper/gecko/mapfile/full

are currently all returning the same content. Work in progress is here: https://github.com/petemoore/mapper (but please don't review yet as work is not completed).

Also working on bug 962853 in parallel (so that vcs sync pushes to mapper service):
https://github.com/petemoore/build-mozharness/compare/bug962853?expand=1

Pete
(In reply to Pete Moore [:pete][:pmoore] from comment #17)

> https://api-pub-build.allizom.org/mapper/gecko-projects/mapfile/full
> https://api-pub-build.allizom.org/mapper/gecko-dev/mapfile/full
> https://api-pub-build.allizom.org/mapper/gecko/mapfile/full
> 
> are currently all returning the same content.

I have finally just fixed this now (and will deploy in staging over the weekend or first thing on Monday - tomorrow I'm on build duty again): https://github.com/petemoore/mapper/commit/c917b6cad25fb0f96fb32e50d6525de58a456318

> Also working on bug 962853 in parallel (so that vcs sync pushes to mapper
> service):
> https://github.com/petemoore/build-mozharness/compare/bug962853?expand=1

This is now pretty much wrapped up, except relengapi does not yet natively have an authentication/authorisation mechanism we can use for pushing the inserts (:dustin is currently working on https://github.com/mozilla/build-relengapi/issues/7 which should fix this for us - there were some setbacks to the original plans to use oAuth2).

Tomorrow (Friday) I have build duty again, but I would hope on Monday morning to get the new version of mapper deployed to Staging, and will look at enabling the old check_client_ip decorator again, until the relengapi authentication/authorization mechanism is in place.

Then I'd expect a couple of days of testing - by midweek to post patches for both mapper and vcs2vcs sync.

I've tested vcs sync against my "petermoore" github account (*not* the petemoore account).

See e.g.: https://github.com/petermoore/build-tools/commit/741a6ffe89d164d2c8220909347f509c5449f507

Note the git note at the bottom of this page:
Upstream source: https://hg.mozilla.org/build/tools/log/f64669f52207ed64577c563beaa2d6dded6c2b1b


Here are some comments from Hal:

 - is this in addition to mapper? or replace it? or be the implementation behind mapper?
> additional
 - git perf if it's behind mapper
> in testing, performance has been good
 - how to handle l10n -- no current combined repo
> this is a good point - if we go with it, we will need to merge the git notes from each individual repo before pushing to refs/notes/commits - not yet done
 - impact on signed commits (not something we do today, but does this preclude ever doing that)
> notes are registered against the SHA of the commit, and stores in "their own tree" - so the notes don't need to be signed, or change the commits that they refer to, so it should not be a problem
 - would need to be run by devs, as this will impact what they see, iiuc
> if they want the notes, they would add it to their refspec. by default, they won't get refs/notes/commits when cloning
 - supported by all git tooling? (tortise on win, etc)
> git notes is a standard core feature of git - we can doublecheck tortoise on win - but since it is a core git feature, i'd be surprised if it wasn't supported on any platform
 - need to review for partner impact
> sure - but when they clone, unless the explicitly choose to include notes, they won't get them - also any git repo can have notes, so they would already potentially have this issue if any dev wrote notes to a native git repo they are pulling from
 - how are notes displayed on web views (github & git.m.o)?
See https://github.com/petermoore/build-tools/commit/741a6ffe89d164d2c8220909347f509c5449f507 for an example

What's advantage? I just see the bug mention alternative -- not what it saves us.
> it means you can be offline and do the lookup - you don't rely on a real-time online service to be available to perform mappings - as soon as you clone, you have both sets of SHAs.

How do we support tbpl & treeherder (which don't have checkouts afaik)?
> They can still use mapper.

What happens to "round tripped" changes? Can notes be deleted without changing shas?
> Yes - they are completely separate trees. They can be added later in time, for any historic commit, and they do not affect the commit, they are just registered against the sha of the commit.

That's a few thoughts off the top of my head. I _think_ this is a potential value add for dev users of the repo. I'm not sure it belongs in the official repos (i.e. yes in gecko-dev, gecko-projects, no in gecko.git) I do like that it provides offline access to the data for devs.

I'm also aware this might be a big maintenance challenge if we commit to it, then get surprised. So, I'd like to make sure we can make a case for how many devs will save how much time, not just the coolness factor. But I'm old school that way.

Neat thinking. fwiw, it's "easy" to attach meta-data to git (and I think hg - gps would know) repositories. e.g. github implements pull requests as additional refs (a full clone of a github repo that has had pull requests includes all of them -- that's why legacy explicitly only pulls certain parts from github -- I think Aki's code does the same)

> Yes the pull requests uses a similar technique - it uses refs under refs/pull/* - notes just uses refs/notes/* (or you can change it to anything you like - refs/notes is just the default).

--Hal
I am now deploying the following 4 products:

Product and version:
    relengapi 0.18
Environment:
    Staging
Location:
    https://api-pub-build.allizom.org/
Source code:
    https://github.com/mozilla/build-relengapi/tree/relengapi-0.1.8/base

Product and version:
    relengapi-docs 0.18
Environment:
    Staging
Location:
    https://api-pub-build.allizom.org/docs/
Source code:
    https://github.com/mozilla/build-relengapi/tree/relengapi-docs-0.1.8/docs

Product and version:
    mapper 0.3
Environment:
    Staging
Location:
    https://api-pub-build.allizom.org/mapper/
Source code:
    https://github.com/petemoore/mapper/tree/relengapi-mapper-0.3

Product and version:
    VCS Sync (mozharness) 1d409f44ebcd4c183b85079e8f6cef29319c814f
Environment:
    Staging
Location:
    github-sync4.dmz.scl3.mozilla.com, pmoore user, code in home directory (currently 54GB free space - build repos processing takes about 2.2GB)
Source code:
    https://github.com/petemoore/build-mozharness/commit/1d409f44ebcd4c183b85079e8f6cef29319c814f


I plan to leave the four of them running for a week in parallel to production. All being well, I will then attach the relengapi and mapper patches to this bug and a mozharnes patch to bug 962853 for formal review (assuming staging tests all pass with flying colours).

I’ve completed my development / local testing, and am relatively confident the new mapper is fully functional, the secure authentication is working, and that the code in vcs sync to use/update the new mapper is working. I’ve run many tests locally and believe I’ve ironed out all the teething issues. I’ve updated the mapper tests to test the new features (such as authentication) and I’ve also set up a continuous integration for mapper on travis which runs the full suite of mapper unit tests with every check in, to keep the codebase as stable as possible.

I've checked with both Aki and Dustin that they are happy for me to use these environments for this purpose for this week.


Environment configuration
=========================

Despite being a staging deployment, it will be a full end-to-end production-like environment.

I will be mirroring the build-repos repositories only (https://hg.mozilla.org/build/*) to github (https://github.com/petermoore?tab=repositories).

VCS Sync email notifications will be sent to: pmoore@mozilla.com


Testing strategy
================

While the staging environment is running, I will be checking that all mappings are published, checking log files for errors, and will also try to do some destructive testing where i bring down e.g. relengapi and bring it back up again later, to check that mappings do not get missed, and email notifications get sent etc. I will also try disabling access to github (by removing the authorised key from github, and then re-adding it again) - and as many of these types of scenarios as possible.

Thanks,
Pete
Just a quick update: things are working well. Some example urls:

Repositories are continually updated, and hashes match official hashes, e.g.
https://github.com/petermoore?tab=repositories vs
https://github.com/mozilla?query=build-

Stagig mapper can be queried, e.g.:
https://api-pub-build.allizom.org/mapper/build-puppet-manifests/mapfile/full

Git commits contain a note to the original source:
https://github.com/petermoore/build-tools/commit/a5066ac946bb61e1cbccab3a9766b654d2db206b (see Git Notes at bottom)

I've done quite a bit of destructive testing, but not been able to cause any long-term damage yet.

There have been a few fixes applied this week that came up in staging, will update the bug when the testing is complete, and attach patches for formal review.

This is just a "heads up" that things are going well. :)

Pete
So there have been a bunch of commits since last week, for a variety of issues:

VCS Sync:
*) Fix for double git note entries
*) Replaced sets module with builtins (see https://bugzilla.mozilla.org/show_bug.cgi?id=962853#c8)
*) Reports http response code in logs, if not 200
*) Now publishes mappings in batches of 200 at a time to mapper, to avoid load balancer timeouts
*) Made publish-to-mapper an action in its own right, as opposed to a part of update-work-mirror

Mapper:
*) git commit and hg changeset were stored the wrong way around - fixed all code and tests to have git commit first, hg changeset second, when receiving mappings
*) switched sqlalchemy filters to use .one() instead of .first() where appropriate, and handle NoResultFound, MultipleResultsFound
*) improved logging and comments

Relengapi:
*) fix to packaging of relengapi which caused some resources to be missing from package
Attached patch bug847640_mapper.patch (obsolete) — Splinter Review
So this is quite a big delta. Let me know if you'd like to go through any of it with me. It is deployed in relengapi staging, and I've had it running a week. It is also passing the unit tests in travis, with the exception of one pylint message:

************* Module relengapi.blueprints.mapper
E:224,34: Instance of 'tuple' has no 'utctimetuple' member (but some types could not be inferred) (maybe-no-member)

I haven't quite worked out what this means yet, but it shouldn't be blocking, and I'm on holiday tomorrow so wanted to get you this today.

Regarding relengapi changes, I've managed to get all those reviewed by dustin/callek, and they have all landed.

I will attach mozharness/vcs sync patch to bug 962853 for the vcs sync changes to publish to mapper. Both parts are done: this bug for the mapper code, and bug 962853 for vcs sync to write to mapper.

If you'd like to meet face-to-face to go through this, just drop me a meeting request and I'm sure we can find a good time.

Thanks,
Pete
Attachment #8430314 - Flags: review?(aki)
Pylint was dropped on its head as a child, you can't blame it.  That's obviously bogus, so it's fine to just disable that warning.
Flags: sec-review?(yboily)
(In reply to Dustin J. Mitchell [:dustin] from comment #23)
> Pylint was dropped on its head as a child, you can't blame it.  That's
> obviously bogus, so it's fine to just disable that warning.

Thanks Dustin!
Mapper is using the relengapi pylint config file, so I'm disabling this pylint test in relengapi - is that ok with you?

See:
https://github.com/mozilla/build-relengapi/pull/76

Thanks,
Pete
Comment on attachment 8430314 [details] [diff] [review]
bug847640_mapper.patch

Thanks Pete! I think this is close.

>+def _build_mapfile(q):
<snip>
>+    Returns:
>+        Text output, which is 40 characters of hg changeset sha, a space,
>+        40 characters of git commit sha, and a newline; or None if the query
>+        yields no results.
>+    """
>+    contents = '\n'.join('%s %s' % (r.git_commit, r.hg_changeset) for r in q)
>+    if contents:
>+        return Response(contents + '\n', mimetype='text/plain')

I think I 'yield'ed the data as it came in in the original:
https://github.com/escapewindow/mapper/blob/master/mapper/app.py#L57
I'm not sure how Response works, but we may want to keep an eye on response times when we query large mapfiles like gecko.  I imagine build/* repos are small enough that it doesn't matter terribly.

>+@bp.route('/<project>/rev/<vcs_type>/<commit>')
>+def get_rev(project, vcs_type, commit):
<snip>
>+    if vcs_type not in ("git", "hg"):
>+        abort(400, "Unknown vcs type %s" % vcs_type)

Hm, we could check for the valid vcs_type in _check_well_formed_sha() as well, but it looks like we only do it the one place, so no real difference.

>+@bp.route('/<project>/mapfile/full')
>+def get_full_mapfile(project):
>+    """Get a full mapfile. <project> can be a comma-delimited set of projects
>+
>+    Args:
>+        project: comma-delimited project names(s) string
>+
>+    Yields:
>+        full mapfile for the given project (git sha, hg sha) ordered by hg sha

I think it Returns, not Yields, correct?

>+def _insert_many(project, dups=False):
<snip>
>+    Exceptions:
>+        HTTPError 409: if dups=False and there were duplicate entries.
>+        HTTPError 400: if the content-type is incorrect
<snip>
>+        if dups:
>+            try:
>+                session.commit()
>+            except sa.exc.IntegrityError:
>+                session.rollback()

Does the rollback raise an HTTPError 206 like insert_many_no_dups docstring expects, or is it silent?  We need some sort of error here.

>+    if not dups:
>+        try:
>+            session.commit()
>+        except sa.exc.IntegrityError:
>+            abort(409, "some of the given mappings already exist")
>+    return jsonify()

I think 'dups' is a confusing name, which looks like my bug.  We should rename 'dups' to... 'allow_dups'? to clarify.

>+@bp.route('/<project>/insert', methods=('POST',))
>+@actions.mapper.mapping.insert.require()
>+def insert_many_no_dups(project):
>+    """Insert many mapfile entries via POST, and error on duplicate SHAs.
>+
>+    Args:
>+        project: single project name string (not comma-delimited list)
>+        POST data: mapfile lines (git_commit hg_changeset\n)
>+        Content-Type: text/plain
>+
>+    Returns:
>+        A string of unsuccessful entries.
>+
>+    Exceptions:
>+        HTTPError 206: if there were duplicate entries.

I don't think we actually throw a 206, though I don't know what session.rollback() does above.

>+@bp.route('/<project>/insert/ignoredups', methods=('POST',))
>+@actions.mapper.mapping.insert.require()
>+def insert_many_allow_dups(project):
>+    """Insert many mapfile entries via POST, and don't error on duplicate SHAs.
>+
>+    Args:
>+        project: single project name string (not comma-delimited list)
>+        POST data: mapfile lines (git_commit hg_changeset\n)
>+        Content-Type: text/plain
>+
>+    Returns:
>+        A string of unsuccessful entries.

Is this accurate?  I don't know what jsonify() does, but aiui we get that or a 409.

>+@bp.route('/<project>/insert/<git_commit>/<hg_changeset>', methods=('POST',))
>+@actions.mapper.mapping.insert.require()
>+def insert_one(project, git_commit, hg_changeset):
>+    """Insert a single mapping
>+
>+    Args:
>+        project: single project name string (not comma-delimited list)
>+        git_commit: 40 char hexadecimal string.
>+        hg_changeset: 40 char hexadecimal string.
>+
>+    Returns:
>+        A string row on success
>+
>+    Exceptions:
>+        HTTPError 500: No results found
>+        HTTPError 409: Mapping already exists for this project
>+        HTTPError 400: Badly formed sha.
>+    """
>+    session = g.db.session('mapper')
>+    proj = _get_project(session, project)
>+    _add_hash(session, git_commit, hg_changeset, proj)
>+    try:
>+        session.commit()
>+        q = Hash.query.join(Project).filter(_project_filter(project))
>+        q = q.filter("git_commit == :commit").params(commit=git_commit)
>+        return q.one().as_json()
>+    except sa.exc.IntegrityError:
>+        abort(409, "mapping already exists")
>+    except NoResultFound:
>+        abort(500, "row was not successfully entered in database!")
>+    except MultipleResultsFound:
>+        abort(500, "duplicate rows found in database!")

I think your docstring and returns are mismatched again.

>+@bp.route('/<project>', methods=('POST',))
>+@actions.mapper.project.insert.require()
>+def add_project(project):
>+    """Insert a new project into the DB.
>+
>+    Args:
>+        project: single project name string (not comma-delimited list)
>+
>+    Returns:
>+        200 OK on success
>+        409 Conflict if the project already exists

Does the jsonify() return a 200 ok, or is this also a mismatched docstring?

diff --git a/relengapi/blueprints/test_mapper.py b/relengapi/blueprints/test_mapper.py

tests++

diff --git a/setup.py b/setup.py
<snip>
>+      url='https://github.com/petemoore/mapper',

We'll want to move this someplace shared soon before/after deployment.
Attachment #8430314 - Flags: review?(aki) → review-
BTW, I've just updated https://wiki.mozilla.org/ReleaseEngineering/Applications/Mapper with some updated information.
Hi Aki,

Many thanks for your careful review.

I've created a new version, made the fixes you suggested, and did a few more doc rewrites (as they were not so consistent and clearly written) and have deployed the new mapper test version 0.7 to staging:

https://api-pub-build.allizom.org/versions

The updated code is here: https://github.com/petemoore/mapper/commits/relengapi-mapper-0.7

I'm testing this, and will then send it back over for review.

In the mean time, I'll also reply to your individual questions in the next comment.

Thanks,
Pete
(In reply to Aki Sasaki [:aki] from comment #25)
> Comment on attachment 8430314 [details] [diff] [review]
> bug847640_mapper.patch
> 
> Thanks Pete! I think this is close.
> 
> >+def _build_mapfile(q):
> <snip>
> >+    Returns:
> >+        Text output, which is 40 characters of hg changeset sha, a space,
> >+        40 characters of git commit sha, and a newline; or None if the query
> >+        yields no results.
> >+    """
> >+    contents = '\n'.join('%s %s' % (r.git_commit, r.hg_changeset) for r in q)
> >+    if contents:
> >+        return Response(contents + '\n', mimetype='text/plain')
> 
> I think I 'yield'ed the data as it came in in the original:
> https://github.com/escapewindow/mapper/blob/master/mapper/app.py#L57
> I'm not sure how Response works, but we may want to keep an eye on response
> times when we query large mapfiles like gecko.  I imagine build/* repos are
> small enough that it doesn't matter terribly.

This is an excellent spot. Indeed, in this implementation, the response can only begin to send once the contents string has been created, which it does by iterating through the elements in 'q'. This could take longer for bigger files, and use a lot of memory, when we have e.g. 1,000,000 rows. I have not tested with inserting very large mapfiles - I would like to do this next week. I can do a standalone test for this, inserting an existing map file for e.g. gecko-dev, and seeing how it performs when I query it. This test should not be too much work to set up or perform. If we have problems, I will rewrite to yield lines in the response as it iterates through the elements in 'q', rather than constructing the whole response in memory first.

> 
> >+@bp.route('/<project>/rev/<vcs_type>/<commit>')
> >+def get_rev(project, vcs_type, commit):
> <snip>
> >+    if vcs_type not in ("git", "hg"):
> >+        abort(400, "Unknown vcs type %s" % vcs_type)
> 
> Hm, we could check for the valid vcs_type in _check_well_formed_sha() as
> well, but it looks like we only do it the one place, so no real difference.

Done.

> >+        if dups:
> >+            try:
> >+                session.commit()
> >+            except sa.exc.IntegrityError:
> >+                session.rollback()
> 
> Does the rollback raise an HTTPError 206 like insert_many_no_dups docstring
> expects, or is it silent?  We need some sort of error here.
> 

This section is for if duplicates are allowed, so it should be ok. Each mapping is inserted in turn, and if one causes an integrity error, this one only will be rolled back (i.e. the duplicate will be ignored). The section of code for where duplicates are not allowed will throw an HTTP error.

> >+    if not dups:
> >+        try:
> >+            session.commit()
> >+        except sa.exc.IntegrityError:
> >+            abort(409, "some of the given mappings already exist")
> >+    return jsonify()
> 
> I think 'dups' is a confusing name, which looks like my bug.  We should
> rename 'dups' to... 'allow_dups'? to clarify.

I renamed it to ignore_dups to be consistent with the function names, and also because I thought "allow dups" might be interpreted as meaning that duplicate entries are allowed in the database, whereas "ignore" is more like "don't put it in there" (it is only a subtle difference).

> 
> >+@bp.route('/<project>/insert/ignoredups', methods=('POST',))
> >+@actions.mapper.mapping.insert.require()
> >+def insert_many_allow_dups(project):
> >+    """Insert many mapfile entries via POST, and don't error on duplicate SHAs.
> >+
> >+    Args:
> >+        project: single project name string (not comma-delimited list)
> >+        POST data: mapfile lines (git_commit hg_changeset\n)
> >+        Content-Type: text/plain
> >+
> >+    Returns:
> >+        A string of unsuccessful entries.
> 
> Is this accurate?  I don't know what jsonify() does, but aiui we get that or
> a 409.

No, it wasn't accurate - good spot. Basically, I hadn't been updating the doc strings, so I've gone through and reviewed them all - so the biggest part of the changes you'll see are (hopefully) improved doc strings.


> 
> >+@bp.route('/<project>', methods=('POST',))
> >+@actions.mapper.project.insert.require()
> >+def add_project(project):
> >+    """Insert a new project into the DB.
> >+
> >+    Args:
> >+        project: single project name string (not comma-delimited list)
> >+
> >+    Returns:
> >+        200 OK on success
> >+        409 Conflict if the project already exists
> 
> Does the jsonify() return a 200 ok, or is this also a mismatched docstring?

Yes, the jsonify() is normally used for passing back json as the contents of a dictionary you give it (e.g. jsonify(a='car', b='dog') would return something like {a:'car', b:'dog'} in the response body, with a 200 HTTP status code. In this case above, it is just returning empty json. In general, I wanted to return json where suitable (although this case is arguable) so that in general, callers of the API could interpret results by parsing the json responses.

> diff --git a/setup.py b/setup.py
> <snip>
> >+      url='https://github.com/petemoore/mapper',
> 
> We'll want to move this someplace shared soon before/after deployment.

Agreed - I'll ping Hal about this to check what the process is for adding a new repo.

Thanks Aki!
Pete
(In reply to Pete Moore [:pete][:pmoore] from comment #28)
> (In reply to Aki Sasaki [:aki] from comment #25)
> > >+def _build_mapfile(q):
> > <snip>
> > >+    Returns:
> > >+        Text output, which is 40 characters of hg changeset sha, a space,
> > >+        40 characters of git commit sha, and a newline; or None if the query
> > >+        yields no results.
> > >+    """
> > >+    contents = '\n'.join('%s %s' % (r.git_commit, r.hg_changeset) for r in q)
> > >+    if contents:
> > >+        return Response(contents + '\n', mimetype='text/plain')
> > 
> > I think I 'yield'ed the data as it came in in the original:
> > https://github.com/escapewindow/mapper/blob/master/mapper/app.py#L57
> > I'm not sure how Response works, but we may want to keep an eye on response
> > times when we query large mapfiles like gecko.  I imagine build/* repos are
> > small enough that it doesn't matter terribly.
> 
> This is an excellent spot. Indeed, in this implementation, the response can
> only begin to send once the contents string has been created, which it does
> by iterating through the elements in 'q'. This could take longer for bigger
> files, and use a lot of memory, when we have e.g. 1,000,000 rows. I have not
> tested with inserting very large mapfiles - I would like to do this next
> week. I can do a standalone test for this, inserting an existing map file
> for e.g. gecko-dev, and seeing how it performs when I query it. This test
> should not be too much work to set up or perform. If we have problems, I
> will rewrite to yield lines in the response as it iterates through the
> elements in 'q', rather than constructing the whole response in memory first.

Currently pushing a test mapfile to https://api-pub-build.allizom.org/mapper/build-big-test/mapfile/full with 235323 mappings - this should be uploaded in the next hour or so.

Funnily enough, I got the sample mapfile from: https://raw.githubusercontent.com/Nephyrin/mozilla-git-mapfile/12c7cddc4d73e5b5d67df95052a30dcef8108018/hg-git-mapfile

When it has uploaded, I'll run some more tests, like hitting https://api-pub-build.allizom.org/mapper/build-big-test/mapfile/full and seeing if there is a long delay before returning the page.

> > diff --git a/setup.py b/setup.py
> > <snip>
> > >+      url='https://github.com/petemoore/mapper',
> > 
> > We'll want to move this someplace shared soon before/after deployment.
> 
> Agreed - I'll ping Hal about this to check what the process is for adding a
> new repo.

Found the wiki page for the process and raised: https://bugzilla.mozilla.org/show_bug.cgi?id=1020131
So for a combined mapfile (with 235323 mappings):
https://raw.githubusercontent.com/Nephyrin/mozilla-git-mapfile/12c7cddc4d73e5b5d67df95052a30dcef8108018/hg-git-mapfile
it can be downloaded from mapper here:
https://api-pub-build.allizom.org/mapper/build-big-test/mapfile/full

It takes typically 10-15 seconds to start returning content.

Therefore, there does seem like a reasonable risk that at some point, a call to this url could hit the 30s limit (loadbalancer http 500 timeout) if the machine was performing very slowly under heavy load, for this mapfile (which is pretty much the biggest size we will get).

The mapfile I tested with is a mapfile combining all mozilla-central-like trees (see https://github.com/Nephyrin/mozilla-git) - very similar to how https://github.com/mozilla/gecko-projects will be.

I think this shouldn't block the rollout of mapper, since retrieving the full mapfile is not something that will occur in any automation, but it is certainly something that should be addressed, so I'll create a bug for it now.

Thanks Aki for spotting this.

Pete
Created bug 1020554 for this
(In reply to Pete Moore [:pete][:pmoore] from comment #30)
> I think this shouldn't block the rollout of mapper, since retrieving the
> full mapfile is not something that will occur in any automation, but it is
> certainly something that should be addressed, so I'll create a bug for it
> now.
> 
> Thanks Aki for spotting this.

Np.  We can hopefully resolve before we put gecko* on new mapper, or at least before retiring the old mapper/mapfile publishing code.
For completeness, attaching also a complete patch which is equivalent to the two incremental patches so far (i.e. https://bugzilla.mozilla.org/attachment.cgi?id=8430314 and https://bugzilla.mozilla.org/attachment.cgi?id=8434431)

Thanks,
Pete
Attachment #8430314 - Attachment is obsolete: true
Attachment #8434433 - Flags: review?(aki)
Comment on attachment 8434431 [details] [diff] [review]
Incremental mapper patch, against previous patch

>+@bp.route('/<projects>/rev/<vcs_type>/<commit>')
>+def get_rev(projects, vcs_type, commit):
<snip>
>     Exceptions:
>-        HTTPError 400: if an unknown vcs
>-        HTTPError 400: if a badly formed sha
>-        HTTPError 404: if row not found.
>+        HTTP 400: if an unknown vcs
>+        HTTP 400: if a badly formed sha
>+        HTTP 404: if row not found
>     """
<snip>
>     except MultipleResultsFound:
>-        return "internal error - multiple results returned, should not be possible in database", 500
>+        abort(500, "Internal error - multiple results returned for %s commit %s"
>+              "in project %s - this should not be possible in database"
>+              % (vcs_type, commit, projects))

Should we add the ISE 500 to the Exceptions in the docstring?

>     for line in request.stream.readlines():
>         line = line.rstrip()
>         try:
>             (git_commit, hg_changeset) = line.split(' ')
>         except ValueError:
>             logger.error("Received input line: '%s' for project %s", line, project)
>             logger.error("Was expecting an input line such as "
>                          "'686a558fad7954d8481cfd6714cdd56b491d2988 "
>                          "fef90029cb654ad9848337e262078e403baf0c7a'")
>             logger.error("i.e. where the first hash is a git commit SHA "
>                          "and the second hash is a mercurial changeset SHA")
>-            abort(400, "Input line received did not contain a space")
>+            abort(400, "Input line '%s' received for project %s did not contain a space"
>+                  % (line, project))
>             # header/footer won't match this format
>             continue
>         _add_hash(session, git_commit, hg_changeset, proj)
>-        if dups:
>+        if ignore_dups:
>             try:
>                 session.commit()
>             except sa.exc.IntegrityError:
>                 session.rollback()
>-    if not dups:
>+    if not ignore_dups:
>         try:
>             session.commit()
>         except sa.exc.IntegrityError:
>-            abort(409, "some of the given mappings already exist")
>+            abort(409, "Some of the given mappings for project %s already exist"
>+                  % project)

Ok, to clarify, if ignore_dups we do a session.commit() per line and roll back if it's a dup.
And if ignore_dups is False we only do one session.commit() for every line.
Accurate?  Do we need a session.rollback() if ignore_dups is False?

Just clarifying for my own edification.

> @bp.route('/<project>/insert/<git_commit>/<hg_changeset>', methods=('POST',))
> @actions.mapper.mapping.insert.require()
> def insert_one(project, git_commit, hg_changeset):
<snip>
>     Exceptions:
>-        HTTPError 500: No results found
>-        HTTPError 409: Mapping already exists for this project
>-        HTTPError 400: Badly formed sha.
>+        HTTP 500: No results found
>+        HTTP 409: Mapping already exists for this project
>+        HTTP 400: Badly formed sha
>     """
>     except NoResultFound:
>-        abort(500, "row was not successfully entered in database!")
>+        abort(500, "Provided mapping %s %s for project %s could not be inserted "
>+              "into the database" % (git_commit, hg_changeset, project))
>     except MultipleResultsFound:
>-        abort(500, "duplicate rows found in database!")
>+        abort(500, "Provided mapping %s %s for project %s has been inserted into "
>+              "the database multiple times" % (git_commit, hg_changeset, project))

Do we want to add the "duplicate rows" 500 to the docstring?

Sorry, being a stickler for docstrings because I'd like them to be accurate at the beginning at the very least.

r+ with the docstring updates, assuming my read of the _insert_many() code is right.
Attachment #8434431 - Flags: review?(aki)
Comment on attachment 8434433 [details] [diff] [review]
Replacement mapper patch, including all changes so far

r+ based on the incremental patch.  Thanks Pete!
Attachment #8434433 - Flags: review?(aki) → review+
(In reply to Aki Sasaki [:aki] from comment #35)
> 
> Should we add the ISE 500 to the Exceptions in the docstring?
> 
Agreed - done
> >     for line in request.stream.readlines():
> >         line = line.rstrip()
> >         try:
> >             (git_commit, hg_changeset) = line.split(' ')
> >         except ValueError:
> >             logger.error("Received input line: '%s' for project %s", line, project)
> >             logger.error("Was expecting an input line such as "
> >                          "'686a558fad7954d8481cfd6714cdd56b491d2988 "
> >                          "fef90029cb654ad9848337e262078e403baf0c7a'")
> >             logger.error("i.e. where the first hash is a git commit SHA "
> >                          "and the second hash is a mercurial changeset SHA")
> >-            abort(400, "Input line received did not contain a space")
> >+            abort(400, "Input line '%s' received for project %s did not contain a space"
> >+                  % (line, project))
> >             # header/footer won't match this format
> >             continue
> >         _add_hash(session, git_commit, hg_changeset, proj)
> >-        if dups:
> >+        if ignore_dups:
> >             try:
> >                 session.commit()
> >             except sa.exc.IntegrityError:
> >                 session.rollback()
> >-    if not dups:
> >+    if not ignore_dups:
> >         try:
> >             session.commit()
> >         except sa.exc.IntegrityError:
> >-            abort(409, "some of the given mappings already exist")
> >+            abort(409, "Some of the given mappings for project %s already exist"
> >+                  % project)
> 
> Ok, to clarify, if ignore_dups we do a session.commit() per line and roll
> back if it's a dup.
> And if ignore_dups is False we only do one session.commit() for every line.
> Accurate?  Do we need a session.rollback() if ignore_dups is False?
> 
> Just clarifying for my own edification.

That's right - only one commit after inserting all mappings if ignore_dups is False; if it is True we instead commit after every row, so that any failure will result only in rolling back the unsuccessful entry. But indeed, you are quite right, that we should also have a session.rollback() if the (one and only) commit fails in the case that ignore_dups is False - so I've now added this. Thanks for spotting this.

> 
> Do we want to add the "duplicate rows" 500 to the docstring?
> 
> Sorry, being a stickler for docstrings because I'd like them to be accurate
> at the beginning at the very least.

Done.

> 
> r+ with the docstring updates, assuming my read of the _insert_many() code
> is right.

In addition to the above, we had a change to the relengapi authentication module that was backwardly incompatible, that I also required some minor changes on the mapper side.

I'm going to go ahead and raise a (dependent) bug now to get this deployed (mapper and relengapi) into production tomorrow (assuming you are happy with this final iteration of changes). And I'll also raise a new bug for deploying the updated vcs_sync build/* repos to publish to the new mapper in production, so hopefully by the end of tomorrow, the build/* repos are publishing to the new mapper in production, securely.

Thanks Aki!
Pete
Attachment #8434433 - Attachment is obsolete: true
Attachment #8437921 - Flags: review?(aki) → review+
Depends on: 1023541
Attachment #8437920 - Flags: checked-in+
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: