Closed Bug 453046 Opened 16 years ago Closed 16 years ago

create hook to mirror pushlog db

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Pike, Assigned: Pike)

References

Details

Attachments

(2 files, 4 obsolete files)

I'd love to have a hook to get the actual pushlog.db data out of hg.m.o.

Preferably in json(p). The api would be rather simple, you'd paste in a range of push ids, possible open ended, and it would return a list of pushes and a list of changesets, just as they're in the actual db.

That would enable richer applications to do pushdb-based work without having to hit hg.m.o for each operation.

Possible applications would be a sign-off web app or custom queries or feeds. For example, I'd love to have a syndicated feed of all l10n-check-ins. I could set that up easily on the l10n server, if I could get to the real pushlog data locally.

It might be great for future development or other analysis, too.
Can you specify the API a bit more? Would this take a start= param? Would that param always be an actual push, or just a cset? Do we need to throttle this at all?
Attached patch non-web hook for json export (obsolete) — Splinter Review
I didn't do the web parts yet, but this should give you an idea.

I'm still playing with the idea of adding jsonp support in the web part, though I'm not sure how useful that would be. It'd be mostly useful for web apps directly using the output, and I'm not sure there's a lot they could do with it now.

I think we should offer start (and end) params, as that's making it easier to process the results, like "no news is good news" is much easier than to diff two complete database dumps.
This needs a modification to map-json to work,

+export ='{data|mozjson}'

but is otherwise pretty fine.

The webhook goes to /json-export, not sure if that's the best name for it.

I'm ambiguous on whether this should be in a separate hook file or in pushlog-feed.py. I'm leaning towards having a separate webcommand for this, as the params are different, startID and endID, where the pushids are > startID and <= endID.
Assignee: nobody → l10n
Attachment #339261 - Flags: review?(ted.mielczarek)
Comment on attachment 339261 [details]
json-export hook and printpushlog command

>def exportworker(basepath, startID=0, endID=None):
>    push_statement = 'SELECT id, user, date FROM pushlog where id > ?'
>    ctx_statement = 'SELECT pushid, rev, node FROM changesets where pushid > ?'

I'd declare ctx_statement a little later, where you're actually using it. As it stands, it's completely unclear why you need two queries at all; maybe some comments in that area would be fine as well. And/or could the queries be folded into one join?

>    args = (startID,)
>    if endID is not None:
>        push_statement += ' and id <= ?'
>        args = (startID, endID)
>    if os.path.basename(basepath) != '.hg':
>        basepath = os.path.join(basepath, '.hg')
>    conn = sqlite.connect(os.path.join(basepath, 'pushlog2.db'))
>    pushes = conn.execute(push_statement, args).fetchall()
>    pushjsons = [dict(zip(('id','user','date'), t)) for t in pushes]

Style nit: 'id', 'user', 'date' (with spaces), as done by Python code everywhere. Some extra vertical whitespace to group lines would perhaps also be nice.

>    ctxs = []
>    if pushes:
>        if endID is not None:
>            ctx_statement += ' and pushid <= ?'
>        changesets = conn.execute(ctx_statement, args).fetchall()
>        ctxs = [dict(zip(('pushid','rev','node'), t)) for t in changesets]
>    return dict(pushlog = pushjsons, changesets = ctxs)

More style nits: same thing for pushid, rev, node. Returning {'pushlog': pushjsons, 'changesets': ctxs} would be a little clearer.

>def export(web, req, tmpl):
>    repo = web.repo
>    repopath = repo.path

Why not fold those into one line?

>    startID = ('startID' in req.form and req.form['startID'][0]) or 0
>    endID = ('endID' in req.form and req.form['endID'][0]) or None

You can just leave out the parentheses here to make it a little more readable.
(In reply to comment #4)
> (From update of attachment 339261 [details])
> >def exportworker(basepath, startID=0, endID=None):
> >    push_statement = 'SELECT id, user, date FROM pushlog where id > ?'
> >    ctx_statement = 'SELECT pushid, rev, node FROM changesets where pushid > ?'
> 
> I'd declare ctx_statement a little later, where you're actually using it. As it
> stands, it's completely unclear why you need two queries at all; maybe some
> comments in that area would be fine as well. And/or could the queries be folded
> into one join?

There can be N changesets per push, so the two queries will return a different amount of return values. So while I probably could join the two queries, I had to rip them into two distinct lists afterwards. Doesn't sound like a good compromise to me.

<...>

> More style nits: same thing for pushid, rev, node. Returning {'pushlog':
> pushjsons, 'changesets': ctxs} would be a little clearer.

I find dict(foo = bar) easier to read than {'foo': bar}

> >def export(web, req, tmpl):
> >    repo = web.repo
> >    repopath = repo.path
> 
> Why not fold those into one line?

Copy and paste is my reason, I'll fix that.

> >    startID = ('startID' in req.form and req.form['startID'][0]) or 0
> >    endID = ('endID' in req.form and req.form['endID'][0]) or None
> 
> You can just leave out the parentheses here to make it a little more readable.

Not sure if "foo in bar and bar or stuff" is easier to read with or without parenthesis. Probably depends on how popular that pattern is in the first place, and how it's commonly written.
(In reply to comment #5)
> There can be N changesets per push, so the two queries will return a different
> amount of return values. So while I probably could join the two queries, I had
> to rip them into two distinct lists afterwards. Doesn't sound like a good
> compromise to me.

Ah, yes. It just seems to me that it'd make more sense to structure the output document not as two tables but as one list of pushes where each push has its own list of included changesets. As an export feature, that would be a more sensible option, IMO, although as a mirror/backup feature your design makes more sense.

> Not sure if "foo in bar and bar or stuff" is easier to read with or without
> parenthesis. Probably depends on how popular that pattern is in the first
> place, and how it's commonly written.

It's pretty popular pattern, and usually written without the parentheses. In Python 2.5, it gets replaced with "bar if foo in bar else stuff", but I think the server is still running 2.4.
Comment on attachment 339261 [details]
json-export hook and printpushlog command

First off, I think you should just put this all in pushlog-feed.py. Arguably we should rename that file at this point. That will make it less of a hassle to deploy, and keep all the pushlog web extensions in one place.

>def exportworker(basepath, startID=0, endID=None):
>    push_statement = 'SELECT id, user, date FROM pushlog where id > ?'
>    ctx_statement = 'SELECT pushid, rev, node FROM changesets where pushid > ?'

I think you should just use one query here, instead of doing one query per push. The only reason the normal pushlog does it that way is because I'm using LIMIT, and that won't play nice otherwise. Just do:

select id, user, date, rev, node from pushlog INNER JOIN changesets ON id = pushid;

Then you can maintain a dict of the pushes, and their associated changesets. Something like:
pushes = { 1: { 'user': 'luser', 'date': '123456', 'changesets': ['abcd', ... ] },
           2: ... }

The side benefit here would be that you could return this data structure directly. I don't think the JSON data structure you're returning feels very natural, as you have to map back and forth between pushes and changesets using the push id, where with my proposal above, you can ignore the push ids (except to use them for ordering), and the changesets are grouped inside the data structure by push.

>    return dict(pushlog = pushjsons, changesets = ctxs)

Aside from my comments I'm fine with the general concept.
Attachment #339261 - Flags: review?(ted.mielczarek) → review-
Right now, I'm doing two queries, one for all pushes, one for all changesets.

I'm not sure I favor a completely different data structure, too. What I want in the end is to recreate the pushlog2.db locally. Changing the format will require me to postprocess the lists on the server side into a, granted, more natural datastructure, and then parse that back into the original tables on the client.

I wonder, I should probably explicitly ORDER BY no matter which, right? ORDER BY id ASC, rev ASC ?
Attached patch address comments, change format (obsolete) — Splinter Review
Let's try this this way around.
Attachment #336316 - Attachment is obsolete: true
Attachment #339261 - Attachment is obsolete: true
Attachment #341434 - Flags: review?(ted.mielczarek)
Attachment #341434 - Flags: review?(dirkjan)
Comment on attachment 341434 [details] [diff] [review]
address comments, change format

+    stmnt = 'SELECT id, user, date, rev, node from pushlog INNER JOIN changesets ON id = pushid WHERE id > ? %s ORDER BY id ASC, rev ASC'

ASC is the default, I'd just leave it out personally.

+    for id, user, date, rev, node in conn.execute(stmnt, args):
+        if id in pushes:
+            pushes[id]['changesets'].append(node)
+            continue
+        pushes[id] = {'user': user,
+                      'date': date,
+                      'changesets': [node]
+                      }

Could you just stick the pushes[id] in an else: block, and ditch the continue? Doesn't buy you much here since there's only one statement after it.

+    return tmpl('export', data=data)

Is there an export template somewhere I'm missing?

+addwebcommand(export, 'export')

I don't think export is the best name for this command, since that doesn't tell you at all what it's exporting. I might suggest 'pushlogexport', but djc would probably say 'pushlog-json' and use the style or whatever.

Overall looks good though, r=me.
Attachment #341434 - Flags: review?(ted.mielczarek) → review+
Attached patch missing patch to hg_templates (obsolete) — Splinter Review
Right, forgot the patch to hg_templates, doh.
Attachment #341443 - Flags: review?(ted.mielczarek)
Attachment #341443 - Flags: review?(ted.mielczarek) → review+
Attachment #341434 - Flags: review?(dirkjan) → review+
Comment on attachment 341434 [details] [diff] [review]
address comments, change format

+    if endID is not None:
+        stmnt = stmnt % 'and id <= ?'
+        args = (startID, endID)
+    else:
+        stmnt = stmnt % ""

Nit: I think stmt (rather than stmnt, which reads awkwardly IMO) is kind of a convention.

>+def export(web, req, tmpl):
>+    repo = web.repo
>+    repopath = repo.path

These could still be folded into one.

>+addwebcommand(export, 'export')

Yeah, I think json-pushlog is the best solution, but that's not very easy to do with current hg. I like export because it's short, but pushlogjson also works, I guess.
The actual remote url is http://localhost:8000/json-export, btw.
That's still not really clear to me what you're exporting JSON of. Should mention pushlog in there somewhere.
json-pushes, maybe?
json-pushes sounds cool to me.

{
 "1": {
  "date": 1222953629, 
  "changesets": [
   "56c0cbfb3a49b29fe6e5c34ccb2a352d38115e4b", 
   "251c44a713899678b3b7787560301caa8d2288c1"
  ], 
  "user": "axelhecht"
 }, 
 "2": {
  "date": 1222953629, 
  "changesets": [
   "972d59af0ec5710cb052b1c4b6accdb233edab5b"
  ], 
  "user": "axelhecht"
 }
}

is the answer to that question, which sounds reasonable.
As discussed on #developers with Ted and Dirkjan, we're going for "pushes" as the entry point. We might consolidate the other formats on that entry point, too.

Addressed the other comments, too.

New patch for hg_templates coming up.

Carrying over the reviews.
Attachment #341434 - Attachment is obsolete: true
Attachment #341491 - Flags: review+
Attachment #341443 - Attachment is obsolete: true
Attachment #341493 - Flags: review+
How do I proceed? Push to ssh://hg.mozilla.org/hg_templates/ and ssh://hg.mozilla.org/users/bsmedberg_mozilla.com/hgpoller/ and file an IT bug on putting them on hg.m.o?
Yes, that would become a Server Ops bug. Although asking for deployment isn't done for every bug, I presume you have an actual use for this you'd want to deploy, so you should probably go ahead and do that.
http://hg.mozilla.org/hg_templates/rev/27ed52884b85 and http://hg.mozilla.org/users/bsmedberg_mozilla.com/hgpoller/rev/58dba74cfa4b, marking FIXED.

I'll file a server ops bug to get this live, I'd like to use this to make opt-in easier. The server ops bug will depend on this one.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Blocks: 458276
Product: mozilla.org → Release Engineering
Product: Release Engineering → Developer Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: