Closed Bug 849345 Opened 11 years ago Closed 9 years ago

Design schema for MySQL pushlog database

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ted, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/738] )

We'll need a schema for a MySQL pushlog db. The current single-repository sqlite schema is here:
http://hg.mozilla.org/hgcustom/hghooks/file/tip/mozhghooks/pushlog.py#l40
One thing that we're exposing in the pushlog API right now is that per-repo push ID.

It'd be good if that would continue to work as is, at least for a transition period.
As we move forward, we're more likely to be interested in other attributes of the cset, such as branch or bookmark. While hg bookmarks can be retrieved remotely, branches can not.

Since we're redoing the schema, seems a good time to add (at least) hg branch to the schema. Bonus points for adding the bookmarks (but note that is plural in general, fine to just store/report the active one, imo).

If we do add branch & bookmark, we'll have to pick an arbitrary maximum length for MySQL (based on Sheeri's comments in bug 827123). Please coordinate with RelEng on picking something appropriate (as much of the automation generates the longer strings)
Hal's comment may cover this already, but there are some pushlog operations which are slow because they need to query the repo to get more information. Bonus points for stuffing that into the db instead. I can't remember any examples right now but hopefully Ted/Axel/others might.
I only use what's currently in the json version of pushes,

 "2240": {
  "date": 1362670203, 
  "changesets": [
   "b859178117e3f70abb4e7e0a5c8b39eb9115577c", 
   "1295953a4b7b93e3ead2fb22f8bff4c9b67a6a2d"
  ], 
  "user": "ffxbld"
 }

the key "2240" being the push id per repo. I don't think I use that actual data for anything but keeping track which pushes I already got for an individual repo.

All the details I need beyond that are added by inspecting a local clone of the repo, which I update once I get a push.
How much data do we really need out of the repository? Essentially, as you can see from the schema linked in comment 0, the pushlog db stores a pretty minimal set of information: a push has an id, the user who pushed it, the timestamp, and the changeset ids that were included in this push. Not included is any of the metadata from the changeset: the changeset author, description, branch/tag info and files changed.
Since we're building for the future here, can we check with the tbpl2 folks to see if they need something to avoid expensive queries?
Ed: can you speak to comment 6?
(In reply to Hal Wine [:hwine] from comment #6)
> Since we're building for the future here, can we check with the tbpl2 folks
> to see if they need something to avoid expensive queries?

I can't think of anything, but jeads might.
Flags: needinfo?(jeads)
Can we also do proactive query reviews on dev/stage to periodically check for expensive queries, so we can catch them before they go into production?
Ted - one other item dawned on me. It's highly likely that we'll have some repositories on git.m.o being committed to within the next year. That will require that git.m.o support pushlog. With per repo pushlog databases, that didn't seem like a major hurdle.

I'm not sure what the right answer will be - one grand pushlog service, or a per-repo-type pushlog service, or something else. And, I'm not suggesting making those decisions is a blocker to this work. Rather, it may influence some design tradeoffs in this new schema.
Hal: is there a bug on supporting a git pushlog? That's the first I've heard of it.
The expensive query part comes from looking at the pushlog joined with test data over long time courses. We've been storing json returned by json-pushes in this mysql schema, https://github.com/mozilla/datazilla/blob/master/datazilla/model/sql/template_schema/schema_hgmozilla.sql.tmpl#L95, to analyze talos data trends over time. For tbpl2 we would like to store either the push id or relevant timestamps directly with test data so we can use it to order test data according to push order when we retrieve it. That way we avoid the expensive part of the join altogether, it would enable us to analyze data over longer time courses. To do this we will likely need to retrieve the push id or timestamp for a particular changeset when we receive/load test data for storage in tbpl2. I don't think this will require anything different in the pushlog schema design but it will make sense to change the data ingestion protocol in tbpl2 to make use of a non-polling strategy, we're hoping to utilize http://pulse.mozilla.org/ to accomplish this.

If there's going to be a git.m.o pushlog in the future it would be great if it could be encapsulated under a single web service for downstream consumers.
Flags: needinfo?(jeads)
Product: mozilla.org → Release Engineering
Product: Release Engineering → Developer Services
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/112]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/112] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/738] [kanban:engops:https://kanbanize.com/ctrl_board/6/112]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/738] [kanban:engops:https://kanbanize.com/ctrl_board/6/112] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/738]
There are no plans to move pushlog off SQLite at this time.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.