827123 - Replicate sqlite pushlog files to all mercurial hosts so we can eliminate NFS

Shyam Mani [:fox2mike]

Reporter

Description

•

13 years ago

Tracker/Discussion bug (spawned from bug 826099)

Shyam Mani [:fox2mike]

Reporter

Updated

•

13 years ago

Assignee: server-ops → server-ops-devservices

Component: Server Operations → Server Operations: Developer Services

Ben Kero [:bkero]

Comment 1

•

13 years ago

I've recently been considering the idea of switching pushlog over to a real database (with a table per repo). This would make life much easier for the hgweb heads, since that will remove the requirement of rsyncing (or recalculating) a new pushlog file with every commit. Callek said he has no problems with this, although needs to consult with the rest of releng before making this decision.

Aki Sasaki (not active)

Comment 2

•

12 years ago

This could help with bug 498641.

(not currently active) Ted Mielczarek

Comment 3

•

12 years ago

(In reply to Ben Kero [:bkero] from comment #1) > I've recently been considering the idea of switching pushlog over to a real > database (with a table per repo). This would make life much easier for the > hgweb heads, since that will remove the requirement of rsyncing (or > recalculating) a new pushlog file with every commit. I'm fine with this. The schema could probably stand a once-over from a DBA while we're at it.

Shyam Mani [:fox2mike]

Reporter

Comment 4

•

12 years ago

Ted, Can we get the current schema attached here please? or copy pasted if it's small enough. Ben, Should probably have a call with Sheeri to discuss this. Might need it's own MySQL cluster in scl3.

Justin Wood (:Callek)

Comment 5

•

12 years ago

(In reply to Shyam Mani [:fox2mike] from comment #4) > Ted, > > Can we get the current schema attached here please? or copy pasted if it's > small enough. I'm not ted, but: (from http://hg.mozilla.org/hgcustom/hghooks/file/3b0c66182bb0/mozhghooks/pushlog.py#l40 ) CREATE TABLE IF NOT EXISTS changesets (pushid INTEGER, rev INTEGER, node text); CREATE TABLE IF NOT EXISTS pushlog (id INTEGER PRIMARY KEY AUTOINCREMENT, user TEXT, date INTEGER); CREATE UNIQUE INDEX IF NOT EXISTS changeset_node ON changesets (node); CREATE UNIQUE INDEX IF NOT EXISTS changeset_rev ON changesets (rev); CREATE INDEX IF NOT EXISTS changeset_pushid ON changesets (pushid); CREATE INDEX IF NOT EXISTS pushlog_date ON pushlog (date); CREATE INDEX IF NOT EXISTS pushlog_user ON pushlog (user); with the knowledge that we can happily adjust schema as we migrate off sqlite, as well as the fact that we do need to occassionally "reset" the pushlog (data) entirely (e.g. due to twig resets, or a try reset; as well as new repos) ---- My [not-a-dba] recommendation if we do this, as far as new schema; is to have a separate trees table, which records which tree a push belongs to, store all pushes in a single (partitioned) table, such that we keep the last say month worth of pushes in a fast-access mode, and further back can be slower. store which tree a push belongs to in a new column, paired with new indexes when resetting a tree, assign a new tree index (set tree name in trees table to "" or NULL), and purge old pushes of said tree after-x-time of tree being defunct. (when we know we don't need them anymore).

Sheeri Cabral [:sheeri]

Comment 6

•

12 years ago

How many pushes are there per week/month/time period? What gets stored in the "node" field, that has the TEXT type? What gets stored in the "user" field, that has the TEXT type? When do pushes get defunct? Is is mostly a time-based thing or does it depend on versions? Moving it to a real db is possible. There is a generic in scl3 (right now tbpl is on generic in phx1 so it's not a stretch that it would be on a shared server), and scl3 generic doesn't have a ton on it, so I'm willing to try it there first.

Ben Kero [:bkero]

Comment 7

•

12 years ago

There are dozens of pushes per day, two INSERTs for each commit. Here's an example of the data. sqlite> .schema changesets CREATE TABLE changesets (pushid INTEGER, rev INTEGER, node text); CREATE UNIQUE INDEX changeset_node ON changesets (node); CREATE INDEX changeset_pushid ON changesets (pushid); CREATE UNIQUE INDEX changeset_rev ON changesets (rev); sqlite> .schema pushlog CREATE TABLE pushlog (id INTEGER PRIMARY KEY AUTOINCREMENT, user TEXT, date INTEGER); CREATE INDEX pushlog_date ON pushlog (date); CREATE INDEX pushlog_user ON pushlog (user); sqlite> select * from changesets limit 10; 1|13382|61007906a1f8ad5c303b0815ac4e9821168d3937 1|0|8ba995b74e18334ab3707f27e9eb8f4e37ba3d29 1|1|9b2a99adc05e53cd4010de512f50118594756650 1|2|10cab3c34c28b0746436f8bc7ffc8c47f421ee23 1|3|a00ac31e8ae4fe6cdf5f40a007c1ab36ae01ffae 1|4|e943454a2e49b2860353c9449359c1822cb14827 1|5|331cb67f2a3cb141465e0da88f8cd1ef36e85ffc 1|6|dad02d3ebc7d9e5fdfed17234d31d10e3b1b55ec 1|7|cd100ce4677919334ec2e3ffb57b444aabf81141 1|8|33654b51bca91fab0faed723e281c76bd65896c1 sqlite> select * from pushlog limit 10; 1|bsmedberg@mozilla.com|1206031764 2|jorendorff@mozilla.com|1206553146 3|bsmedberg@mozilla.com|1206641628 4|jorendorff@mozilla.com|1206727429 5|jorendorff@mozilla.com|1207177491 6|jorendorff@mozilla.com|1207607933 7|bsmedberg@mozilla.com|1207691072 8|bsmedberg@mozilla.com|1207714702 9|bsmedberg@mozilla.com|1207846866 10|bsmedberg@mozilla.com|1207850431

Sheeri Cabral [:sheeri]

Comment 8

•

12 years ago

The TEXT data type in MySQL will be vast overkill for those fields, then. It's not a matter of space on disk, but when MySQL is processing those in memory, it becomes trickier. Check out http://www.pythian.com/blog/text-vs-varchar/ Is the changeset node always 40 characters? If so, I'd recommend using CHAR(40). If it's variable, VARCHAR(x) should be fine, like VARCHAR(50) or VARCHAR(100). Also, the e-mail address doesn't need to be TEXT. VARCHAR(50) should probably be enough, unless you have different guidelines for e-mail addresses (the ones above are about 20 characters long). This data is pretty small, I think we can handle the purging. I'm still waiting for the answer to: When do pushes get defunct? Is is mostly a time-based thing or does it depend on versions?

Ben Kero [:bkero]

Comment 9

•

12 years ago

It's a hash, it's always going to be 40 characters. I don't know if we even have a policy about what to accept as storage formats for email addresses (I'm guessing some have UTF-8 encoding). AFAIK the pushes never become defunct. Unless something goes really wrong (like sensitive data is committed) and I have to go tear it out manually. Sometimes branches of bigger repositories like mozilla-central are cloned onto 'twigs'. These twigs can be 'reset' to be a fresh clone of mozilla-central again, although start out with blank pushlogs. One of our oldest (and probably biggest) belongs to mozilla-central: -rw-rw-r-- 1 hg scm_level_3 19M Feb 11 11:12 pushlog2.db

Sheeri Cabral [:sheeri]

Comment 10

•

12 years ago

Hrm, maybe :Callek can address what he meant by pushes going defunct in comment 5?

Nick Thomas [:nthomas] (UTC+12)

Comment 11

•

12 years ago

(In reply to Ben Kero [:bkero] from comment #9) > Sometimes branches of bigger repositories like mozilla-central are cloned > onto 'twigs'. These twigs can be 'reset' to be a fresh clone of > mozilla-central again, although start out with blank pushlogs. This! We must have a very safe way of being able to wipe a pushlog for a branch if we are combining all the branches into one database.

(not currently active) Ted Mielczarek

Comment 12

•

12 years ago

I don't think it's important to be able to save the old data for a branch if we reset it, we don't currently do that. We can simply delete all the pushlog records for that branch and start clean.

(not currently active) Ted Mielczarek

Comment 13

•

12 years ago

Also note that the column types are what they are just because it's sqlite, not for any important reasons.

(not currently active) Ted Mielczarek

Updated

•

12 years ago

Summary: Figure out a caching mechanism for pushlog → Move pushlog to MySQL backend

Justin Wood (:Callek)

Comment 14

•

12 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #10) > Hrm, maybe :Callek can address what he meant by pushes going defunct in > comment 5? Yea, I merely meant defunct as in, "we reset a tree" (try, project-branch, etc.) currently that purging is done by deleting the .sqlite file. If we use a real shared DB it might make sense to do this slightly different, but certainly just deleting the rows will be fine. The ability to easily reset a repo to a clean/purged state is needed. but beyond that the pushlog data doesn't expire.

Shyam Mani [:fox2mike]

Reporter

Comment 15

•

12 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #6) > Moving it to a real db is possible. There is a generic in scl3 (right now > tbpl is on generic in phx1 so it's not a stretch that it would be on a > shared server), and scl3 generic doesn't have a ton on it, so I'm willing to > try it there first. This needs to have uber uptime. Any issues with this will essentially kill hg operations since pushlog is enabled globally. I'm a little paranoid about sharing hardware for this for the above reason, but will go with what the DBAs recommend. More than happy to order hardware to run this too, if needed.

Sheeri Cabral [:sheeri]

Comment 16

•

12 years ago

Callek - thanx for clarifying, I was trying to assess if we'd need some kind of partitioning or defragmentation if we are constantly deleting stuff. -------------------- Shyam - Right now only pootle and graphite_mozilla_org is on generic in scl3, but I'm totally OK with having separate hardware too. This is important, like putting bouncer on its own cluster. Can we reuse some of the old addons in scl3 hardware for this? or is that already claimed? (we had one master and several slaves, like 3-6 slaves) ------------------------- Callek, bkero, whoever: Is there a sense of the "working set" of data? e.g. what "most" queries will be? Usually in a system like this it's something like "the most recent pushes". I'm just trying to get a sense of how large a buffer pool size we'd need.

(not currently active) Ted Mielczarek

Comment 17

•

12 years ago

We should get some access logs off the hg.mo webheads and see what queries get hit most often. I suspect the answer there is "whatever queries TBPL uses", and everything else is noise.

Shyam Mani [:fox2mike]

Reporter

Comment 18

•

12 years ago

Ben, can you help with what ted needs here? Ted/Sheeri : I'll keep an eye, but I'd like Ben to drive the project. Sheeri : we will probably order hardware for this when the time is right (aka before we go to production).

Assignee: server-ops-devservices → bkero

Whiteboard: [2013Q2]

Ben Kero [:bkero]

Updated

•

12 years ago

Blocks: 498641

Axel Hecht [:Pike]

Comment 19

•

12 years ago

Note, one pushlog offender is the pushlog scraper on the l10n dashboard, that's currently pounding the /json-pushes/ api on some 800 repos. Queries looks like json-pushes?startID=2134&endID=2234, i.e., we're getting new pushes, but at max some 200 at a time. We'd love to move that to a single db query, though, at which point that'd collapse to one frequent ping to the central database, if we get access to the db or an API on top of that db.

Axel Hecht [:Pike]

Comment 20

•

12 years ago

Can we accellerate this? It seems like the cleanest approach to make sure we're not running into issues like bugs 842536, 847864, where repos don't record new pushes due to file permissions. The l10n repos in particular for new locales are not as reliable as we'd like them to be, and that makes it hard for us to hit our goals this quarter.

(not currently active) Ted Mielczarek

Updated

•

12 years ago

Depends on: 849343

(not currently active) Ted Mielczarek

Updated

•

12 years ago

Depends on: 849345

(not currently active) Ted Mielczarek

Comment 21

•

12 years ago

I filed what I think are the prerequisites to getting the pushlog code to the point where we could make this switch.

Shyam Mani [:fox2mike]

Reporter

Updated

•

12 years ago

Whiteboard: [2013Q2]

Use revlogs for storing pushlog data 12 years ago Gregory Szorc [:gps] 21.12 KB, patch		Details \| Diff \| Splinter Review
Use revlogs for storing pushlog data 12 years ago Gregory Szorc [:gps] 34.57 KB, patch		Details \| Diff \| Splinter Review
Extension to synchronize pushlog via Mercurial wire protocol 12 years ago Gregory Szorc [:gps] 11.42 KB, patch		Details \| Diff \| Splinter Review