Closed Bug 1343328 Opened 7 years ago Closed 5 months ago

Switch to bigint to protect against auto_increment overflow

Categories

(Tree Management :: Perfherder, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fxp])

Attachments

(4 files)

Not urgent yet at all (presumably we're still a few years away from issues), but filing now otherwise I'll forget, and we no longer have DBAs doing this kind of monitoring for us.

The performance_datum table uses a signed int for the auto_increment primary key (which has a max value of 2147483647), and has currently used 10.7% of the available range.

Calculated using the SQL query here:
http://code.openark.org/blog/mysql/checking-for-auto_increment-capacity-with-single-query

There is no data expiry currently performed on the perfherder tables.
Priority: P5 → P3
(In reply to Ed Morley [:emorley] from comment #0)
> The performance_datum table uses a signed int for the auto_increment primary
> key (which has a max value of 2147483647), and has currently used 10.7% of
> the available range.

Now up to 17.4%.
Priority: P3 → P1

Update: now up to 37.6%.

See Also: → 1346567

Using bigint will increase the size of the performance_datum table. Thus, fixing bug 1346567 becomes even more prioritary.

Priority: P1 → P2

Ionut, is this going to happen before July or in H2? And do you need anything from the Treeherder team?

Assignee: nobody → igoldan

(In reply to Karl Thiessen [:kthiessen, he/him] from comment #4)

Ionut, is this going to happen before July or in H2? And do you need anything from the Treeherder team?

This is going to happen before July, by using bigint. We've now depleted 38.8% of the available ids.

igoldan, when do you think you will be able to get to this?

Flags: needinfo?(igoldan)
Flags: needinfo?(igoldan)

(In reply to Armen [:armenzg] from comment #6)

igoldan, when do you think you will be able to get to this?

By Aug 14 I should have the small alter patch.
If that doesn't happen, then I've uncovered some weird technical aspects & can't proceed further.

Assuming all goes smoothly, we will need to plan a weekend deploy, as I'm pretty sure the schema migration will consume
quite a bit of time.

Meanwhile, we've depleted 41.28% of available ids.

Priority: P2 → P1

@camd: I don't think I have enough experience to be sure of my comment here. Could you please make corrections as needed?

What happens once we reach the end of the indexes? does it start from 0 again?

3 years ago we were at 10.7% and we're now at 41.28%. It might take another year or two. With this I mean that we have some time to fix this right and not rush it. I was under the impression that we were under a smaller timeline.

How long would a migration with your PR take?

As I started reading about this topic we might to take a step back on the solution:
https://dba.stackexchange.com/questions/95740/alter-primary-key-column-from-int-to-bigint-in-production-mysql-5-6-19a
https://www.red-gate.com/simple-talk/sql/database-administration/changing-data-types-large-tables-int-bigint-conundrum/
https://brianschiller.com/blog/2017/03/23/django-migrate-to-bigint

IIUC, switching the field from int to bigint during a migration could take a long time. Anyone what happens during that time? Does that specific table become unavailable during that time? Would it mean that inserts and reads get queued? Would that mean that the Perfherder UI backend calls would not succeed? Would that mean that log parsing operations that attempt writes to that table would fail? Would they request the write operation (the DB will queue the write operation) and move on? (meaning log parsing would not be affected but MySql perf_datum inserts would be queued).

What is our rollback plan if a migration fails?

FYI for now, we might still be able to restore a snapshot to stage or dev/prototype (1TB DBs) since production (2TB DB) will have more than 1TB data in the next few months. Alternatively, we can create a temporary snapshot from production to test against instead of restoring a backup to dev/stage.

Some notes on MySql migrations:
https://docs.djangoproject.com/en/2.2/topics/migrations/#mysql

In addition, MySQL will fully rewrite tables for almost every schema operation and generally takes a time proportional to the number of rows in the table to add or remove columns

Assuming we do the migration over a weekend.
Is there a way we can stop writes to the performance table while a migration happens?
Maybe stop log parsing while the migration completes? (In Heroku we can disable log parsing by reducing the number of nodes; we would let the Celery queues grow and resume it once the migration ends).

Side note, we can probably delete this:

    # the following properties are obsolete and should be removed at some point
    ds_job_id = models.PositiveIntegerField(db_column="ds_job_id", null=True)
    result_set_id = models.PositiveIntegerField(null=True)
Flags: needinfo?(cdawson)

I did a bunch of this kind of thing when doing the migration from datasource -> django orm. In general the pattern that works the best with MySQL is something like this:

  1. Create a copy of the existing table, perform the migration on it
  2. Copy over any rows added since then to the new table
  3. Temporarily turn off data ingestion
  4. Copy any last rows over that were added in that short interval
  5. Swap the copy with the current one
  6. Deploy the new code
  7. Turn data migration back on

This ensures that the window where things are offline is relatively short. You can find lots of hand-wringing and gnashing of teeth online about this online, here's an example:

https://dba.stackexchange.com/questions/3760/alter-table-on-a-large-table-with-an-indexed-column

This is one of the main reasons I would never recommend anyone use MySQL for a new project, table schema migrations in Postgres are usually much less painful.

Deleting the unused columns in performance_datum, as you suggest, is definitely a good idea!

(In reply to Armen [:armenzg] from comment #11)

@camd: I don't think I have enough experience to be sure of my comment here. Could you please make corrections as needed?

What happens once we reach the end of the indexes? does it start from 0 again?

3 years ago we were at 10.7% and we're now at 41.28%. It might take another year or two. With this I mean that we have some time to fix this right and not rush it. I was under the impression that we were under a smaller timeline.

How long would a migration with your PR take?

As I started reading about this topic we might to take a step back on the solution:
https://dba.stackexchange.com/questions/95740/alter-primary-key-column-from-int-to-bigint-in-production-mysql-5-6-19a
https://www.red-gate.com/simple-talk/sql/database-administration/changing-data-types-large-tables-int-bigint-conundrum/
https://brianschiller.com/blog/2017/03/23/django-migrate-to-bigint

IIUC, switching the field from int to bigint during a migration could take a long time. Anyone what happens during that time? Does that specific table become unavailable during that time? Would it mean that inserts and reads get queued? Would that mean that the Perfherder UI backend calls would not succeed? Would that mean that log parsing operations that attempt writes to that table would fail? Would they request the write operation (the DB will queue the write operation) and move on? (meaning log parsing would not be affected but MySql perf_datum inserts would be queued).

What is our rollback plan if a migration fails?

FYI for now, we might still be able to restore a snapshot to stage or dev/prototype (1TB DBs) since production (2TB DB) will have more than 1TB data in the next few months. Alternatively, we can create a temporary snapshot from production to test against instead of restoring a backup to dev/stage.

Some notes on MySql migrations:
https://docs.djangoproject.com/en/2.2/topics/migrations/#mysql

In addition, MySQL will fully rewrite tables for almost every schema operation and generally takes a time proportional to the number of rows in the table to add or remove columns

Assuming we do the migration over a weekend.
Is there a way we can stop writes to the performance table while a migration happens?
Maybe stop log parsing while the migration completes? (In Heroku we can disable log parsing by reducing the number of nodes; we would let the Celery queues grow and resume it once the migration ends).

Side note, we can probably delete this:

    # the following properties are obsolete and should be removed at some point
    ds_job_id = models.PositiveIntegerField(db_column="ds_job_id", null=True)
    result_set_id = models.PositiveIntegerField(null=True)

There are a lot of great questions/points here.

I think the first question we should answer is (as you say) do we need to do this now? Could we wait a year? Perhaps we could migrate to Postgres in the mean time? As it was said, these things are less disruptive on Postgres.

Will's approach sounds like a good one. Perhaps there's a DBA that could help us out with this? I don't feel like my DB skills are quite strong enough to drive it.

You're right that if we go forward this this migration, turning off the log parsing workers would be a good way to go. Then it will pause action on that table (iiuc). It would still mean Treeherder is non functional for that time period since we wouldn't get any bug suggestions for new jobs. But that would be a good approach to prevent something overloading. Our log parsing queues would get huge and all, but that's ok. We would just need to do it at the right time (perhaps weekend or off-hours). And notify out stakeholders of the planned outage.

I suppose the other interesting question is: How long will this migration actually take? That's easy to find out, I think. We duplicate the DB from prod. Set prototype to point to that snapshot and give it a try.

Flags: needinfo?(cdawson)

A possible DBA-type person to consult about this would be :bobm, in Habib's CloudOps group. Bob's had experience with both MySQL 5 and MariaDB 10 doing ALTER TABLE runs on tables with tens of millions of rows, as part of his work on Persona, Firefox Sync, and Firefox Accounts.

(In reply to Karl Thiessen [:kthiessen, he/him] from comment #14)

A possible DBA-type person to consult about this would be :bobm, in Habib's CloudOps group. Bob's had experience with both MySQL 5 and MariaDB 10 doing ALTER TABLE runs on tables with tens of millions of rows, as part of his work on Persona, Firefox Sync, and Firefox Accounts.

I think we have an advantage in Sync by using MariaDB with TokuDB and making schema changes that were able to be done online. We've used a couple of tools in the past to make otherwise offline migrations online, which you might want to test on a test DB.

pt-online-schema-change needs some extra settings to work on RDS: https://www.percona.com/blog/2016/07/01/pt-online-schema-change-amazon-rds/
https://github.com/github/gh-ost/blob/master/doc/rds.md

I think you really need to test a regular ALTER TABLE on a DB clone to see how much time you're looking at before speculating any further on ways to speed it up or do it without downtime.

Status: NEW → ASSIGNED

Unasigning myself from this ticket, as I no longer have the cycles to work on.

Assignee: igoldan → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3

Update: now at 45.28% on production.

Severity: normal → critical

Update: now at 49% on production.

Joel, we need to revisit this ticket, as it's basically a ticking bomb.
If this somehow gets forgotten under radar & the available ids are depleted, it'll basically freeze the monitoring abilities of the perf sheriffs + potentially lose all investigation perf data.

Flags: needinfo?(jmaher)

I want to make sure we are not ingesting data we are not using and we are only storing data we care about (i.e. do we need a full year of data, will 6 months be acceptable, old tests/frameworks are deleted, etc.)

:igoldan, can you drive confirming the data we have is good so we have data we care about to migrate?

Flags: needinfo?(jmaher) → needinfo?(igoldan)
See Also: → 1599859

The first step is to alter Treeherder/Django so it provides the record ID, effectively overriding MySQL's auto increment. This will allow us to go a few ways:

  1. Alter table with the new bigint column for ID (keeping the old), resting on MySQL's new zero-downtime column adding feature
  2. Tell Django to loop around to zero, using the same ID column.
  3. Tell MySQL the current column is UNSIGNED, and give us a few more years.

There may be a way to update MySQL auto increment counter back to 1:

ALTER TABLE tablename AUTO_INCREMENT = 1

(In reply to Joel Maher ( :jmaher ) (UTC-4) (back Feb 19th) from comment #20)

I want to make sure we are not ingesting data we are not using and we are only storing data we care about (i.e. do we need a full year of data, will 6 months be acceptable, old tests/frameworks are deleted, etc.)

:igoldan, can you drive confirming the data we have is good so we have data we care about to migrate?

We are planning 2020/Q2 for cleaning the data & data producers. Especially those which are frequently generating invalids & reassigns.

For example, one of the big & questionable data producers we have is platform_microbench. But our sporadic checks indicate there may be value in them now. We could do a very simple preliminary check to confirm it's indeed useful, by checking them for 2 weeks, during this quarter maybe?
This would only confirm this specific data is to be kept. If it turns out it shouldn't be kept, we'd still remove it in the next quarter after thorough evaluation.

Dave, could you provide more details about this undergoing? Not for this specific example, but for the overall situation.

Flags: needinfo?(igoldan) → needinfo?(dave.hunt)

(In reply to Ionuț Goldan [:igoldan] from comment #23)

Dave, could you provide more details about this undergoing? Not for this specific example, but for the overall situation.

We discussed establishing a way to measure if a test is considered suitable for sheriffing. There are many metrics we could use to evaluate this such as deviant noise (based on Kyle's measure noise work), number of alerts during a fixed evaluation period, resolution of these alerts such as % invalid. Once we have decided on our method, we can try it out with our existing tests to see if there are any we should revisit.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #20)

I want to make sure we are not ingesting data we are not using and we are only storing data we care about (i.e. do we need a full year of data, will 6 months be acceptable, old tests/frameworks are deleted, etc.)

Do we expect that we'll exceed limits by keeping 12 months of data? If so, could we introduce test rankings*, so that tests we care about and alert on would be rank 1 and kept for 12 months, and all other tests would be rank 2 and kept for only 6 months? We could even have a third rank that is kept for a matter of weeks, which could be used for anything experimental.

* I'm avoiding the term 'tiers' as it's used elsewhere, but this is effectively the same thing. We could potentially use the same term but it could be confusing. Perhaps it would be better still to have a category/label such as critical/experimental.

Flags: needinfo?(dave.hunt)

I am on board with the different ranks proposed. Try server results should be rank 3 (6 weeks). Anything that isn't sheriffed should be considered for removal or at the very least rank 2 - maybe the only exception is subtests. Also for deprecated tests/configs those could fall into rank 2. One example is when we upgrade hardware or platform, the signatures change- do we need them all? possibly.

Depends on: 1617871

The data is large and approximately in the order it was ingested (in ID order). Being particular about what gets deleted will take time for the machine to find, it will be write intensive as many pages of data must be swapped to execute the manoeuvre, and it will leave holes in the data pages and will not save much space. The performance_datum uses an index on push_date to identify the records to delete; leaving some holes. The job table is deleted in id-order to ensure fast and hole-free table "truncation".

Any delete strategy, (or "rankings") has no impact on the auto-increment overflow problem: The same number of records are churning through the performance_datum table, we are just keeping them for less time.

See Also: → 1613448
Assignee: nobody → klahnakoski

This PR (https://github.com/mozilla/treeherder/pull/6364) is an attempt to avoid a migration: It explicitly sets the PerformanceDatum.id to a unique value across processes and threads. The major drawback is it requires 'isolation_level': 'serializable', which may slow down all transactions.

A proper data migration, as :wlach suggests, can be done in a stored procedure. This procedure would batch-copy the old table to the new one, and when done swap the table names. The problem is Django's schema and the MySQL schema will be different, and I am not certain how to synchronize the two. Maybe it is as simple as updating the django_migrations table, and bouncing all clients?

(In reply to Kyle Lahnakoski [:ekyle] from comment #29)

A proper data migration, as :wlach suggests, can be done in a stored procedure. This procedure would batch-copy the old table to the new one, and when done swap the table names. The problem is Django's schema and the MySQL schema will be different, and I am not certain how to synchronize the two. Maybe it is as simple as updating the django_migrations table, and bouncing all clients?

You can manually adjust thedjango_migrations table if the migration files in the code base have gotten out of sync with that table. But ultimately, any change that alters a table needs to be be recorded via the migration files (so an alter column to bigint). I'm wondering if there's a "django"-esque way of doing this. Dealing with out-of-sync migrations is a pain. :)

Not sure what you mean by bouncing all clients?

I think this is the kind of thing we'd do incrementally anyways. Try it out on prototype after doing as much as you can locally, get all of the fiddly details right, and then do it on stage, and finally prod.

I don't know how :camd and :armenzg feel about it, but I think prototype could probably be tested without a scheduled down time (maybe at least giving people a heads up the day before) since we can restore the database from the previous snapshot (this one we can do without dev services involved, I believe).

A few naive questions.

What's the worse that could happen on a table of that size if you just change the field id = models.BigAutoField(primary_key=True) and run the makemigrations/migrate commands? Is the big unknown how long it would take?

If we stop data ingestion entirely, that would mean not processing tasks temporarily (which kick off the log parsing that performance data is derived from) and then as soon as we turned it back on we'd still have those tasks. I think the only caveat there is that the pulse guardian queue limits messages to 20,000.

I guess as long as I'm getting emails about this, I may as well give my two cents. :) One other resource you can look at was my old guide to doing these types of migrations, which unfortunately never got merged (this is probably my fault):

https://github.com/mozilla/treeherder/pull/2058/files

(In reply to Sarah Clements [:sclements] from comment #30)

You can manually adjust thedjango_migrations table if the migration files in the code base have gotten out of sync with that table. But ultimately, any change that alters a table needs to be be recorded via the migration files (so an alter column to bigint). I'm wondering if there's a "django"-esque way of doing this. Dealing with out-of-sync migrations is a pain. :)

Correct, this is how I would do it: perform the migration manually, then update the django migrations afterwards. It's no so bad.

I think the "django"-esque way would be to use Postgres instead of mySQL where these types of migrations aren't so big a deal. :) Sadly I suspect that's not a realistic option...

I think this is the kind of thing we'd do incrementally anyways. Try it out on prototype after doing as much as you can locally, get all of the fiddly details right, and then do it on stage, and finally prod.

Yup, this is what I did back in the day.

(In reply to Sarah Clements [:sclements] from comment #31)

A few naive questions.

What's the worse that could happen on a table of that size if you just change the field id = models.BigAutoField(primary_key=True) and run the makemigrations/migrate commands? Is the big unknown how long it would take?

This will take a very long time, as it will essentially need to rewrite the table behind the scenes. While this is happening, inserts to the performance datum table will be paused, which could cause all sorts of weird and unpleasant side effects if data ingestion is otherwise continuing.

If we stop data ingestion entirely, that would mean not processing tasks temporarily (which kick off the log parsing that performance data is derived from) and then as soon as we turned it back on we'd still have those tasks. I think the only caveat there is that the pulse guardian queue limits messages to 20,000.

Yup! This should be ok to do unless things have changed a lot from when I used to work on Treeherder. You will probably want to do this at a less busy time of day, if at all possible (I think I tried to do early EST, long before the Pacific people had started working).

heh, thanks Will.

(In reply to Sarah Clements [:sclements] from comment #30)

You can manually adjust thedjango_migrations table if the migration files in the code base have gotten out of sync with that table. But ultimately, any change that alters a table needs to be be recorded via the migration files (so an alter column to bigint). I'm wondering if there's a "django"-esque way of doing this. Dealing with out-of-sync migrations is a pain. :)

Correct, this is how I would do it: perform the migration manually, then update the django migrations afterwards. It's no so bad.

When you say "perform the migration manually", are you referring to what Kyle is saying in comment 29, followed by a django migration (for that alter column), resolving any differences in the django_migrations table if needed?

(In reply to Sarah Clements [:sclements] from comment #33)

Correct, this is how I would do it: perform the migration manually, then update the django migrations afterwards. It's no so bad.

When you say "perform the migration manually", are you referring to what Kyle is saying in comment 29, followed by a django migration (for that alter column), resolving any differences in the django_migrations table if needed?

Similar to what Kyle is suggesting, but I didn't bother with the stored procedure part.

For the django migrations, I just meant modifying the django_migrations table directly. There are some explicit instructions on how to do that in the PR I linked to:

https://github.com/mozilla/treeherder/pull/2058/files#diff-11610aa3a9d6b0f78a749d7d89942d2fR41

Update: now at 57% on production.

Update: now at 60% on production.

Just for reference, I've found another solution on SO.
It's manually updating the ids starting from an initial counter. It assumes that the foreign keys have ON UPDATE CASCADE set.
Then it resets the AUTO INCREMENT value to the this last counter+1.

Pros:

  • 3 lines of easy raw SQL
  • handles foreign key relations

Cons:

  • would take a big amount of time (maybe several hours)
  • in that time, no ingestion, no background workers should touch the affected tables

We're starting to work on our roadmap so I'll flag it as something that needs to be addressed this year, preferably by end of H1.

Assignee: klahnakoski → nobody
Summary: Add data expiration for performance_datum or switch to bigint to protect against auto_increment overflow → Switch to bigint to protect against auto_increment overflow

Chris can you take a peak at the performance_datum table (just prod is fine) and see where we're at with the ints remaining in the allocated range?

Flags: needinfo?(cvalaas)
| treeherder      | performance_datum               | id          | int       | int(11)     |           0 |          2147483647 |     1476139398 |               0.6874 |

So we're at 68.7%

Flags: needinfo?(cvalaas)

Hello,
I started working on this issue, and something bothered me.
I followed @ekyle and @wlach advice, an worked on a migration based on this approach: https://github.com/mozilla/treeherder/pull/2058/files#diff-11610aa3a9d6b0f78a749d7d89942d2fR41.

I generated 10M fake PerformanceDatum entries, started the MySQL default container and tried to copy data to a new table with a bigint PK.
It seems that copying large amount of data is much slower than small amounts (running INSERT INTO performance_datum_copy SELECT * FROM performance_datum).

| Rows | Time             | Estimated for 1M5 |
| ---- | ---------------- | ----------------- |
| 100K | 1.95s            | 8.1H              |
| 200K | 3.36s            | 7.0H              |
| 1M   | 22.01 sec        | 9.2H              |
| 2M   | 3 min 2.42 sec   | 37.5H             |
| 10M  | 59 min 40.08 sec | 150H              |

I am scared it will take a tremendous time for the full migration.
These are stats on my machine though, do you know if this is an usual behavior with MySQL ?
Should I perform the copy by chunks in your opinion ?

I wonder if there is data we can purge first to reduce the total size- still it might take a few days. I would prefer to break this up into smaller transactions that take <4 hours each.

:kimberlythegeek- did you find out if there was data sets/frameworks that could be removed?

Flags: needinfo?(ksereduck)

(In reply to rigal from comment #41)

Hello,
I started working on this issue, and something bothered me.
I followed @ekyle and @wlach advice, an worked on a migration based on this approach: https://github.com/mozilla/treeherder/pull/2058/files#diff-11610aa3a9d6b0f78a749d7d89942d2fR41.

I generated 10M fake PerformanceDatum entries, started the MySQL default container and tried to copy data to a new table with a bigint PK.
It seems that copying large amount of data is much slower than small amounts (running INSERT INTO performance_datum_copy SELECT * FROM performance_datum).

I wouldn't be wedded to the INSERT INTO performance_datum_copy SELECT * FROM performance_datum approach. Dumping performance_datum into a CSV (via mysqldump) then recreating the table from that may well be faster. You could also try recreating the table without indexes and then adding them at the end (I can't remember if that part's expensive).

This really isn't that much data when it comes down to it. There's gotta be a way to make this part reasonably fast. :)

A final point is that things are definitely going to be a lot slower on your local machine than they are in production.

In my opinion, it would be easier/safer to split the procedure into stages:

  • Clean possible retained data
  • Create the new table (with the bigint PK) in a django migration
  • Run a command that, step by step, fills the new table with existing rows (error recovery would be easy to handle with autoincrement). This could (but eventually won't) take a while.
  • Swap the tables in a final migration (that sync/locks the tables).

I would like to benchmark the CSV approach, and probably CREATE TABLE AS too for reference.

:jmaher Hey, sorry for the delay, I was sick and then out of town. As such I haven't gotten started on this yet, but I will begin looking into this next week. Beatrice has offered to help as well. Leaving need info to find this easily later.

Flags: needinfo?(ksereduck)

annnnnd I forgot to uncheck "clear need info" -_-

Flags: needinfo?(ksereduck)

I just checked the related table perf_multicommitdatum which is the only one that has a Foreign key towards performance_datum. This table is only used to temporarily flag a performance datum element. This feature does not seem used at the moment.

It's currently empty on both production and staging, so we can take a simple approach for that table:

  • the migration will check if the table is empty and fail otherwise
  • the migration will drop the table, then recreate a new one with the bigint PK/FK (it only has the one field)
  • finally the foreign key constraint will be restored once the performance_datum table has been migrated

per our discussion at the performance work week in Toronto last week, we decided that storing replicates data is easily justifiable, and have no plans to remove frameworks or shorten data expire time, removing needinfo

Flags: needinfo?(ksereduck)
No longer blocks: 1770152
Whiteboard: [fxp]
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: