Closed Bug 685669 Opened 14 years ago Closed 13 years ago

[amo] Refresh -dev db and filesystem

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Product:

Component:

Platform:

All

Other

Type:

task

Priority:

Not set

Severity:

blocker

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: clouserw, Assigned: mpressman)

Details

Wil Clouser [:clouserw]

Reporter

Description

•

14 years ago

Please refresh the -dev database. This should be imported from production and go into a differently named database. Once it's done swap the configs to point to the new one which will minimize downtime. The filesystem sync has no such fortune and just kinda has to go. Bonus points for not syncing the update_counts and download_counts tables, your call on how to do that. It'll save us hours of import time if you skip them.

Wil Clouser [:clouserw]

Reporter

Updated

•

14 years ago

Summary: Refresh -dev db and filesystem → [amo] Refresh -dev db and filesystem

Wil Clouser [:clouserw]

Reporter

Comment 1

•

14 years ago

What's the status of this? I'd like to do another run on the preview site before doing it in production, but since this takes so long that may not be possible. Can you update this without the stats tables and maybe we can do it today?

Severity: normal → major

Ben Kero [:bkero]

Updated

•

14 years ago

Assignee: server-ops → mpressman

Wil Clouser [:clouserw]

Reporter

Comment 2

•

14 years ago

What's the word on this? The SDK was released yesterday but we haven't been able to upgrade our add-ons because this isn't done yet. -> critical

Severity: major → critical

Matt Pressman [:mpressman]

Assignee

Comment 3

•

14 years ago

which dev host are you looking for a refresh on? The host addons-webdev1.db.phx1 is currently up to date and replicating from the AMO master

Wil Clouser [:clouserw]

Reporter

Comment 4

•

14 years ago

(In reply to Matt Pressman from comment #3) > which dev host are you looking for a refresh on? The host > addons-webdev1.db.phx1 is currently up to date and replicating from the AMO > master addons-dev.allizom.org

Wil Clouser [:clouserw]

Reporter

Comment 5

•

14 years ago

(In reply to Wil Clouser [:clouserw] from comment #4) > (In reply to Matt Pressman from comment #3) > > which dev host are you looking for a refresh on? The host > > addons-webdev1.db.phx1 is currently up to date and replicating from the AMO > > master > > addons-dev.allizom.org Er, comment went too soon. That's the host I'm interested in. I don't know what addons-webdev1.db.phx1 is. Is that the host serving addons-dev.allizom.org?

Matt Pressman [:mpressman]

Assignee

Comment 6

•

14 years ago

in #webdev fligtar said he needed the host for a demo, so I am prepping the dump, but won't import it until I get the go-ahead.

Wil Clouser [:clouserw]

Reporter

Comment 7

•

14 years ago

Demo is over, let's do it

Matt Pressman [:mpressman]

Assignee

Comment 8

•

14 years ago

amo refresh for addons_dev db on dev1.db.phx1.mozilla.com is complete

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Wil Clouser [:clouserw]

Reporter

Comment 9

•

14 years ago

The site is currently down. Looks like something in the versions table is missing. How did you filter the tables you didn't load?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Matt Pressman [:mpressman]

Assignee

Comment 10

•

14 years ago

I simply dumped all tables except for the update_counts and download_counts tables. There is no direct foreign key constraint on versions to either of those tables. Can you provide me with the query or debug that is pointing to the versions table?

Wil Clouser [:clouserw]

Reporter

Comment 11

•

14 years ago

Umm, not really. With some debug flags on we can make it log queries but they'd be fast and furious and I don't know how to do that with wsgi. Can you describe how you dumped the tables? If you took it right off a live slave I suspect it dumped one table (addons?) well before the alphabetically-distant versions and they got out of sync.

Wil Clouser [:clouserw]

Reporter

Comment 12

•

14 years ago

The site is 100% down so this is blocking all QA as well as upgrading jetpacks to the new sdk which was released 2 days ago. Going to mark as blocker in the morning.

Stephen Donner [:stephend] Not actively reading bugmail

Updated

•

14 years ago

Severity: critical → blocker

Matt Pressman [:mpressman]

Assignee

Comment 13

•

14 years ago

I did dump it off a slave, but locked all tables before the dump so as to not get any out of sync

Severity: blocker → critical

Matt Pressman [:mpressman]

Assignee

Comment 14

•

14 years ago

since this is a blocker I'll be more verbose, I used the -x flag with mysqldump to lock all tables across all databases

Wil Clouser [:clouserw]

Reporter

Comment 15

•

14 years ago

So, for next steps - we don't have access to the box this is running on, so debugging is very slow and difficult. If it's not something obvious, we should just do the standard: - complete dump to a new db - change the config to point to the new db name

Matt Pressman [:mpressman]

Assignee

Comment 16

•

14 years ago

there appears to be a db running on the same host called addons_dev2 ???

Wil Clouser [:clouserw]

Reporter

Comment 17

•

14 years ago

(In reply to Matt Pressman from comment #16) > there appears to be a db running on the same host called addons_dev2 ??? As people without access to these boxes there isn't a lot we can help with here. This is probably a symptom of reloading the db in the past - we toggle back and forth. The settings_local.py files will tell you what db AMO is hitting right now. Alice: Once this is switched you'll need to change your configs to insert results into the new database.

Matt Pressman [:mpressman]

Assignee

Comment 18

•

14 years ago

A full dump including the update_counts and download_counts tables just completed and I will transfer it over and start loading it into addons_dev as it appears that addons_dev2 does not contain two tables that do exist in addons_dev. The tables are addons_premium and blca

Wil Clouser [:clouserw]

Reporter

Comment 19

•

14 years ago

sounds like addons_dev2 is the inactive db currently then. Since the site is broken it doesn't matter, but on a normal day the correct action would be to load into the addons_dev2 database and then switch the config.

Matt Pressman [:mpressman]

Assignee

Comment 20

•

14 years ago

I'm loading the full dump now into addons_dev

Matt Pressman [:mpressman]

Assignee

Comment 21

•

14 years ago

Full database load has completed. This was taken from addons-webdev1.db.phx1.mozilla.com which is an up to date actively replicated slave of the amo master. As far as I know, this host is only used for metrics queries which would otherwise be too much load on the production hosts. If there are any other issues with the data I can dump from a host that is currently out of production because of hardware issues, but is still replicating.

Stephen Donner [:stephend] Not actively reading bugmail

Comment 22

•

14 years ago

(In reply to Matt Pressman from comment #21) > Full database load has completed. This was taken from > addons-webdev1.db.phx1.mozilla.com which is an up to date actively > replicated slave of the amo master. As far as I know, this host is only used > for metrics queries which would otherwise be too much load on the production > hosts. If there are any other issues with the data I can dump from a host > that is currently out of production because of hardware issues, but is still > replicating. https://addons-dev.allizom.org/en-US/firefox/ is still bombing out; I've repurposed bug 687034, which was originally filed about this issue (comment 9). I'm unclear if the new stacktrace is a code or DB issue.

Wil Clouser [:clouserw]

Reporter

Comment 23

•

14 years ago

Matt: Can you give me the database dump in a db on cm-webdev01-master01 so I can try my code with it? I import a new db from production every night onto that box (the `remora` db) and I've never had these problems so I'm not really sure what's going on. Alternatively you could use one of the standard dumps from production that are mounted in the /data/backup-drop/ folder on that box. I don't know if there is a difference in how they are dumped.

Wil Clouser [:clouserw]

Reporter

Comment 24

•

14 years ago

This is now blocking the builder team as well. We need to get this resolved this morning. How can I help?

Severity: critical → blocker

krupa raj[:krupa]

Comment 25

•

14 years ago

As Stephen mentioned in comment 22, the traceback getting generated is "DoesNotExist: Version matching query does not exist" and the details are available at bug 687034.

Wil Clouser [:clouserw]

Reporter

Comment 26

•

14 years ago

(In reply to krupa raj 82[:krupa] from comment #25) > As Stephen mentioned in comment 22, the traceback getting generated is > "DoesNotExist: Version matching query does not exist" and the details are > available at bug 687034. Are you saying it's a code problem and not the database refresh causing the traceback?

Stephen Donner [:stephend] Not actively reading bugmail

Comment 27

•

14 years ago

(In reply to Wil Clouser [:clouserw] from comment #26) > Are you saying it's a code problem and not the database refresh causing the > traceback? Not sure -- still seems DB/schema-related, according to the stacktrace.

Wil Clouser [:clouserw]

Reporter

Comment 28

•

14 years ago

(In reply to Stephen Donner [:stephend] from comment #27) > (In reply to Wil Clouser [:clouserw] from comment #26) > > Are you saying it's a code problem and not the database refresh causing the > > traceback? > > Not sure -- still seems DB/schema-related, according to the stacktrace. I agree. Matt: we're all on IRC, please come talk to us if there isn't a clear course of action here.

Matt Pressman [:mpressman]

Assignee

Comment 29

•

14 years ago

I'm putting the dump on cm-webdev01-master01 - additionally, whatever file your using from the /data/backup-drop/ folder would not be from production since it got moved from sjc to phx. I believe this was part of the reason for creating the addons-webdev1.db.phx1 host

Matt Pressman [:mpressman]

Assignee

Comment 30

•

14 years ago

if this is db/schema related, I can very quickly determine the difference's returned based on the queries that are running, since we have a stack trace, can someone post the sql around the trace?

Wil Clouser [:clouserw]

Reporter

Comment 31

•

14 years ago

(In reply to Matt Pressman from comment #29) > I'm putting the dump on cm-webdev01-master01 - additionally, whatever file > your using from the /data/backup-drop/ folder would not be from production > since it got moved from sjc to phx. I believe this was part of the reason > for creating the addons-webdev1.db.phx1 host They are stale, that's bug 685746. :-/

Jake Maul [:jakem]

Comment 32

•

14 years ago

Would it be of any value to import the most recent dump from cm-webdev01-master01 for the time being? I realize it's out of date, but "up-and-old" might be better than "down-and-new", while we troubleshoot the issue. FWIW, the "mysqldump -x" flag locks all tables in all databases for the duration of the dump. There's no way such a dump would develop any inconsistencies *during* the dump... the data would have had to be inconsistent when the dump started. The only other possibilities that occur to me are that the dump itself was corrupted, or that the import somehow failed. If we can dig up any dumps between Aug 24 and today that are useful, it might be a good idea to try a "git bisect"-style attack, and see if we can get *something* to work.

Wil Clouser [:clouserw]

Reporter

Comment 33

•

14 years ago

up-and-old isn't great because it's weeks old at this point and the FS is out of sync. I certainly hope we can dig up dumps between aug 24 and today - they should be happening every thing (see the last comment in bug 685746). I don't know if that will help us or not, since we haven't had this problem in the past. The import for the db you gave me is still running, unfortunately.

Matt Pressman [:mpressman]

Assignee

Comment 34

•

14 years ago

I am loading the dump file into a 5.1 instance on cm-webdev01-slave01 right now. Once this is complete, I will notify and we can test against this

Wil Clouser [:clouserw]

Reporter

Comment 35

•

14 years ago

It took a little under 4 hours for me earlier today. Is yours done now? If so, can you let me know the db name?

Matt Pressman [:mpressman]

Assignee

Comment 36

•

14 years ago

The dump is complete and the database name is addons_dev and the user/pass credentials are the same as on dev1.db.phx1.mozilla.com

Wil Clouser [:clouserw]

Reporter

Comment 37

•

14 years ago

I have the code from HEAD running with that db at http://khan.mozilla.org:8008/en-US/firefox/ . I haven't been able to make it fail yet by clicking around. Can you flush -dev's memcache? The broken result may be cached there. Next ideas?

Wil Clouser [:clouserw]

Reporter

Comment 38

•

14 years ago

I have a minimal set of steps to reproduce this bug now. Below is the output showing the failure case on -dev: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In [1]: from addons.models import Addon In [2]: x = Addon.objects.get(pk=2108) In [3]: x.current_version --------------------------------------------------------------------------- DoesNotExist Traceback (most recent call last) /data/www/addons-dev.allizom.org/zamboni/<ipython console> in <module>() /data/www/addons-dev.allizom.org/zamboni/apps/addons/models.py in current_version(self) 520 if self.type == amo.ADDON_PERSONA: 521 return --> 522 if not self._current_version: 523 self.update_version() 524 return self._current_version /data/www/addons-dev.allizom.org/zamboni/vendor/src/django/django/db/models/fields/related.py in __get__(self, instance, instance_type) 312 db = router.db_for_read(self.field.rel.to, instance=instance) 313 if getattr(rel_mgr, 'use_for_related_fields', False): --> 314 rel_obj = rel_mgr.using(db).get(**params) 315 else: 316 rel_obj = QuerySet(self.field.rel.to).using(db).get(**params) /data/www/addons-dev.allizom.org/zamboni/vendor/src/django/django/db/models/query.py in get(self, *args, **kwargs) 347 if not num: 348 raise self.model.DoesNotExist("%s matching query does not exist." --> 349 % self.model._meta.object_name) 350 raise self.model.MultipleObjectsReturned("get() returned more than one %s -- it returned %s! Lookup parameters were %s" 351 % (self.model._meta.object_name, num, kwargs)) DoesNotExist: Version matching query does not exist. In [4]: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Running those exact same steps on my box using the copy of the -dev database results in success: In [3]: x.current_version Out[3]: <Version: 1.2.2> Can you verify everything is updating correctly (particularly everything in vendor/)? It may be using different libraries. I haven't had the chance to dig into this yet this morning.

Wil Clouser [:clouserw]

Reporter

Comment 39

•

14 years ago

The slave is out of sync. Please fix and add to nagios. In [3]: from django.db import connections In [4]: c = connections['default'].cursor() In [5]: c.execute('select * from versions where id=1269487') Out[5]: 1L In [6]: c.fetchall() Out[6]: ((1269487L, 2108L, 6L, u'1.2.2', u'', 2450543L, datetime.datetime(2011, 9, 2, 21, 32, 10), datetime.datetime(2011, 9, 2, 21, 33, 30), 1020200200100L, None, None, 0, 0),) In [7]: c = connections['slave'].cursor() In [8]: c.execute('select * from versions where id=1269487') Out[8]: 0L In [9]: c.fetchall() Out[9]: () In [10]:

Matt Pressman [:mpressman]

Assignee

Comment 40

•

14 years ago

due to applications writing to the slave it caused the replication from the master to stop. This is why the slave didn't match the master because of writes occurring during the initial db load. It's back up and replicating now

Michael Morgan [:morgamic]

Comment 41

•

14 years ago

Can we please set up a nagios check for replication on the dev servers? They are pretty important to the teams that use them.

Matt Pressman [:mpressman]

Assignee

Comment 42

•

14 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=687960 has been created to add nagios checks

Wil Clouser [:clouserw]

Reporter

Comment 43

•

14 years ago

The front page is still a traceback. https://addons-dev.allizom.org/services/monitor suggests that all 4 redis boxes are running as masters, whereas 2 are supposed to be slaves. This could be the cause for our caching problems.

Wil Clouser [:clouserw]

Reporter

Comment 44

•

14 years ago

(In reply to Wil Clouser [:clouserw] from comment #43) > The front page is still a traceback. > https://addons-dev.allizom.org/services/monitor suggests that all 4 redis > boxes are running as masters, whereas 2 are supposed to be slaves. This > could be the cause for our caching problems. Redis problem split off into https://bugzilla.mozilla.org/show_bug.cgi?id=688035

Jeremy Orem [:oremj]

Comment 45

•

13 years ago

What's the status of this bug?

Wil Clouser [:clouserw]

Reporter

Comment 46

•

13 years ago

The site is back up, but I don't remember if it got a new dump or not. I suggest we close this and we'll file fresh the next time we need a dump. Thanks.

Matt Pressman [:mpressman]

Assignee

Comment 47

•

13 years ago

Closing, reopen the next time you need a dump

Status: REOPENED → RESOLVED

Closed: 14 years ago → 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.