Closed Bug 685669 Opened 14 years ago Closed 13 years ago

[amo] Refresh -dev db and filesystem

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

All
Other
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: clouserw, Assigned: mpressman)

Details

Please refresh the -dev database. This should be imported from production and go into a differently named database. Once it's done swap the configs to point to the new one which will minimize downtime. The filesystem sync has no such fortune and just kinda has to go. Bonus points for not syncing the update_counts and download_counts tables, your call on how to do that. It'll save us hours of import time if you skip them.
Summary: Refresh -dev db and filesystem → [amo] Refresh -dev db and filesystem
What's the status of this? I'd like to do another run on the preview site before doing it in production, but since this takes so long that may not be possible. Can you update this without the stats tables and maybe we can do it today?
Severity: normal → major
Assignee: server-ops → mpressman
What's the word on this? The SDK was released yesterday but we haven't been able to upgrade our add-ons because this isn't done yet. -> critical
Severity: major → critical
which dev host are you looking for a refresh on? The host addons-webdev1.db.phx1 is currently up to date and replicating from the AMO master
(In reply to Matt Pressman from comment #3) > which dev host are you looking for a refresh on? The host > addons-webdev1.db.phx1 is currently up to date and replicating from the AMO > master addons-dev.allizom.org
(In reply to Wil Clouser [:clouserw] from comment #4) > (In reply to Matt Pressman from comment #3) > > which dev host are you looking for a refresh on? The host > > addons-webdev1.db.phx1 is currently up to date and replicating from the AMO > > master > > addons-dev.allizom.org Er, comment went too soon. That's the host I'm interested in. I don't know what addons-webdev1.db.phx1 is. Is that the host serving addons-dev.allizom.org?
in #webdev fligtar said he needed the host for a demo, so I am prepping the dump, but won't import it until I get the go-ahead.
Demo is over, let's do it
amo refresh for addons_dev db on dev1.db.phx1.mozilla.com is complete
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
The site is currently down. Looks like something in the versions table is missing. How did you filter the tables you didn't load?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I simply dumped all tables except for the update_counts and download_counts tables. There is no direct foreign key constraint on versions to either of those tables. Can you provide me with the query or debug that is pointing to the versions table?
Umm, not really. With some debug flags on we can make it log queries but they'd be fast and furious and I don't know how to do that with wsgi. Can you describe how you dumped the tables? If you took it right off a live slave I suspect it dumped one table (addons?) well before the alphabetically-distant versions and they got out of sync.
The site is 100% down so this is blocking all QA as well as upgrading jetpacks to the new sdk which was released 2 days ago. Going to mark as blocker in the morning.
Severity: critical → blocker
I did dump it off a slave, but locked all tables before the dump so as to not get any out of sync
Severity: blocker → critical
since this is a blocker I'll be more verbose, I used the -x flag with mysqldump to lock all tables across all databases
So, for next steps - we don't have access to the box this is running on, so debugging is very slow and difficult. If it's not something obvious, we should just do the standard: - complete dump to a new db - change the config to point to the new db name
there appears to be a db running on the same host called addons_dev2 ???
(In reply to Matt Pressman from comment #16) > there appears to be a db running on the same host called addons_dev2 ??? As people without access to these boxes there isn't a lot we can help with here. This is probably a symptom of reloading the db in the past - we toggle back and forth. The settings_local.py files will tell you what db AMO is hitting right now. Alice: Once this is switched you'll need to change your configs to insert results into the new database.
A full dump including the update_counts and download_counts tables just completed and I will transfer it over and start loading it into addons_dev as it appears that addons_dev2 does not contain two tables that do exist in addons_dev. The tables are addons_premium and blca
sounds like addons_dev2 is the inactive db currently then. Since the site is broken it doesn't matter, but on a normal day the correct action would be to load into the addons_dev2 database and then switch the config.
I'm loading the full dump now into addons_dev
Full database load has completed. This was taken from addons-webdev1.db.phx1.mozilla.com which is an up to date actively replicated slave of the amo master. As far as I know, this host is only used for metrics queries which would otherwise be too much load on the production hosts. If there are any other issues with the data I can dump from a host that is currently out of production because of hardware issues, but is still replicating.
(In reply to Matt Pressman from comment #21) > Full database load has completed. This was taken from > addons-webdev1.db.phx1.mozilla.com which is an up to date actively > replicated slave of the amo master. As far as I know, this host is only used > for metrics queries which would otherwise be too much load on the production > hosts. If there are any other issues with the data I can dump from a host > that is currently out of production because of hardware issues, but is still > replicating. https://addons-dev.allizom.org/en-US/firefox/ is still bombing out; I've repurposed bug 687034, which was originally filed about this issue (comment 9). I'm unclear if the new stacktrace is a code or DB issue.
Matt: Can you give me the database dump in a db on cm-webdev01-master01 so I can try my code with it? I import a new db from production every night onto that box (the `remora` db) and I've never had these problems so I'm not really sure what's going on. Alternatively you could use one of the standard dumps from production that are mounted in the /data/backup-drop/ folder on that box. I don't know if there is a difference in how they are dumped.
This is now blocking the builder team as well. We need to get this resolved this morning. How can I help?
Severity: critical → blocker
As Stephen mentioned in comment 22, the traceback getting generated is "DoesNotExist: Version matching query does not exist" and the details are available at bug 687034.
(In reply to krupa raj 82[:krupa] from comment #25) > As Stephen mentioned in comment 22, the traceback getting generated is > "DoesNotExist: Version matching query does not exist" and the details are > available at bug 687034. Are you saying it's a code problem and not the database refresh causing the traceback?
(In reply to Wil Clouser [:clouserw] from comment #26) > Are you saying it's a code problem and not the database refresh causing the > traceback? Not sure -- still seems DB/schema-related, according to the stacktrace.
(In reply to Stephen Donner [:stephend] from comment #27) > (In reply to Wil Clouser [:clouserw] from comment #26) > > Are you saying it's a code problem and not the database refresh causing the > > traceback? > > Not sure -- still seems DB/schema-related, according to the stacktrace. I agree. Matt: we're all on IRC, please come talk to us if there isn't a clear course of action here.
I'm putting the dump on cm-webdev01-master01 - additionally, whatever file your using from the /data/backup-drop/ folder would not be from production since it got moved from sjc to phx. I believe this was part of the reason for creating the addons-webdev1.db.phx1 host
if this is db/schema related, I can very quickly determine the difference's returned based on the queries that are running, since we have a stack trace, can someone post the sql around the trace?
(In reply to Matt Pressman from comment #29) > I'm putting the dump on cm-webdev01-master01 - additionally, whatever file > your using from the /data/backup-drop/ folder would not be from production > since it got moved from sjc to phx. I believe this was part of the reason > for creating the addons-webdev1.db.phx1 host They are stale, that's bug 685746. :-/
Would it be of any value to import the most recent dump from cm-webdev01-master01 for the time being? I realize it's out of date, but "up-and-old" might be better than "down-and-new", while we troubleshoot the issue. FWIW, the "mysqldump -x" flag locks all tables in all databases for the duration of the dump. There's no way such a dump would develop any inconsistencies *during* the dump... the data would have had to be inconsistent when the dump started. The only other possibilities that occur to me are that the dump itself was corrupted, or that the import somehow failed. If we can dig up any dumps between Aug 24 and today that are useful, it might be a good idea to try a "git bisect"-style attack, and see if we can get *something* to work.
up-and-old isn't great because it's weeks old at this point and the FS is out of sync. I certainly hope we can dig up dumps between aug 24 and today - they should be happening every thing (see the last comment in bug 685746). I don't know if that will help us or not, since we haven't had this problem in the past. The import for the db you gave me is still running, unfortunately.
I am loading the dump file into a 5.1 instance on cm-webdev01-slave01 right now. Once this is complete, I will notify and we can test against this
It took a little under 4 hours for me earlier today. Is yours done now? If so, can you let me know the db name?
The dump is complete and the database name is addons_dev and the user/pass credentials are the same as on dev1.db.phx1.mozilla.com
I have the code from HEAD running with that db at http://khan.mozilla.org:8008/en-US/firefox/ . I haven't been able to make it fail yet by clicking around. Can you flush -dev's memcache? The broken result may be cached there. Next ideas?
I have a minimal set of steps to reproduce this bug now. Below is the output showing the failure case on -dev: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In [1]: from addons.models import Addon In [2]: x = Addon.objects.get(pk=2108) In [3]: x.current_version --------------------------------------------------------------------------- DoesNotExist Traceback (most recent call last) /data/www/addons-dev.allizom.org/zamboni/<ipython console> in <module>() /data/www/addons-dev.allizom.org/zamboni/apps/addons/models.py in current_version(self) 520 if self.type == amo.ADDON_PERSONA: 521 return --> 522 if not self._current_version: 523 self.update_version() 524 return self._current_version /data/www/addons-dev.allizom.org/zamboni/vendor/src/django/django/db/models/fields/related.py in __get__(self, instance, instance_type) 312 db = router.db_for_read(self.field.rel.to, instance=instance) 313 if getattr(rel_mgr, 'use_for_related_fields', False): --> 314 rel_obj = rel_mgr.using(db).get(**params) 315 else: 316 rel_obj = QuerySet(self.field.rel.to).using(db).get(**params) /data/www/addons-dev.allizom.org/zamboni/vendor/src/django/django/db/models/query.py in get(self, *args, **kwargs) 347 if not num: 348 raise self.model.DoesNotExist("%s matching query does not exist." --> 349 % self.model._meta.object_name) 350 raise self.model.MultipleObjectsReturned("get() returned more than one %s -- it returned %s! Lookup parameters were %s" 351 % (self.model._meta.object_name, num, kwargs)) DoesNotExist: Version matching query does not exist. In [4]: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Running those exact same steps on my box using the copy of the -dev database results in success: In [3]: x.current_version Out[3]: <Version: 1.2.2> Can you verify everything is updating correctly (particularly everything in vendor/)? It may be using different libraries. I haven't had the chance to dig into this yet this morning.
The slave is out of sync. Please fix and add to nagios. In [3]: from django.db import connections In [4]: c = connections['default'].cursor() In [5]: c.execute('select * from versions where id=1269487') Out[5]: 1L In [6]: c.fetchall() Out[6]: ((1269487L, 2108L, 6L, u'1.2.2', u'', 2450543L, datetime.datetime(2011, 9, 2, 21, 32, 10), datetime.datetime(2011, 9, 2, 21, 33, 30), 1020200200100L, None, None, 0, 0),) In [7]: c = connections['slave'].cursor() In [8]: c.execute('select * from versions where id=1269487') Out[8]: 0L In [9]: c.fetchall() Out[9]: () In [10]:
due to applications writing to the slave it caused the replication from the master to stop. This is why the slave didn't match the master because of writes occurring during the initial db load. It's back up and replicating now
Can we please set up a nagios check for replication on the dev servers? They are pretty important to the teams that use them.
https://bugzilla.mozilla.org/show_bug.cgi?id=687960 has been created to add nagios checks
The front page is still a traceback. https://addons-dev.allizom.org/services/monitor suggests that all 4 redis boxes are running as masters, whereas 2 are supposed to be slaves. This could be the cause for our caching problems.
(In reply to Wil Clouser [:clouserw] from comment #43) > The front page is still a traceback. > https://addons-dev.allizom.org/services/monitor suggests that all 4 redis > boxes are running as masters, whereas 2 are supposed to be slaves. This > could be the cause for our caching problems. Redis problem split off into https://bugzilla.mozilla.org/show_bug.cgi?id=688035
What's the status of this bug?
The site is back up, but I don't remember if it got a new dump or not. I suggest we close this and we'll file fresh the next time we need a dump. Thanks.
Closing, reopen the next time you need a dump
Status: REOPENED → RESOLVED
Closed: 14 years ago13 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.