Closed
Bug 685669
Opened 14 years ago
Closed 13 years ago
[amo] Refresh -dev db and filesystem
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: clouserw, Assigned: mpressman)
Details
Please refresh the -dev database. This should be imported from production and go into a differently named database. Once it's done swap the configs to point to the new one which will minimize downtime.
The filesystem sync has no such fortune and just kinda has to go.
Bonus points for not syncing the update_counts and download_counts tables, your call on how to do that. It'll save us hours of import time if you skip them.
| Reporter | ||
Updated•14 years ago
|
Summary: Refresh -dev db and filesystem → [amo] Refresh -dev db and filesystem
| Reporter | ||
Comment 1•14 years ago
|
||
What's the status of this? I'd like to do another run on the preview site before doing it in production, but since this takes so long that may not be possible. Can you update this without the stats tables and maybe we can do it today?
Severity: normal → major
Updated•14 years ago
|
Assignee: server-ops → mpressman
| Reporter | ||
Comment 2•14 years ago
|
||
What's the word on this? The SDK was released yesterday but we haven't been able to upgrade our add-ons because this isn't done yet. -> critical
Severity: major → critical
| Assignee | ||
Comment 3•14 years ago
|
||
which dev host are you looking for a refresh on? The host addons-webdev1.db.phx1 is currently up to date and replicating from the AMO master
| Reporter | ||
Comment 4•14 years ago
|
||
(In reply to Matt Pressman from comment #3)
> which dev host are you looking for a refresh on? The host
> addons-webdev1.db.phx1 is currently up to date and replicating from the AMO
> master
addons-dev.allizom.org
| Reporter | ||
Comment 5•14 years ago
|
||
(In reply to Wil Clouser [:clouserw] from comment #4)
> (In reply to Matt Pressman from comment #3)
> > which dev host are you looking for a refresh on? The host
> > addons-webdev1.db.phx1 is currently up to date and replicating from the AMO
> > master
>
> addons-dev.allizom.org
Er, comment went too soon. That's the host I'm interested in. I don't know what addons-webdev1.db.phx1 is. Is that the host serving addons-dev.allizom.org?
| Assignee | ||
Comment 6•14 years ago
|
||
in #webdev fligtar said he needed the host for a demo, so I am prepping the dump, but won't import it until I get the go-ahead.
| Reporter | ||
Comment 7•14 years ago
|
||
Demo is over, let's do it
| Assignee | ||
Comment 8•14 years ago
|
||
amo refresh for addons_dev db on dev1.db.phx1.mozilla.com is complete
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 9•14 years ago
|
||
The site is currently down. Looks like something in the versions table is missing. How did you filter the tables you didn't load?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Assignee | ||
Comment 10•14 years ago
|
||
I simply dumped all tables except for the update_counts and download_counts tables. There is no direct foreign key constraint on versions to either of those tables. Can you provide me with the query or debug that is pointing to the versions table?
| Reporter | ||
Comment 11•14 years ago
|
||
Umm, not really. With some debug flags on we can make it log queries but they'd be fast and furious and I don't know how to do that with wsgi.
Can you describe how you dumped the tables? If you took it right off a live slave I suspect it dumped one table (addons?) well before the alphabetically-distant versions and they got out of sync.
| Reporter | ||
Comment 12•14 years ago
|
||
The site is 100% down so this is blocking all QA as well as upgrading jetpacks to the new sdk which was released 2 days ago. Going to mark as blocker in the morning.
Updated•14 years ago
|
Severity: critical → blocker
| Assignee | ||
Comment 13•14 years ago
|
||
I did dump it off a slave, but locked all tables before the dump so as to not get any out of sync
Severity: blocker → critical
| Assignee | ||
Comment 14•14 years ago
|
||
since this is a blocker I'll be more verbose, I used the -x flag with mysqldump to lock all tables across all databases
| Reporter | ||
Comment 15•14 years ago
|
||
So, for next steps - we don't have access to the box this is running on, so debugging is very slow and difficult. If it's not something obvious, we should just do the standard:
- complete dump to a new db
- change the config to point to the new db name
| Assignee | ||
Comment 16•14 years ago
|
||
there appears to be a db running on the same host called addons_dev2 ???
| Reporter | ||
Comment 17•14 years ago
|
||
(In reply to Matt Pressman from comment #16)
> there appears to be a db running on the same host called addons_dev2 ???
As people without access to these boxes there isn't a lot we can help with here. This is probably a symptom of reloading the db in the past - we toggle back and forth. The settings_local.py files will tell you what db AMO is hitting right now.
Alice: Once this is switched you'll need to change your configs to insert results into the new database.
| Assignee | ||
Comment 18•14 years ago
|
||
A full dump including the update_counts and download_counts tables just completed and I will transfer it over and start loading it into addons_dev as it appears that addons_dev2 does not contain two tables that do exist in addons_dev. The tables are addons_premium and blca
| Reporter | ||
Comment 19•14 years ago
|
||
sounds like addons_dev2 is the inactive db currently then. Since the site is broken it doesn't matter, but on a normal day the correct action would be to load into the addons_dev2 database and then switch the config.
| Assignee | ||
Comment 20•14 years ago
|
||
I'm loading the full dump now into addons_dev
| Assignee | ||
Comment 21•14 years ago
|
||
Full database load has completed. This was taken from addons-webdev1.db.phx1.mozilla.com which is an up to date actively replicated slave of the amo master. As far as I know, this host is only used for metrics queries which would otherwise be too much load on the production hosts. If there are any other issues with the data I can dump from a host that is currently out of production because of hardware issues, but is still replicating.
(In reply to Matt Pressman from comment #21)
> Full database load has completed. This was taken from
> addons-webdev1.db.phx1.mozilla.com which is an up to date actively
> replicated slave of the amo master. As far as I know, this host is only used
> for metrics queries which would otherwise be too much load on the production
> hosts. If there are any other issues with the data I can dump from a host
> that is currently out of production because of hardware issues, but is still
> replicating.
https://addons-dev.allizom.org/en-US/firefox/ is still bombing out; I've repurposed bug 687034, which was originally filed about this issue (comment 9). I'm unclear if the new stacktrace is a code or DB issue.
| Reporter | ||
Comment 23•14 years ago
|
||
Matt: Can you give me the database dump in a db on cm-webdev01-master01 so I can try my code with it? I import a new db from production every night onto that box (the `remora` db) and I've never had these problems so I'm not really sure what's going on.
Alternatively you could use one of the standard dumps from production that are mounted in the /data/backup-drop/ folder on that box. I don't know if there is a difference in how they are dumped.
| Reporter | ||
Comment 24•14 years ago
|
||
This is now blocking the builder team as well. We need to get this resolved this morning. How can I help?
Severity: critical → blocker
Comment 25•14 years ago
|
||
As Stephen mentioned in comment 22, the traceback getting generated is "DoesNotExist: Version matching query does not exist" and the details are available at bug 687034.
| Reporter | ||
Comment 26•14 years ago
|
||
(In reply to krupa raj 82[:krupa] from comment #25)
> As Stephen mentioned in comment 22, the traceback getting generated is
> "DoesNotExist: Version matching query does not exist" and the details are
> available at bug 687034.
Are you saying it's a code problem and not the database refresh causing the traceback?
(In reply to Wil Clouser [:clouserw] from comment #26)
> Are you saying it's a code problem and not the database refresh causing the
> traceback?
Not sure -- still seems DB/schema-related, according to the stacktrace.
| Reporter | ||
Comment 28•14 years ago
|
||
(In reply to Stephen Donner [:stephend] from comment #27)
> (In reply to Wil Clouser [:clouserw] from comment #26)
> > Are you saying it's a code problem and not the database refresh causing the
> > traceback?
>
> Not sure -- still seems DB/schema-related, according to the stacktrace.
I agree. Matt: we're all on IRC, please come talk to us if there isn't a clear course of action here.
| Assignee | ||
Comment 29•14 years ago
|
||
I'm putting the dump on cm-webdev01-master01 - additionally, whatever file your using from the /data/backup-drop/ folder would not be from production since it got moved from sjc to phx. I believe this was part of the reason for creating the addons-webdev1.db.phx1 host
| Assignee | ||
Comment 30•14 years ago
|
||
if this is db/schema related, I can very quickly determine the difference's returned based on the queries that are running, since we have a stack trace, can someone post the sql around the trace?
| Reporter | ||
Comment 31•14 years ago
|
||
(In reply to Matt Pressman from comment #29)
> I'm putting the dump on cm-webdev01-master01 - additionally, whatever file
> your using from the /data/backup-drop/ folder would not be from production
> since it got moved from sjc to phx. I believe this was part of the reason
> for creating the addons-webdev1.db.phx1 host
They are stale, that's bug 685746. :-/
Comment 32•14 years ago
|
||
Would it be of any value to import the most recent dump from cm-webdev01-master01 for the time being? I realize it's out of date, but "up-and-old" might be better than "down-and-new", while we troubleshoot the issue.
FWIW, the "mysqldump -x" flag locks all tables in all databases for the duration of the dump. There's no way such a dump would develop any inconsistencies *during* the dump... the data would have had to be inconsistent when the dump started. The only other possibilities that occur to me are that the dump itself was corrupted, or that the import somehow failed.
If we can dig up any dumps between Aug 24 and today that are useful, it might be a good idea to try a "git bisect"-style attack, and see if we can get *something* to work.
| Reporter | ||
Comment 33•14 years ago
|
||
up-and-old isn't great because it's weeks old at this point and the FS is out of sync.
I certainly hope we can dig up dumps between aug 24 and today - they should be happening every thing (see the last comment in bug 685746). I don't know if that will help us or not, since we haven't had this problem in the past. The import for the db you gave me is still running, unfortunately.
| Assignee | ||
Comment 34•14 years ago
|
||
I am loading the dump file into a 5.1 instance on cm-webdev01-slave01 right now. Once this is complete, I will notify and we can test against this
| Reporter | ||
Comment 35•14 years ago
|
||
It took a little under 4 hours for me earlier today. Is yours done now? If so, can you let me know the db name?
| Assignee | ||
Comment 36•14 years ago
|
||
The dump is complete and the database name is addons_dev and the user/pass credentials are the same as on dev1.db.phx1.mozilla.com
| Reporter | ||
Comment 37•14 years ago
|
||
I have the code from HEAD running with that db at http://khan.mozilla.org:8008/en-US/firefox/ . I haven't been able to make it fail yet by clicking around.
Can you flush -dev's memcache? The broken result may be cached there.
Next ideas?
| Reporter | ||
Comment 38•14 years ago
|
||
I have a minimal set of steps to reproduce this bug now. Below is the output showing the failure case on -dev:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
In [1]: from addons.models import Addon
In [2]: x = Addon.objects.get(pk=2108)
In [3]: x.current_version
---------------------------------------------------------------------------
DoesNotExist Traceback (most recent call last)
/data/www/addons-dev.allizom.org/zamboni/<ipython console> in <module>()
/data/www/addons-dev.allizom.org/zamboni/apps/addons/models.py in current_version(self)
520 if self.type == amo.ADDON_PERSONA:
521 return
--> 522 if not self._current_version:
523 self.update_version()
524 return self._current_version
/data/www/addons-dev.allizom.org/zamboni/vendor/src/django/django/db/models/fields/related.py in __get__(self, instance, instance_type)
312 db = router.db_for_read(self.field.rel.to, instance=instance)
313 if getattr(rel_mgr, 'use_for_related_fields', False):
--> 314 rel_obj = rel_mgr.using(db).get(**params)
315 else:
316 rel_obj = QuerySet(self.field.rel.to).using(db).get(**params)
/data/www/addons-dev.allizom.org/zamboni/vendor/src/django/django/db/models/query.py in get(self, *args, **kwargs)
347 if not num:
348 raise self.model.DoesNotExist("%s matching query does not exist."
--> 349 % self.model._meta.object_name)
350 raise self.model.MultipleObjectsReturned("get() returned more than one %s -- it returned %s! Lookup parameters were %s"
351 % (self.model._meta.object_name, num, kwargs))
DoesNotExist: Version matching query does not exist.
In [4]:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Running those exact same steps on my box using the copy of the -dev database results in success:
In [3]: x.current_version
Out[3]: <Version: 1.2.2>
Can you verify everything is updating correctly (particularly everything in vendor/)? It may be using different libraries. I haven't had the chance to dig into this yet this morning.
| Reporter | ||
Comment 39•14 years ago
|
||
The slave is out of sync. Please fix and add to nagios.
In [3]: from django.db import connections
In [4]: c = connections['default'].cursor()
In [5]: c.execute('select * from versions where id=1269487')
Out[5]: 1L
In [6]: c.fetchall()
Out[6]:
((1269487L,
2108L,
6L,
u'1.2.2',
u'',
2450543L,
datetime.datetime(2011, 9, 2, 21, 32, 10),
datetime.datetime(2011, 9, 2, 21, 33, 30),
1020200200100L,
None,
None,
0,
0),)
In [7]: c = connections['slave'].cursor()
In [8]: c.execute('select * from versions where id=1269487')
Out[8]: 0L
In [9]: c.fetchall()
Out[9]: ()
In [10]:
| Assignee | ||
Comment 40•14 years ago
|
||
due to applications writing to the slave it caused the replication from the master to stop. This is why the slave didn't match the master because of writes occurring during the initial db load. It's back up and replicating now
Comment 41•14 years ago
|
||
Can we please set up a nagios check for replication on the dev servers? They are pretty important to the teams that use them.
| Assignee | ||
Comment 42•14 years ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=687960 has been created to add nagios checks
| Reporter | ||
Comment 43•14 years ago
|
||
The front page is still a traceback. https://addons-dev.allizom.org/services/monitor suggests that all 4 redis boxes are running as masters, whereas 2 are supposed to be slaves. This could be the cause for our caching problems.
| Reporter | ||
Comment 44•14 years ago
|
||
(In reply to Wil Clouser [:clouserw] from comment #43)
> The front page is still a traceback.
> https://addons-dev.allizom.org/services/monitor suggests that all 4 redis
> boxes are running as masters, whereas 2 are supposed to be slaves. This
> could be the cause for our caching problems.
Redis problem split off into https://bugzilla.mozilla.org/show_bug.cgi?id=688035
Comment 45•13 years ago
|
||
What's the status of this bug?
| Reporter | ||
Comment 46•13 years ago
|
||
The site is back up, but I don't remember if it got a new dump or not. I suggest we close this and we'll file fresh the next time we need a dump.
Thanks.
| Assignee | ||
Comment 47•13 years ago
|
||
Closing, reopen the next time you need a dump
Status: REOPENED → RESOLVED
Closed: 14 years ago → 13 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•