Closed
Bug 685269
Opened 14 years ago
Closed 12 years ago
Bugzilla graph data needs to get synced between SCL3 and PHX
Categories
(Developer Services :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: justdave, Assigned: fubar)
References
Details
The cron jobs that generate all the stats in Bugzilla are still running in SJC. Whether they stay there or move to PHX, some of the data updated by these cron jobs is stored on disk, and needs to be synced between the colos. There used to be some general automation that synced the NFS stores for all of the clusters between SJC and PHX, but that seems to have gone away at some point.
In particular, Bugzilla needs /mnt/bugzilla/data/mining to be cloned and kept up to date in both colos. This is all generated by a once-daily cron job, so having it rsync from mradm02 immediately after the existing cron job completes would be fine (and a good way to keep track of it in case the cron job moves).
| Reporter | ||
Comment 1•14 years ago
|
||
SJC is Bugzilla's failover, so even if the cron job moves to PHX, it'll need to sync back the other direction to keep SJC up-to-date in that case.
Comment 2•14 years ago
|
||
can we go ahead and get this moved to phx1 in prep for scl3??
Updated•14 years ago
|
Assignee: server-ops → justdave
we should keep the entire data/ directory in sync between the two sites, not just data/mining.
| Reporter | ||
Comment 4•14 years ago
|
||
I don't know how this works, that's why I put it in the queue. Last time I tried to reverse-engineer how the nfs rsyncs were working I broke it. Needs to go to someone in webops who knows how the nfs rsync scripts work on the cluster masters.
Assignee: justdave → server-ops
| Reporter | ||
Comment 5•14 years ago
|
||
(In reply to Byron Jones ‹:glob› from comment #3)
> we should keep the entire data/ directory in sync between the two sites, not
> just data/mining.
Actually, no, because the most of it is all temp storage that's generated at runtime.
What specifically needs synced? I know we should avoid syncing temp files.
i didn't think there would be any real harm in synching the entire data/ directory, but you're right, if we can avoid synching the temp file we should.
minimally we just need data/mining/ and data/params
Comment 7•14 years ago
|
||
What is wrong with just running the the relevant cron scripts for Bugzilla in both locations keeping their own data stores instead of treating one as authoritative and syncing it to the other? Since the databases are replicating they are both getting the same data and generating identical files anyway (in theory). The only issue I could see if one (SJC probably) had not had its cron scripts running for a period of time unchecked and then an emergency changeover was needed. If the previously active location no longer had its data contents for whatever reason and the backup locations data contents were also broken, they we would have a problem.
Also this becomes a bigger issue if we ever enable the storage of bug attachments on disk instead of in the DB. Not sure why we would do this unless just to save space in the databases themselves. But if we did it would be imperative that we had some sort of syncing system in place.
dkl
Updated•14 years ago
|
Assignee: server-ops → scabral
Component: Server Operations: Web Operations → Server Operations: Database
QA Contact: mrz → cshields
Comment 8•14 years ago
|
||
So, a few things:
the MAILTO on /etc/cron.d/bugzilla.cron points to justdave -- I copied that to phx but we may want to rethink that (change in sjc and phx)
There's a collectstas.pl script and a whine.pl script that exist and I have set their cron scripts on phx. I commented out the 3rd script, which checks for unsent bugmail and sends if it hasn't been sent, because that script isn't on phx.
I'm going to close this as FIXED now, but feel free to check out what I did.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to Sheeri Cabral [:sheeri] from comment #8)
> There's a collectstas.pl script and a whine.pl script that exist and I have
> set their cron scripts on phx.
i don't think the whine.pl script should run from multiple locations, can you please disable it on PHX for now. it's probably safe, but i'm concerned about race conditions causing multiple emails to be sent.
> I commented out the 3rd script, which checks for unsent bugmail and sends if
> it hasn't been sent, because that script isn't on phx.
that script should be run somewhere, please uncomment the script on SJC only.
if you're not going to sync data/mining between the two instances, then the only cron change that should happen right now is running collectstats.pl on SJC. i think it would be wise to hold off any other bmo deployment changes until after the christmas break, when we're back from leave.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Reporter | ||
Comment 10•14 years ago
|
||
whine.pl and send_unsent should only be run in one place. Either place is fine as long as it's only one. Collectstats updates stuff both on the filesystem and in the database. It needs to run in only one place also and the results that are on the filesystem be rsynced to the other. Otherwise you might screw up stuff in the database by having both of them trying to write to it at once.
Comment 11•14 years ago
|
||
Verified that only collectstats.pl is running in phx. SJC is running whine.pl and collectstats.pl.
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 12•14 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #11)
> Verified that only collectstats.pl is running in phx. SJC is running
> whine.pl and collectstats.pl.
Which was exactly the problem. collectstats.pl can only be run in one place also. Running it in both places will cause data conflicts.
(In reply to Dave Miller [:justdave] from comment #10)
> Collectstats updates stuff both on the
> filesystem and in the database. It needs to run in only one place also and
> the results that are on the filesystem be rsynced to the other. Otherwise
> you might screw up stuff in the database by having both of them trying to
> write to it at once.
I just removed it from phx for now.
Trying to figure out which direction the rsync is going, will put it back on whichever side is sourcing.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Reporter | ||
Comment 13•14 years ago
|
||
hmm, can't find an rsync anywhere, the sync-nfs thing on ip-admin02 still has bugzilla's thing commented out (which is good because that script was messing it up last time we used it on bugzilla). Can't find anything equivalent on mradm02.
| Reporter | ||
Comment 14•14 years ago
|
||
This is actually the reason I tossed this back at the queue before, because there was a sync-nfs script set up specifically for this purpose a while back and I didn't understand how it was working and couldn't make it do what I wanted. Pretty sure oremj was the one who set it up but I don't really remember.
I probably should just write a dedicated script for it and not try to fit in with the rest of the cluster stuff.
Assignee: scabral → justdave
| Reporter | ||
Comment 16•13 years ago
|
||
hmm, this has nothing to do with database.
Assignee: justdave → server-ops
Status: REOPENED → NEW
Component: Server Operations: Database → Server Operations: Web Operations
Updated•13 years ago
|
Assignee: server-ops → ashish
Comment 17•13 years ago
|
||
Is this still an issue now that the sjc servers are gone? If so, we should probably build some production bugzilla servers in scl3, which is another issue entirely.
Comment 18•13 years ago
|
||
Is this still an issue now that the sjc servers are gone? If so, we should probably build some production bugzilla servers in scl3, which is another issue entirely.
Comment 19•13 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #18)
> We should probably build some production bugzilla servers in scl3, which is another
> issue entirely.
that's bug 726710
Comment 20•13 years ago
|
||
OK, so I've renamed this bug to sync the graph data between scl3 and phx, and made it dependent on bug 726710. If that's not right, let me know....but now we know what the next action for this bug is.
Depends on: 726710
Summary: Bugzilla graph data needs to get synced between SJC and PHX → Bugzilla graph data needs to get synced between SCL3 and PHX
Comment 21•13 years ago
|
||
Punting this to Dev Svcs to set this up along with Bug 726710.
Assignee: ashish → server-ops-devservices
Component: Server Operations: Web Operations → Server Operations: Developer Services
QA Contact: cshields → shyam
Comment 22•12 years ago
|
||
Over to storage for some advice. Not super critical, but we need to at some point find a nice method to keep 10.22.82.11:/vol/bugzilla_prod in sync with a new netapp volume in phx1 (when we get to bug 820918).
Assignee: server-ops-devservices → server-ops-storage
Component: Server Operations: Developer Services → Server Operations: Storage
QA Contact: shyam → dparsons
Comment 23•12 years ago
|
||
Is a solution for this still needed?
Comment 24•12 years ago
|
||
(In reply to Dan Parsons [:lerxst] from comment #23)
> Is a solution for this still needed?
Yes. We still don't have anything except rsyncing things over.
Comment 25•12 years ago
|
||
I'm not sure there's going to be anything to offer here.
There's no bidirectional syncing mechanism for r/w vols.
If r/o on the standby side is an option, a side door would open except that we don't have snapmirror licensed. But if someone ponied up some Really Serious Dough we could:
* have a r/w vol on one side be mirrored r/o on the other, and vice versa (picture an X), but it'd still take localized crons to massage the data up from the r/o copy up to the local r/w one on the standby site (picture an X with 1 side looping) Though at that point it's feeling like just an expensive way to rsync not-directly-across-the-wan; or
* have a mirror from activesite to standbysite. But that would mean having steps to reverse-the-polarity on the mirror during failovers, and I don't know if there's any weirdness involved in the code if it finds that vol r/o on the standby site.
Comment 26•12 years ago
|
||
:justdave, given the options presented so far here, which sounds best to you?
Comment 27•12 years ago
|
||
CC'ing Jake. He's going to be responsible for the operational side of this.
Comment 28•12 years ago
|
||
:jakem, any thoughts on the best way to do this? is there any way the storage team can help?
Flags: needinfo?(nmaul)
| Reporter | ||
Comment 29•12 years ago
|
||
The data that needs to be synced is all generated by a cron job, it just needs to be available in both places. Wherever the cron job is running should be considered the master copy, and if Bugzilla is running in multiple locations, then that data needs to get synced to the locations other than where the cron job is running as soon as possible after the cron job completes (I think it's only daily anyway).
Comment 30•12 years ago
|
||
:justdave, unfortunately there is no way we can really help you with this. With NetApp's snapmirror, one side is read only, the other side is the master, and switching them is not an easy or simple thing, unfortunately.
If there's a way a read-only mirror can help you, let me know and we can set that up (only between phx1-hci and scl3-hci, however).
Status: NEW → RESOLVED
Closed: 14 years ago → 12 years ago
Resolution: --- → WONTFIX
| Reporter | ||
Comment 31•12 years ago
|
||
This was assigned to Storage for advice, not completion (see comment 22). This needs to happen, whether you're the ones to solve it or not. Comment 30 says you're not, so it goes back to dev services. An rsync on a cron job is probably plenty for this.
Assignee: server-ops-storage → server-ops-devservices
Status: RESOLVED → REOPENED
Component: Server Operations: Storage → Server Operations: Developer Services
QA Contact: dparsons → nmaul
Resolution: WONTFIX → ---
Comment 32•12 years ago
|
||
Kendall, This needs to be done as part of DR for Bugzilla anyway, so if you have sometime, let's chat about this and we can put a cron job and get this squared away.
Assignee: server-ops-devservices → klibby
Comment 33•12 years ago
|
||
Clearing my own NEEDINFO, sounds like we have a reasonable solution in comment 31. I'm happy with that.
Side note: this is a problem that Labs is currently toiling away with, using Ceph (http://ceph.com/). It's a modern distributed filesystem, which should be well-suited to at least low-traffic-but-highly-critical data (not yet sure about the performance aspects of it, IME that's frequently a weak point in distributed filesystems). In this particular case, cron+rsync is certainly much simpler, but Ceph should provide us with a fairly generic solution to this basic problem... we've run into this a few times before, and never had a great answer.
The other option we've espoused at times is to push this sort of data off to Amazon's S3. It's not the same semantics (the application has to change), but it's a relatively cheap and effective way to outsource this kind of problem.
Flags: needinfo?(nmaul)
Comment 34•12 years ago
|
||
(In reply to Jake Maul [:jakem] from comment #33)
> The other option we've espoused at times is to push this sort of data off to
> Amazon's S3. It's not the same semantics (the application has to change),
> but it's a relatively cheap and effective way to outsource this kind of
> problem.
fwiw if we're to change the application, putting this data into the database makes a lot more sense than s3.
| Assignee | ||
Comment 35•12 years ago
|
||
Cron job to sync bugzilla_prod/{data,graphs} has been set up. It's currently running between push1.bugs.(scl3|phx1) since fox2mike had already set up ssh keys for them, and until this week we didn't have an admin node in phx1. It needs to be moved to the admin nodes and, ideally, all of the cron stuff should be easily controlled by a puppet param to make failovers easier, but that's a larger task.
Currently runs twice an hour (15,45) since it's not a large amount of data, I wasn't sure how long collectstats.pl takes to run, and having a lower recovery point objective isn't a bad thing. Output, if any, to cron-bugzilla.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in
before you can comment on or make changes to this bug.
Description
•