Closed Bug 656135 Opened 13 years ago Closed 13 years ago

push MindTouch 2010 to developer.mozilla.org on 2011-08-30

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

All
Other
task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: groovecoder, Assigned: nmaul)

References

()

Details

      No description provided.
Jeremy,

Do you have notes from the MindTouch 2010 staging upgrade?
I think it is going to be:
* revert all the security patches
* svn switch to the mindtouch 2010 branch
* sync that out
* run php maintenance/update-db.php
* reapply all the patches
is there a revert-db.php also?
(In reply to comment #3)
> is there a revert-db.php also?

Nope.
did we test a rollback in staging? :/

Craig, did you make most of the MindTouch 2010 bug fixes? Can you be on-hand for the push?
After scanning thru the bug 600834 dependencies, I think the potential issues could be:

* Product Activation Key - we may have to re-activate MindTouch after the upgrade to fix API issues. (bug 605549 and bug 605645)
* Restore site preferences - such as the From: email (bug 646989)
* Re-apply security patches (do we have a list of all applied patches?)

We would like to push tomorrow at 2pm PST.
Blocks: 593941
Depends on: 656444
No longer blocks: 593941
Blocks: 593941
Assignee: server-ops → nmaul
Should this be WONTFIX'd? We're upgrading developer-stage9 first, right? Then this sometime afterwards?

The project plan from MindTouch (received today) calls for stage9 to be upgraded on or around May 27, and presumably a prod upgrade by June 14, if not sooner (depending on QA + bugs found + fixes from MT).
Status: NEW → ASSIGNED
No need to WONTFIX it - we will fix it. Just not until 6/14. :)
Whiteboard: [waiting on 656444]
During this upgrade, we also need to upgrade Mono to 2.10. Just noting this here so it's not forgotten.
See this bug for details on the -stage9 upgrade, which this should roughly mirror:

https://bugzilla.mozilla.org/show_bug.cgi?id=656444
We are still seeing huge CPU loads on sm-devmostage01 and 02 at times, presumably corresponding with some type of script being run against developer-stage9.mozilla.org.

I cannot in good conscience recommend this upgrade. Performance/concurrency on -stage9 is undeniably worse, and the only explanation is that it can't handle pages with large numbers of images or attachments. We are having to restart mono processes daily (or more) to get the servers back operational.

The problem is not just memory usage (which seems to be better now), as that was apparently primarily caused by sheppy's mass-import script adding many images/attachments to the same page. The current problem is completely maxxed out CPU usage... high enough to cause mono to completely stop responding on both servers, although the rest of the server is fine. Apache/SSH respond normally... mono/dekiwiki doesn't.

I believe this problem will persist in production- it just might take a bit longer before the whole cluster is frozen solid. I don't believe this is some type of legitimate high CPU usage that will go away if it runs long enough- we've tried letting it run for several minutes, with no obvious effect except that Mono/dekiwiki is non-responsive throughout.


If you'd like to proceed anyway, just let us know. I don't have any concern about being able to do the upgrade, only that I believe it will adversely affect cluster performance and reliability. The main reason to proceed as planned would be that you know what is causing -stage9 to die tonight, and can prevent it somehow. :)


In the meantime, I have stopped -stage9 for the night. Something is killing it, causing on-call to be paged repeatedly. We can easily bring this back up tomorrow morning with a simple 'service dekiwiki9 start' on the 2 servers.
jakem:  can you contact mindtouch tomorrow morning to see what else they might have to say about this?  it's disappointing that even after working with them, we aren't confident enough to do the upgrade... nor do we know exactly why sheppy's script (and any other odd behavior) is bringing stage9 down.

a few random questions we might want to ask:

1. if this is a mono issue, and mindtouch 10 is dependent on the new mono... what are our options?   is there any way to get the fixes in mindtouch 10 without mono upgrade?

2. why can't they work with you to investigate this further to better understand our issues?   getting a simple answer about what *might* be causing the meltdown is not satisfactory... even if what sheppy was doing is an edge case and can't be supported without perf problems.

3. what is the proper escalation path to get to the bottom of our current issues?  do they need to fly out their best support team to work directly with our servers?  is that something they can do remotely with cooperation from our IT team?
(In reply to comment #11)

> I cannot in good conscience recommend this upgrade. Performance/concurrency
> on -stage9 is undeniably worse, and the only explanation is that it can't
> handle pages with large numbers of images or attachments. We are having to
> restart mono processes daily (or more) to get the servers back operational.

I did change my script to no longer attach lots of stuff to one page; instead of a single page for all attachments, they're now actually being attached to the pages that use them.

I wish I had known in advance before you killed stage9; you sort of bunged up the test I was running against it. :)
Depends on: 661370
No longer depends on: 656444
Whiteboard: [waiting on 656444] → [waiting on 661370]
heads up on this. bug 661370 is resolving, so we're tentatively scheduling this for Aug 30th if that's okay.
(In reply to Luke Crouch [:groovecoder] from comment #14)
> heads up on this. bug 661370 is resolving, so we're tentatively scheduling
> this for Aug 30th if that's okay.

Let's go for it.  It's been long enough... we'll see what happens.
Summary: push MindTouch 2010 to developer.mozilla.org → push MindTouch 2010 to developer.mozilla.org on 2011-08-30
Realistically, it's unlikely to be worse than the current situation, no matter what. :D
Don't ever say that! :)
Crap, now I've jinxed it. Dammit. :)
From the evaluation bug, here's the output from the 10.0.9 -> 10.1 upgrade:
http://etherpad.mozilla.com:9000/miy99arFVY

The 'steps' referred to are from here:
http://projects.mindtouch.com/Mozilla/Documentation/10.1.1_Upgrade_Steps

Note that prod actually has to upgrade from 9.12.3, not 10.0.9. sheppy emailed them to ask if that can be done in one shot, or if we have to do it in 2 stages (9.12->10.0->10.1 or just 9.12->10.1)
Jake, you and Sheppy have more experience with this upgrade but considering how much hassle it's been so far I would hesitate to do anything other than what we did on stage9.
Jake, Sheppy:

Are you both still comfortable doing this tomorrow?
Yes.

Jake has synced up with Brian at MindTouch, and they plan to begin at 11 AM PDT; Brian will be in IRC just in case.
Severity: enhancement → normal
Whiteboard: [waiting on 661370]
After quite a bit of hassle, this is now completed.

http://etherpad.mozilla.com:9000/36zk8fSzT1


The basic procedure was as follows:

1) backup the dekiwiki document root on mradm02

2) backup the database(s) being upgraded

We had a custom robots.txt in place... this broke svn switch. Just move it out of the way *before* switching (after might work too but I don't know how to get a clean switch that way).
3) switch to the 10.1.1 SVN branch: 
svn switch https://svn.mindtouch.com/source/public/dekiwiki/10.1.1/web

4) Deploy to frontends, do *not* restart deki

5) run database upgrade script:  cd dekiwiki/maintenance; php update-db.php

6) upgrade Mono by loosely following these instructions, but via puppet instead of directly:http://developer.mindtouch.com/en/docs/mindtouch_setup/010Installation/060Installing_on_CentOS/Installing%2f%2fUpgrade_Mono_on_CentOS

7) issue-multi-command dekiwiki service dekiwiki restart



Major hurdles encountered (that I can remember... we went through a lot):

1) On step 4, the part about not restarting dekiwiki was not initially understood. This resulted in 500 ISE errors for the whole devmo site until we could roll back.

2) Lucene indexes apparently needed to be rebuilt. These are stored on the NetApp, and it was a simple matter of removing the existing indexes, once this was diagnosed. I believe this is still rebuilding now, in the background (automatically).

3) There is a separate license key in MT 10 that is needed to allow anonymous (non-logged-in) access. This is now pushed out from mradm02 along with the rest of the content. Without this all wiki pages redirect to a login page.

4) Mono needed to refresh it's SSL keystore cache of CAs on each web head.

5) The UI cache had to be flushed to fix various resource string errors reported by sheppy and others. This is done inside the dekiwiki admin interface. Sheppy and Brian from MT did this.

6) A few extensions "lost" their manifest setting, and sheppy had to port these over from the staging site. I'm not entirely sure what this means, but these settings are apparently pointers to files in the dekiwiki/ dir somewhere. Why these were lost is a mystery.


There are still some small issues here and there that the MDN team is opening separate bugs for. However, the main push is done so I'm closing this one out... finally. Yay!
Severity: normal → enhancement
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
verified fixed http://developer.mozilla.org
Status: RESOLVED → VERIFIED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.