developer-stage9 MindTouch 10.0.9 upgrade evaluation

RESOLVED FIXED

Status

RESOLVED FIXED
8 years ago
4 years ago

People

(Reporter: sheppy, Assigned: nmaul)

Tracking

Details

(Whiteboard: waiting on feedback)

(Reporter)

Description

8 years ago
I'm running scripts that dump large amounts of content into the wiki on developer-stage9.mozilla.org as a test, and the site is quite sluggish (to be fair, it was being sluggish even without the script).

We need to understand why it's slow, and if this same problem will arise on production.

In particular, we get a lot of timeouts accessing services, blank pages appearing, and so forth.
(Reporter)

Comment 1

8 years ago
To be clear, I'm adding one page or file at a time to the wiki, one after another. Given that the site should be able to handle dozens of people editing all at once, this should not have a noticeable impact on the site, IMHO.

Updated

8 years ago
Assignee: server-ops → nmaul
(Assignee)

Comment 2

8 years ago
At the time this bug was opened, I do see a good bit of iowait on sm-devmostage01 and 02. Right *now*, I see that they are both maxxed out on CPU usage, and have been for about 2 hours. They're also low on RAM, and throughout the day are swapping in and out of RAM to disk.


These servers are actually VMware VM's, and thus they share hardware with other VMs. They're far weaker than the production equipment:

The stage cluster is 2 VMs, with 4 total (shared) CPU cores and 2GB total RAM.

The prod cluster is 3 dedicated servers, with 16 total (not shared) CPU cores and 14GB total RAM.


The prod cluster has zero swap usage, and are each using 1-2GB each for cache and buffer space. I see no significant I/O wait on them.

In terms of per-core performance, a couple really simple benchmarks show all the difference ('bc' is single-threaded, so this is a 1-core benchmark... I didn't bother trying to run more than one at a time):

[root@sm-devmostage02 /]# time echo "scale=2000; a(1)*4" | bc -l > /dev/null
real	0m7.193s
[root@pm-dekiwiki03 ~]# time echo "scale=2000; a(1)*4" | bc -l > /dev/null
real	0m3.688s

[root@sm-devmostage01 usr]# time echo '2^2^20' | bc > /dev/null
real	0m13.689s
[root@pm-dekiwiki02 ~]# time echo '2^2^20' | bc > /dev/null
real	0m7.114s


In terms of disk cache efficiency, sm-devmostage01 is nearly 100% miss rate at the moment, while pm-dekiwiki02 is near 100% hit rate.

Obviously the current workload has a big effect on this, but it definitely looks like there is both a memory and CPU concern on the staging servers, which does not exist on the prod servers.


The only concern I would have is that perhaps whatever operation is currently running is problematic not because of weak hardware, but because of some type of bug or problem in MindTouch 10 or Mono2.10. There's no good way I can test this easily.

What I can tell you is that are hardware limitations in stage that are much reduced in prod. Because of this I'm confident prod will be better than stage, but how much better would really be dependent on MindTouch.


Are the problems on stage causing significant problems for you? We may be able to bump up their RAM or CPU allocations if it's an issue... I'm not really sure how much capacity we have available to beef them up, but there might be some.
Status: NEW → ASSIGNED
Sheppy, we should check w/ MindTouch if there are known performance issues and optimizations for MindTouch 10.
(Reporter)

Comment 4

8 years ago
Ah! I was under the impression this was real hardware rather than VM. That does make a difference.

The machine is getting hit super hard right now while I do some stuff on it; my test had a couple of errors to fix, and I'm in the process of deleting all the stuff I added earlier so I can re-run the test.

We've already talked to MindTouch about performance and they seem to think we should see good performance. It's hard to be sure until we have it on real hardware though.
(Reporter)

Comment 5

8 years ago
Can anyone look at it and see what it's doing right now? I've had it working on that delete operation since yesterday evening and as far as I can tell it's still not done. Even this many files and pages I would have expected to be done by now...
Please restart stage9 + MindTouch.
(Assignee)

Comment 7

8 years ago
Restarted.

(In reply to comment #5)
> Can anyone look at it and see what it's doing right now? I've had it working
> on that delete operation since yesterday evening and as far as I can tell
> it's still not done. Even this many files and pages I would have expected to
> be done by now...

One of the CPU cores on sm-devmostage01 was maxxed out since around 12:45am. Other than that, I don't see anything (now) that looks suspicious.

If the delete operation runs in serial, this would make sense- at any given time only 1 CPU on 1 server would be doing anything. If you're passing in one big group of files to delete, it's definitely conceivable that it would be handled in a single-threaded manner, by whichever server gets the request. It might be worth splitting up jobs like that, if possible, which would spread the load.
(Assignee)

Updated

8 years ago
Duplicate of this bug: 662209
(Reporter)

Comment 9

8 years ago
The delete is done. Now I'm uploading files one after another. I added in a sleep(2) to make it wait a couple of seconds in between to let the server "rest" a bit. Maybe that will help.

But realistically, it ought to be able to handle this. The wiki should be able to cope with one article or file being uploaded at a time over and over again, shouldn't it?
(Assignee)

Comment 10

8 years ago
This has happened again, and I've restarted dekiwiki9 again.

I've emailed MindTouch for their input as to why the new version might be performing worse than the old one, looking for suggestions.
(Reporter)

Comment 11

8 years ago
Could be because I just started my tool back up, with the 2-sec delay between uploads. We should keep an eye on it while that runs (even if all goes smoothly, it will take at least a day or so).
(Assignee)

Comment 12

8 years ago
(In reply to comment #9)
> The delete is done. Now I'm uploading files one after another. I added in a
> sleep(2) to make it wait a couple of seconds in between to let the server
> "rest" a bit. Maybe that will help.

Are you submitting multiple independent HTTP requests, or are they all bundled into one connection (either several requests over one pipelined connection, or even just one big request)?

Have we done the same type of thing on the old MindTouch 9 system?
(Reporter)

Comment 13

8 years ago
I'm doing individual requests; actually spawning off a curl process using popen() for each file.

We never did this on MindTouch 9; this is a one-off tool I'm testing in preparation to migrate content from the old mozilla.org site onto devmo sometime in the next couple of weeks.
(Assignee)

Comment 14

8 years ago
Is the timing of the curl jobs consistently high, or spikey (10 fast, 1 very slow)? How long do they take (min/max/avg estimations, if you can)? Many thanks.
(Reporter)

Comment 15

8 years ago
Seems spikey. Sometimes they go by pretty quick, and sometimes they take a long time for no apparent reason, even on small files. Often involving multiple failed attempts in a row before they work (getting 502 and 503 errors, which seem to happen on transient database connection errors) or 404 or 405 errors (which happen when the web server fails to respond.

The 5xx errors are by far the most common.

Do you want more detailed statistics? I can add some logging to track elapsed time per transaction and to count up the errors if you want.
(Assignee)

Comment 16

8 years ago
If it's easy, that would be handy to have - especially the ratio between the different error codes - but don't go too far out of your way for it.
Jake, what's your username on stage9 so sheppy and I can give you (or maybe a generic webops user?) admin access.
(Assignee)

Comment 18

8 years ago
I don't have a stage9 account currently. I tried to register here: https://developer-stage9.mozilla.org/Special:UserRegistration, but all I get upon submission is "Text does not match the picture."... there is no picture. I'm assuming it's some sort of missing captcha field.
(Assignee)

Comment 19

8 years ago
Summary so far:

It appears that the new MindTouch 10.0.9 + Mono 2.10 deployment is using up more memory than the old MindTouch 9.12.3 + Mono 2.4 deployment. We don't have exact numbers from the old system, but prod's status page shows ~100-400MB per server. The devmostage instances were hitting around 1GB, which put them into swap. This caused massive I/O wait times. Eventually the load balancer would take that server out of the rotation due to nonresponsiveness, at which time it would slowly shift itself from mostly I/O wait time to a 50/50 split between user and system CPU time, with no I/O wait (but still lots of swap used). The system would never really recover, even after several hours.

I have emailed MindTouch about this, and am awaiting a response.

sheppy has a MindTouch VM (provided by MindTouch) which runs 10.0.8 + Mono 2.4.3, and has only 256MB RAM allocated to it... it does not exhibit this problem. This has a myriad of implications, but the Mono version difference is the one that sticks out in my mind. The other major differences are 1) load balancing, 2) shared NFS storage, and 3) external database server. I am disinclined to believe any of these is at fault... I think it more likely that either Mono or some other server-side configuration is responsible, or we'd be having performance problems with lots of other staging sites as well.
(Assignee)

Comment 20

8 years ago
Renaming to reflect what this bug has kinda turned into.
Summary: Check load situation on developer-stage9.mozilla.org → developer-stage9 MindTouch 10.0.9 upgrade evaluation
(Assignee)

Updated

8 years ago
Blocks: 656135
(Reporter)

Comment 21

8 years ago
I talked to Steve Bjorg at MindTouch about this in IRC a bit, and while he's still cogitating a bit, his current working theory is that the indexer is the culprit here -- it's trying to update the index while this chain of rapid-fire page commits is occurring, and is bogging down trying to keep up.

Whether or not this is really what's going on, I don't know. But it would make sense: our index is kept on an NFS share, while the one on my home VM is on the same disk as the wiki, so that would cause, I suspect, a noticeable performance variation that would explain why performance is good on my VM but not on developer-stage9.

We don't yet have a solution, but if this is in fact the problem, we can at least progress toward one.

In the short term, we can probably disable the indexer while the migration tool runs then turn it back on to let the index update. That might be an experiment worth performing on dev-stage9 if you're game, Jake.

Obviously this is a potential DOS attack vector if the API can be used to bog down the server, and that's something that needs to be addressed (again, assuming this hypothesis is correct). I'll continue talking with Bjorg about it but wanted to post an update on the current thinking.
(Assignee)

Comment 22

8 years ago
That sounds good to me, let's try disabling the indexer during a script run tomorrow sometime.

Typically when someone says "index", I think "database". What's it indexing? If it's the pages themselves, that would surely be database access rather than NFS.

If this was an NFS I/O problem, I would expect to see a lot of I/O wait times and NFS traffic to the NetApp. I've got a case open with NetApp to get monitoring data off of that filer, but as far as the I/O wait times, there are none. When the problem occurs, we have 100% user and 0% iowait time.

In any case, if we can figure out how to disable it, let's give it a shot and see what happens.
(Reporter)

Comment 23

8 years ago
The "indexer" is generating the Lucene index used by the site's search engine; this is done externally from the database and we store that index on the NFS share.

I suspect the problem may be along the lines of the indexer trying to start indexing over and over again as more and more pages get piled into its queue and it's not handling it correctly; instead of finishing the current run then looking to see if there's anything new, it's doing something like starting a new run atop the existing one, then blocking up because things are already in progress.

That said, I have no actual evidence that's what's happening; it's a semi-educated guess based on what I've been told by MindTouch, what we're seeing, and a partial understanding of how the indexing process works.

But it would definitely be interesting to disable the indexer and see what happens.
(Assignee)

Comment 24

8 years ago
The last news I heard was that MindTouch was able to replicate the problem, given a copy of our database and sheppy's import script. Anything since then?
(Reporter)

Comment 25

8 years ago
I'm waiting on another update from MindTouch; I've just pinged them again. Apparently they don't do anything if someone isn't riding them like a trick pony.
(Assignee)

Updated

7 years ago
Whiteboard: waiting on feedback
(Assignee)

Comment 26

7 years ago
As of today (8/05) the test import is still failing. After fixing some things on their end, and altering some LB configs on ours, we aren't getting 500 errors during the run anymore. Instead, we get a complete lack of response from the API after some time, although the site continues to operate.

The procedure they gave us most recently (to upgrade form 10.0.9 to 10.1.1 and put in place some fixes for this import issue) are here: http://projects.mindtouch.com/Mozilla/Documentation/10.1.1_Upgrade_Steps

Note this does address the indexer possibility mentioned previously, by reducing the indexer count from the default (10) to 2 workers.

sheppy has re-emailed MindTouch with some details, and we're still in a holding pattern.
(Assignee)

Comment 27

7 years ago
Here's the output generated from the upgrade mentioned in comment 26:

http://etherpad.mozilla.com:9000/miy99arFVY
(Reporter)

Comment 28

7 years ago
OK, so my import tool is still failing, but it's no longer obliterating performance on the server; the failure is happening when it tries to import a 1-megabyte file over, and the connection is timing out.

I think we can consider this particular problem fixed.

Which leads me to ask: when can we deploy this upgrade onto the production site? This has been going on for far too long now.
we have django pushes scheduled for 16th, 23rd, and 30th already. we might be able to do this alongside one of the smaller django code pushes - I would suggest the 30th.
(Reporter)

Comment 30

7 years ago
While I'd like to get this done sooner, I can't say it's urgent, so the 30th is okay with me.
(Reporter)

Comment 31

7 years ago
OK, the latest test run of my code against stage9 went off without a hitch. 0 HTTP errors, all files transferred successfully.
sheppy, it sounds like we can plan to push MT 10 on 08/30 then?
(Reporter)

Comment 33

7 years ago
Yep. Are we pushing 10.1 then, rather than 10.0.9?
Whichever version you and Jake are comfortable with from stage9.
(Assignee)

Comment 35

7 years ago
Let's touch base with MT about a direct upgrade from 9.12 to 10.1. If they're okay with that, I am too. If not, we'll do an intermediary stop-off at 10.0.9... although I see no good reason to stop there... if we're already upgrading, we might as well go all the way to 10.1.

Sheppy, can you hit them up about it?
(Reporter)

Comment 36

7 years ago
Email is off to MindTouch; I cc'd you, Jake. What time on the 30th are we looking at doing it?
(Assignee)

Comment 37

7 years ago
Any time 9am-5pm... whatever works for you guys. There's a SUMO push at 3pm, but I can probably get out of that and get someone else to do it.
(Assignee)

Comment 38

7 years ago
In fact, I'm going to close this bug out... the evaluation is effectively completed, so we'll continue discussions in the main upgrade bug, bug 656135.
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.