Performance of MXR has been abysmal recently



Infrastructure & Operations
WebOps: Other
7 years ago
5 years ago


(Reporter: Dolske, Assigned: jakem)





7 years ago
The short-term problem here is that over the last few days (or more?) has become rather slow. Hopefully it just needs a swift kick to right whatever is wrong and get it happy again.

The long-term problem is that MXR is a pretty important tool for developers (especially the mozilla-central index), and when it's unhappy work slows down and people get grumpy. What kinds of things can we do with IT to keep MXR happy?

 - What's the current IT support level? Should it be higher?
 - Can we get faster hardware for it? It's often been a bit pokey
   even when healthy.
 - Can we get naigos monitoring to make noise when MXR performance drops?
   (EG if loading page X takes > Y ms, or search for A takes > B ms)
 - Maybe hire a contractor to take a stab at making it more efficient?
Wonder whether this is related to the issues in bug 659724? AFAIK tinderbox is on the same box (though I thought perhaps mail handling had been separated).

Comment 2

7 years ago
As recently determined in another MXR-related bug, nobody in IT really knows anything about MXR, with the likely exception of justdave. We don't know what kind of performance is normal, or how it works, or who the maintainers of it are. When it breaks, we don't know what to do. When someone needs something done to it, we don't know how to do it.

For all intents and purposes, this is a bottom-tier (least-supported) web app for us. I don't think it has any redundancy, and I have no idea about backups. If the server falls over, I'm highly skeptical we could get it back online and usable in a reasonable time frame... doubly-so if it falls over when justdave is on vacation. :)

I think at a minimum, we need someone in webdev to claim ownership of this app and work with us on improving (and documenting) it.

To answer your questions:

1) Current IT support level is minimal. Judging by recent bugs, it should be higher.

2) Faster hardware is always a possibility. *What* hardware needs upgraded, or if it would help, is up for debate. It's not a dedicated mxr server... some tinderbox stuff runs on it, at least. The box also uses external storage. It's hard to judge what changes would benefit MXR, and what would only benefit tinderbox.

3) Nagios monitoring should be a possibility. We would need to know what, specifically, to monitor (what URL), and what a reasonable response time is, and what to do if the monitor triggers.

4) Don't know, sounds like something the web dev team would be better equipped to answer. The last external contractor project I know of did not go well at all from a performance standpoint. :)

Sorry to be such a downer about this. I know it's pretty important for some folks, and I'd like to help make it better. :(
No apology necessary, Jake. We can't fix problems until we know they ARE problems, and making a deliberate choice about the level of support we think MXR should have is far superior to just quietly fighting fires and hoping nothing really bad happens.

I agree that mxr is a system we depend on a great deal in day to day development, but I don't know what level of IT support it merits because I'm not sure I know what all my options are, there.


7 years ago
Assignee: server-ops → justdave

Comment 4

7 years ago
I'll followup offline / in a separate bug for dealing with the long-term support issue. So let's focus this bug on just fixing mxr for now.

And, on that note, mxr's index seems to be about a week stale. :( Possibly related to why it's being slow?
I don't know very much about it either, I mostly just poked at whatever timeless told me to poke at, and he hasn't been around much lately.  The number one thing that would really help would be having an active developer owning it again.
So projects-central got added recently.  It's F***ING HUGE.  The index generation on it had been running for 45 hours when I just killed it.  That's been piling up and what's been soaking the CPU.  Everything else has also been waiting in line behind it to get indexed.  I just disabled the index updating on projects-central, and killed off all the existing jobs.  The performance should improve very shortly if it hasn't already, and the out-of-date stuff (except for projects-central) should be updated again overnight tonight.
Last Resolved: 7 years ago
Resolution: --- → FIXED

Comment 7

7 years ago
still terrible.


bah:~ dougt$ curl > a
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1891k    0 1891k    0     0  31498      0 --:--:--  0:01:01 --:--:-- 11878


bah:~ dougt$ curl > a
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.8M  100 12.8M    0     0  2588k      0  0:00:05  0:00:05 --:--:-- 3257k
Resolution: FIXED → ---
your test is irrelevant.

1 minute to render a 1.8 MB file with a billion cross reference links and a ton of ajax hooks in it sounds pretty normal to me.
Last Resolved: 7 years ago7 years ago
Resolution: --- → FIXED

Comment 9

7 years ago
This lacks an actual owner who understands the code and can help work to make it scale.  My team's a bit blocked on figuring out what the bottle neck is.  It doesn't appear to be hardware - faster hardware won't actually fix the root scalability issues.

@morgamic - can someone from your team help?

@dougt - your test shows that a caching load balancer with a backend server running an ftpd written in C can out perform a single web app written mostly in perl that renders the page instead of just shoving it out the network.
Assignee: justdave → mrz
Resolution: FIXED → ---
Two paths to splitting the work, both relatively straightforward:

1) Load-balancing proxy to put different trees on different machines:

Easy proxy config, since it's very much separated per repository and uniformly path-prefixed.

2) Move indexing off to another machine or machines:

genxref uses filesystem moves to put a new database in place, so:

* Move the update-and-genxref cron job to another machine.
* After genxref completes:
  scp fileidx mxr:/wherever/
  ssh mxr mv /wherever/ /wherever/fileidx

Similarly with the glimpse index files.

Parallelizing genxref looks pretty feasible too, because at its core it's already a map and reduce operation with the map run serially, and the data is slammed into a Berkeley db.  There's some other state that may need to be communicated (like the total number of xrefs, if that's used for something other than just diagnostic output), but the core of it should be separable by basically running the &find* calls at the bottom of genxref in different invocations of the script.  Could get fancier by doing the tree-walk once, but I suspect that's not where the time is spent.

None of this helps with search speed or source-page display speed, if that's the core problem.  Caching might, but without looking at an index-interval's worth of http logs it's hard to tell.

I haven't looked at the /source or /search stuff yet, to see what wins might be lurking.

(Hope this is helpful to whoever is looking at this.)
(That's all just from code-reading, though.  If some devops type can profile it in situ, that would be actual data to replace the semaphore I'm doing right now.)
aiui, glimpse is a big issue here - takes forever to index and we are re-indexing everything over and over..  That was cool a decade ago, but it doesn't scale.  So, when I suggested that we need developer help with this in twitter, the site could use come core design help.  Matthew is spot-on in comment 9 about getting developer help for this.

If it is such a critical tool for developers, then we need to give it the proper TLC it deserves.  Too many of our tools were not built to scale past the point we are reaching now, and hardware alone won't fix the issues.
I don't really care how long it takes to index mozilla-central (and the size of that tree doesn't grow *that* fast), as long as it doesn't interfere with response latency.

It would be great to get incremental indexing, and I am working on something little right now that will make it more cache-friendly (computing and tracking last-modified), but getting good -- and consistent -- performance really wants the indexing from the latency-sensitive page serving.

How long does glimpse take (clock time and CPU time) over m-c?  m-c web response is priorities 1 through 10. project-central indexing just should never be allowed to compete with it for cycles or I/O bandwidth, IMO.

Can we try moving glimpse and genxref to another machine, with the trivial scp/ssh stuff I outline above, and see how much that gets us?  Then we can track CPU ratios per request and see what we need to optimize next.

Also, could I get a day's http logs somewhere?  (And how often the reindexing happens?)  That would let us estimate the upside of caching.
I don't mean to say, to be clear, that there is not opportunity for win in hacking on mxr a bit, or that we shouldn't do that hacking.  I just think we can get some pretty big wins through the two low-impact (and low risk) config changes I listed above.
The long-term plan for MXR is DXR, for which we have maintainers and which was designed with more modern practices.  I don't know if it does incremental reindexing, but see above as regards having maintainers.

Having a load profile for the app would help make sure that we appropriately validate the performance of DXR, but if you guys already have tools for replaying HTTP logs (with some mechanical transformation to the new pattern) then I won't worry about it.
Sorry Mike, we've been too busy fighting fires to get you the data you've asked for here (and will continue to be so for the foreseeable future).

About MXR performance:  we are working with Dell on a -possible- issue with the SAN that MXR and Tinderbox are using.  I say possible because right now there are no indicators to show what is to blame, all we know is that we can see the degredation and it coincides with the bugs we have gotten.

More on this when we know more, but we are working on this as I speak.
Can I get my key installed on the machine in question? I can gather the data myself; totally understand that you're slammed.
With all of the moving parts right now that are making it hard to troubleshoot this issue, I'd rather not right now.

Once we have the fire at hand put out, then we can talk about the log access..  (and with the angle we are working right now this has little to do with apache hits and everything to do with SAN access)
I'm not talking about changing things, I'm talking about looking at what's going on, how big the databases are, profiling the script, etc.  I don't see how you can get dev help without more detail about the help you need, and what's actually going to the system. :-(

People have been talking about how mxr is just too slow, and the app needs work, for ages -- probably true, but is it the case that anyone analyzed the system to see what the I/O and so forth rates were, before blaming it all on the (terrible, terrible) code?

I'm willing to help, but I'm not really that willing to wait to "talk about log access" or be told I can't look at the system (which has previously been described as unsupported, but now is inside The Wall where only IT is allowed to touch?).  I'm not really interested in jumping through hoops to earn my "can have unprivileged account on barely-supported system" merit badge, so that I can get the data you can't give me for the foreseeable future, and thereby give you the dev help that people keep asking for.

Maybe someone else will be willing to play mother-may-I to help out, but I'm done here. Best of luck with your analysis.
(In reply to Mike Shaver (:shaver) from comment #19)
> Maybe someone else will be willing to play mother-may-I to help out, but I'm
> done here. Best of luck with your analysis.


Sorry if you misunderstood me:

we need dev help with mxr..  we want dev help with mxr..

In the process of time we have discovered an IO problem with the infrastructure that is unrelated to the MXR code.

Your time is valuable, and I don't want to waste it debugging issues in an app that is sitting on instability when the problem is not in the app itself right now.

Once we get the system stable, we will need to figure out how to scale mxr out beyond its current state.  I would love your help then, but unless you are anxious and willing to help debug SAN/ISCSI issues right now, I just feel it would be fruitless to try and debug IO performance when the underlying IO is degraded..
Yes, seriously.

If you're worried about my time being wasted (ignoring that nobody even *responded* to my concrete suggestions for fast wins above), that's sweet, but then the right answer is "sure, you can have access, but we're in the middle of a storage disaster".  If there's even half-assed logging in place I can learn a lot, and get the data I need to have a reasonable setup locally to test with.  It's not "we can talk about log access later".  *Log access* should be such a no-brainer that it's easier than telling me no in the bug.

I was fighting this same fight in 2007.  When it's not so pointlessly hard for a trusted and experienced member of the community to get the access he needs to write some 10-line shell scripts and help with a problem that IT can't address themselves, my contact info is in phonebook.

(If it were up to me, I'd just point MXR at a local disk, honestly. It doesn't need HA, and its disk use grows slowly unless we add another tree. Having it contend with Tinderbox (!) for I/O capacity seems like almost the worst storage configuration possible for each of them.  Even if I took over MXR, though, I doubt I'd actually get to make decisions like that, I wasn't able to when it came to things that mattered to Engineering when I ran that.)
We are shuffling around some volumes on the SAN per suggestion from Dell.  These operations will take time, so we will update again tomorrow on the outcome.

Comment 23

7 years ago
Just to throw out a random idea (inspired by dougt):

Since the most demanding users of MXR are developers, and developers tend to have beefy hardware anyway, it might help as a (short-term?) solution to figure out how to make it easy for folks to run local MXRs. If interested people could spend 15 minutes to get a superfast MXR on their local box (or a shared box in an office), that would be compelling _and_ help reduce load on the official-MXR. [Aside: it would be interesting to know if MXR has a small number of heavy users causing a majority of the load, or just huge numbers of random users.]

Might also be a path for skipping ahead to solving "how can IT support DXR", and just continue mostly-ignoring MXR until DXR saves us.
(In reply to Justin Dolske [:Dolske] from comment #23)
> Since the most demanding users of MXR are developers, and developers tend to
> have beefy hardware anyway, it might help as a (short-term?) solution to
> figure out how to make it easy for folks to run local MXRs. If interested
> people could spend 15 minutes to get a superfast MXR on their local box (or
> a shared box in an office), that would be compelling _and_ help reduce load
> on the official-MXR. [Aside: it would be interesting to know if MXR has a
> small number of heavy users causing a majority of the load, or just huge
> numbers of random users.]

This is one way to go about it, except then we have to consider how to distribute all of the data for it (currently 176G on disk)

As mentioned in Comment 20 and Comment 22 the problem that has plagued us since around 7/27 has been identified in the SAN and we are working on a fix. We have a tray of SATA disks that is at max IOPs capacity.  The fix for this is taking a long time since the controller is competing for time on those disks when there is none to go around.

I still see this as our short term fix even though "short term" is taking days.

Long-term I would like to look at the entire site, how do we re-engineer it to scale (not to mention it is a single box and a single point of failure).
Is it 176G for everything, or just for m-c, or just the indices?  That's one of the things I wanted to look at, to see what the working set was likely to be.

Comment 26

7 years ago
(In reply to Corey Shields [:cshields] from comment #24)

> This is one way to go about it, except then we have to consider how to
> distribute all of the data for it (currently 176G on disk)

No, I'm talking about having a local install of the MXR tool (a few perl scripts + glimpse?), pointing it at a local clone of mozilla-central, and building an index locally. No bulk data distribution required.
hg clone ~/Sites/mxr, sudo bash, port install glimpse, edit lxr.conf, do random bits from mxr's INSTALL, Bob's your uncle. Only took me partial attention for three hours, though I don't quite have every part working right yet.
philor, it would be great if you could write down the steps you took in a wiki page for anyone else who might want to do the same.
Can someone comment on the performance of MXR today (aside from the updating issues, which we will address next - and I believe some of those update procs were not able to finish due to the IO performance issue).  tl;dr - it should be better today.

The problem with performance here is the same problem that we have had with tinderbox lately (see bug 676822).  Long story short its data was residing on a tray of SATA disks that was at its max IOPs capacity.  (same tray as tinderbox too)

It took a lot of work with Dell to figure this out, and many days to get data shuffled around but now the MXR data is shared across 3 trays of SAS disks.  The iscsi connection is still coming in to the SATA tray but hopefully soon, it will renegotiate and move that connection to one of the SAS trays.

From the system graphs it seems to be handling a bit more load now (and IO is faster) so I would like verification that "Performance of MXR" is not so abysmal now  :)

(I would still like to work on the core design of MXR - we are outgrowing tools like this quickly and need to address scaling this out..  but let's get the fires put out first)
It definitely feels zippier to me today than it has for the last few days; thanks!
I have a local version of mxr that uses the browser cache as appropriate, but I'm not sure if it'll be a win.  (It won't hurt, but I'm not sure how much energy to put into making it robust enough to deploy.)

Is there somewhere I could get a day of mxr server logs so I can mine them for the facts I seek?  (Also: how often do we reindex m-c?)

Next up will be incremental reindexing, though putting the indexing process on another machine would likely win us hugely enough that it wouldn't need to be incremental.  (On the other hand, making it incremental might make it fast enough that it can live on one machine.)
(In reply to Corey Shields [:cshields] from comment #12)
> aiui, glimpse is a big issue here - takes forever to index and we are
> re-indexing everything over and over..

I have some simple changes to propose here.

Indexing m-c on my laptop (MBA, on battery):

Stock MXR config: 
       51.42 real        37.64 user        10.95 sys

Modified config, full index:
       27.91 real        24.73 user         2.75 sys

Modified config, incremental reindex after hg update:
        2.57 real         1.44 user         0.47 sys

One liner patch!
Sadly, I fear that update-xref is a bigger problem than update-search, by a huge margin.  It takes more than 10x as long on my laptop, then I killed it to save my battery.

Could be an artifact of my setup, though.  Corey, can you confirm that it's glimpse that's hurting us here?
I've put shaver's key on to the box..

I am not sure where the src of MXR resides and how it is updated, but I can confirm that there are no puppet classes assigned to the box that would clobber any of the MXR scripts.  So, in other words, local fixes may be safe  (but I would appreciate a pointer to a repo we update off of, and I will document it appropriately)
Looking at update-xref performance now, because it's a disaster, but here's the one-liner patch referenced above:

-my $cmd = "($TIME glimpseindex -n -b -B -M 64 -H . $src_dir $STDERRTOSTDOUT) >> $log";
+my $cmd = "($TIME glimpseindex -n -f -B -M 64 -H . $src_dir $STDERRTOSTDOUT) >> $log";

 * turns off byte-offset tracking (-b), since we don't use the byte-offset
   data anyway, and
 * turns on incremental reindexing (-f).

Had the effects I list in comment 32.  I'll push this and some other bits to my user repo tonight, likely.
That change isn't sufficient for the full win (though it gets the 2x win), because update-search goes out of its way to blow away the old indexes.  I had a hack in place to make it not do that, which I will now have to make PRODUCTION GRADE.

Comment 37

7 years ago
cshields: the data is in /var/www/webtools/mxr. It is managed by Hg, although I don't see any cron's responsible for updating it. However, fwiw, it's current:
03c6f34cb54f: Bug 654576 update mxr root sourcing for mozilla-aurora and friends per bug 652229 default tip - Wed, 04 May 2011 00:08:05 +0200 - rev 406

[root@dm-webtools04 mxr]# hg log | head
changeset:   406:03c6f34cb54f
tag:         tip
date:        Wed May 04 00:08:05 2011 +0200
summary:     Bug 654576 update mxr root sourcing for mozilla-aurora and friends per bug 652229

There *are* 5 files which have been modified locally and apparently not checked in to hg:

[root@dm-webtools04 mxr]# hg status
M genxref
M lxr.conf

All changes look relatively simple (10 lines or less), but some look rather critical and things will probably break if the changes get lost.


7 years ago
Assignee: mrz → shaver
I see what you did there.
I've fallen out of touch with where we are here and where we need to go (not to mention where I would -like- to go with MXR).  Shaver, if you have time this week can we touch base at all-hands?

Comment 40

7 years ago
I have implemented a local change of comment 35. It looks to not be updated in the repo yet:

However, I've looked into this a bit more, and comment 33 is correct... genxref seems to take much more CPU time than glimpseindex.

numcalls %calls  real time  %real    CPU time  %cpu     process
291888  100.00%  9161.20re  100.00%  308.10cp  100.00%    (total)
35      0.01%    312.64re   3.41%    231.94cp  75.28%   genxref
2568    0.88%    28.88re    0.32%    20.09cp   6.52%    hg
15158   5.19%    73.48re    0.80%    16.20cp   5.26%    source
105     0.04%    119.23re   1.30%    15.59cp   5.06%    glimpseindex
507     0.17%    9.62re     0.10%    5.96cp    1.93%    rsync

CPU time is the sum of user and system time, and does not include I/O wait time. In this case the usage is almost entirely user time, not system time.

As you can see, genxref consumes roughly 15x as much CPU time as glimpseindex.

I don't have highly reliable disk stats per-process just yet, but what I do have seems to indicate that genxref is much harder on the disk than glimpseindex as well.

I am looking in to getting a dedicated (more powerful) machine for MXR. Of particular interest is getting more CPU power and some faster storage.
Assignee: shaver → nmaul
Component: Server Operations → Server Operations: Web Operations
QA Contact: mrz → cshields

Comment 41

7 years ago
The time spent by genxref is largely spread across a number of regex's. I've done some work with NYTProf against it on a couple different trees, and have isolated some of the worst offenders. Unfortunately some of them already seem rather fast, but just get called very very often.

If anyone would be interested in taking a shot at optimzing the regex's, here are some of the worst offenders that I've been able to spot. These are from a run against 'mozilla-central', which took 26 minutes.

271860 times 326 sec by main::findident at line 880, avg 1.20ms/call
325439 times 168 sec by main::dumpdb at line 1191 of genxref, avg 517µs/call
595329 times 152 sec by main::findident at line 891, avg 255µs/call
968850 times 93.0 sec by main::findident at line 880, avg 96µs/call
1753431 times 61.4 sec by main::findident at line 910, avg 35µs/call
11088 times 56.6 sec by main::findusage at line 1151, avg 5.10ms/call
1120432 times 51.5 sec by main::findident at line 914, avg 46µs/call
3431456 times 18.6 sec by main::findident at line 914, avg 5µs/call
12291 times 15.2 sec by main::c_clean at line 613, avg 1.24ms/call
12291 times 13.0 sec by main::c_clean at line 610, avg 1.06ms/call

This seems to be somewhat consistent across trees. That is, the 'firefox' tree generates a similar list, with the same line numbers represented as some of the worst offenders (880, 891, 910, 914, 1151).
Depends on: 681197

Comment 42

7 years ago
No luck optimizing the regex's  themselves, but have stumbled onto several other possible optimizations. Unfortunately, most of them are a bit out of my reach.

The actual indexing job improvements are being handled in Bug 681197. Some of that may have a direct or indirect impact on MXR performance:

Direct: increasing glimpseindex index size from "tiny" to "small", by adding the -o flag. This should hopefully speed up some types of searches.

Indirect: indexing runs being parallelized or skipped altogether, freeing up CPU cycles and IOPS for searches.

An example of an improvement I can't easily make myself would be to introduce parallelism *within* the "" or "genxref" scripts, and to make it skip files that haven't changed. This would drastically reduce the amount of work that happens on any given indexing run, but I'm hesitant to undertake something like this myself. If it breaks I won't necessarily notice, because I don't use MXR.

Another improvement is to replace dm-webtools04 outright, with one or more replacement systems. It may be possible to generate the indexes on one system, while the web app runs on another (or even to put the web app on 2 systems, behind a load balancer. This will be much more feasible once we get moved into our new SCL3 datacenter as we should have better shared storage performance there, which is key to making this work well.

Comment 43

7 years ago
It appears that LXR (the app MXR is built off of... or maybe was forked from, or forked by... I can't get a good history of this) has evolved quite a bit beyond what we have. Specifically, it seems to have some of these features- genxref appears to be tied in to the revision control system. How thorough it is I can't say, but it makes a whole lot more sense to me to look at upgrading our installation rather than tweaking it.

It's evolved considerably. Upstream genxref is under 300 lines and makes heavy use of perl modules... ours is over 1300, and is mostly self-contained spaghetti. The dates in ours say 2006, but judging by the upstream LXR changelog, I think those may have been updated by Mozilla, and we might actually be running a fork from a much older LXR... 1999-ish even, maybe.

The updating mechanism doesn't resemble ours very much. The current version uses a single call to genxref directly, which I think handles everything (updating, xref, glimpseindex), whereas our version has several layers of perl and shell scripts that wrap that.

I think our next step towards serious MXR performance enhancements should be to build out a real MXR cluster, set it up with the current version of LXR, and see effectively we can migrate to it.

I'm going to open a new bug for this, and consider this one resolved. It's a big enough change in direction to warrant it, I think.
Last Resolved: 7 years ago7 years ago
Resolution: --- → INCOMPLETE


7 years ago
Blocks: 704374

Comment 45

7 years ago
Is DXR any better documented or supported than the current MXR?

I ask because LXR is a supported upstream app that we apparently deployed and never updated, except for local tweaks. What is the status of DXR, by comparison?

DXR appears to be a one-man effort, specific to Mozilla. That's fine (maybe good in fact), but what happens when that one man loses interest?

I'm all for switching if there's a better solution. That DXR installation is marked as being outdated, and there appears to have been no effort to improve this, or add other trees. If we're willing to put serious dev muscle behind it and get it up to MXR/LXR standards, I'm perfectly happy using it instead of a current LXR.

It should be noted that it was just brought up to me that DXR is not usable for all of the same use cases that MXR is. I'm not a dev actively using both tools though, so take that FWIW. :)

The ultimate problem I want to solve here (beyond performance concerns) is that we should not hobble along on an app that nobody maintains, hoping and waiting and praying for someone to dive in and handle all the random bug reports. This is where MXR is right now- an unmaintained fork of an unknown version of LXR. I can't even tell which code came from Mozilla and which is upstream. If we got up to current LXR and either stick with stock or actively maintain a fork, we'll be better off.
I might suggest you take this to a dev.planning thread, Jake. Mozilla devs will have nuanced opinions on mxr/dxr/lxr and those might help you clarify the best way forward. It might also be a tarpit of epic proportions, but keep your eye on the prize!


7 years ago
Blocks: 705760
Component: Server Operations: Web Operations → WebOps: Other
Product: → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.