NOTE: seeing this on production as well as staging
Steps to Reproduce:
1. Log in to SUMO and load https://support-stage.mozilla.org/en-US/kb/Creating+articles
2. Type something in the "Quick edit an article" textfield and click "Edit"
3. When the page loads (something like https://support-stage.mozilla.org/tiki-editpage.php?page=Hotdogs), type some stuff into the "Edit:" field and the "Edit Summary:" field
4. Click "Save"
The article is created
Page Load Error: connect reset
Created attachment 364404 [details]
I upgraded all four back end servers to RHEL 5.3 (because it hadn't been done yet, and just to eliminate unapplied security updates from being an issue) and rebooted them. Didn't seem to help any. There's no unusual load on the databases, nor on the back end machines. The Connection Reset error seems to imply that it could be a netscaler issue, since the browser is actually talking to the netscaler.
Note that this is reproducible on both stage and production, and started happening in production BEFORE last night's code push, so the code push is not responsible for this.
There's been bugs on other sites having issues with the netscaler doing this to them that were fixed by tweaks to the netscaler config, (seach fixed mozilla.org bugs for "conn reset") but I don't know what was done or if it's still relevant (it's been a while).
Both stage and production do go through the netscaler, also.
Are you certain this problem is also occurring in production? If so, could you provide a sample case or, at least, a timestamp for a recent incident?
I believe I've identified the source of the problem for the staging site, but it's not relevant to production.
mrapp-stage02 appears to have an inappropriate ethernet bonding configuration. It's missing necessary options in /etc/modprobe.conf, which is causing it to revert to the default round-robin balancing of traffic between eth0 and eth1. This is confusing the forwarding tables on core1 and core2, creating packet loss for this host. The packet loss is significant enough that it is causing the NetScaler to fail health checks on a regular basis.
I believe fixing the configuration for bond0 should resolve our issues for support-stage. I'm waiting to hear if we can reproduce on production. The production VIP uses different backend servers, all of which appear to be properly configured.
Derek, it now seems fine to me on production (knock on wood); testing on staging now, too, it seems fine there.
Have you guys been changing stuff in the interim?
(In reply to comment #6)
> Derek, it now seems fine to me on production (knock on wood); testing on
> staging now, too, it seems fine there.
> Have you guys been changing stuff in the interim?
Seeing it again; I'm going to hold off on claiming it's OK until I haven't seen it for quite a while.
Seeing it on production or staging? We won't be making the staging fix until tomorrow's maintenance window, so I would expect that the problem is still present there.
Mostly seeing it on staging; it's highly sporadic, though, and to test it I have to create new articles which spams the heck out of our SUMO editors, so I'm not doing it as often as I'd like.
I'll wait until tomorrow's fix to test further on staging.
[root@mrapp-stage02 ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth0
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:13:21:1b:9b:ee
Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:13:21:1b:9b:ed
Was the fix applied tonight? Seem to be having better luck tonight WRT timeouts; I wrote a small/simple Selenium script that:
* logs in
* click "Edit" on an article
* logs out
* load a URL to edit an article while logged out, thus forcing login with an edit-page parameter.
Didn't see any issues. We _were_ seeing this on production, though, a few days back, and it's worrisome to me that it could've "fixed itself"; did we have any related fixes/changes on production recently?
Yep, push was put in tonight. I'll leave you in dmoore's hands now :)
Eric Cooper and I both just hit this tonight while creating new en-US pages on staging.
Webserver logs confirm at least three timeouts in the past 24 hours. While that's a significant improvement, we're still working on isolating these last few errors.
Is this at all tied to our planned push for the SUMO 1.0 release? If so, this bug needs to be marked as a blocker for that release.
Throwing back to server-ops so that Someone Smarter than Me (tm) can take a look.
Here's a screencast: http://screencast.com/t/PY45mPhAw
Sorry that it came with a soundtrack; I'm a screencasting n00b.
Given that folks seem to have been troubleshooting this bug for over a few weeks now, is this a page me in the middle of Friday night blocker?
Can this wait till Monday for further troubleshooting?
(In reply to comment #17)
> Given that folks seem to have been troubleshooting this bug for over a few
> weeks now, is this a page me in the middle of Friday night blocker?
> Can this wait till Monday for further troubleshooting?
We're going to configure a duplicate VIP for support-stage.mozilla.org on our new load-balancing appliance. This will allow us to debug the issue using a different hardware path and, hopefully, improve performance. I'll update with a new URL for testing later in the day.
(In reply to comment #19)
> We're going to configure a duplicate VIP for support-stage.mozilla.org on our
> new load-balancing appliance. This will allow us to debug the issue using a
> different hardware path and, hopefully, improve performance. I'll update with a
> new URL for testing later in the day.
On the layer 4 Cisco ACE or Zeus? AFAIK, we have some Netscaler-specific code in SUMO that at some point we should abstract away.
My intention was to use the Cisco ACE, as the configuration isn't necessarily complex enough to warrant spinning up VIP on the Zeus. Could you elaborate a bit more on the Netscaler-specific interations?
(In reply to comment #21)
> My intention was to use the Cisco ACE, as the configuration isn't necessarily
> complex enough to warrant spinning up VIP on the Zeus. Could you elaborate a
> bit more on the Netscaler-specific interations?
Sure; bug 437787 at least; there might be more. I'll bring this up in our SUMO meeting, if we're to (later) switch over to Zeus on SUMO.
btw, i was only talking about the databases for SUMO, not SUMO itself.
to be clear, i'm not at all advocating moving away from the netscaler for web stuff on sumo.
I think the example Stephen refers to should be fine on another caching solutions: it's just sending cache control headers. I don't think there is anything particularly weird about how the NS handles these, but obviously it would be good to test if we did switch. Even though I realize that is not what is being advocated here, it's important to note that the code shouldn't be a barrier to switching.
Does submitting a knowledge base article touch or depend on the tiki_comments table? I noticed that table is often locked for long periods of time due to: select max(`commentDate`) from
`tiki_comments`,`tiki_forums` where `object` = `forumId`and `object
webroot/tiki-editpage.php doesn't AFIK.
This appears to be with tikiwiki exceeding the php memory limits. I ran through the example and got "Connection Interrupted" and the logs show "PHP Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to allocate 40961 bytes)".
I've increased the memory limit from 32 to 128 on stage.
Can anyone reproduce this after Jeremy's swift help?
On production? Yes, I just ran into it, while creating the start page a/b test pages.
(In reply to comment #29)
> Can anyone reproduce this after Jeremy's swift help?
Yes, which is one of the reasons that Eric filed bug 484100; I'm not clear on the relationship, though.
(In reply to comment #30)
> On production? Yes, I just ran into it, while creating the start page a/b test
I haven't applied the fix to production yet. I was going to let it bake a bit on stage.
Sorry to say that I just got the error on stage after creating the article https://support-stage.mozilla.org/sv-SE/kb/Localization+Dashboard?bl=n :(
Not sure if this is related, but I can't edit that page I just created: https://support-stage.mozilla.org/tiki-editpage.php?page=Localization+Dashboard
I get a "Connection interrupted" error every other time, and just a blank page every other page.
14:54 <oremj> [Thu Mar 19 14:14:51 2009] [error] [client 10.2.81.4] PHP Fatal error: Maximum execution time of 30 seconds exceeded in
/data/www/support-stage.mozilla.com/webroot/lib/smarty_tiki/outputfilter.urlt.php on line 311
14:56 <oremj> that's a giant while loop
14:56 <oremj> does page have a lot of a or img tags?
14:57 <oremj> the comment above the function that is timing out says:
14:57 <oremj> * Although care has been taken to write fast regexes, this is likely to
14:57 <oremj> * be slow on large files, very complex files etc.
Memory fix from comment 28 has been applied to production. Let me know if you still see this problem.
Jeremy, we are seeing this problem on stage again (blocking testing the release). Can you check the logs for us?
What time did you see the error?
Just 5 seconds ago on the URL: https://support-stage.mozilla.org/tiki-editpage.php?locale=en-US&page=Localization+Dashboard&login
[Tue Mar 24 11:06:00 2009] [error] [client 10.2.81.4] PHP Fatal error: Maximum execution time of 30 seconds exceeded in /data/www/support-stage.mozilla.com/webroot/lib/smarty_tiki/outputfilter.urlt.php on line 309, referer: https://support-stage.mozilla.org/tiki-editpage.php?locale=en-US&page=Localization+Dashboard
Via Jeremy, you can profile this online by adding an appropriate parameter, e.g.
and then see the results at
I see for example 77000 calls to preg_match (ouch).
Eric, can you take a look at the profile and see what you can come up with?
I'll add potential solutions later today.
This seems to fallen off the radar, and it still happens often. So....bump.
For reference, this issue was partially addressed in bug 484100.
(In reply to comment #44)
> For reference, this issue was partially addressed in bug 484100.
I suppose we'll just have to do more of that :)
I'm getting a head start on this, especially since it's only going to get worse with the addition of the screencasts feature.
For whatever reason, new cachegrind.out files aren't showing up for support-stage on webgrind.authstage.mozilla.com. Did we lose it?
Local testing shows that the RemoveXSS function (and the subsequent RemoveXSSregexp calls) executed during webroot/tiki-setup_base.php is taking up about 27% to 35% of the processing time. These functions check all POST/GET data and run it through numerous pattern checking methods. On massive POST/GET data, like wiki pages, it's no wonder it's chugging along.
Again, I can't confirm this is the issue on stage yet, but I would imagine it is if it's happening locally.
What we could do is take the new engine from tiki 3.0 and merge it back into our 1.9 tiki-setup_base.php. From a cursory glance, the sanitization portion of the file is mostly modular. We can presumably pull out the current method and plop in the new one without much issue. Try it?
Optimizing tiki-setup_base.php will cause a performance boost on all pages.
ccing oremj for comment on webgrind.
RemoveXSS is newish, we backported it from a newer version of tiki sometime at the end of last year. This was after we did our last round of optimization (for Fx3 launch).
We can certainly try putting in the new version.
If it doesn't help much then we should roll our own, have been through this exercise before with badly performing XSS filters.
(In reply to comment #48)
> ccing oremj for comment on webgrind.
> RemoveXSS is newish, we backported it from a newer version of tiki sometime at
> the end of last year. This was after we did our last round of optimization
> (for Fx3 launch).
> We can certainly try putting in the new version.
> If it doesn't help much then we should roll our own, have been through this
> exercise before with badly performing XSS filters.
In an order to avoid potential confusion, bug 497109 has been filed to address
We think that the fix for bug 497109 will fix this. Downing this and targeting out. We'll push it and see.
David told me to warn if and when I would find the problem again. At this moment I'm trying to add a screencast to this article
and I got 4 of this errors in a row (and the screencast has not been uploaded).
Mozilla/5.0 (X11; U; Linux i686; it; rv:126.96.36.199) Gecko/2009060214 Firefox/3.0.11
Also had this today. The change was saved even though the error was shown.
Another timeout a few seconds ago, upon saving changes to the public version of
Jeremy, see comment 53 for when this happened last (2009-06-21 11:52:12 PDT). Is this helpful in determining what the cause is?
Just had the timeout error saving the changes to
fyi: with all the articles I have saved today I had this timeout error.
Jermey, any information about status today?
I don't get a timeout when saving changes to the page mentioned in comment 55. Are there any articles timing out today?
Yup; it just happened on https://support.mozilla.com/tiki-editpage.php?locale=en-US&page=*Cannot+connect+after+upgrading+Firefox.
Here's a screencast: http://screencast.com/t/W07aD3ZtI1S
On my local copy saving a page takes a long time. I say we do some benchmarking and see which parts of the process are slowing down the whole thing.
Status of the benchmarking?
Here are some benchmarks with explanations on my local machine.
Things to note:
1. I don't have timeouts set up (e.g. even if it takes 15 minutes, pages load)
2. I don't have a mail server set up, so I have tested with the email notifications portion commented out. (tikilib.php:4519-4528)
3. The wiki text I used to create this is the same as in our "Clearing Private Data" article (my dump is from July 1st)
4. I added categories, folksonomy tags, and checked "Allow HTML"
Test 1: Regular, staging article:
0.279 s -- setup
0.004 s -- after translation bits (l467)
0.002 s -- before screencasts (l709)
0.154 s -- before check_ticket() (l869)
0.162 s -- logs
0.162 s -- email notifications
0 s -- create staging
0.544 s -- after create stage and so much else (l1063)
0.044 s -- freetags (l1150)
0.003 s -- categories (l1171)
Test 2: Creating a pre-approved article (straight in the KB category) -- this doesn't occur on saving existing pages unless they are being approved.
0.235 s -- setup
0.003 s -- after translation bits (l467)
0.002 s -- before screencasts (l709)
0.148 s -- before check_ticket() (l869)
0.296 s -- logs x 2
0.612 s -- create staging
1.328 s -- after create stage and so much else (l1063)
0.041 s -- freetags (l1150)
0.017 s -- categories (l1171)
0.021 s -- the rest (l1252)
0.277 s -- display
The portion of the code I had commented out is in lines 4512:4528 of tikilib.php. Looking at the code, these creation times can vary a lot, depending on:
* email notifications
* uploaded files
* log size
* page history
It seems that on stage it takes about 7 seconds from submitting the page to having a full page load (on the above test of pre-approved article), but taking away the display time (~4s?) it should be about 2-3 seconds, which still seems longer than on my local machine (with the difference of mail notifications off).
I'll attach an xdebug table and the cachegrind.out file. Laura: perhaps you can take a look at it and see if you can pin point more exactly what our issues are.
Created attachment 390855 [details]
Not sure how much this helps, but here's a screenshot of top lengthy tasks in xdebug.
Ah, yes, the xdebug output is too big. Here's a link to the file:
So it appears that on prod creating a page takes a loooong time and times out (for example, see stephend's screencast at http://screencast.com/t/gP9EoGsFPd ). It's possible that's because of the db being overloaded? Or?
On stage it does NOT time out and the creation time is much lower.
IT, can we get some help? Based on the article title in the screencast (URL request for page=...), you can look up the exact times in the logs and see what causes the timeouts. If we could at least pin point what to focus on optimizing, that would be great.
Laura, if you have other ideas...
Clearly isn't going to be done for 1.2.1.
I don't think IT are going to be able to help you (I don't see how they would get the information you want). I'll take a look at this for 1.3.
An idea I have is to try turning off email notifications and see how much time that saves. We could rewrite the email library to queue notifications in a separate file that will be cronned. We should have a cron anyway, for that and other tasks (such as additional actions for renaming a page -- see bug 504663).
So one of the slowdowns of creating/updating an article is the action log. Lines 4516 and 6969 of tikilib.php (r48953) add to the log. There may be other log actions on create page, but I have benchmarked these two in comment 62.
Taking them out might be a significant help, but this needs to be tested with proper php timeouts to see if timeouts on staging can be reproduced before but not after.
There is a concern for metrics from removing these, though, and Cww mentions:
as long as an article isn't created and deleted in the same 24 hours (And I'm not even sure I care about it in that case), we can just fish it out of tiki_pages or tiki_pagehistory
So we should be fine?
We'll also need to basically rework the way we do metrics for KB articles so we don't rely on actionlog for some but not others. (So translations/updates etc will need to be pulled directly from tiki_pagehistory etc.)
This is quite a bit of change in terms of queries -- basically a rewrite of all queries we're using right now for KB metrics. There will also possibly need to be query optmization etc especially for the metrics dashboard. So if we go this route lots of advance warning would be awesome.
If you want to go this route for forum also, warning is VERY good since the same applies (only moreso since there are lots more forum threads than KB articles).
Cheng, what would it take to completely eliminate the internal tiki log? How hard would that be? Maybe we should file a tracking bug to evaluate this, since it may improve site-wide performance.
Basically, I'd have to rewrite all the metrics queries not to use it and then figure out if we can replace all the functionality we currently get out of the actionlog with the metrics dashboard (likely) and then checking to make sure we don't actually lose anything we may want to track in the future (which may mean expanding the metrics DB a little).
That's probably at least a couple weeks of work assuming all goes well.
There's no need to take this work on yet. Will let you know if there is a need to do this, but I hope we can find another solution.
Looking at , maybe optimizing multilingual support will help.
I profiled on Thursday and the key problem is (as mentioned somewhere above) RemoveXSS, where the script spends ~30% of its clock time. I started tweaking the regexes but I don't think I want to add this to 1.4 at this stage given the problems with search and limitations on QA time. I'll have a patch up this week and it can go into 1.4.1.
On our test instance of Sumo, roland encountered this bug too, it seems.
Looking at the logs, I can see 3 POSTs to a KB article, 2 which worked, one that didn't.
The 2 successfull ones took 222 and 550 msecs and returned a 302 (redirect) response of 20 bytes. So I am guessing the POST just does the work and redirects somewhere else.
The 1 failed request took 665msec and returned, interestingly, a 200 OK response of only 20 bytes. I am guessing it was :
HTTP/1.1 200 OK
That would certainly cause the browser to display a white page, no ?
(In reply to comment #75)
> HTTP/1.1 200 OK
> Location: /kb/something
> That would certainly cause the browser to display a white page, no ?
Yes. Something is causing the 200 header to be sent before the location header. You can see this in PHP when there is whitespace in an included file (which triggers output to the browser, with a plain old 200 in front of it). Usually in that situation PHP will give you a "Headers already sent" warning. This gives me a place to start digging, thanks for the info.
Created attachment 403189 [details]
White screen of death when changing a KB page from Staging to Knowledge Base
Attached are the HTTP Headers from the Firefox 3.5 Live HTTP Headers extension when I reproduced it as follows:
I. create a wiki page as non admin
1. Login as non admin e.g. RolandNonAdmin
2. Create a new article e.g.https://sumo-preview.mozillamessaging.com/en-US/kb/yahoo
3. Click "Create this page", add content, click Save and then the article is created https://sumo-preview.mozillamessaging.com/en-US/kb/yahoo?bl=n
II. login in as admin and approve
1. Login as admin
2. via "Last Changes" go to the article https://sumo-preview.mozillamessaging.com/en-US/kb/yahoo "This is a new staging page that has not been approved before. Edit and manually move it to the category for approved pages to approve it for the first time. " and click "Edit this page", change "Categorize" from "Staging area" to "Knowledge Base", add a reason under "Describe the changes you made" and clicking "Save" and then I got WSOD
I have *still* never reproed this (spent a good bit of time trying this morning).
If anyone - Stephen? Krupa? - could get a screencast of it, maybe I'll be able to - if there's a checkbox you're checking or something I am missing.
Also, if you see errors on stage when this happens - or if you don't - please paste them in the bug. error_reporting is now turned way up on that box so we can diagnose this.
(In reply to comment #79)
> Also, if you see errors on stage when this happens - or if you don't - please
> paste them in the bug. error_reporting is now turned way up on that box so we
> can diagnose this.
Has anyone encountered this in the past week? Seeing any PHP errors?
Since I still can't repro this locally, I'm pushing it out to the .next milestone.
If anyone sees this on stage and sees a PHP error, please post the error here.
I finally ran into a local time out, although it was on a completely different page. (tiki-theme_control_objects.php) When I profiled it the two parts that stuck out were the RemoveXSS section and the URLT output filter.
LPH says the XSS filtering is not so eager/aggressive in the TikiWiki trunk, though the process looks similar to me. I have hope that after bug 522813, we'll get to a place where we can remove the URLT filter completely.
Anyway, these are the places I'll concentrate, but since I still can't reproduce the exact bug locally, and I hope that it will go away with future work, I'm pushing it out yet again.
I'm still thinking the two problem sections are the URL output filter and the RemoveXSS section. But these aren't exactly low-hanging fruit. I'm almost tempted to wait until post-upgrade to see if this gets better. But for now -> pushing to 1.5.1
*** Bug 539609 has been marked as a duplicate of this bug. ***
Still seeing this on stage.support.mozillamessaging.com (aka SuMoMo Stage). Can reproduce every time by following the steps in comment 77. I can give access to SuMoMo Stage to whoever wants to debug further. Pleae email me roland AT mozillamessaging.com for access
Fixed by Kitsune.