Last Comment Bug 480415 - Timeouts when submitting a Knowledge Base article on SUMO
: Timeouts when submitting a Knowledge Base article on SUMO
Status: VERIFIED FIXED
:
Product: support.mozilla.org
Classification: Other
Component: Knowledge Base Software (show other bugs)
: unspecified
: All All
: P1 major (vote)
: 1.6
Assigned To: James Socol [:jsocol, :james]
:
Mentors:
https://support-stage.mozilla.org/en-...
: 539609 (view as bug list)
Depends on: 484100 497109
Blocks: 520628
  Show dependency treegraph
 
Reported: 2009-02-26 14:38 PST by Stephen Donner [:stephend]
Modified: 2011-01-03 12:29 PST (History)
21 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Screenshot (154.35 KB, image/png)
2009-02-26 15:16 PST, Stephen Donner [:stephend]
no flags Details
xdebug screenshot (126.18 KB, image/png)
2009-07-27 11:31 PDT, Paul Craciunoiu [:paulc]
no flags Details
White screen of death when changing a KB page from Staging to Knowledge Base (3.92 KB, text/plain)
2009-09-27 22:33 PDT, Roland Tanglao :rolandtanglao
no flags Details

Description Stephen Donner [:stephend] 2009-02-26 14:38:56 PST
NOTE: seeing this on production as well as staging

Steps to Reproduce:

1. Log in to SUMO and load https://support-stage.mozilla.org/en-US/kb/Creating+articles
2. Type something in the "Quick edit an article" textfield and click "Edit"
3. When the page loads (something like https://support-stage.mozilla.org/tiki-editpage.php?page=Hotdogs), type some stuff into the "Edit:" field and the "Edit Summary:" field
4. Click "Save"

Expected Results:

The article is created

Actual Results:

Page Load Error: connect reset

(See screenshot.)
Comment 1 Stephen Donner [:stephend] 2009-02-26 15:16:22 PST
Created attachment 364404 [details]
Screenshot
Comment 2 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-02-27 00:59:22 PST
I upgraded all four back end servers to RHEL 5.3 (because it hadn't been done yet, and just to eliminate unapplied security updates from being an issue) and rebooted them.  Didn't seem to help any.  There's no unusual load on the databases, nor on the back end machines.  The Connection Reset error seems to imply that it could be a netscaler issue, since the browser is actually talking to the netscaler.

Note that this is reproducible on both stage and production, and started happening in production BEFORE last night's code push, so the code push is not responsible for this.

There's been bugs on other sites having issues with the netscaler doing this to them that were fixed by tweaks to the netscaler config, (seach fixed mozilla.org bugs for "conn reset") but I don't know what was done or if it's still relevant (it's been a while).
Comment 3 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-02-27 01:00:51 PST
Both stage and production do go through the netscaler, also.
Comment 4 Derek Moore [:dmoore] 2009-03-04 12:14:31 PST
Stephen,

Are you certain this problem is also occurring in production? If so, could you provide a sample case or, at least, a timestamp for a recent incident?

I believe I've identified the source of the problem for the staging site, but it's not relevant to production.
Comment 5 Derek Moore [:dmoore] 2009-03-04 13:02:30 PST
Dave,

mrapp-stage02 appears to have an inappropriate ethernet bonding configuration. It's missing necessary options in /etc/modprobe.conf, which is causing it to revert to the default round-robin balancing of traffic between eth0 and eth1. This is confusing the forwarding tables on core1 and core2, creating packet loss for this host. The packet loss is significant enough that it is causing the NetScaler to fail health checks on a regular basis.

I believe fixing the configuration for bond0 should resolve our issues for support-stage. I'm waiting to hear if we can reproduce on production. The production VIP uses different backend servers, all of which appear to be properly configured.
Comment 6 Stephen Donner [:stephend] 2009-03-04 13:51:45 PST
Derek, it now seems fine to me on production (knock on wood); testing on staging now, too, it seems fine there.

Have you guys been changing stuff in the interim?
Comment 7 Stephen Donner [:stephend] 2009-03-04 16:47:34 PST
(In reply to comment #6)
> Derek, it now seems fine to me on production (knock on wood); testing on
> staging now, too, it seems fine there.
> 
> Have you guys been changing stuff in the interim?

Seeing it again; I'm going to hold off on claiming it's OK until I haven't seen it for quite a while.
Comment 8 Derek Moore [:dmoore] 2009-03-04 16:49:25 PST
Seeing it on production or staging? We won't be making the staging fix until tomorrow's maintenance window, so I would expect that the problem is still present there.
Comment 9 Stephen Donner [:stephend] 2009-03-04 17:02:01 PST
Mostly seeing it on staging; it's highly sporadic, though, and to test it I have to create new articles which spams the heck out of our SUMO editors, so I'm not doing it as often as I'd like.

I'll wait until tomorrow's fix to test further on staging.
Comment 10 matthew zeier [:mrz] 2009-03-05 19:08:17 PST
[root@mrapp-stage02 ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:13:21:1b:9b:ee

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:13:21:1b:9b:ed
Comment 11 Stephen Donner [:stephend] 2009-03-05 23:07:40 PST
Was the fix applied tonight?  Seem to be having better luck tonight WRT timeouts; I wrote a small/simple Selenium script that:

* logs in
* click "Edit" on an article
* logs out
* load a URL to edit an article while logged out, thus forcing login with an edit-page parameter.

Didn't see any issues.  We _were_ seeing this on production, though, a few days back, and it's worrisome to me that it could've "fixed itself"; did we have any related fixes/changes on production recently?
Comment 12 matthew zeier [:mrz] 2009-03-05 23:15:39 PST
Yep, push was put in tonight.  I'll leave you in dmoore's hands now :)
Comment 13 Stephen Donner [:stephend] 2009-03-09 19:48:46 PDT
Eric Cooper and I both just hit this tonight while creating new en-US pages on staging.
Comment 14 Derek Moore [:dmoore] 2009-03-10 13:39:18 PDT
Webserver logs confirm at least three timeouts in the past 24 hours. While that's a significant improvement, we're still working on isolating these last few errors.
Comment 15 David Tenser [:djst] 2009-03-13 10:15:17 PDT
Is this at all tied to our planned push for the SUMO 1.0 release? If so, this bug needs to be marked as a blocker for that release.
Comment 16 Stephen Donner [:stephend] 2009-03-13 22:50:11 PDT
Throwing back to server-ops so that Someone Smarter than Me (tm) can take a look.

Here's a screencast: http://screencast.com/t/PY45mPhAw

Sorry that it came with a soundtrack; I'm a screencasting n00b.
Comment 17 Aravind Gottipati [:aravind] 2009-03-13 23:08:38 PDT
Given that folks seem to have been troubleshooting this bug for over a few weeks now, is this a page me in the middle of Friday night blocker?

Can this wait till Monday for further troubleshooting?
Comment 18 Stephen Donner [:stephend] 2009-03-13 23:14:36 PDT
(In reply to comment #17)
> Given that folks seem to have been troubleshooting this bug for over a few
> weeks now, is this a page me in the middle of Friday night blocker?
> 
> Can this wait till Monday for further troubleshooting?

Sorry, yeah.
Comment 19 Derek Moore [:dmoore] 2009-03-16 12:57:18 PDT
Stephen,

We're going to configure a duplicate VIP for support-stage.mozilla.org on our new load-balancing appliance. This will allow us to debug the issue using a different hardware path and, hopefully, improve performance. I'll update with a new URL for testing later in the day.
Comment 20 Stephen Donner [:stephend] 2009-03-16 17:11:01 PDT
(In reply to comment #19)
> Stephen,
> 
> We're going to configure a duplicate VIP for support-stage.mozilla.org on our
> new load-balancing appliance. This will allow us to debug the issue using a
> different hardware path and, hopefully, improve performance. I'll update with a
> new URL for testing later in the day.

On the layer 4 Cisco ACE or Zeus?  AFAIK, we have some Netscaler-specific code in SUMO that at some point we should abstract away.
Comment 21 Derek Moore [:dmoore] 2009-03-16 17:26:28 PDT
Stephen,

My intention was to use the Cisco ACE, as the configuration isn't necessarily complex enough to warrant spinning up VIP on the Zeus. Could you elaborate a bit more on the Netscaler-specific interations?
Comment 22 Stephen Donner [:stephend] 2009-03-16 17:44:07 PDT
(In reply to comment #21)
> Stephen,
> 
> My intention was to use the Cisco ACE, as the configuration isn't necessarily
> complex enough to warrant spinning up VIP on the Zeus. Could you elaborate a
> bit more on the Netscaler-specific interations?

Sure; bug 437787 at least; there might be more.  I'll bring this up in our SUMO meeting, if we're to (later) switch over to Zeus on SUMO.
Comment 23 matthew zeier [:mrz] 2009-03-16 19:01:37 PDT
btw, i was only talking about the databases for SUMO, not SUMO itself.
Comment 24 matthew zeier [:mrz] 2009-03-16 19:07:35 PDT
to be clear, i'm not at all advocating moving away from the netscaler for web stuff on sumo.
Comment 25 Laura Thomson :laura 2009-03-16 19:54:09 PDT
I think the example Stephen refers to should be fine on another caching solutions: it's just sending cache control headers.  I don't think there is anything particularly weird about how the NS handles these, but obviously it would be good to test if we did switch.  Even though I realize that is not what is being advocated here, it's important to note that the code shouldn't be a barrier to switching.
Comment 26 Jeremy Orem [:oremj] 2009-03-18 15:06:03 PDT
Does submitting a knowledge base article touch or depend on the tiki_comments table? I noticed that table is often locked for long periods of time due to: select max(`commentDate`) from
		`tiki_comments`,`tiki_forums` where `object` = `forumId`and `object
Comment 27 E. Cooper [:ecooper] 2009-03-18 15:22:11 PDT
webroot/tiki-editpage.php doesn't AFIK.
Comment 28 Jeremy Orem [:oremj] 2009-03-18 15:49:36 PDT
This appears to be with tikiwiki exceeding the php memory limits.  I ran through the example and got "Connection Interrupted" and the logs show "PHP Fatal error:  Allowed memory size of 33554432 bytes exhausted (tried to allocate 40961 bytes)".

I've increased the memory limit from 32 to 128 on stage.
Comment 29 David Tenser [:djst] 2009-03-19 09:04:40 PDT
Can anyone reproduce this after Jeremy's swift help?
Comment 30 Chris Ilias [:cilias] 2009-03-19 09:08:21 PDT
On production? Yes, I just ran into it, while creating the start page a/b test pages.
Comment 31 Stephen Donner [:stephend] 2009-03-19 09:10:11 PDT
(In reply to comment #29)
> Can anyone reproduce this after Jeremy's swift help?

Yes, which is one of the reasons that Eric filed bug 484100; I'm not clear on the relationship, though.
Comment 32 Jeremy Orem [:oremj] 2009-03-19 10:48:45 PDT
(In reply to comment #30)
> On production? Yes, I just ran into it, while creating the start page a/b test
> pages.

I haven't applied the fix to production yet.  I was going to let it bake a bit on stage.
Comment 33 David Tenser [:djst] 2009-03-19 14:01:28 PDT
Sorry to say that I just got the error on stage after creating the article https://support-stage.mozilla.org/sv-SE/kb/Localization+Dashboard?bl=n :(
Comment 34 David Tenser [:djst] 2009-03-19 14:17:01 PDT
Not sure if this is related, but I can't edit that page I just created: https://support-stage.mozilla.org/tiki-editpage.php?page=Localization+Dashboard

I get a "Connection interrupted" error every other time, and just a blank page every other page.
Comment 35 Jeremy Orem [:oremj] 2009-03-19 14:58:17 PDT
14:54 <oremj> [Thu Mar 19 14:14:51 2009] [error] [client 10.2.81.4] PHP Fatal error:  Maximum execution time of 30 seconds exceeded in 
              /data/www/support-stage.mozilla.com/webroot/lib/smarty_tiki/outputfilter.urlt.php on line 311
14:56 <oremj> that's a giant while loop
14:56 <oremj> does page have a lot of a or img tags?
14:57 <oremj> the comment above the function that is timing out says:
14:57 <oremj>  * Although care has been taken to write fast regexes, this is likely to
14:57 <oremj>  * be slow on large files, very complex files etc.
Comment 36 Jeremy Orem [:oremj] 2009-03-23 12:53:26 PDT
Memory fix from comment 28 has been applied to production.  Let me know if you still see this problem.
Comment 37 Laura Thomson :laura 2009-03-24 06:03:36 PDT
Jeremy, we are seeing this problem on stage again (blocking testing the release). Can you check the logs for us?
Comment 38 Jeremy Orem [:oremj] 2009-03-24 11:03:33 PDT
What time did you see the error?
Comment 39 David Tenser [:djst] 2009-03-24 11:06:19 PDT
Just 5 seconds ago on the URL: https://support-stage.mozilla.org/tiki-editpage.php?locale=en-US&page=Localization+Dashboard&amp;login

Blank page.
Comment 40 Jeremy Orem [:oremj] 2009-03-24 11:15:41 PDT
[Tue Mar 24 11:06:00 2009] [error] [client 10.2.81.4] PHP Fatal error:  Maximum execution time of 30 seconds exceeded in /data/www/support-stage.mozilla.com/webroot/lib/smarty_tiki/outputfilter.urlt.php on line 309, referer: https://support-stage.mozilla.org/tiki-editpage.php?locale=en-US&page=Localization+Dashboard
Comment 41 Laura Thomson :laura 2009-03-24 11:57:08 PDT
Via Jeremy, you can profile this online by adding an appropriate parameter, e.g.
https://support-stage.mozilla.org/tiki-editpage.php?locale=en-US&page=Localization+Dashboard&XDEBUG_PROFILE=true

and then see the results at 
https://webgrind.authstage.mozilla.com/

I see for example 77000 calls to preg_match (ouch).

Eric, can you take a look at the profile and see what you can come up with?
Comment 42 E. Cooper [:ecooper] 2009-03-24 12:47:32 PDT
Will do. 

I'll add potential solutions later today.
Comment 43 Chris Ilias [:cilias] 2009-05-20 18:37:20 PDT
This seems to fallen off the radar, and it still happens often. So....bump.
Comment 44 E. Cooper [:ecooper] 2009-05-26 09:43:14 PDT
For reference, this issue was partially addressed in bug 484100.
Comment 45 Paul Craciunoiu [:paulc] 2009-05-26 09:53:25 PDT
(In reply to comment #44)
> For reference, this issue was partially addressed in bug 484100.
I suppose we'll just have to do more of that :)
Comment 46 E. Cooper [:ecooper] 2009-06-01 07:51:01 PDT
I'm getting a head start on this, especially since it's only going to get worse with the addition of the screencasts feature.
Comment 47 E. Cooper [:ecooper] 2009-06-08 10:08:17 PDT
For whatever reason, new cachegrind.out files aren't showing up for support-stage on webgrind.authstage.mozilla.com. Did we lose it?

Local testing shows that the RemoveXSS function (and the subsequent RemoveXSSregexp calls) executed during webroot/tiki-setup_base.php is taking up about 27% to 35% of the processing time. These functions check all POST/GET data and run it through numerous pattern checking methods. On massive POST/GET data, like wiki pages, it's no wonder it's chugging along.

Again, I can't confirm this is the issue on stage yet, but I would imagine it is if it's happening locally.

What we could do is take the new engine from tiki 3.0 and merge it back into our 1.9 tiki-setup_base.php. From a cursory glance, the sanitization portion of the file is mostly modular. We can presumably pull out the current method and plop in the new one without much issue. Try it?

Optimizing tiki-setup_base.php will cause a performance boost on all pages.
Comment 48 Laura Thomson :laura 2009-06-08 11:56:12 PDT
ccing oremj for comment on webgrind.

RemoveXSS is newish, we backported it from a newer version of tiki sometime at the end of last year.  This was after we did our last round of optimization (for Fx3 launch).

We can certainly try putting in the new version.

If it doesn't help much then we should roll our own, have been through this exercise before with badly performing XSS filters.
Comment 49 E. Cooper [:ecooper] 2009-06-09 08:19:16 PDT
(In reply to comment #48)
> ccing oremj for comment on webgrind.
> 
> RemoveXSS is newish, we backported it from a newer version of tiki sometime at
> the end of last year.  This was after we did our last round of optimization
> (for Fx3 launch).
> 
> We can certainly try putting in the new version.
> 
> If it doesn't help much then we should roll our own, have been through this
> exercise before with badly performing XSS filters.

In an order to avoid potential confusion, bug 497109 has been filed to address
this.
Comment 50 Laura Thomson :laura 2009-06-17 15:57:45 PDT
We think that the fix for bug 497109 will fix this.  Downing this and targeting out.  We'll push it and see.
Comment 51 Underpass 2009-06-19 21:43:12 PDT
David told me to warn if and when I would find the problem again. At this moment I'm trying to add a screencast to this article 

https://support.mozilla.com/it/kb/*Impostare+Firefox+come+browser+predefinito

and I got 4 of this errors in a row (and the screencast has not been uploaded).

Using

Mozilla/5.0 (X11; U; Linux i686; it; rv:1.9.0.11) Gecko/2009060214 Firefox/3.0.11
Comment 52 Tom Ellins [:TMZ] 2009-06-21 10:07:34 PDT
Also had this today. The change was saved even though the error was shown.
Comment 53 Underpass 2009-06-21 11:52:12 PDT
Another timeout a few seconds ago, upon saving changes to the public version of

https://support.mozilla.com/it/kb/Impossibile+cancellare+la+cronologia+dalla+Barra+degli+indirizzi?bl=n
Comment 54 David Tenser [:djst] 2009-06-22 04:46:16 PDT
Jeremy, see comment 53 for when this happened last (2009-06-21 11:52:12 PDT). Is this helpful in determining what the cause is?
Comment 55 Underpass 2009-06-24 13:42:00 PDT
Just had the timeout error saving the changes to

https://support.mozilla.com/it/kb/Impostare+Firefox+come+browser+predefinito?bl=n
Comment 56 Underpass 2009-07-10 12:43:48 PDT
fyi: with all the articles I have saved today I had this timeout error.
Comment 57 Laura Thomson :laura 2009-07-21 10:23:57 PDT
Jermey, any information about status today?
Comment 58 Jeremy Orem [:oremj] 2009-07-21 10:29:18 PDT
I don't get a timeout when saving changes to the page mentioned in comment 55. Are there any articles timing out today?
Comment 60 Paul Craciunoiu [:paulc] 2009-07-24 10:07:28 PDT
On my local copy saving a page takes a long time. I say we do some benchmarking and see which parts of the process are slowing down the whole thing.
Comment 61 Laura Thomson :laura 2009-07-27 09:14:49 PDT
Status of the benchmarking?
Comment 62 Paul Craciunoiu [:paulc] 2009-07-27 11:29:03 PDT
Here are some benchmarks with explanations on my local machine.
Things to note:
1. I don't have timeouts set up (e.g. even if it takes 15 minutes, pages load)
2. I don't have a mail server set up, so I have tested with the email notifications portion commented out. (tikilib.php:4519-4528)
3. The wiki text I used to create this is the same as in our "Clearing Private Data" article (my dump is from July 1st)
4. I added categories, folksonomy tags, and checked "Allow HTML"

Test 1: Regular, staging article:
0.279 s -- setup 
0.004 s -- after translation bits (l467)
0.002 s -- before screencasts (l709)
0.154 s -- before check_ticket() (l869) 
0.162 s -- logs
0.162 s -- email notifications
0 s -- create staging 
0.544 s -- after create stage and so much else (l1063) 
0.044 s -- freetags (l1150)
0.003 s -- categories (l1171)

Test 2: Creating a pre-approved article (straight in the KB category) -- this doesn't occur on saving existing pages unless they are being approved.
0.235 s -- setup 
0.003 s -- after translation bits (l467)
0.002 s -- before screencasts (l709)
0.148 s -- before check_ticket() (l869) 
0.296 s -- logs x 2
0.612 s -- create staging 
1.328 s -- after create stage and so much else (l1063) 
0.041 s -- freetags (l1150)
0.017 s -- categories (l1171)
0.021 s -- the rest (l1252)
0.277 s -- display

The portion of the code I had commented out is in lines 4512:4528 of tikilib.php. Looking at the code, these creation times can vary a lot, depending on:
* email notifications
* uploaded files
* log size
* page history

It seems that on stage it takes about 7 seconds from submitting the page to having a full page load (on the above test of pre-approved article), but taking away the display time (~4s?) it should be about 2-3 seconds, which still seems longer than on my local machine (with the difference of mail notifications off).

I'll attach an xdebug table and the cachegrind.out file. Laura: perhaps you can take a look at it and see if you can pin point more exactly what our issues are.
Comment 63 Paul Craciunoiu [:paulc] 2009-07-27 11:31:48 PDT
Created attachment 390855 [details]
xdebug screenshot

Not sure how much this helps, but here's a screenshot of top lengthy tasks in xdebug.
Comment 64 Paul Craciunoiu [:paulc] 2009-07-27 11:36:55 PDT
Ah, yes, the xdebug output is too big. Here's a link to the file:
http://craciunoiu.net/files/mozilla/tiki-editpage-test1.xdebug
Comment 65 Paul Craciunoiu [:paulc] 2009-07-27 15:57:05 PDT
So it appears that on prod creating a page takes a loooong time and times out (for example, see stephend's screencast at http://screencast.com/t/gP9EoGsFPd ). It's possible that's because of the db being overloaded? Or?
On stage it does NOT time out and the creation time is much lower.

IT, can we get some help? Based on the article title in the screencast (URL request for page=...), you can look up the exact times in the logs and see what causes the timeouts. If we could at least pin point what to focus on optimizing, that would be great.

Laura, if you have other ideas...
Comment 66 Laura Thomson :laura 2009-07-28 08:42:41 PDT
Clearly isn't going to be done for 1.2.1.

I don't think IT are going to be able to help you (I don't see how they would get the information you want).  I'll take a look at this for 1.3.
Comment 67 Paul Craciunoiu [:paulc] 2009-07-31 13:47:05 PDT
An idea I have is to try turning off email notifications and see how much time that saves. We could rewrite the email library to queue notifications in a separate file that will be cronned. We should have a cron anyway, for that and other tasks (such as additional actions for renaming a page -- see bug 504663).
Comment 68 Paul Craciunoiu [:paulc] 2009-08-11 10:51:07 PDT
So one of the slowdowns of creating/updating an article is the action log. Lines 4516 and 6969 of tikilib.php (r48953) add to the log. There may be other log actions on create page, but I have benchmarked these two in comment 62.

Taking them out might be a significant help, but this needs to be tested with proper php timeouts to see if timeouts on staging can be reproduced before but not after.

There is a concern for metrics from removing these, though, and Cww mentions:
---
as long as an article isn't created and deleted in the same 24 hours (And I'm not even sure I care about it in that case), we can just fish it out of tiki_pages or tiki_pagehistory
---
So we should be fine?
Comment 69 [:Cww] 2009-08-11 10:59:36 PDT
We'll also need to basically rework the way we do metrics for KB articles so we don't rely on actionlog for some but not others.  (So translations/updates etc will need to be pulled directly from tiki_pagehistory etc.)

This is quite a bit of change in terms of queries -- basically a rewrite of all queries we're using right now for KB metrics.  There will also possibly need to be query optmization etc especially for the metrics dashboard. So if we go this route lots of advance warning would be awesome.

If you want to go this route for forum also, warning is VERY good since the same applies (only moreso since there are lots more forum threads than KB articles).
Comment 70 Paul Craciunoiu [:paulc] 2009-08-11 12:06:03 PDT
Cheng, what would it take to completely eliminate the internal tiki log? How hard would that be? Maybe we should file a tracking bug to evaluate this, since it may improve site-wide performance.
Comment 71 [:Cww] 2009-08-11 13:54:59 PDT
Basically, I'd have to rewrite all the metrics queries not to use it and then figure out if we can replace all the functionality we currently get out of the actionlog with the metrics dashboard (likely) and then checking to make sure we don't actually lose anything we may want to track in the future (which may mean expanding the metrics DB a little).

That's probably at least a couple weeks of work assuming all goes well.
Comment 72 Laura Thomson :laura 2009-08-11 14:04:39 PDT
There's no need to take this work on yet.  Will let you know if there is a need to do this, but I hope we can find another solution.
Comment 73 Paul Craciunoiu [:paulc] 2009-08-13 14:37:58 PDT
Looking at [1], maybe optimizing multilingual support will help.
[1] http://viewvc.svn.mozilla.org/vc/projects/sumo/trunk/webroot/tiki-editpage.php?r1=9526&r2=9859
Comment 74 Laura Thomson :laura 2009-09-21 07:18:22 PDT
I profiled on Thursday and the key problem is (as mentioned somewhere above) RemoveXSS, where the script spends ~30% of its clock time.  I started tweaking the regexes but I don't think I want to add this to 1.4 at this stage given the problems with search and limitations on QA time.  I'll have a patch up this week and it can go into 1.4.1.
Comment 75 Philippe M. Chiasson (:gozer) 2009-09-25 12:09:30 PDT
On our test instance of Sumo, roland encountered this bug too, it seems.

Looking at the logs, I can see 3 POSTs to a KB article, 2 which worked, one that didn't.

The 2 successfull ones took 222 and 550 msecs and returned a 302 (redirect) response of 20 bytes. So I am guessing the POST just does the work and redirects somewhere else.

The 1 failed request took 665msec and returned, interestingly, a 200 OK response of only 20 bytes. I am guessing it was :

HTTP/1.1 200 OK
Location: /kb/something

That would certainly cause the browser to display a white page, no ?
Comment 76 Laura Thomson :laura 2009-09-25 12:20:55 PDT
(In reply to comment #75)

> HTTP/1.1 200 OK
> Location: /kb/something
> 
> That would certainly cause the browser to display a white page, no ?

Yes.  Something is causing the 200 header to be sent before the location header.  You can see this in PHP when there is whitespace in an included file (which triggers output to the browser, with a plain old 200 in front of it).  Usually in that situation PHP will give you a "Headers already sent" warning.  This gives me a place to start digging, thanks for the info.
Comment 77 Roland Tanglao :rolandtanglao 2009-09-27 22:33:40 PDT
Created attachment 403189 [details]
White screen of death when changing a KB page from Staging to Knowledge Base

Attached are the HTTP Headers from the Firefox 3.5 Live HTTP Headers extension when I reproduced it as follows:
I. create a wiki page as non admin

   1. Login as non admin e.g. RolandNonAdmin
   2. Create a new article e.g.https://sumo-preview.mozillamessaging.com/en-US/kb/yahoo
   3. Click "Create this page", add content, click Save and then the article is created https://sumo-preview.mozillamessaging.com/en-US/kb/yahoo?bl=n 

II. login in as admin and approve

   1. Login as admin 
   2. via "Last Changes" go to the article https://sumo-preview.mozillamessaging.com/en-US/kb/yahoo "This is a new staging page that has not been approved before. Edit and manually move it to the category for approved pages to approve it for the first time. " and click "Edit this page", change "Categorize" from "Staging area" to "Knowledge Base", add a reason under "Describe the changes you made" and clicking "Save" and then I  got WSOD
Comment 78 Laura Thomson :laura 2009-09-28 09:15:39 PDT
I have *still* never reproed this (spent a good bit of time trying this morning).

If anyone - Stephen?  Krupa? - could get a screencast of it, maybe I'll be able to - if there's a checkbox you're checking or something I am missing.
Comment 79 Laura Thomson :laura 2009-09-29 08:04:04 PDT
Also, if you see errors on stage when this happens - or if you don't - please paste them in the bug.  error_reporting is now turned way up on that box so we can diagnose this.
Comment 80 James Socol [:jsocol, :james] 2009-10-05 17:25:08 PDT
(In reply to comment #79)
> Also, if you see errors on stage when this happens - or if you don't - please
> paste them in the bug.  error_reporting is now turned way up on that box so we
> can diagnose this.

Has anyone encountered this in the past week? Seeing any PHP errors?
Comment 81 James Socol [:jsocol, :james] 2009-10-15 14:30:31 PDT
Since I still can't repro this locally, I'm pushing it out to the .next milestone.

If anyone sees this on stage and sees a PHP error, please post the error here.
Comment 82 James Socol [:jsocol, :james] 2009-10-28 15:19:39 PDT
I finally ran into a local time out, although it was on a completely different page. (tiki-theme_control_objects.php) When I profiled it the two parts that stuck out were the RemoveXSS section and the URLT output filter.

LPH says the XSS filtering is not so eager/aggressive in the TikiWiki trunk, though the process looks similar to me. I have hope that after bug 522813, we'll get to a place where we can remove the URLT filter completely.

Anyway, these are the places I'll concentrate, but since I still can't reproduce the exact bug locally, and I hope that it will go away with future work, I'm pushing it out yet again.
Comment 83 James Socol [:jsocol, :james] 2009-11-23 12:58:23 PST
I'm still thinking the two problem sections are the URL output filter and the RemoveXSS section. But these aren't exactly low-hanging fruit. I'm almost tempted to wait until post-upgrade to see if this gets better. But for now -> pushing to 1.5.1
Comment 84 Philippe M. Chiasson (:gozer) 2010-01-15 10:09:46 PST
*** Bug 539609 has been marked as a duplicate of this bug. ***
Comment 85 Roland Tanglao :rolandtanglao 2010-01-15 10:36:56 PST
Still seeing this on stage.support.mozillamessaging.com (aka SuMoMo Stage). Can reproduce every time by following the steps in comment 77. I can give access to SuMoMo Stage to whoever wants to debug further. Pleae email me roland AT mozillamessaging.com for access
Comment 86 James Socol [:jsocol, :james] 2010-12-07 10:33:55 PST
Fixed by Kitsune.
Comment 87 Stephen Donner [:stephend] 2011-01-03 12:29:58 PST
Ver-ah-fied

Note You need to log in before you can comment on or make changes to this bug.