480415 - Timeouts when submitting a Knowledge Base article on SUMO

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Description

•

15 years ago

NOTE: seeing this on production as well as staging

Steps to Reproduce:

1. Log in to SUMO and load https://support-stage.mozilla.org/en-US/kb/Creating+articles
2. Type something in the "Quick edit an article" textfield and click "Edit"
3. When the page loads (something like https://support-stage.mozilla.org/tiki-editpage.php?page=Hotdogs), type some stuff into the "Edit:" field and the "Edit Summary:" field
4. Click "Save"

Expected Results:

The article is created

Actual Results:

Page Load Error: connect reset

(See screenshot.)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 1

•

15 years ago

Attached image Screenshot — Details

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Updated

•

15 years ago

Blocks: 444703

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Updated

•

15 years ago

OS: Mac OS X → All

Dave Miller [:justdave]

Updated

•

15 years ago

Assignee: server-ops → justdave

Dave Miller [:justdave]

Comment 2

•

15 years ago

I upgraded all four back end servers to RHEL 5.3 (because it hadn't been done yet, and just to eliminate unapplied security updates from being an issue) and rebooted them.  Didn't seem to help any.  There's no unusual load on the databases, nor on the back end machines.  The Connection Reset error seems to imply that it could be a netscaler issue, since the browser is actually talking to the netscaler.

Note that this is reproducible on both stage and production, and started happening in production BEFORE last night's code push, so the code push is not responsible for this.

There's been bugs on other sites having issues with the netscaler doing this to them that were fixed by tweaks to the netscaler config, (seach fixed mozilla.org bugs for "conn reset") but I don't know what was done or if it's still relevant (it's been a while).

Assignee: justdave → dmoore

Dave Miller [:justdave]

Comment 3

•

15 years ago

Both stage and production do go through the netscaler, also.

E. Cooper [:ecooper]

Updated

•

15 years ago

No longer blocks: 444703

Derek Moore [:dmoore]

Comment 4

•

15 years ago

Stephen,

Are you certain this problem is also occurring in production? If so, could you provide a sample case or, at least, a timestamp for a recent incident?

I believe I've identified the source of the problem for the staging site, but it's not relevant to production.

Derek Moore [:dmoore]

Comment 5

•

15 years ago

Dave,

mrapp-stage02 appears to have an inappropriate ethernet bonding configuration. It's missing necessary options in /etc/modprobe.conf, which is causing it to revert to the default round-robin balancing of traffic between eth0 and eth1. This is confusing the forwarding tables on core1 and core2, creating packet loss for this host. The packet loss is significant enough that it is causing the NetScaler to fail health checks on a regular basis.

I believe fixing the configuration for bond0 should resolve our issues for support-stage. I'm waiting to hear if we can reproduce on production. The production VIP uses different backend servers, all of which appear to be properly configured.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 6

•

15 years ago

Derek, it now seems fine to me on production (knock on wood); testing on staging now, too, it seems fine there.

Have you guys been changing stuff in the interim?

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 7

•

15 years ago

(In reply to comment #6)
> Derek, it now seems fine to me on production (knock on wood); testing on
> staging now, too, it seems fine there.
> 
> Have you guys been changing stuff in the interim?

Seeing it again; I'm going to hold off on claiming it's OK until I haven't seen it for quite a while.

Derek Moore [:dmoore]

Comment 8

•

15 years ago

Seeing it on production or staging? We won't be making the staging fix until tomorrow's maintenance window, so I would expect that the problem is still present there.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 9

•

15 years ago

Mostly seeing it on staging; it's highly sporadic, though, and to test it I have to create new articles which spams the heck out of our SUMO editors, so I'm not doing it as often as I'd like.

I'll wait until tomorrow's fix to test further on staging.

matthew zeier [:mrz]

Comment 10

•

15 years ago

[root@mrapp-stage02 ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:13:21:1b:9b:ee

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:13:21:1b:9b:ed

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 11

•

15 years ago

Was the fix applied tonight?  Seem to be having better luck tonight WRT timeouts; I wrote a small/simple Selenium script that:

* logs in
* click "Edit" on an article
* logs out
* load a URL to edit an article while logged out, thus forcing login with an edit-page parameter.

Didn't see any issues.  We _were_ seeing this on production, though, a few days back, and it's worrisome to me that it could've "fixed itself"; did we have any related fixes/changes on production recently?

matthew zeier [:mrz]

Comment 12

•

15 years ago

Yep, push was put in tonight.  I'll leave you in dmoore's hands now :)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 13

•

15 years ago

Eric Cooper and I both just hit this tonight while creating new en-US pages on staging.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Updated

•

15 years ago

Severity: critical → blocker

Derek Moore [:dmoore]

Comment 14

•

15 years ago

Webserver logs confirm at least three timeouts in the past 24 hours. While that's a significant improvement, we're still working on isolating these last few errors.

David Tenser [:djst]

Comment 15

•

15 years ago

Is this at all tied to our planned push for the SUMO 1.0 release? If so, this bug needs to be marked as a blocker for that release.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 16

•

15 years ago

Throwing back to server-ops so that Someone Smarter than Me (tm) can take a look.

Here's a screencast: http://screencast.com/t/PY45mPhAw

Sorry that it came with a soundtrack; I'm a screencasting n00b.

Assignee: dmoore → server-ops

Aravind Gottipati [:aravind]

Comment 17

•

15 years ago

Given that folks seem to have been troubleshooting this bug for over a few weeks now, is this a page me in the middle of Friday night blocker?

Can this wait till Monday for further troubleshooting?

Assignee: server-ops → aravind

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 18

•

15 years ago

(In reply to comment #17)
> Given that folks seem to have been troubleshooting this bug for over a few
> weeks now, is this a page me in the middle of Friday night blocker?
> 
> Can this wait till Monday for further troubleshooting?

Sorry, yeah.

Aravind Gottipati [:aravind]

Updated

•

15 years ago

Assignee: aravind → server-ops

Severity: blocker → major

Derek Moore [:dmoore]

Comment 19

•

15 years ago

Stephen,

We're going to configure a duplicate VIP for support-stage.mozilla.org on our new load-balancing appliance. This will allow us to debug the issue using a different hardware path and, hopefully, improve performance. I'll update with a new URL for testing later in the day.

Assignee: server-ops → dmoore

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 20

•

15 years ago

(In reply to comment #19)
> Stephen,
> 
> We're going to configure a duplicate VIP for support-stage.mozilla.org on our
> new load-balancing appliance. This will allow us to debug the issue using a
> different hardware path and, hopefully, improve performance. I'll update with a
> new URL for testing later in the day.

On the layer 4 Cisco ACE or Zeus?  AFAIK, we have some Netscaler-specific code in SUMO that at some point we should abstract away.

Derek Moore [:dmoore]

Comment 21

•

15 years ago

Stephen,

My intention was to use the Cisco ACE, as the configuration isn't necessarily complex enough to warrant spinning up VIP on the Zeus. Could you elaborate a bit more on the Netscaler-specific interations?

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 22

•

15 years ago

(In reply to comment #21)
> Stephen,
> 
> My intention was to use the Cisco ACE, as the configuration isn't necessarily
> complex enough to warrant spinning up VIP on the Zeus. Could you elaborate a
> bit more on the Netscaler-specific interations?

Sure; bug 437787 at least; there might be more.  I'll bring this up in our SUMO meeting, if we're to (later) switch over to Zeus on SUMO.

matthew zeier [:mrz]

Comment 23

•

15 years ago

btw, i was only talking about the databases for SUMO, not SUMO itself.

matthew zeier [:mrz]

Comment 24

•

15 years ago

to be clear, i'm not at all advocating moving away from the netscaler for web stuff on sumo.

Laura Thomson :laura

Comment 25

•

15 years ago

I think the example Stephen refers to should be fine on another caching solutions: it's just sending cache control headers.  I don't think there is anything particularly weird about how the NS handles these, but obviously it would be good to test if we did switch.  Even though I realize that is not what is being advocated here, it's important to note that the code shouldn't be a barrier to switching.

matthew zeier [:mrz]

Updated

•

15 years ago

Assignee: dmoore → oremj

Jeremy Orem [:oremj]

Comment 26

•

15 years ago

Does submitting a knowledge base article touch or depend on the tiki_comments table? I noticed that table is often locked for long periods of time due to: select max(`commentDate`) from
		`tiki_comments`,`tiki_forums` where `object` = `forumId`and `object

E. Cooper [:ecooper]

Comment 27

•

15 years ago

webroot/tiki-editpage.php doesn't AFIK.

Jeremy Orem [:oremj]

Comment 28

•

15 years ago

This appears to be with tikiwiki exceeding the php memory limits.  I ran through the example and got "Connection Interrupted" and the logs show "PHP Fatal error:  Allowed memory size of 33554432 bytes exhausted (tried to allocate 40961 bytes)".

I've increased the memory limit from 32 to 128 on stage.

David Tenser [:djst]

Comment 29

•

15 years ago

Can anyone reproduce this after Jeremy's swift help?

Chris Ilias [:cilias]

Comment 30

•

15 years ago

On production? Yes, I just ran into it, while creating the start page a/b test pages.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 31

•

15 years ago

(In reply to comment #29)
> Can anyone reproduce this after Jeremy's swift help?

Yes, which is one of the reasons that Eric filed bug 484100; I'm not clear on the relationship, though.

Jeremy Orem [:oremj]

Comment 32

•

15 years ago

(In reply to comment #30)
> On production? Yes, I just ran into it, while creating the start page a/b test
> pages.

I haven't applied the fix to production yet.  I was going to let it bake a bit on stage.

David Tenser [:djst]

Comment 33

•

15 years ago

Sorry to say that I just got the error on stage after creating the article https://support-stage.mozilla.org/sv-SE/kb/Localization+Dashboard?bl=n :(

David Tenser [:djst]

Comment 34

•

15 years ago

Not sure if this is related, but I can't edit that page I just created: https://support-stage.mozilla.org/tiki-editpage.php?page=Localization+Dashboard

I get a "Connection interrupted" error every other time, and just a blank page every other page.

Jeremy Orem [:oremj]

Comment 35

•

15 years ago

14:54 <oremj> [Thu Mar 19 14:14:51 2009] [error] [client 10.2.81.4] PHP Fatal error:  Maximum execution time of 30 seconds exceeded in 
              /data/www/support-stage.mozilla.com/webroot/lib/smarty_tiki/outputfilter.urlt.php on line 311
14:56 <oremj> that's a giant while loop
14:56 <oremj> does page have a lot of a or img tags?
14:57 <oremj> the comment above the function that is timing out says:
14:57 <oremj>  * Although care has been taken to write fast regexes, this is likely to
14:57 <oremj>  * be slow on large files, very complex files etc.

Jeremy Orem [:oremj]

Comment 36

•

15 years ago

Memory fix from comment 28 has been applied to production.  Let me know if you still see this problem.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Laura Thomson :laura

Comment 37

•

15 years ago

Jeremy, we are seeing this problem on stage again (blocking testing the release). Can you check the logs for us?

Severity: major → critical

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Jeremy Orem [:oremj]

Comment 38

•

15 years ago

What time did you see the error?

David Tenser [:djst]

Comment 39

•

15 years ago

Just 5 seconds ago on the URL: https://support-stage.mozilla.org/tiki-editpage.php?locale=en-US&page=Localization+Dashboard&amp;login

Blank page.

Jeremy Orem [:oremj]

Comment 40

•

15 years ago

[Tue Mar 24 11:06:00 2009] [error] [client 10.2.81.4] PHP Fatal error:  Maximum execution time of 30 seconds exceeded in /data/www/support-stage.mozilla.com/webroot/lib/smarty_tiki/outputfilter.urlt.php on line 309, referer: https://support-stage.mozilla.org/tiki-editpage.php?locale=en-US&page=Localization+Dashboard

Laura Thomson :laura

Comment 41

•

15 years ago

Via Jeremy, you can profile this online by adding an appropriate parameter, e.g.
https://support-stage.mozilla.org/tiki-editpage.php?locale=en-US&page=Localization+Dashboard&XDEBUG_PROFILE=true

and then see the results at 
https://webgrind.authstage.mozilla.com/

I see for example 77000 calls to preg_match (ouch).

Eric, can you take a look at the profile and see what you can come up with?

E. Cooper [:ecooper]

Comment 42

•

15 years ago

Will do. 

I'll add potential solutions later today.

Jeremy Orem [:oremj]

Updated

•

15 years ago

Assignee: oremj → nobody

Component: Server Operations → Knowledge Base Software

Product: mozilla.org → support.mozilla.com

QA Contact: mrz → kb-software

Version: other → unspecified

Chris Ilias [:cilias]

Comment 43

•

15 years ago

This seems to fallen off the radar, and it still happens often. So....bump.

Chris Ilias [:cilias]

Updated

•

15 years ago

Target Milestone: --- → 1.2

E. Cooper [:ecooper]

Comment 44

•

15 years ago

For reference, this issue was partially addressed in bug 484100.

Depends on: 484100

Paul Craciunoiu [:paulc]

Comment 45

•

15 years ago

(In reply to comment #44)
> For reference, this issue was partially addressed in bug 484100.
I suppose we'll just have to do more of that :)

E. Cooper [:ecooper]

Comment 46

•

15 years ago

I'm getting a head start on this, especially since it's only going to get worse with the addition of the screencasts feature.

Assignee: nobody → smirkingsisyphus

E. Cooper [:ecooper]

Comment 47

•

15 years ago

For whatever reason, new cachegrind.out files aren't showing up for support-stage on webgrind.authstage.mozilla.com. Did we lose it?

Local testing shows that the RemoveXSS function (and the subsequent RemoveXSSregexp calls) executed during webroot/tiki-setup_base.php is taking up about 27% to 35% of the processing time. These functions check all POST/GET data and run it through numerous pattern checking methods. On massive POST/GET data, like wiki pages, it's no wonder it's chugging along.

Again, I can't confirm this is the issue on stage yet, but I would imagine it is if it's happening locally.

What we could do is take the new engine from tiki 3.0 and merge it back into our 1.9 tiki-setup_base.php. From a cursory glance, the sanitization portion of the file is mostly modular. We can presumably pull out the current method and plop in the new one without much issue. Try it?

Optimizing tiki-setup_base.php will cause a performance boost on all pages.

Laura Thomson :laura

Comment 48

•

15 years ago

ccing oremj for comment on webgrind.

RemoveXSS is newish, we backported it from a newer version of tiki sometime at the end of last year.  This was after we did our last round of optimization (for Fx3 launch).

We can certainly try putting in the new version.

If it doesn't help much then we should roll our own, have been through this exercise before with badly performing XSS filters.

E. Cooper [:ecooper]

Updated

•

15 years ago

Depends on: 497109

E. Cooper [:ecooper]

Comment 49

•

15 years ago

(In reply to comment #48)
> ccing oremj for comment on webgrind.
> 
> RemoveXSS is newish, we backported it from a newer version of tiki sometime at
> the end of last year.  This was after we did our last round of optimization
> (for Fx3 launch).
> 
> We can certainly try putting in the new version.
> 
> If it doesn't help much then we should roll our own, have been through this
> exercise before with badly performing XSS filters.

In an order to avoid potential confusion, bug 497109 has been filed to address
this.

Laura Thomson :laura

Comment 50

•

15 years ago

We think that the fix for bug 497109 will fix this.  Downing this and targeting out.  We'll push it and see.

Severity: critical → major

Target Milestone: 1.2 → 1.3

Underpass

Comment 51

•

15 years ago

David told me to warn if and when I would find the problem again. At this moment I'm trying to add a screencast to this article 

https://support.mozilla.com/it/kb/*Impostare+Firefox+come+browser+predefinito

and I got 4 of this errors in a row (and the screencast has not been uploaded).

Using

Mozilla/5.0 (X11; U; Linux i686; it; rv:1.9.0.11) Gecko/2009060214 Firefox/3.0.11

Tom Ellins [:TMZ]

Comment 52

•

15 years ago

Also had this today. The change was saved even though the error was shown.

Underpass

Comment 53

•

15 years ago

Another timeout a few seconds ago, upon saving changes to the public version of

https://support.mozilla.com/it/kb/Impossibile+cancellare+la+cronologia+dalla+Barra+degli+indirizzi?bl=n

David Tenser [:djst]

Comment 54

•

15 years ago

Jeremy, see comment 53 for when this happened last (2009-06-21 11:52:12 PDT). Is this helpful in determining what the cause is?

Underpass

Comment 55

•

15 years ago

Just had the timeout error saving the changes to

https://support.mozilla.com/it/kb/Impostare+Firefox+come+browser+predefinito?bl=n

Underpass

Comment 56

•

15 years ago

fyi: with all the articles I have saved today I had this timeout error.

Chris Ilias [:cilias]

Updated

•

15 years ago

Target Milestone: 1.3 → 1.2.1

Laura Thomson :laura

Comment 57

•

15 years ago

Jermey, any information about status today?

Jeremy Orem [:oremj]

Comment 58

•

15 years ago

I don't get a timeout when saving changes to the page mentioned in comment 55. Are there any articles timing out today?

Chris Ilias [:cilias]

Comment 59

•

15 years ago

Yup; it just happened on https://support.mozilla.com/tiki-editpage.php?locale=en-US&page=*Cannot+connect+after+upgrading+Firefox.

Here's a screencast: http://screencast.com/t/W07aD3ZtI1S

Paul Craciunoiu [:paulc]

Comment 60

•

15 years ago

On my local copy saving a page takes a long time. I say we do some benchmarking and see which parts of the process are slowing down the whole thing.

Paul Craciunoiu [:paulc]

Updated

•

15 years ago

Blocks: 487598

Laura Thomson :laura

Comment 61

•

15 years ago

Status of the benchmarking?

Paul Craciunoiu [:paulc]

Comment 62

•

15 years ago

Here are some benchmarks with explanations on my local machine.
Things to note:
1. I don't have timeouts set up (e.g. even if it takes 15 minutes, pages load)
2. I don't have a mail server set up, so I have tested with the email notifications portion commented out. (tikilib.php:4519-4528)
3. The wiki text I used to create this is the same as in our "Clearing Private Data" article (my dump is from July 1st)
4. I added categories, folksonomy tags, and checked "Allow HTML"

Test 1: Regular, staging article:
0.279 s -- setup 
0.004 s -- after translation bits (l467)
0.002 s -- before screencasts (l709)
0.154 s -- before check_ticket() (l869) 
0.162 s -- logs
0.162 s -- email notifications
0 s -- create staging 
0.544 s -- after create stage and so much else (l1063) 
0.044 s -- freetags (l1150)
0.003 s -- categories (l1171)

Test 2: Creating a pre-approved article (straight in the KB category) -- this doesn't occur on saving existing pages unless they are being approved.
0.235 s -- setup 
0.003 s -- after translation bits (l467)
0.002 s -- before screencasts (l709)
0.148 s -- before check_ticket() (l869) 
0.296 s -- logs x 2
0.612 s -- create staging 
1.328 s -- after create stage and so much else (l1063) 
0.041 s -- freetags (l1150)
0.017 s -- categories (l1171)
0.021 s -- the rest (l1252)
0.277 s -- display

The portion of the code I had commented out is in lines 4512:4528 of tikilib.php. Looking at the code, these creation times can vary a lot, depending on:
* email notifications
* uploaded files
* log size
* page history

It seems that on stage it takes about 7 seconds from submitting the page to having a full page load (on the above test of pre-approved article), but taking away the display time (~4s?) it should be about 2-3 seconds, which still seems longer than on my local machine (with the difference of mail notifications off).

I'll attach an xdebug table and the cachegrind.out file. Laura: perhaps you can take a look at it and see if you can pin point more exactly what our issues are.

Paul Craciunoiu [:paulc]

Comment 63

•

15 years ago

Attached image xdebug screenshot — Details

Not sure how much this helps, but here's a screenshot of top lengthy tasks in xdebug.

Paul Craciunoiu [:paulc]

Comment 64

•

15 years ago

Ah, yes, the xdebug output is too big. Here's a link to the file:
http://craciunoiu.net/files/mozilla/tiki-editpage-test1.xdebug

Paul Craciunoiu [:paulc]

Comment 65

•

15 years ago

So it appears that on prod creating a page takes a loooong time and times out (for example, see stephend's screencast at http://screencast.com/t/gP9EoGsFPd ). It's possible that's because of the db being overloaded? Or?
On stage it does NOT time out and the creation time is much lower.

IT, can we get some help? Based on the article title in the screencast (URL request for page=...), you can look up the exact times in the logs and see what causes the timeouts. If we could at least pin point what to focus on optimizing, that would be great.

Laura, if you have other ideas...

Assignee: smirkingsisyphus → server-ops

Laura Thomson :laura

Comment 66

•

15 years ago

Clearly isn't going to be done for 1.2.1.

I don't think IT are going to be able to help you (I don't see how they would get the information you want).  I'll take a look at this for 1.3.

Assignee: server-ops → laura

Target Milestone: 1.2.1 → 1.3

Paul Craciunoiu [:paulc]

Comment 67

•

15 years ago

An idea I have is to try turning off email notifications and see how much time that saves. We could rewrite the email library to queue notifications in a separate file that will be cronned. We should have a cron anyway, for that and other tasks (such as additional actions for renaming a page -- see bug 504663).

Paul Craciunoiu [:paulc]

Comment 68

•

15 years ago

So one of the slowdowns of creating/updating an article is the action log. Lines 4516 and 6969 of tikilib.php (r48953) add to the log. There may be other log actions on create page, but I have benchmarked these two in comment 62.

Taking them out might be a significant help, but this needs to be tested with proper php timeouts to see if timeouts on staging can be reproduced before but not after.

There is a concern for metrics from removing these, though, and Cww mentions:
---
as long as an article isn't created and deleted in the same 24 hours (And I'm not even sure I care about it in that case), we can just fish it out of tiki_pages or tiki_pagehistory
---
So we should be fine?

[:Cww]

Comment 69

•

15 years ago

We'll also need to basically rework the way we do metrics for KB articles so we don't rely on actionlog for some but not others.  (So translations/updates etc will need to be pulled directly from tiki_pagehistory etc.)

This is quite a bit of change in terms of queries -- basically a rewrite of all queries we're using right now for KB metrics.  There will also possibly need to be query optmization etc especially for the metrics dashboard. So if we go this route lots of advance warning would be awesome.

If you want to go this route for forum also, warning is VERY good since the same applies (only moreso since there are lots more forum threads than KB articles).

Paul Craciunoiu [:paulc]

Comment 70

•

15 years ago

Cheng, what would it take to completely eliminate the internal tiki log? How hard would that be? Maybe we should file a tracking bug to evaluate this, since it may improve site-wide performance.

[:Cww]

Comment 71

•

15 years ago

Basically, I'd have to rewrite all the metrics queries not to use it and then figure out if we can replace all the functionality we currently get out of the actionlog with the metrics dashboard (likely) and then checking to make sure we don't actually lose anything we may want to track in the future (which may mean expanding the metrics DB a little).

That's probably at least a couple weeks of work assuming all goes well.

Laura Thomson :laura

Comment 72

•

15 years ago

There's no need to take this work on yet.  Will let you know if there is a need to do this, but I hope we can find another solution.

Paul Craciunoiu [:paulc]

Comment 73

•

15 years ago

Looking at [1], maybe optimizing multilingual support will help.
[1] http://viewvc.svn.mozilla.org/vc/projects/sumo/trunk/webroot/tiki-editpage.php?r1=9526&r2=9859

Laura Thomson :laura

Updated

•

15 years ago

Target Milestone: 1.3 → 1.4

Laura Thomson :laura

Comment 74

•

15 years ago

I profiled on Thursday and the key problem is (as mentioned somewhere above) RemoveXSS, where the script spends ~30% of its clock time.  I started tweaking the regexes but I don't think I want to add this to 1.4 at this stage given the problems with search and limitations on QA time.  I'll have a patch up this week and it can go into 1.4.1.

Target Milestone: 1.4 → 1.4.1

Philippe M. Chiasson (:gozer)

Comment 75

•

15 years ago

On our test instance of Sumo, roland encountered this bug too, it seems.

Looking at the logs, I can see 3 POSTs to a KB article, 2 which worked, one that didn't.

The 2 successfull ones took 222 and 550 msecs and returned a 302 (redirect) response of 20 bytes. So I am guessing the POST just does the work and redirects somewhere else.

The 1 failed request took 665msec and returned, interestingly, a 200 OK response of only 20 bytes. I am guessing it was :

HTTP/1.1 200 OK
Location: /kb/something

That would certainly cause the browser to display a white page, no ?

Laura Thomson :laura

Comment 76

•

15 years ago

(In reply to comment #75)

> HTTP/1.1 200 OK
> Location: /kb/something
> 
> That would certainly cause the browser to display a white page, no ?

Yes.  Something is causing the 200 header to be sent before the location header.  You can see this in PHP when there is whitespace in an included file (which triggers output to the browser, with a plain old 200 in front of it).  Usually in that situation PHP will give you a "Headers already sent" warning.  This gives me a place to start digging, thanks for the info.

Roland Tanglao needinfo please :rolandtanglao, :mohnkuchen, :adobo, :sinigang, :roland

Comment 77

•

15 years ago

Attached file White screen of death when changing a KB page from Staging to Knowledge Base — Details

Attached are the HTTP Headers from the Firefox 3.5 Live HTTP Headers extension when I reproduced it as follows:
I. create a wiki page as non admin

   1. Login as non admin e.g. RolandNonAdmin
   2. Create a new article e.g.https://sumo-preview.mozillamessaging.com/en-US/kb/yahoo
   3. Click "Create this page", add content, click Save and then the article is created https://sumo-preview.mozillamessaging.com/en-US/kb/yahoo?bl=n 

II. login in as admin and approve

   1. Login as admin 
   2. via "Last Changes" go to the article https://sumo-preview.mozillamessaging.com/en-US/kb/yahoo "This is a new staging page that has not been approved before. Edit and manually move it to the category for approved pages to approve it for the first time. " and click "Edit this page", change "Categorize" from "Staging area" to "Knowledge Base", add a reason under "Describe the changes you made" and clicking "Save" and then I  got WSOD

Laura Thomson :laura

Comment 78

•

15 years ago

I have *still* never reproed this (spent a good bit of time trying this morning).

If anyone - Stephen?  Krupa? - could get a screencast of it, maybe I'll be able to - if there's a checkbox you're checking or something I am missing.

Laura Thomson :laura

Comment 79

•

15 years ago

Also, if you see errors on stage when this happens - or if you don't - please paste them in the bug.  error_reporting is now turned way up on that box so we can diagnose this.

James Socol [:jsocol, :james]

Assignee

Updated

•

15 years ago

Assignee: laura → james

Priority: -- → P1

James Socol [:jsocol, :james]

Assignee

Comment 80

•

15 years ago

(In reply to comment #79)
> Also, if you see errors on stage when this happens - or if you don't - please
> paste them in the bug.  error_reporting is now turned way up on that box so we
> can diagnose this.

Has anyone encountered this in the past week? Seeing any PHP errors?

James Socol [:jsocol, :james]

Assignee

Comment 81

•

15 years ago

Since I still can't repro this locally, I'm pushing it out to the .next milestone.

If anyone sees this on stage and sees a PHP error, please post the error here.

Hardware: x86 → All

Target Milestone: 1.4.1 → 1.4.2

James Socol [:jsocol, :james]

Assignee

Comment 82

•

15 years ago

I finally ran into a local time out, although it was on a completely different page. (tiki-theme_control_objects.php) When I profiled it the two parts that stuck out were the RemoveXSS section and the URLT output filter.

LPH says the XSS filtering is not so eager/aggressive in the TikiWiki trunk, though the process looks similar to me. I have hope that after bug 522813, we'll get to a place where we can remove the URLT filter completely.

Anyway, these are the places I'll concentrate, but since I still can't reproduce the exact bug locally, and I hope that it will go away with future work, I'm pushing it out yet again.

Target Milestone: 1.4.2 → 1.5

James Socol [:jsocol, :james]

Assignee

Comment 83

•

15 years ago

I'm still thinking the two problem sections are the URL output filter and the RemoveXSS section. But these aren't exactly low-hanging fruit. I'm almost tempted to wait until post-upgrade to see if this gets better. But for now -> pushing to 1.5.1

Target Milestone: 1.5 → 1.5.1

Laura Thomson :laura

Updated

•

15 years ago

Target Milestone: 1.5.1 → 1.6

Roland Tanglao needinfo please :rolandtanglao, :mohnkuchen, :adobo, :sinigang, :roland

Comment 85

•

15 years ago

Still seeing this on stage.support.mozillamessaging.com (aka SuMoMo Stage). Can reproduce every time by following the steps in comment 77. I can give access to SuMoMo Stage to whoever wants to debug further. Pleae email me roland AT mozillamessaging.com for access

Roland Tanglao needinfo please :rolandtanglao, :mohnkuchen, :adobo, :sinigang, :roland

Updated

•

15 years ago

Blocks: 520628
No longer blocks: 487598

James Socol [:jsocol, :james]

Assignee

Comment 86

•

14 years ago

Fixed by Kitsune.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 14 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 87

•

14 years ago

Ver-ah-fied

Status: RESOLVED → VERIFIED

Screenshot 15 years ago Stephen Donner [:stephend] Not actively reading bugmail 154.35 KB, image/png		Details
xdebug screenshot 15 years ago Paul Craciunoiu [:paulc] 126.18 KB, image/png		Details
White screen of death when changing a KB page from Staging to Knowledge Base 15 years ago Roland Tanglao needinfo please :rolandtanglao, :mohnkuchen, :adobo, :sinigang, :roland 3.92 KB, text/plain		Details