Last Comment Bug 532498 - Error message "Service Unavailable" with every my edit
: Error message "Service Unavailable" with every my edit
Status: RESOLVED INCOMPLETE
:
Product: support.mozilla.org
Classification: Other
Component: Knowledge Base Software (show other bugs)
: unspecified
: All All
: P1 major (vote)
: Future
Assigned To: James Socol [:jsocol, :james]
:
:
Mentors:
Depends on: 551513
Blocks:
  Show dependency treegraph
 
Reported: 2009-12-02 13:43 PST by Pavel Cvrcek [:JasnaPaka]
Modified: 2010-10-24 15:51 PDT (History)
24 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Headers (5.65 KB, text/plain)
2009-12-03 12:33 PST, Pavel Cvrcek [:JasnaPaka]
no flags Details
patch, moves sending wiki edit notifications to a gearman worker (15.26 KB, patch)
2010-03-23 16:57 PDT, James Socol [:jsocol, :james]
paul+moz: review+
Details | Diff | Splinter Review
helpful testing patch (1.51 KB, patch)
2010-03-23 17:00 PDT, James Socol [:jsocol, :james]
no flags Details | Diff | Splinter Review
patch to patch 1, moves DB connecting/closing inside the function and adds debug output (8.16 KB, patch)
2010-03-30 17:31 PDT, James Socol [:jsocol, :james]
laura: review+
Details | Diff | Splinter Review

Description Pavel Cvrcek [:JasnaPaka] 2009-12-02 13:43:08 PST
When I try to save my edit on SUMO every time I get error page:

"Service Unavailable
The service is temporarily unavailable. Please try again later." 

When I submit post data on this error page again then all works fine. Chris Ilias said that maybe this problem is related to Amsterdam server because he doesn't see problem and I'm from Europe.

This problem was discussed in Contributors forum:
https://support.mozilla.com/en-US/forum/3/513605
Comment 1 Stephen Donner [:stephend] 2009-12-02 13:47:30 PST
Pavel, can you get and attach (as a plain text file) Live HTTP Headers [1] output of just these submission attempts?

[1] https://addons.mozilla.org/en-US/firefox/addon/3829

Thanks!
Comment 2 Tom Ellins [:TMZ] 2009-12-02 14:02:37 PST
IT please check this out.

http://mozilla-uk.org/headers
Comment 3 Tom Ellins [:TMZ] 2009-12-02 14:04:26 PST
per jsocol's request. http://mozilla-uk.org/service.JPG
Comment 4 Shyam Mani [:fox2mike] 2009-12-03 11:10:39 PST
I think Derek's fixed this..TMZ confirmed over IRC.
Comment 5 Pavel Cvrcek [:JasnaPaka] 2009-12-03 12:33:10 PST
Created attachment 415930 [details]
Headers

Looks like it still doesn't work for me. See my headers.
Comment 6 Tom Ellins [:TMZ] 2009-12-03 12:35:25 PST
This appears to be a intermittent issue. 3 people in the EU can not reproduce. Fox2mike, could you look at this?
Comment 7 Tom Ellins [:TMZ] 2009-12-03 12:36:08 PST
test article on prod where this can be tested without editing live docs. = https://support.mozilla.com/en-US/kb/tmztest?bl=n
Comment 8 [:Cww] 2009-12-03 14:07:16 PST
re-assign to server-ops if you need IT attention.
Comment 9 Derek Moore [:dmoore] 2009-12-03 14:41:15 PST
I believe we have isolated the issue related to SUMO timouts in Amsterdam. Any client request (particularly edits) which took more than 10 seconds to complete could trigger a service timeout for other users of the site. We've removed the monitor which controlled this behavior and we've slightly expanded the monitoring timeouts in general.

I'll leave this bug open for now, as I'd like for everyone to confirm their issues have been resolved.
Comment 10 Pavel Cvrcek [:JasnaPaka] 2009-12-03 14:57:14 PST
At this moment all works fine for me. Thanks!
Comment 11 Underpass 2009-12-03 23:23:23 PST
Just had the error editing the Italian version of ((How to make Firefox the default browser)).
Comment 12 Derek Moore [:dmoore] 2009-12-04 00:32:55 PST
Thanks, Simone. I've investigated and confirmed the problem. We'll continue to work on it.
Comment 13 Michele Rodaro [:michro] 2009-12-05 09:15:42 PST
Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; it; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5

Just had the error when I approved an edit I've made to the Italian version of ((Firefox consumes a lot of CPU resources)).

It happens even when I try to connect to the "all Knowledge Base articles" page
(after a refresh I can view the page).

Michele
Comment 14 Thomas Schwecherl 2009-12-12 07:02:38 PST
Unchanged situation since several weeks:
"Service Unavailable" error after every(!) approving or saving of an article and most times when I try to open https://support.mozilla.com/kb/all+Knowledge+Base+articles or https://support.mozilla.com/kb/Localization Dashboard

Mozilla/5.0 (X11; U; Linux i686; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5
Comment 15 Kadir Topal [:atopal] 2009-12-15 14:17:22 PST
I can reproduce, just saw this error when I opened this page:
https://support.mozilla.com/en-US/kb/All%20Knowledge%20Base%20articles
Not editing anything.
Comment 16 Derek Moore [:dmoore] 2009-12-17 15:13:27 PST
Thanks for the input, everyone. We're still tracking problems with pages which take a significant amount of time to load (such as the All Articles link, above).
Comment 17 Boersenfeger 2009-12-19 06:51:41 PST
The Problem is in SUMO on all sites, that I translate or changed in the last week, and that were a lot. Im living in Germany, maybe the Server in Amsterdam is responsible here!? Sry, my English is bad. Have a nice Christmas all.
Comment 18 Vito Smolej 2010-01-07 09:55:48 PST
I get it pretty regularly (sl l10n, located in Munich). It has no negative effects, as far as I am concerned - to me, "it's just another process timing out on you".
Comment 19 matthew zeier [:mrz] 2010-01-07 10:48:56 PST
Vito, I dumbed down the health check more.  Want to let this bake a bit before rolling this to other sites.
Comment 20 matthew zeier [:mrz] 2010-01-13 21:05:32 PST
I'm going to call this baked and close it.  We'll use this same strategy for other sites.
Comment 21 Chris Ilias [:cilias] 2010-02-13 18:48:53 PST
We're getting reports of this still happening. Has this been rolled out?
Comment 22 matthew zeier [:mrz] 2010-02-13 19:54:30 PST
Yes, long time ago.  Still seeing it with sumo?
Comment 23 Vito Smolej 2010-02-13 22:46:40 PST
I >>NEVER<< can move the SUMO material from the staging area to the knowledge base without seeing this. The process finishes OK, I just need to refresh the page. 

It works for me, it's is just a drag.
Comment 24 Boersenfeger 2010-02-14 02:32:04 PST
Nothing changed here. See my First Post Comment 17!
I see the "Service Unavailable
The service is temporarily unavailable. Please try again later." Dialog every Time, when I wrote or change a Site on SUMO. I have to send the changed Information again, than it works.
Comment 25 Underpass 2010-02-14 03:19:50 PST
Same here (Italy). It happens about 9 times out of 10.

Thought you're still working on this.
Comment 26 matthew zeier [:mrz] 2010-02-14 11:11:56 PST
> Thought you're still working on this.

After comment 19 we let it sit for a week, didn't hear any complains and assumed the issue resolved.  No one's been working on it since largely because this bug was marked resolved.

I'll re-open and we'll re-investigate this week.
Comment 27 Derek Moore [:dmoore] 2010-02-16 09:55:36 PST
Everyone,

We've made more changes to improve the proxy stability between Amsterdam and San Jose. We have also migrated some prior experimental changes on non-SSL sumo to all SSL-enabled sites.

Please update here if you are still experiencing problems. It is particularly helpful if you include the URL where you encounter a timeout and the timestamp of when it occurred.
Comment 28 Underpass 2010-02-16 10:03:28 PST
Hello,

This URL

https://support.mozilla.com/it/kb/all+Knowledge+Base+articles

always gives me "Connection reset"
Comment 29 Underpass 2010-02-16 23:26:55 PST
Just had this error when editing the article ((*Eliminare i cookie))
Comment 31 Boersenfeger 2010-02-17 09:30:29 PST
This Site https://support.mozilla.com/de/kb/Article+list?style_mode=inproduct says
Service Unavailable

The service is temporarily unavailable. Please try again later.
Comment 32 Derek Moore [:dmoore] 2010-02-17 09:42:50 PST
Thank you for the specific feedback, everyone. There are several problems here, both on the San Jose and the Amsterdam side, and the specific examples are very useful for tracking them down.
Comment 33 Guillermo López :willyaranda (probably SLOW response) 2010-02-17 09:47:16 PST
I'm getting the problem *every* time I edit, or try to approve a change. It's so frustrating.

Anyway, the slowness of the site is painful.
Comment 34 Boersenfeger 2010-02-17 10:10:49 PST
This Side was changed here. https://support.mozilla.com/de/kb/*Das+Download-Fenster+%C3%B6ffnet+sich+nicht?bl=n
Same Problem ...
Comment 35 Michele Rodaro [:michro] 2010-02-18 02:29:51 PST
Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; it; rv:1.9.2) Gecko/20100115 Firefox/3.6

Always the same error.
Just had "Service Unavailable" saving the edits to the articles

https://support.mozilla.com/it/kb/*Improvvisa+apertura+di+molte+schede+o+finestre+in+Firefox

https://support.mozilla.com/it/kb/*Errore+nel+caricamento+di+un+sito+web

and every time I try to connect to https://support.mozilla.com/it/kb/all+Knowledge+Base+articles
Comment 36 David Tenser [:djst] 2010-02-18 05:23:51 PST
Same here, getting this problem every time I edit (from Sweden). What's the status of this bug? What does [baking] mean?
Comment 37 John Daggett (:jtd) 2010-02-18 05:33:17 PST
This was happening to me three hours ago, grrr....
Comment 38 David Tenser [:djst] 2010-02-18 05:55:49 PST
I can confirm comment 28, loading https://support.mozilla.com/it/kb/all+Knowledge+Base+articles always times out.

The other steps to reproduce require editing an article. Every time I save an edit, it times out.
Comment 39 Boersenfeger 2010-02-20 01:56:35 PST
https://support.mozilla.com/de/kb/*Firefox+als+Standardbrowser+festlegen+funktioniert+nicht?bl=n
and others, Problem held.
It is necessary to commit every Article, when this Problem occurs? So every Article, which I worked on, show this !
(Bad English, sorry)
Comment 40 matthew zeier [:mrz] 2010-02-24 10:08:08 PST
oremj is going to poke at this today.  if we can't find a quick fix we'll turn down the GLB service and investigate more without impacting you.
Comment 41 Jeremy Orem [:oremj] 2010-02-26 11:53:44 PST
Let's see if moving to zeus magically fixes this.
Comment 42 Boersenfeger 2010-02-28 02:59:02 PST
(In reply to comment #41)
Problem is not fixed. I changed this Article https://support.mozilla.com/de/kb/*Lesezeichen+zum+Internet+Explorer+exportieren?bl=n and its happened again.
Comment 43 Jeremy Orem [:oremj] 2010-03-01 01:08:40 PST
We haven't moved it yet. I'll comment when it has been moved.
Comment 44 Jeremy Orem [:oremj] 2010-03-05 11:00:21 PST
Looks like moving to zeus didn't fix the problem.
Comment 45 Jeremy Orem [:oremj] 2010-03-05 11:12:38 PST
Found the root of this problem! Anytime a page is created it sends out a ton of e-mails and zeus eventually just sees this as timing out.

If I hit the webhead directly it doesn't time out, but my current page create is up to 240 seconds with no end in sight. This also explains why we don't see the issue in stage, it doesn't send e-mail.

I'll reassign to support and it will be up to them to figure out how to send e-mail faster (most likely needs to be asynchronous).
Comment 46 Jeremy Orem [:oremj] 2010-03-05 11:14:24 PST
*** Bug 549961 has been marked as a duplicate of this bug. ***
Comment 47 Jeremy Orem [:oremj] 2010-03-05 11:16:41 PST
A quick workaround would be to turn off sending e-mails on page create until this is fixed.
Comment 48 Vito Smolej 2010-03-05 11:19:32 PST
re "Looks like moving to zeus didn't fix the problem." 

- it did not make it worse either. 

smo
Comment 49 Laura Thomson :laura 2010-03-09 08:36:57 PST
We can't turn off those emails.  Ways to fix:

- Flush user output and then send the emails (hiding problems from the user)
- Move the emails out of the process, sample code here
http://gearman.org/index.php?id=php_-_mail_queue
but requires gearman.  Jeremy, shall I file a separate bug for IT for gearman for SUMO?
Comment 50 James Socol [:jsocol, :james] 2010-03-09 09:45:21 PST
If IT is OK with it, it feels like Gearman is the way to go. Assigning to me, 1.5.3, Major. Any objections?
Comment 51 David Tenser [:djst] 2010-03-09 10:08:25 PST
(In reply to comment #50)
> Any objections?

Only cheerful excitement!
Comment 52 Jeremy Orem [:oremj] 2010-03-15 09:59:28 PDT
I'm okay with gearman, but I've heard rumors of the amo team switching to celery (http://ask.github.com/celery/). Should probably talk to them first.
Comment 53 James Socol [:jsocol, :james] 2010-03-16 10:17:54 PDT
(In reply to comment #52)
> I'm okay with gearman, but I've heard rumors of the amo team switching to
> celery (http://ask.github.com/celery/). Should probably talk to them first.

I asked in #amo and no one said they were moving. Also, I couldn't find PHP APIs (maybe I'm looking under the wrong names?) where Gearman has published APIs in a bunch of languages, including the PECL extension.
Comment 54 James Socol [:jsocol, :james] 2010-03-23 16:57:56 PDT
Created attachment 434416 [details] [diff] [review]
patch, moves sending wiki edit notifications to a gearman worker

Sorry this took so long. I spent a long time trying to follow what Tiki does when sending these notifications, and I'm fairly sure I understand why its so slow. However, I couldn't duplicate that functionality in a reasonable amount of time without just letting Tiki do it, itself.

So, this patch basically does a copy-paste of the contents of sendWikiEmailNotifications() (in webroot/lib/notification/notificationemaillib.php) and puts it in a worker (in scripts/gearman/notification.php). To make this work, the worker needs to cd to webroot and include tiki-setup.php, which means that the worker needs a full checkout of SUMO to work. (I apologize for that but after a week in the rabbit hole, I needed to solve this.)

This adds a new configuration option to webroot/db/local.php.dist, namely $gearman_servers, an array. Each member of $gearman_servers is an array with 'host' and 'port' vars.

To run the worker, make sure gearmand is running, then just type
  php notifications.php -d

The worker will start a daemon (the '-d' bit) and write its PID to scripts/gearman/etc/notification.pid.

Now, edit a page (I'll attach something in a second that helps not send mail). The email should all still get sent. If you kill the worker process, and edit a page, the notifications will get sent out the next time you start the worker.

This bit of code is shockingly deep. I can understand how it would slow down the response so much. If Jeremy was correct that this is the pain point, this patch will go a long way toward alleviating that.
Comment 55 James Socol [:jsocol, :james] 2010-03-23 17:00:45 PDT
Created attachment 434420 [details] [diff] [review]
helpful testing patch

This helped me test the notifications on real data. Shutting down the mail services might also work, if you can find a way to clear the queue before restarting them. (I couldn't, frustratingly. Still working on that.)

All this does is patch webroot/lib/webmail/htmlMimeMail.php (which at least _some_ things use to do their sending) to write out some information to the file /tmp/mailq instead of sending mail. It let me test with real data.
Comment 56 Paul Craciunoiu [:paulc] 2010-03-23 20:46:53 PDT
Comment on attachment 434416 [details] [diff] [review]
patch, moves sending wiki edit notifications to a gearman worker

Looks good. It was a lot easier than I expected. I did a diff of the sendWikiEmailNotification() function contents just to make sure all that deep code stayed the same and that looks fine too.

And I saw useful output in /tmp/mailtq, so your test file helped a lot! ;)


One thing I wasn't able to test is resuming. Perhaps I missed a step? Here's what I did:
* after seeing expected output in /tmp/mailq (in other words, things were working), I killed the worker and made another edit to the page.
* started the worker with |php notification.php -d| as before
* checked /tmp/mailq for an update

The update wasn't there. Maybe I'm missing something? An explicit "resume"? I'm cool with having another look at this if so.

However, emails get sent out of the main thread so for that purpose, r+.
Comment 57 James Socol [:jsocol, :james] 2010-03-25 19:03:38 PDT
r64841. Still talking to fox2mike in bug 551513 about getting it running.
Comment 58 Guillermo López :willyaranda (probably SLOW response) 2010-03-29 07:48:03 PDT
I'm getting this problem while adding a new translation every time with:

https://support.mozilla.com/tiki-edit_translation.php?locale=es&page=Firefox%20crashes%20when%20you%20exit%20it

And I don't think that this could be cause by mailing since this not involves any mail…
Comment 59 James Socol [:jsocol, :james] 2010-03-29 23:59:55 PDT
Ran into a problem that I'd missed locally: occasionally instead of sending notifications, the worker will spew a bunch of HTML to the terminal and then die.

My theory is that the database server is closing the connection and the client is not trying to reopen it before sending queries. I need to verify this tomorrow but it's by far my best lead.

Assuming that's the case, what needs to happen: the DB connection needs to be made in the task function (see webroot/db/tiki-db.php for what happens). This probably means creating and destroying a tikilib object in the task as well. Quite possibly it also means putting that DB connection object into the global scope, and then creating all the objects that extend TikiLib. Joy.
Comment 60 James Socol [:jsocol, :james] 2010-03-30 17:31:56 PDT
Created attachment 436085 [details] [diff] [review]
patch to patch 1, moves DB connecting/closing inside the function and adds debug output

This patch does a lot to the normal environment set up by tiki-setup.php:

1. Destroys a number of DB connection-dependent globals immediately after they're created.
2. Connects to the DB and recreates the globals when it needs to (when it receives a job).
3. Destroys the globals and disconnects again after the job is done.

It also adds a constant (DAEMON) that is TRUE if the script was run with the -d flag. It then uses that constant to control printing debug statements to the terminal (it only does if DAEMON is FALSE). The debugging statements are prettied-up for easy reading.
Comment 61 Laura Thomson :laura 2010-03-31 12:30:20 PDT
Comment on attachment 436085 [details] [diff] [review]
patch to patch 1, moves DB connecting/closing inside the function and adds debug output

Code itself works (easier than I thought, too, yay!)...my only general comment is consider putting the debug info into an error log if you're running in DAEMON mode, as this may help to debug problems in prod down the track.
Comment 62 James Socol [:jsocol, :james] 2010-03-31 14:31:04 PDT
(In reply to comment #61)
> (From update of attachment 436085 [details] [diff] [review])
> Code itself works (easier than I thought, too, yay!)...my only general comment
> is consider putting the debug info into an error log if you're running in
> DAEMON mode, as this may help to debug problems in prod down the track.

That's a good idea, but I'm going to hold off for now as I don't know the best place/way to write out logs, and Shyam's not here today.

r65138.

Pinging the on-call to reboot the worker.
Comment 63 James Socol [:jsocol, :james] 2010-04-01 11:06:24 PDT
Backed out in r65201.

I've spent a week or so on this, and while I've made progress, I'm also heading down the rabbit hole, which is a bad way to allocate our resources right now.

This has been a problem for a while. It will probably be a problem for a little while longer, but it will be fixed in Kitsune, and it's better for us to focus on that now.

I'm not WONTFIXing this, but it also can't block 1.5.3 any more.
Comment 64 Guillermo López :willyaranda (probably SLOW response) 2010-04-01 14:32:55 PDT
I don't want to be arrogant, but maybe in the US you don't see this, but I think this problem is very big in Europe datacenter to leave it for Katsume to fix it (how many time until we can see it in production BTW?).

I'm getting reports from people that tries to help with SUMO translations, and everyone are having this issue while saving. And they lost their job while translating.

In my community we are few people that know about this problem, and, for the nature of SUMO, it's a wiki, we can't reach every people that wants to help, I think a lot of people has lost time and effort while saving because of this bug.
Comment 65 James Socol [:jsocol, :james] 2010-04-01 19:30:01 PDT
We realize that this is a huge pain, especially for localizers. You should know that we didn't just back this out of 1.5.3 because it was frustrating or we didn't feel like working on it. We did it because it was basically eating up all of our resources and preventing other work from getting done, and it looked like it would continue to do so. Especially if, like comment 58 indicates, our approach wouldn't actually solve everything.

SUMOdev is a two-person team right now. If one of us is completely dedicated to one, the other person can't get code reviewed, check in, get feedback, etc--at least not quickly enough for it to be helpful. While that's fine for a day or so, this was eating up all of my time for a week with no end in sight. Essentially every day spent on this bug delays Kitsune by a day right now.

Ultimately, it was a decision of resource allocation and how to best serve the SUMO project and community as a whole. While fixing this bug would be a good use of our time, there are higher priorities right now.

For an overview of the rough plan for the next 6-9 months, you can look here:
* https://wiki.mozilla.org/Support/Kitsune_Milestones
* https://wiki.mozilla.org/Support/SUMOdev_Meeting_Notepad/2010_Q1#Mar_30.2C_2010

I'll try to write up a blog post about what we've been doing--I know it hasn't been highly visible (though check http://support-stage-new.mozilla.com/en/search hopefully sometime tomorrow) or seem very important--it's just search results. But it's actually hugely important in laying the foundations for the new platform.
Comment 66 Vito Smolej 2010-04-01 22:57:31 PDT
as for me "... this is a huge pain, especially for localizers ..." does not hold, I mean, as long I can assume that "it´s not a bug, it´s a feature" (g), I´m doing fine (...press "reload" and continue). Seems like, unfortunately, every new member of the SUMO is bound to hit this pothole sooner or later...

Goes without saying, that your work is much appreciated.

regards

smo
Comment 67 Boersenfeger 2010-04-02 07:38:02 PDT
Sure, you had not enough manpower. But new Translaters, who loosed there Stuff are not amused and went off from Translating Work on SUMO. Is it better than? Ive never loose my Text on this error. Maybe U can warn the translaters, that they should save their text before they click on send! Nice Easterweekend all.
Comment 68 [:Cww] 2010-04-02 10:56:32 PDT
As far as I know, the submit still works and no data is lost.  It just seems that way because the server gets so caught up processing the request that it throws an error message.  Would it help if we put a notice above the submit button saying that you may see an error but it's just because the server is working on your request and they don't have to resubmit anything.
Comment 69 Vito Smolej 2010-04-02 12:37:06 PDT
There ?may? be a situation, leading to loss of data, namely (just a scenario of what happened to me some time ago) that you get a fresh file, localize it and send it off >>without making a staging copy<< (i.e. translated to a beginner's mindset >>knowing<< there's something like a staging and a production copy). 

Checking for that would make sense in any case and probably keep some feathers unruffled. 

Re msgbox on timeout, something like "...in case you get XYZ message, press reload to continue" would make sense - it would indicate the problöem is known, and help the user climb over it.
Comment 70 Kadir Topal [:atopal] 2010-04-02 13:33:18 PDT
Yes, I'd also suggest we inform users that we know about the problem and provide them with a workaround. Generally I'd also say that this problem is unacceptable, but in regard to the situation trying to fix it would mean spending ressources on a dying piece of software. Instead we should try to bear with it just a little bit longer until this part of the KB is replaced by Django-Code as well. 

I know this isn't really satisfying and being from Europe I hate that as much as you, but the alternative is even worse :/
Comment 71 David Tenser [:djst] 2010-04-20 02:22:24 PDT
We currently have 1.5 people working on SUMO web development (James x 1, Paul x 0.5) and James spent approximately a full week trying to nail this down (see comment 63). We can't justify having James or Paul spend another week or more going around in circles and further delaying the development of SUMO 2.0.

mrz: Is there *any* way this problem can at least be reduced with some sort of IT quickfix -- e.g. more DB mirrors, more RAM, faster servers, etc? If so, we really should look into that now so we can improve the situation.

This really hurts the usability of SUMO for anyone editing pages, and seems to be happening mostly in Europe (although I'm not 100% sure about that).
Comment 72 Kadir Topal [:atopal] 2010-04-20 03:36:29 PDT
Just wanted to add that it's not only editing anymore, I also got the same message when I used the search and when I replied to a forum thread, again from Europe.
Comment 73 matthew zeier [:mrz] 2010-04-20 09:35:22 PDT
> mrz: Is there *any* way this problem can at least be reduced with some sort of
> IT quickfix -- e.g. more DB mirrors, more RAM, faster servers, etc? If so, we
> really should look into that now so we can improve the situation.
> 
> This really hurts the usability of SUMO for anyone editing pages, and seems to
> be happening mostly in Europe (although I'm not 100% sure about that).

Wasn't aware this was still an issue!

It has to do with the proxy setup we have in Amsterdam.  US-based hits aren't having any problems.  The only quick fix is to disable the Amsterdam proxy and force everyone to come to the US.

Should we do that?
Comment 74 David Tenser [:djst] 2010-04-20 09:40:26 PDT
Could we try that to see if it's a net positive? Would this only affect logged in sessions? If so, it seems like the right thing to do (given my lack of insights about the potential downsides of such a change).
Comment 75 matthew zeier [:mrz] 2010-04-20 11:17:08 PDT
This will affect all users.  

I took Amsterdam out of the GLB pool.  Let me know if there are still issues (and if there are, please let me know what IP address your computer is getting for support.mozilla.com).
Comment 76 Tobias (:Tobbi) Markus 2010-04-20 14:11:39 PDT
(In reply to comment #75)
> This will affect all users.  
> 
> I took Amsterdam out of the GLB pool.  Let me know if there are still issues
> (and if there are, please let me know what IP address your computer is getting
> for support.mozilla.com).

There are definitely still issues with Service Unavailable:

support.mozilla.com has the IP 63.245.213.89 for me.
Comment 77 matthew zeier [:mrz] 2010-04-20 15:00:34 PDT
you probably have to wait for dns ttls to expire.
Comment 78 Jeremy Orem [:oremj] 2010-04-20 15:34:26 PDT
(In reply to comment #58)
> I'm getting this problem while adding a new translation every time with:
> 
> https://support.mozilla.com/tiki-edit_translation.php?locale=es&page=Firefox%20crashes%20when%20you%20exit%20it
> 
> And I don't think that this could be cause by mailing since this not involves
> any mail…

I think this is bug 549961.
Comment 79 matthew zeier [:mrz] 2010-04-21 12:43:22 PDT
Still seeing those Amsterdam IPs (63.245.213.0/24)?
Comment 80 Kadir Topal [:atopal] 2010-04-21 14:08:01 PDT
What I get is: 63.245.213.89 not sure if that's Amsterdam as well
Comment 81 matthew zeier [:mrz] 2010-04-21 15:01:39 PDT
Anything in 63.245.213.89 is.  Are you on a unix-like machine?  Can you send me the output of "host -v developer.mozilla.org" ?
Comment 82 Kadir Topal [:atopal] 2010-04-21 15:07:00 PDT
sure

Trying "developer.mozilla.org"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13716
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;developer.mozilla.org.         IN      A

;; ANSWER SECTION:
developer.mozilla.org.  517     IN      CNAME   developer-mozilla-org.geo.mozilla.com.
developer-mozilla-org.geo.mozilla.com. 3517 IN CNAME devmo.glb.mozilla.net.
devmo.glb.mozilla.net.  97      IN      A       63.245.209.139

Received 141 bytes from 192.168.178.1#53 in 30 ms
Trying "devmo.glb.mozilla.net"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59184
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;devmo.glb.mozilla.net.         IN      AAAA

;; AUTHORITY SECTION:
glb.mozilla.net.        300     IN      SOA     ns.mozilla.org. sysadmins.mozilla.org. 2010042100 10800 3600 604800 1800

Received 99 bytes from 192.168.178.1#53 in 43 ms
Trying "devmo.glb.mozilla.net"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13479
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;devmo.glb.mozilla.net.         IN      MX

;; AUTHORITY SECTION:
glb.mozilla.net.        300     IN      SOA     ns.mozilla.org. sysadmins.mozilla.org. 2010042100 10800 3600 604800 1800
Comment 83 matthew zeier [:mrz] 2010-04-21 15:09:43 PDT
> developer-mozilla-org.geo.mozilla.com. 3517 IN CNAME devmo.glb.mozilla.net.
> devmo.glb.mozilla.net.  97      IN      A       63.245.209.139

209.139 is San Jose.  Something in your system is still returning the wrong address but DNS is working. 

Nothing in /etc/hosts right?
Comment 84 Kadir Topal [:atopal] 2010-04-22 00:33:57 PDT
I still see this (63.245.213.88) and caches should've been expired by now. An no, nothing in /etc/hosts
Comment 85 matthew zeier [:mrz] 2010-04-22 08:56:02 PDT
A bit of a disconnect - DNS is returning the correct results for you but you're not seeing the right address.  What tool are you using to get 213.88?
Comment 86 Kadir Topal [:atopal] 2010-04-23 11:11:41 PDT
the ping command
Comment 87 David Tenser [:djst] 2010-04-27 10:37:30 PDT
*Just* got a service unavailable error when posting a forum reply at https://support.mozilla.com/en-US/forum/3/656442. In Sweden, one minute ago.


Output:

Trying "developer.mozilla.org"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24996
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 2, ADDITIONAL: 2

;; QUESTION SECTION:
;developer.mozilla.org.		IN	A

;; ANSWER SECTION:
developer.mozilla.org.	159	IN	CNAME	developer-mozilla-org.geo.mozilla.com.
developer-mozilla-org.geo.mozilla.com. 2904 IN CNAME devmo.glb.mozilla.net.
devmo.glb.mozilla.net.	27	IN	A	63.245.209.139

;; AUTHORITY SECTION:
glb.mozilla.net.	294	IN	NS	ns4-glb.mozilla.net.
glb.mozilla.net.	294	IN	NS	ns1-glb.mozilla.net.

;; ADDITIONAL SECTION:
ns1-glb.mozilla.net.	167	IN	A	63.245.208.15
ns4-glb.mozilla.net.	167	IN	A	63.245.212.25

Received 217 bytes from 213.80.98.2#53 in 7 ms
Trying "devmo.glb.mozilla.net"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31179
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;devmo.glb.mozilla.net.		IN	AAAA

;; AUTHORITY SECTION:
glb.mozilla.net.	290	IN	SOA	ns.mozilla.org. sysadmins.mozilla.org. 2010042100 10800 3600 604800 1800

Received 99 bytes from 213.80.98.2#53 in 12 ms
Trying "devmo.glb.mozilla.net"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28539
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;devmo.glb.mozilla.net.		IN	MX

;; AUTHORITY SECTION:
glb.mozilla.net.	290	IN	SOA	ns.mozilla.org. sysadmins.mozilla.org. 2010042100 10800 3600 604800 1800

Received 99 bytes from 213.80.98.2#53 in 12 ms
Comment 88 matthew zeier [:mrz] 2010-04-27 11:19:59 PDT
This seems more app related then - you're hitting San Jose on that last query.
Comment 89 Tobias (:Tobbi) Markus 2010-05-12 08:23:16 PDT
I'm still hitting Amsterdam (I'm located in Northern Germany). See my traceroute paste here:

Routenverfolgung zu support.mozilla.com [63.245.213.89] über maximal 30 Abschnitte:

  1     1 ms     1 ms     1 ms  192.168.1.1
  2    41 ms    41 ms    40 ms  217.0.119.143
  3    43 ms    43 ms    41 ms  217.0.86.6
  4    44 ms    46 ms    45 ms  hh-eb1-i.HH.DE.NET.DTAG.DE [62.154.32.230]
  5    44 ms    43 ms    44 ms  so-7-1.car2.Hamburg1.Level3.net [4.68.127.241]
  6    44 ms    43 ms    43 ms  ae-11-11.car1.Hamburg1.Level3.net [4.69.133.177]
  7    50 ms    50 ms    50 ms  ae-4-4.ebr1.Dusseldorf1.Level3.net [4.69.133.182]
  8    50 ms    50 ms    49 ms  ae-1-100.ebr2.Dusseldorf1.Level3.net [4.69.141.150]
  9    53 ms    53 ms    53 ms  ae-47-47.ebr1.Amsterdam1.Level3.net [4.69.143.205]
 10    54 ms    53 ms    53 ms  ae-12-51.car2.Amsterdam1.Level3.net [4.69.139.131]
 11    54 ms    54 ms    55 ms  212.72.43.14
 12    54 ms    54 ms    54 ms  92.60.240.130
 13    54 ms    54 ms    54 ms  sumo02.zlb.nl.mozilla.com [63.245.213.89]

Ablaufverfolgung beendet.
Comment 90 matthew zeier [:mrz] 2010-05-12 08:30:49 PDT
For the second time I've taken Amsterdam out of the GLB pool.  I don't know why it put itself back in but it's out now.
Comment 91 James Socol [:jsocol, :james] 2010-05-12 08:32:40 PDT
Everyone in Europe:

Please keep track of timeouts (they may appear as blank pages, "Service Unavailable" or "Server is not responding" errors) over the next few days (give the change a little window to propagate first). If the timeout rate is noticeably higher or lower, please let us know here.
Comment 92 Kadir Topal [:atopal] 2010-05-31 06:43:03 PDT
Okay, I don't see Europe in the traceroute any more, however SUMO is still giving me blank pages from time to time, but that's probably a different bug. Does anyone else still see "service unavailable" messages?
Comment 93 Pavel Cvrcek [:JasnaPaka] 2010-05-31 06:51:15 PDT
I worked with SUMO last week and I didn't see message "service unavailable" but many times I got blank page (as Kadir said). My location: Czech Republic, Central Europe.
Comment 94 Boersenfeger 2010-05-31 08:34:32 PDT
(In reply to comment #93)
I have this Issue both, since my first post in this Bug-Tracker. Every Time I change an Article on SUMO or make a Translation and try to save it, this happened. When I have a look for the Page, where all Article shows, I often have this Message: "Service Unavailable
The service is temporarily unavailable. Please try again later."
Comment 95 Thomas Schwecherl 2010-05-31 10:43:05 PDT
I don't see the "service unavailable" message any more - but a blank page instead (on large pages) or the request to download the page (after an action, e.g. saving a review). Location: Upper Austria.
Comment 96 David Tenser [:djst] 2010-06-01 01:04:27 PDT
Via James: "The error message changed and so it's been difficult to get an answer on whether it's better or not. The test is: do you see blank pages more or less often than you saw "Service Unavailable?" It's the same error, just without the AMS Zeus error message."

Thomas, that download dialog is bug 549961. We're trying to figure out what we can do about that. 

These two bugs are both results of the current Tiki-based SUMO not scaling to meet the increased load from more users, more contributors, and more KB articles. We're trying to strike a balance between devoting most of our resources on building the new Django-based SUMO, but especially bug 549961 is enough of a problem right now that we need to spend cycles on fixing that before continuing with the next-gen SUMO. Stay tuned, and thanks everyone for your patience and understanding.
Comment 97 Kadir Topal [:atopal] 2010-06-01 11:58:41 PDT
We will increase the apache timeout window on Thursday when SUMO is moved to it's own hardware, so please report back if that fixes the blank page issues. The change is tracked in bug 569412
Comment 98 Michele Rodaro [:michro] 2010-06-16 03:09:22 PDT
Hi from Italy,

I didn't get blank pages or the Save/Open file dialog (bug 549961) since two/three days when editing or approving articles.
I hope that this issue has been definitively fixed.
Comment 99 Boersenfeger 2010-07-08 10:46:42 PDT
In the last few Days, I read and change nearly 20 Articles for the German SUMO. On every Change I click on Save, a white Page occurs, than I have to press the F5 Button and say OK to the following Question... It still happened for me. :-(

Note You need to log in before you can comment on or make changes to this bug.