Closed Bug 675426 Opened 14 years ago Closed 14 years ago

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: robbie.bykowski, Assigned: dparsons)

References

Details

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0a1) Gecko/20110729 Firefox/8.0a1 Build ID: 20110729030824 Steps to reproduce: Go into add-on manger, clicked plugins and told it to check for updates. On the subsequent page opened it begins checking for updates to plugins. Actual results: The page then fails, with no additional information. Expected results: Either give more information of the reason for the error, or there is something wrong with firefox when it comes to plugin checking. In some ways it would be better if firefox automatically did this, and popped up with information on the latest plugin.
Component: General → plugins.mozilla.org
Product: Firefox → Websites
QA Contact: general → plugins-mozilla-org
Version: 8 Branch → unspecified
seems to work for me, is this still a problem ?
I've heard from Twitter users that the plugin check sometimes doesn't work, and I can confirm the issue: the plugins.mozilla.org server sometimes returns 500 Internal Sever Error.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Mac OS X → All
Hardware: x86 → All
The error happens roughly once every 10 times.
Well, it was failing, but now it appears to be working.
I think the pfs server gets too many requests. See Bug 589060 how it can be reduced to once.
(In reply to comment #2) > I've heard from Twitter users that the plugin check sometimes doesn't work, > and I can confirm the issue: the plugins.mozilla.org server sometimes > returns 500 Internal Sever Error. Ozten, mrz: do you know if plugincheck is monitored by IT/nagios - if not would be good to be monitored in case the service is not working or ?
Assignee: nobody → server-ops
Component: plugins.mozilla.org → Server Operations
Product: Websites → mozilla.org
QA Contact: plugins-mozilla-org → mrz
Version: unspecified → other
(In reply to comment #6) > > Ozten, mrz: do you know if plugincheck is monitored by IT/nagios - if not > would be good to be monitored in case the service is not working or ? Kohei do you have the concrete url like https://www.mozilla.com/en-US/plugincheck/ when you get this 500 errors since IT will look into this
Additional information like dates and times when the service "didn't work" and maybe screenshots will help too. Thanks! Also, since this is under mozilla.com, it's got a bunch of hardware behind it and shouldn't have any issues processing too many requests.
Assignee: server-ops → shyam
Status: NEW → UNCONFIRMED
Ever confirmed: false
Summary: plugin checker does not work → Possible issues with http://www.mozilla.com/en-US/plugincheck/
I'd look at plugins.mozilla.org first, as it seems to be erroring out a tonne these days. There's nothing reported beyond http://www.flickr.com/photos/moz_kev/5714576832/, but I'm going to bet that's where the issue is.
(In reply to comment #8) > Additional information like dates and times when the service "didn't work" > and maybe screenshots will help too. Shyam: the plugincheck page makes several calls to plugins.mozilla.org, and if it times out the user is presented with an error. The error is not related to mozilla.com infra, iirc, its whenever there are communications issues with plugins.mozilla.org (which is the reference server for the plugin lookups). I don't know what kind of resiliency/capacity is built into the boxen that serve up plugins, but I have been able to replicate the problems seen. Hitting plugins.mozilla.org has resulted in multiple "Oops" error messages this morning, and that happens on a fairly regular basis over the last couple months.
Rob, Can we please have the production URL in comment #11 put into Nagios? You might want to compare strings or maybe just look for http error code. This should page oncall and display on IRC, so we can catch the error as it's happening. (In reply to comment #10) > Hitting plugins.mozilla.org has resulted in multiple "Oops" error messages > this morning, and that happens on a fairly regular basis over the last > couple months. Kev, plugins.mozilla.org is backed by 23 servers in the AMO cluster in the San Jose Datacenter. That is quite a bit of hardware. Granted that the same hardware serves AMO as well, but if plugins.mozilla.org was generating quite a bit of traffic, we could be seeing issues because of that and vice-versa (heavy load on AMO can affect plugins). We'll keep an eye on it for now and see what we can do.
Assignee: shyam → rtucker
Shyam, I can add the check. Are just looking for a normal 200, and to throw a critical page for a 500?
(In reply to comment #13) > Shyam, > I can add the check. Are just looking for a normal 200, and to throw a > critical page for a 500? 10-4. Check should obey redirects (if any) and finally get a 200. Critical on anything else.
Shyam, As discussed on IRC. I'm able to get the "Ouch Something went south..." error page repeatedly. This will make the check essentially useless. We'll want to get the underlying issue fixed, then I'll install the check.
I believe that I've isolated the issue to pm-app-amo07
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to Rob Tucker from comment #16) > I believe that I've isolated the issue to pm-app-amo07 Can you elaborate some more? I could remove pm-app-amo07 from the pool if that would solve the issue, but we should ideally investigate why this one host has an issue.
Status: NEW → UNCONFIRMED
Ever confirmed: false
Status: UNCONFIRMED → NEW
Ever confirmed: true
Any updates on this? I'm still seeing issues from pm-app-amo07 and I cannot install the check until it's resolved.
I've disabled pm-app-amo07, I'll try and debug that in a bit. Kev, plugins is only being backed by 5 of the AMO nodes, not all of them as I had originally mentioned. Still, if this was a node that kept failing, removing it should take away the error pages. Also, if this is a service we really depend upon, it's time to look into moving this into a cluster of it's own. Do we have traffic stats for this website?
(In reply to Shyam Mani [:fox2mike] from comment #19) > Kev, plugins is only being backed by 5 of the AMO nodes, not all of them as > I had originally mentioned. Still, if this was a node that kept failing, > removing it should take away the error pages. Fixing the root cause would be a good thing. I'd like to understand what amo07 is having problems with, and I'm betting the AMO team would too. > Also, if this is a service we really depend upon, it's time to look into > moving this into a cluster of it's own. Do we have traffic stats for this > website? This server drives plugincheck, which has been actively marketed by the engagement team since 3.5. The plugincheck web page will be linked to from within the product starting with Firefox 6 in a bunch of different languages. If it's not resilient, it should be :)
I've added checks to each of the individual systems. They are green. Closing this out, please let me know if there are any issues.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to Kev [:kev] Needham from comment #20) > This server drives plugincheck, which has been actively marketed by the > engagement team since 3.5. The plugincheck web page will be linked to from > within the product starting with Firefox 6 in a bunch of different > languages. > > If it's not resilient, it should be :) Days before a release is the wrong time to hear that a piece of backend infrastructure will become part of the product. I don't know how we can change our culture such that IT gets a proper "heads up", but for this case we will just have to roll with the punches and see what we get. oremj is on this bug, I'll poke him today about spreading the plugincheck around. We don't have anymore room in sjc1 for this, but when AMO moves to phx1 (a matter of a week or two now?) we should be better off.
(In reply to Corey Shields [:cshields] from comment #22) > (In reply to Kev [:kev] Needham from comment #20) > > This server drives plugincheck, which has been actively marketed by the > > engagement team since 3.5. The plugincheck web page will be linked to from > > within the product starting with Firefox 6 in a bunch of different > > languages. > > > > If it's not resilient, it should be :) > > Days before a release is the wrong time to hear that a piece of backend > infrastructure will become part of the product. I don't know how we can > change our culture such that IT gets a proper "heads up", but for this case > we will just have to roll with the punches and see what we get. Plugincheck is a released product, and it has been for a couple years. We've done major marketing around it before, and I don't think you'll see a significant change in traffic to the system (you have to open the addons manager, then go to the plugins section, then click on the link that takes you there. My comments around resiliency are centered mainly around my expectation that it would already be meeting needs, because it's been out there since 3.5. That said, the product change is definitely something that IT should be aware of. Who should be point for these kinds of changes, and I'll ask Deb about ensuring IT is part of the product process to try and mitigate surprises. > oremj is on this bug, I'll poke him today about spreading the plugincheck > around. We don't have anymore room in sjc1 for this, but when AMO moves to > phx1 (a matter of a week or two now?) we should be better off. Cool. Again, I don't think you'll see a significant impact, but that's an assumption only. Thanks for taking care of this, folks. I haven't seen an "Oops" page in prod the last few days.
(In reply to Kev [:kev] Needham from comment #23) > Thanks for taking care of this, folks. I haven't seen an "Oops" page in prod > the last few days. Only because the server is out of the pool. Ozten, can you poke me on irc when you have sometime? The server that was pulled out is still 500'ing and I can't get it to spit logs :|
Assignee: rtucker → shyam
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Hey folks, We're still getting "Oops, something went South" errors, and QA has also reported seeing errors with bug 679812 that is probably related. Can we get some traction on tracking down the root cause and addressing? If there's additional info/people needed, please let me know and I'll do what I can to assist.
Severity: normal → critical
Assignee: shyam → dparsons
(In reply to Austin King [:ozten] from comment #11) > Stage: > https://plugins.stage.mozilla.com/pfs/v2?appID=%7Bec8030f7-c20a-464f-9b0e- > 13a3a9e97384%7D&appRelease=5. > 0&appVersion=20110615151330&clientOS=Linux+i686&chromeLocale=en&detection=ori > ginal&mimetype=application%2Fx-shockwave- > flash+application%2Ffuturesplash&callback=C > > Production: > https://plugins.mozilla.org/pfs/v2?appID=%7Bec8030f7-c20a-464f-9b0e- > 13a3a9e97384%7D&appRelease=5. > 0&appVersion=20110615151330&clientOS=Linux+i686&chromeLocale=en&detection=ori > ginal&mimetype=application%2Fx-shockwave- > flash+application%2Ffuturesplash&callback=C Can I get verification that 1) we want the output of these 2 to be exactly the same, 2) stage is still correct and prod is broken, and 3) any ideas as to why prod is returning differently? In addition, knowing the backend server of any Oops page, along with date/time, would be REAL helpful here (you can find the server in the return header)
This is now fixed. The problem was that memcache support in php wasn't functioning. The reason it wasn't functioning is here: [root@pm-app-amo07 httpd]# php PHP Warning: PHP Startup: memcache: Unable to initialize module Module compiled with module API=20050922, debug=0, thread-safety=0 PHP compiled with module API=20060613, debug=0, thread-safety=0 These options need to match To fix it, I did this: pecl install memcache and then restarted httpd. I have no idea how amo07's memcache .so file got compiled against a different PHP API than the version installed. But, they match now and the errors have disappeared. Extra info: (1) An easy way to reproduce the error - execute this from pm-app-amo07: curl -I -L -H "Host: plugins.mozilla.org" "http://localhost:81/pfs/v2?appID=%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D&appRelease=3.6.18&appVersion=20110614230723&clientOS=Windows+NT+6.0&chromeLocale=en-US&detection=original&mimetype=application%2Fsdp+application%2Fx-sdp+application%2Fx-rtsp+video%2Fquicktime+video%2Fflc+audio%2Fx-wav+audio%2Fwav&callback=C" (2) To enable application logging for plugins.mozilla.org, change line 5 of /data/www/plugins.mozilla.org/application/config/config-local.php to say: $config['core.log_threshold'] = 6;
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Additionally, jason put pm-app-amo07 back in zeus for plugins.mozilla.org
(In reply to Dan Parsons [:lerxst] from comment #27) > This is now fixed. The problem was that memcache support in php wasn't > functioning. The reason it wasn't functioning is here: > > [root@pm-app-amo07 httpd]# php > PHP Warning: PHP Startup: memcache: Unable to initialize module > Module compiled with module API=20050922, debug=0, thread-safety=0 > PHP compiled with module API=20060613, debug=0, thread-safety=0 > These options need to match > > To fix it, I did this: > pecl install memcache Hrmm, we usually install stuff using rpms, was there a mismatched rpm on the machine? Something that didn't get upgraded with the rest of php on that machine?
I also very much prefer to handle everything via packages wherever possible, however I compared the respective RPM versions between amo07 and the other boxes and they were identical, except for the fact that amo07 was 64-bit and the others were 32-bit. I've seen RedHat-based PHP packages cause this problem before (CentOS & RHEL), and the final RPM-based "fix" usually involves reinstalling a lot of packages (php, php modules and sometimes apache too) and I didn't want to risk whatever else was being served from that machine. Also, I was told that this machine would likely be replaced by something in phx1 soon. Additionally, there was some difficulty in completely removing this machine from zeus (it was not ever completely removed except for a few minutes today). I chose the pecl route because it seemed like the best balance of "fix it now" and "don't break anything else".
In this case we are fine - this is a quick fix to a system that will be phased out with AMO moving to phx1 and the scl3 portion will be built from scratch, leaving these nodes behind. thanks Dan!
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.