Closed
Bug 675426
Opened 14 years ago
Closed 14 years ago
Possible issues with http://www.mozilla.com/en-US/plugincheck/
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: robbie.bykowski, Assigned: dparsons)
References
Details
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0a1) Gecko/20110729 Firefox/8.0a1
Build ID: 20110729030824
Steps to reproduce:
Go into add-on manger, clicked plugins and told it to check for updates. On the subsequent page opened it begins checking for updates to plugins.
Actual results:
The page then fails, with no additional information.
Expected results:
Either give more information of the reason for the error, or there is something wrong with firefox when it comes to plugin checking. In some ways it would be better if firefox automatically did this, and popped up with information on the latest plugin.
Updated•14 years ago
|
Component: General → plugins.mozilla.org
Product: Firefox → Websites
QA Contact: general → plugins-mozilla-org
Version: 8 Branch → unspecified
Comment 1•14 years ago
|
||
seems to work for me, is this still a problem ?
Comment 2•14 years ago
|
||
I've heard from Twitter users that the plugin check sometimes doesn't work, and I can confirm the issue: the plugins.mozilla.org server sometimes returns 500 Internal Sever Error.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Updated•14 years ago
|
OS: Mac OS X → All
Hardware: x86 → All
Comment 3•14 years ago
|
||
The error happens roughly once every 10 times.
| Reporter | ||
Comment 4•14 years ago
|
||
Well, it was failing, but now it appears to be working.
Comment 5•14 years ago
|
||
I think the pfs server gets too many requests. See Bug 589060 how it can be reduced to once.
Comment 6•14 years ago
|
||
(In reply to comment #2)
> I've heard from Twitter users that the plugin check sometimes doesn't work,
> and I can confirm the issue: the plugins.mozilla.org server sometimes
> returns 500 Internal Sever Error.
Ozten, mrz: do you know if plugincheck is monitored by IT/nagios - if not would be good to be monitored in case the service is not working or ?
Updated•14 years ago
|
Assignee: nobody → server-ops
Component: plugins.mozilla.org → Server Operations
Product: Websites → mozilla.org
QA Contact: plugins-mozilla-org → mrz
Version: unspecified → other
Comment 7•14 years ago
|
||
(In reply to comment #6)
>
> Ozten, mrz: do you know if plugincheck is monitored by IT/nagios - if not
> would be good to be monitored in case the service is not working or ?
Kohei do you have the concrete url like https://www.mozilla.com/en-US/plugincheck/ when you get this 500 errors since IT will look into this
Comment 8•14 years ago
|
||
Additional information like dates and times when the service "didn't work" and maybe screenshots will help too.
Thanks!
Also, since this is under mozilla.com, it's got a bunch of hardware behind it and shouldn't have any issues processing too many requests.
Assignee: server-ops → shyam
Status: NEW → UNCONFIRMED
Ever confirmed: false
Summary: plugin checker does not work → Possible issues with http://www.mozilla.com/en-US/plugincheck/
Comment 9•14 years ago
|
||
I'd look at plugins.mozilla.org first, as it seems to be erroring out a tonne these days. There's nothing reported beyond http://www.flickr.com/photos/moz_kev/5714576832/, but I'm going to bet that's where the issue is.
Comment 10•14 years ago
|
||
(In reply to comment #8)
> Additional information like dates and times when the service "didn't work"
> and maybe screenshots will help too.
Shyam: the plugincheck page makes several calls to plugins.mozilla.org, and if it times out the user is presented with an error. The error is not related to mozilla.com infra, iirc, its whenever there are communications issues with plugins.mozilla.org (which is the reference server for the plugin lookups). I don't know what kind of resiliency/capacity is built into the boxen that serve up plugins, but I have been able to replicate the problems seen.
Hitting plugins.mozilla.org has resulted in multiple "Oops" error messages this morning, and that happens on a fairly regular basis over the last couple months.
Comment 11•14 years ago
|
||
(In reply to comment #6)
> Ozten, mrz: do you know if plugincheck is monitored by IT/nagios - if not
> would be good to be monitored in case the service is not working or ?
I think we need to monitor the PFS backend.
Stage:
https://plugins.stage.mozilla.com/pfs/v2?appID=%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D&appRelease=5.0&appVersion=20110615151330&clientOS=Linux+i686&chromeLocale=en&detection=original&mimetype=application%2Fx-shockwave-flash+application%2Ffuturesplash&callback=C
Production:
https://plugins.mozilla.org/pfs/v2?appID=%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D&appRelease=5.0&appVersion=20110615151330&clientOS=Linux+i686&chromeLocale=en&detection=original&mimetype=application%2Fx-shockwave-flash+application%2Ffuturesplash&callback=C
Comment 12•14 years ago
|
||
Rob,
Can we please have the production URL in comment #11 put into Nagios? You might want to compare strings or maybe just look for http error code. This should page oncall and display on IRC, so we can catch the error as it's happening.
(In reply to comment #10)
> Hitting plugins.mozilla.org has resulted in multiple "Oops" error messages
> this morning, and that happens on a fairly regular basis over the last
> couple months.
Kev, plugins.mozilla.org is backed by 23 servers in the AMO cluster in the San Jose Datacenter. That is quite a bit of hardware. Granted that the same hardware serves AMO as well, but if plugins.mozilla.org was generating quite a bit of traffic, we could be seeing issues because of that and vice-versa (heavy load on AMO can affect plugins).
We'll keep an eye on it for now and see what we can do.
Assignee: shyam → rtucker
Comment 13•14 years ago
|
||
Shyam,
I can add the check. Are just looking for a normal 200, and to throw a critical page for a 500?
Comment 14•14 years ago
|
||
(In reply to comment #13)
> Shyam,
> I can add the check. Are just looking for a normal 200, and to throw a
> critical page for a 500?
10-4. Check should obey redirects (if any) and finally get a 200. Critical on anything else.
Comment 15•14 years ago
|
||
Shyam,
As discussed on IRC. I'm able to get the "Ouch Something went south..." error page repeatedly. This will make the check essentially useless. We'll want to get the underlying issue fixed, then I'll install the check.
Comment 16•14 years ago
|
||
I believe that I've isolated the issue to pm-app-amo07
Updated•14 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 17•14 years ago
|
||
(In reply to Rob Tucker from comment #16)
> I believe that I've isolated the issue to pm-app-amo07
Can you elaborate some more? I could remove pm-app-amo07 from the pool if that would solve the issue, but we should ideally investigate why this one host has an issue.
Status: NEW → UNCONFIRMED
Ever confirmed: false
Updated•14 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 18•14 years ago
|
||
Any updates on this? I'm still seeing issues from pm-app-amo07 and I cannot install the check until it's resolved.
Comment 19•14 years ago
|
||
I've disabled pm-app-amo07, I'll try and debug that in a bit.
Kev, plugins is only being backed by 5 of the AMO nodes, not all of them as I had originally mentioned. Still, if this was a node that kept failing, removing it should take away the error pages.
Also, if this is a service we really depend upon, it's time to look into moving this into a cluster of it's own. Do we have traffic stats for this website?
Comment 20•14 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #19)
> Kev, plugins is only being backed by 5 of the AMO nodes, not all of them as
> I had originally mentioned. Still, if this was a node that kept failing,
> removing it should take away the error pages.
Fixing the root cause would be a good thing. I'd like to understand what amo07 is having problems with, and I'm betting the AMO team would too.
> Also, if this is a service we really depend upon, it's time to look into
> moving this into a cluster of it's own. Do we have traffic stats for this
> website?
This server drives plugincheck, which has been actively marketed by the engagement team since 3.5. The plugincheck web page will be linked to from within the product starting with Firefox 6 in a bunch of different languages.
If it's not resilient, it should be :)
Comment 21•14 years ago
|
||
I've added checks to each of the individual systems. They are green. Closing this out, please let me know if there are any issues.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 22•14 years ago
|
||
(In reply to Kev [:kev] Needham from comment #20)
> This server drives plugincheck, which has been actively marketed by the
> engagement team since 3.5. The plugincheck web page will be linked to from
> within the product starting with Firefox 6 in a bunch of different
> languages.
>
> If it's not resilient, it should be :)
Days before a release is the wrong time to hear that a piece of backend infrastructure will become part of the product. I don't know how we can change our culture such that IT gets a proper "heads up", but for this case we will just have to roll with the punches and see what we get.
oremj is on this bug, I'll poke him today about spreading the plugincheck around. We don't have anymore room in sjc1 for this, but when AMO moves to phx1 (a matter of a week or two now?) we should be better off.
Comment 23•14 years ago
|
||
(In reply to Corey Shields [:cshields] from comment #22)
> (In reply to Kev [:kev] Needham from comment #20)
> > This server drives plugincheck, which has been actively marketed by the
> > engagement team since 3.5. The plugincheck web page will be linked to from
> > within the product starting with Firefox 6 in a bunch of different
> > languages.
> >
> > If it's not resilient, it should be :)
>
> Days before a release is the wrong time to hear that a piece of backend
> infrastructure will become part of the product. I don't know how we can
> change our culture such that IT gets a proper "heads up", but for this case
> we will just have to roll with the punches and see what we get.
Plugincheck is a released product, and it has been for a couple years. We've done major marketing around it before, and I don't think you'll see a significant change in traffic to the system (you have to open the addons manager, then go to the plugins section, then click on the link that takes you there.
My comments around resiliency are centered mainly around my expectation that it would already be meeting needs, because it's been out there since 3.5. That said, the product change is definitely something that IT should be aware of. Who should be point for these kinds of changes, and I'll ask Deb about ensuring IT is part of the product process to try and mitigate surprises.
> oremj is on this bug, I'll poke him today about spreading the plugincheck
> around. We don't have anymore room in sjc1 for this, but when AMO moves to
> phx1 (a matter of a week or two now?) we should be better off.
Cool. Again, I don't think you'll see a significant impact, but that's an assumption only.
Thanks for taking care of this, folks. I haven't seen an "Oops" page in prod the last few days.
Comment 24•14 years ago
|
||
(In reply to Kev [:kev] Needham from comment #23)
> Thanks for taking care of this, folks. I haven't seen an "Oops" page in prod
> the last few days.
Only because the server is out of the pool.
Ozten, can you poke me on irc when you have sometime? The server that was pulled out is still 500'ing and I can't get it to spit logs :|
Assignee: rtucker → shyam
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 25•14 years ago
|
||
Hey folks,
We're still getting "Oops, something went South" errors, and QA has also reported seeing errors with bug 679812 that is probably related.
Can we get some traction on tracking down the root cause and addressing? If there's additional info/people needed, please let me know and I'll do what I can to assist.
Severity: normal → critical
Updated•14 years ago
|
Assignee: shyam → dparsons
Comment 26•14 years ago
|
||
(In reply to Austin King [:ozten] from comment #11)
> Stage:
> https://plugins.stage.mozilla.com/pfs/v2?appID=%7Bec8030f7-c20a-464f-9b0e-
> 13a3a9e97384%7D&appRelease=5.
> 0&appVersion=20110615151330&clientOS=Linux+i686&chromeLocale=en&detection=ori
> ginal&mimetype=application%2Fx-shockwave-
> flash+application%2Ffuturesplash&callback=C
>
> Production:
> https://plugins.mozilla.org/pfs/v2?appID=%7Bec8030f7-c20a-464f-9b0e-
> 13a3a9e97384%7D&appRelease=5.
> 0&appVersion=20110615151330&clientOS=Linux+i686&chromeLocale=en&detection=ori
> ginal&mimetype=application%2Fx-shockwave-
> flash+application%2Ffuturesplash&callback=C
Can I get verification that 1) we want the output of these 2 to be exactly the same, 2) stage is still correct and prod is broken, and 3) any ideas as to why prod is returning differently?
In addition, knowing the backend server of any Oops page, along with date/time, would be REAL helpful here (you can find the server in the return header)
| Assignee | ||
Comment 27•14 years ago
|
||
This is now fixed. The problem was that memcache support in php wasn't functioning. The reason it wasn't functioning is here:
[root@pm-app-amo07 httpd]# php
PHP Warning: PHP Startup: memcache: Unable to initialize module
Module compiled with module API=20050922, debug=0, thread-safety=0
PHP compiled with module API=20060613, debug=0, thread-safety=0
These options need to match
To fix it, I did this:
pecl install memcache
and then restarted httpd.
I have no idea how amo07's memcache .so file got compiled against a different PHP API than the version installed. But, they match now and the errors have disappeared.
Extra info:
(1) An easy way to reproduce the error - execute this from pm-app-amo07:
curl -I -L -H "Host: plugins.mozilla.org" "http://localhost:81/pfs/v2?appID=%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D&appRelease=3.6.18&appVersion=20110614230723&clientOS=Windows+NT+6.0&chromeLocale=en-US&detection=original&mimetype=application%2Fsdp+application%2Fx-sdp+application%2Fx-rtsp+video%2Fquicktime+video%2Fflc+audio%2Fx-wav+audio%2Fwav&callback=C"
(2) To enable application logging for plugins.mozilla.org, change line 5 of /data/www/plugins.mozilla.org/application/config/config-local.php to say:
$config['core.log_threshold'] = 6;
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 28•14 years ago
|
||
Additionally, jason put pm-app-amo07 back in zeus for plugins.mozilla.org
Comment 29•14 years ago
|
||
(In reply to Dan Parsons [:lerxst] from comment #27)
> This is now fixed. The problem was that memcache support in php wasn't
> functioning. The reason it wasn't functioning is here:
>
> [root@pm-app-amo07 httpd]# php
> PHP Warning: PHP Startup: memcache: Unable to initialize module
> Module compiled with module API=20050922, debug=0, thread-safety=0
> PHP compiled with module API=20060613, debug=0, thread-safety=0
> These options need to match
>
> To fix it, I did this:
> pecl install memcache
Hrmm, we usually install stuff using rpms, was there a mismatched rpm on the machine? Something that didn't get upgraded with the rest of php on that machine?
| Assignee | ||
Comment 30•14 years ago
|
||
I also very much prefer to handle everything via packages wherever possible, however I compared the respective RPM versions between amo07 and the other boxes and they were identical, except for the fact that amo07 was 64-bit and the others were 32-bit.
I've seen RedHat-based PHP packages cause this problem before (CentOS & RHEL), and the final RPM-based "fix" usually involves reinstalling a lot of packages (php, php modules and sometimes apache too) and I didn't want to risk whatever else was being served from that machine. Also, I was told that this machine would likely be replaced by something in phx1 soon. Additionally, there was some difficulty in completely removing this machine from zeus (it was not ever completely removed except for a few minutes today).
I chose the pecl route because it seemed like the best balance of "fix it now" and "don't break anything else".
Comment 31•14 years ago
|
||
In this case we are fine - this is a quick fix to a system that will be phased out with AMO moving to phx1 and the scl3 portion will be built from scratch, leaving these nodes behind.
thanks Dan!
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•