Open Bug 798756 Opened 8 years ago Updated 8 months ago

plugins.mozilla.org and kohana

Categories

(Websites :: plugins.mozilla.org, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: ericz, Assigned: laura)

References

Details

Attachments

(2 files)

plugins[5-6].webapp.phx1 flapped with these alerts:

[00:17] < nagios-phx1> | Sat 00:17:19 PDT [171] plugins6.webapp.phx1.mozilla.com:https - port 81 plugins.m.o is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1007 bytes in 0.037 second response time
[01:40] < nagios-phx1> | Sat 01:40:26 PDT [183] plugins5.webapp.phx1.mozilla.com:https - port 81 plugins.m.o is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1007 bytes in 0.047 second response time

Trying it in a browser gives this error:

An error occurred during a connection to plugins5.webapp.phx1.mozilla.com:81.
SSL received a record that exceeded the maximum permissible length.
(Error code: ssl_error_rx_record_too_long)
plugins4.webapp.phx1.mozilla.com just alerted the same way.
Summary: plugins[5-6].webapp.phx1 SSL received a record that exceeded the maximum permissible length → plugins[4-6].webapp.phx1 SSL received a record that exceeded the maximum permissible length
plugins.mozilla.org is down now (was working before) even though all of the individual plugins alerts above cleared (and the site was working a few minutes ago).
Severity: normal → critical
More specifically https://plugins.mozilla.org/en-us says only:

Ouch
Something went south...
And now it is back up again.  Lowering severity.
Severity: critical → normal
Summary: plugins[4-6].webapp.phx1 SSL received a record that exceeded the maximum permissible length → plugins.mozilla.org briefly down, SSL alerts
Noticed this in the logs on each of the plugins[3-6]:

[Sun Sep 30 03:33:01 2012] [warn] WARNING: Attempt to change ServerLimit ignored during restart

Not sure what that means but after manually restarting httpd it doesn't show.
Looks like we have increased traffic, and we probably need to add additional nodes.
I am also seeing a lot of these in my the plugins db nodes:

[root@plugins2.db.phx1 ~]# tail -n2  /var/log/mysql/mysql.err 
121006  6:23:41 [Warning] Too many connections
121006  6:23:41 [Warning] Too many connections

Can we raise the connection limit or do we need to stand up additional slaves?
See Also: → 787663
max_connections is 1200, which is pretty high, but we can raise it.
updated max_connections to 5000 (from 1200, the default for us) in puppet and on the machines. plugins1, the rw/master, is OK, only about 200 connections, but immediately the slave went from 1200 connections to 2779 connections....looking to see if there's a smoking gun wrt a query.
tons of these (over 1200):

1666072743      plugins_user    10.8.70.202:22948       plugins_mozilla_org     Query   0       Opening tables  SHOW COLUMNS FROM `plugins`        0       0       1
1666072744      plugins_user    10.8.70.202:22949       plugins_mozilla_org     Query   0       Opening tables  SHOW COLUMNS FROM `plugins`        0       0       1
1666072745      plugins_user    10.8.70.202:22950       plugins_mozilla_org     Query   0       Opening tables  SHOW COLUMNS FROM `plugins`        0       0       1
[root@plugins2.db.phx1 ~]# mysql -e "show full processlist" | grep SHOW.COLUMNS.FROM..plugins | wc -l
1223
The SHOW COLUMNS btw go away right away, so it's not anything locking/killing things.
The max time any query is taking is 6 seconds....but we're still seeing the SHOW COLUMNS FROM plugins query which is odd...especially in conjunction with temporary tables (this is on plugins2):



mysql> select * from processlist order by TIME DESC limit 10\G
*************************** 1. row ***************************
           ID: 1
         USER: system user
         HOST: 
           DB: NULL
      COMMAND: Connect
         TIME: 14323605
        STATE: Waiting for master to send event
         INFO: NULL
      TIME_MS: 14323604575
    ROWS_SENT: 0
ROWS_EXAMINED: 0
    ROWS_READ: 1
*************************** 2. row ***************************
           ID: 3
         USER: repl
         HOST: 10.8.70.153:34773
           DB: NULL
      COMMAND: Binlog Dump
         TIME: 14323589
        STATE: Master has sent all binlog to slave; waiting for binlog to be up
         INFO: NULL
      TIME_MS: 14323588958
    ROWS_SENT: 0
ROWS_EXAMINED: 0
    ROWS_READ: 2
*************************** 3. row ***************************
           ID: 2
         USER: system user
         HOST: 
           DB: NULL
      COMMAND: Connect
         TIME: 1422
        STATE: Slave has read all relay log; waiting for the slave I/O thread t
         INFO: NULL
      TIME_MS: 1421624
    ROWS_SENT: 0
ROWS_EXAMINED: 0
    ROWS_READ: 2
*************************** 4. row ***************************
           ID: 1666361804
         USER: plugins_user
         HOST: 10.8.70.202:12580
           DB: plugins_mozilla_org
      COMMAND: Query
         TIME: 9
        STATE: removing tmp table
         INFO: SHOW COLUMNS FROM `platforms`
      TIME_MS: 8346
    ROWS_SENT: 5
ROWS_EXAMINED: 5
    ROWS_READ: 6
*************************** 5. row ***************************
           ID: 1666363653
         USER: plugins_user
         HOST: 10.8.70.202:15505
           DB: plugins_mozilla_org
      COMMAND: Query
         TIME: 8
        STATE: removing tmp table
         INFO: SHOW COLUMNS FROM `plugins`
      TIME_MS: 8064
    ROWS_SENT: 12
ROWS_EXAMINED: 12
    ROWS_READ: 13
*************************** 6. row ***************************
           ID: 1666362557
         USER: plugins_user
         HOST: 10.8.70.202:13741
           DB: plugins_mozilla_org
      COMMAND: Query
         TIME: 8
        STATE: removing tmp table
         INFO: SHOW COLUMNS FROM `platforms`
      TIME_MS: 8014
    ROWS_SENT: 5
ROWS_EXAMINED: 5
    ROWS_READ: 6
*************************** 7. row ***************************
           ID: 1666360426
         USER: plugins_user
         HOST: 10.8.70.202:10488
           DB: plugins_mozilla_org
      COMMAND: Query
         TIME: 8
        STATE: removing tmp table
         INFO: SHOW COLUMNS FROM `plugin_releases`
      TIME_MS: 8047
    ROWS_SENT: 29
ROWS_EXAMINED: 29
    ROWS_READ: 30
*************************** 8. row ***************************
           ID: 1666359693
         USER: plugins_user
         HOST: 10.8.70.202:9307
           DB: plugins_mozilla_org
      COMMAND: Query
         TIME: 8
        STATE: removing tmp table
         INFO: SELECT `plugins`.`id` AS `plugin_id`, `plugins`.`pfs_id` AS `plugin_pfs_id`, `plugins`.`original_plugin_id` AS `plugin_original_plugin_id`, `plugins`.`sandbox_profile_id` AS `plugin_sandbox_profile_id`, `plugins`.`name` AS `plugin_name`, `plugins`.`description` AS `plugin_description`, `plugins`.`vendor` AS `plugin_vendor`, `plugins`.`url` AS `plugin_url`, `plugins`.`icon_url` AS `plugin_icon_url`, `plugins`.`license_url` AS `plugin_license_url`, `plugins`.`modified` AS `plugin_modified`, `plugins`.`created` AS `plugin_created`, `plugin_releases`.`id` AS `pluginrelease_id`, `plugin_releases`.`plugin_id` AS `pluginrelease_plugin_id`, `plugin_releases`.`os_id` AS `pluginrelease_os_id`, `plugin_releases`.`platform_id` AS `pluginrelease_platform_id`, `plugin_releases`.`status_code` AS `pluginrelease_status_code`, `plugin_releases`.`name` AS `pluginrelease_name`, `plugin_releases`.`description` AS `pluginrelease_description`, `plugin_releases`.`vendor` AS `pluginrelease_vendor`, `plugin_releases`.`url` AS `pluginrelease_url`, `plugin_releases`.`icon_url` AS `pluginrelease_icon_url`, `plugin_releases`.`vulnerability_description` AS `pluginrelease_vulnerability_description`, `plugin_releases`.`vulnerability_url` AS `pluginrelease_vulnerability_url`, `plugin_releases`.`guid` AS `pluginrelease_guid`, `plugin_releases`.`filename` AS `pluginrelease_filename`, `plugin_releases`.`version` AS `pluginrelease_version`, `plugin_releases`.`detected_version` AS `pluginrelease_detected_version`, `plugin_releases`.`detection_type` AS `pluginrelease_detection_type`, `plugin_releases`.`xpi_location` AS `pluginrelease_xpi_location`, `plugin_releases`.`installer_location` AS `pluginrelease_installer_location`, `plugin_releases`.`installer_hash` AS `pluginrelease_installer_hash`, `plugin_releases`.`installer_shows_ui` AS `pluginrelease_installer_shows_ui`, `plugin_releases`.`manual_installation_url` AS `pluginrelease_manual_installation_url`, `plugin_releases`.`license_url` AS `pluginrelease_license_url`, `plugin_releases`.`needs_restart` AS `pluginrelease_needs_restart`, `plugin_releases`.`min` AS `pluginrelease_min`, `plugin_releases`.`max` AS `pluginrelease_max`, `plugin_releases`.`xpcomabi` AS `pluginrelease_xpcomabi`, `plugin_releases`.`modified` AS `pluginrelease_modified`, `plugin_releases`.`created` AS `pluginrelease_created`, `oses`.`id` AS `os_id`, `oses`.`name` AS `os_name`, `platforms`.`id` AS `platform_id`, `platforms`.`app_id` AS `platform_app_id`, `platforms`.`app_release` AS `platform_app_release`, `platforms`.`app_version` AS `platform_app_version`, `platforms`.`locale` AS `platform_locale`
FROM (`plugins`)
JOIN `plugin_releases` ON (`plugin_releases`.`plugin_id` = `plugins`.`id`)
JOIN `oses` ON (`oses`.`id` = `plugin_releases`.`os_id`)
JOIN `platforms` ON (`platforms`.`id` = `plugin_releases`.`platform_id`)
JOIN `mimes_plugins` ON (`mimes_plugins`.`plugin_id` = `plugins`.`id`)
JOIN `mimes` ON (`mimes_plugins`.`mime_id` = `mimes`.`id`)
WHERE plugins.sandbox_profile_id IS NULL
AND `detection_type` IN ('version_available','*')
AND `oses`.`name` IN ('windows nt 6.1','win','*')
AND `platforms`.`app_id` IN ('{ec8030f7-c20a-464f-9b0e-13a3a9e97384}','*')
AND `platforms`.`app_version` IN (20120905151427,'*')
AND `platforms`.`app_release` IN ('15.0.1','*')
AND `platforms`.`locale` IN ('en-US','*')
AND `mimes`.`name` IN ('application/x-wlpg3-detect','application/x-wlpg-detect')
GROUP BY `plugin_releases`.`id`
ORDER BY `plugin_releases`.`version` DESC, `oses`.`name` DESC, `platforms`.`locale` DESC, `platforms`.`app_id` DESC, `platforms`.`app_version` DESC, `platforms`.`app_release` DESC
      TIME_MS: 7803
    ROWS_SENT: 0
ROWS_EXAMINED: 6
    ROWS_READ: 1
*************************** 9. row ***************************
           ID: 1666363983
         USER: plugins_user
         HOST: 10.8.70.202:15974
           DB: plugins_mozilla_org
      COMMAND: Query
         TIME: 7
        STATE: removing tmp table
         INFO: SHOW COLUMNS FROM `plugins`
      TIME_MS: 6737
    ROWS_SENT: 12
ROWS_EXAMINED: 12
    ROWS_READ: 13
*************************** 10. row ***************************
           ID: 1666364699
         USER: plugins_user
         HOST: 10.8.70.202:17171
           DB: plugins_mozilla_org
      COMMAND: Query
         TIME: 6
        STATE: removing tmp table
         INFO: SHOW COLUMNS FROM `plugins`
      TIME_MS: 5083
    ROWS_SENT: 12
ROWS_EXAMINED: 12
    ROWS_READ: 13
10 rows in set (5.49 sec)
Laura - is the SHOW COLUMNS stuff MySQL's magic, or is that actually in the code?

Also was there a release or something at noon yesterday? graphs are showing an increase in overall traffic pretty sharply at noon yesterday, about an hour ramping up, and staying pretty high until just now.
Added plugins1.db (rw server) to ro pool in zeus for now.
Bug 798793 for additional web nodes
It would be  nice if we could offload the traffic by caching on Zeus. The app is currently setting cache-control headers.

➜  ~  curl -I "https://plugins.mozilla.org/pfs/v2?appID=%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D&appRelease=15.0&appVersion=20120824154833&clientOS=Windows+NT+5.1&chromeLocale=de-DE&detection=version_available&mimetype=application%2Fx-director&callback=C"
HTTP/1.1 200 OK
Server: Apache
X-Backend-Server: plugins2.webapp.phx1.mozilla.com
Cache-Control: no-store, no-cache, must-revalidate, private
Content-Type: text/javascript
Date: Sat, 06 Oct 2012 15:43:43 GMT
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Transfer-Encoding: chunked
Connection: Keep-Alive
Set-Cookie: auth_profiles=deleted; expires=Fri, 07-Oct-2011 15:43:42 GMT; path=/; httponly
Set-Cookie: kohanasession=jmsrlgqr6pn4pivudhrmfp4tb6; expires=Sat, 06-Oct-2012 17:43:43 GMT; path=/; HttpOnly
Set-Cookie: kohanasession=jmsrlgqr6pn4pivudhrmfp4tb6; expires=Sat, 06-Oct-2012 17:43:43 GMT; path=/; httponly
Set-Cookie: auth_profiles=deleted; expires=Fri, 07-Oct-2011 15:43:42 GMT; path=/; httponly
Set-Cookie: auth_profiles=deleted; expires=Fri, 07-Oct-2011 15:43:42 GMT; path=/; httponly
Set-Cookie: kohanasession_data=c2Vzc2lvbl9pZHxzOjI2OiJqbXNybGdxcjZwbjRwaXZ1ZGhybWZwNHRiNiI7dG90YWxfaGl0c3xpOjE7X2tmX2ZsYXNoX3xhOjA6e311c2VyX2FnZW50fHM6Nzg6ImN1cmwvNy4yNC4wICh4ODZfNjQtYXBwbGUtZGFyd2luMTIuMCkgbGliY3VybC83LjI0LjAgT3BlblNTTC8wLjkuOHIgemxpYi8xLjIuNSI7aXBfYWRkcmVzc3xzOjEzOiI3Mi40NC4yMTAuMjIwIjtsYXN0X2FjdGl2aXR5fGk6MTM0OTUzODIyMzs%3D; expires=Sat, 06-Oct-2012 17:43:43 GMT; path=/; httponly
X-Frame-Options: DENY
X-Powered-By: PHP/5.3.3
X-Cache-Info: not cacheable; response specified "Cache-Control: no-store"
Assignee: server-ops-webops → server-ops
Component: Server Operations: Web Operations → Server Operations: AMO Operations
QA Contact: cshields → oremj
Assignee: server-ops → jthomas
plugins[7-12].webapp.phx1 configured and added to Zeus plugins pool.
Sad again, all the webheads reached MaxClients. Zeus graphs show an increase in the last hours to almost 200k hits/minute, which is more than double of what it was 24h ago.
(In reply to Dumitru Gherman [:dumitru] from comment #19)
> Sad again, all the webheads reached MaxClients. Zeus graphs show an increase
> in the last hours to almost 200k hits/minute, which is more than double of
> what it was 24h ago.

Was there some type of change in the latest Firefox release or nightly that is causing this traffic increase? This seems very odd.
Duplicate of this bug: 798771
I doubled the MaxClients on all 11 webheads(!) (from 256 to 512), but per comment 17, some Zeus caching love would be really great.
The database servers are also hitting "Too many connections", obviously...
Severity: normal → critical
I've been stracing some processes. The SHOW COLUMNS queries do seem to be the culprit.
Connections to the zeus db vip where taking seconds to respond. I moved the db VIP over to pp-zlb08 for now and the problem has gone away.
happening again...
Traffic is way lower on plugins2:

mysql> show status like 'max_used_connections';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections | 1005  |
+----------------------+-------+
1 row in set (0.00 sec)


Even though plugins1 is hovering around max_connections (5000). I've updated max_connections for both of them (dynamic and in puppet) to 10,000.
A lot of these http requests are the same. I modified the Cache-Control headers so these requests are being cached in Zeus and off load some of the traffic from our web nodes.
Traffic graphs for the past 7 days.
We are still experiencing increased traffic. Caching at Zeus is helping with performance and availability.
Severity: critical → normal
I would venture the spike was caused by this:
https://blog.mozilla.org/addons/2012/10/05/prompting-our-users-to-update-their-plugins/
and a press push surrounding that.
See bug https://bugzilla.mozilla.org/show_bug.cgi?id=798995 - can we check if the traffic came from one IP or was malicious?
Laura - can you speak to all the "SHOW COLLUMNS" that seems to be coming in? is it from the code?
Getting a full thread from the general log:

http://scabral.pastebin.mozilla.org/1860939

[10:08:31] <@laura> yeah, looks like framework games to me
[10:08:38] <@laura> ugh, kohana, why can't I quit you
[10:08:49] <@laura> I'll dig up the code that is the offender
[10:08:53] <@laura> but it's not easy to fix
[10:08:59] <sheeri> OK. Let me know if you want other threads
[10:09:00] <@laura> not userspace code
[10:09:07] <sheeri> :-\
[10:09:14] <sheeri> laura what's the framework?
[10:09:24] <sheeri> I can look to see if there's a switch or something that can bypass that crap
Is there still a load problem that we need to be solving, or has it resolved?  It's very unclear from the bug.
I changed the zeus config to make ro go only to plugins2, which has the mysql query cache on now. The zeus cache is still on, but we are already getting query cache hits.
We have a temporary fix in place, but we need a more permanent fix.
I'll see if I can solve the ORM issue, give me some time to test.
(In reply to Laura Thomson :laura from comment #31)
> I would venture the spike was caused by this:
> https://blog.mozilla.org/addons/2012/10/05/prompting-our-users-to-update-
> their-plugins/
> and a press push surrounding that.

bug 797378
Assignee: jthomas → laura
Component: Server Operations: AMO Operations → plugins.mozilla.org
Product: mozilla.org → Websites
QA Contact: oremj
Is there a plan/another bug for the ORM issue? Or has this been resolved/addressed?
There is a plan, and it may as well be this bug.  I have not had time yet to address this, sorry, but have other upcoming work on plugins (in bug 811841) so I will address it then.
Summary: plugins.mozilla.org briefly down, SSL alerts → plugins.mozilla.org and kohana
Assignee: laura → bsavage
Assignee: bsavage → laura
Does plugins still use kohana? If so, are you (Laura) still the right contact for this issue?
(In reply to Sheeri Cabral [:sheeri] from comment #42)
> Does plugins still use kohana? If so, are you (Laura) still the right
> contact for this issue?

Hey Sheeri, for the moment unfortunately yes it does. Ricky Rosario is going to rewrite this service in Django and we are most likely going to move the whole thing to AWS.

I am going to ping both Ricky and Lonnen for more info here.
Flags: needinfo?(rrosario)
Flags: needinfo?(chris.lonnen)
OK, great. Actuall
The intent is a whole rewrite into python this Q. We'll deploy it in *the cloud* and then this can be decomm'd. I'm not sure what it means for this bug.
Flags: needinfo?(rrosario)
Flags: needinfo?(chris.lonnen)
You need to log in before you can comment on or make changes to this bug.