If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Intranet is caching everything with an expire time of 2038.

RESOLVED FIXED

Status

Infrastructure & Operations
WebOps: IT-Managed Tools
--
major
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: cyborgshadow, Assigned: cyliang)

Tracking

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/408] )

(Reporter)

Description

3 years ago
Recently, we started getting disk alerts for the intranet backup server.

Upon investigating, at first we thought it was just backups taking too much space. Eventually, we realized it technically is the backups taking too much space, but that the backups have grown several hundred MB a day. 

Looking further, it seems the "objectcache" table is now caching everything with an exptime of 2038 or greater. Out for 4.8 million rows in the table, 4.79 million had an exptime in 2038.

I'm clearing out these rows to preserve space as it's only a cache table, but this should be looked at so that it's caching things to a more relevant timeframe. 24+ years isn't a valid cache expiration for 99% of the content being cached.
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/408]
(Reporter)

Comment 1

3 years ago
More info: It appears that the majority of the new cache rows being set for 2038 are minify.js and minify-css. 

These are being cached thousands of times, but with a different hash at the end.

Guessing someone decided to cache forever...someone decided to rotate cache, and the two combined are infinitely expanding.
(Reporter)

Comment 2

3 years ago
In about 20 minutes since clearing all the 2038 cache:

-- This query gets a list of all the unique keynames from the table where the first 55 characters match (char length up to the hash start)

mysql> select left(keyname,55) AS keyname from objectcache where exptime > '2038-01-01' group by left(keyname,55);
+---------------------------------------------------------+
| keyname                                                 |
+---------------------------------------------------------+
| intranet_mozilla_org:resourceloader:filter:minify-css:7 |
| intranet_mozilla_org:resourceloader:filter:minify-js:7: |
+---------------------------------------------------------+
2 rows in set (0.06 sec)

-- Showing that we already have 10,000 rows in 20 minutes.

mysql> select count(*) from objectcache where exptime > '2038-01-01';
+----------+
| count(*) |
+----------+
|     9945 |
+----------+
1 row in set (0.01 sec)
(Reporter)

Updated

3 years ago
Severity: normal → major
(Reporter)

Comment 3

3 years ago
A note as of today:

A fresh search reveals it's minify-js as the caching issue.

mysql> select left(keyname,55), count(left(keyname,55)) AS keyname from objectcache where exptime > '2038-01-01' group by left(keyname,55);
+---------------------------------------------------------+---------+
| left(keyname,55)                                        | keyname |
+---------------------------------------------------------+---------+
| intranet_mozilla_org:resourceloader:filter:minify-css:7 |       6 |
| intranet_mozilla_org:resourceloader:filter:minify-js:7: | 1642896 |
+---------------------------------------------------------+---------+
2 rows in set (31.23 sec)
(Reporter)

Comment 4

3 years ago
Given the amount of rows and the time it took to get here, whatever change caused this issue had to have occurred in the past 1-2 weeks.
(Assignee)

Updated

3 years ago
Assignee: server-ops-webops → cliang
(Assignee)

Comment 5

3 years ago
This looks like it might be a result of an upstream bug, which hasn't been touched.[1]  There's more discussion of this in http://www.mediawiki.org/wiki/Talk:ResourceLoader#Cache_consumption_20440.  

There is a workaround mentioned in one of the comments:
  > I've just hacked ResourceLoader.php -- instead of $cache->set( $key, $result); we now use
  > $cache->set( $key, $result, time()+86400);

I've put this in place in the dev and stage environments for intranet. 

@cyborgshadow: Can you confirm if this has had the desired effect in the dev and staging DBs?  That is, are there now cache entries for minify-js with a shorter exptime?

@jd: Would it be incredibly stupid to put this workaround into production?  Other folks are doing a straight "TRUNCATE TABLE objectcache;" on a regular basis which doesn't seem entirely correct but would mean no changes to the codebase. 


[1]https://bugzilla.wikimedia.org/show_bug.cgi?id=42094)
Flags: needinfo?(jcrowe)
Flags: needinfo?(bjohnson)
(Reporter)

Comment 6

3 years ago
Both dev and staging have a minimum exptime of 2038-01-19 still. Unfortunately it appears to still be invalid.
Flags: needinfo?(bjohnson)
(Reporter)

Updated

3 years ago
Blocks: 1028545
(Assignee)

Comment 7

3 years ago
Hrm.  So, I truncated the objectcache table in dev and then started interacting with intranet-dev.allizom.org.  This seems to produce the effect I expected.  [1]  I'm wondering if the cache object is set for each unique viewer and, thus, won't get set with the shorter exptime without removing the entry with the 2038-01-19 exptime first.

Of course, this assumes that the entries with the shorter exptime will get properly cleaned out of the db.  I'll verify this tomorrow (June 25th).  If the db doesn't get cleaned out, I'm going to advocate for reaming out the cache on a regular basis. 


[1] mysql> select left(keyname,55), count(left(keyname,55)) AS keyname from objectcache where exptime > '2038-01-01' group by left(keyname,55);
Empty set (0.00 sec)

mysql> select left(keyname,55), count(left(keyname,55)) AS keyname from objectcache where exptime < '2038-01-01' group by left(keyname,55);
+---------------------------------------------------------+---------+
| left(keyname,55)                                        | keyname |
+---------------------------------------------------------+---------+
| intranet_dev_allizom_org:messages:en                    |       1 |
| intranet_dev_allizom_org:pcache:idhash:6575-0!*!0!!en!* |       1 |
| intranet_dev_allizom_org:pcache:idoptions:6575          |       1 |
| intranet_dev_allizom_org:resourceloader:filter:minify-c |       1 |
| intranet_dev_allizom_org:resourceloader:filter:minify-j |       1 |
+---------------------------------------------------------+---------+
5 rows in set (0.00 sec)
(Assignee)

Comment 8

3 years ago
I'm going to advocate for not having this particular tidbit of information cached at all, which will involve:

   1. neutering ResourceLoader.php to not cache the info and 
   2. clearing out the existing entries in the objectcache table with an exptime of 2038-01-19

I've done this for the intranet dev and staging environments.

@cyborshadow: Is there a good / bad time to try to coordinate doing this in production?  I'd like to do it early enough in the day that we can roll this change back if it looks detrimental to performance. 


 
Right now, it looks like 1) info this does not get cleaned out and 2) it gets generated per session per user, which makes the problem worse.   $wgMainCacheType is currently sent to 'CACHE_NONE'.  

The bit that is killing us in ResourceLoader.php is:

        // Try for cache hit
        // Use CACHE_ANYTHING since filtering is very slow compared to DB queries
        $key = wfMemcKey( 'resourceloader', 'filter', $filter, self::$filterCacheVersion, md5( $data ) );
        $cache = wfGetCache( CACHE_ANYTHING );
        $cacheEntry = $cache->get( $key );

According to the docs:

CACHE_ANYTHING - Use $wgMessageCacheType or $wgParserCacheType if they are set to something other then CACHE_NONE or CACHE_ANYTHING. Otherwise use CACHE_DB.

So, this filter information is being cached, despite having $wgMainCacheType set to 'CACHE_NONE'.  I talked with JD this morning, who thinks that whatever caching savings we're getting on Intranet is small enough that it's not useful.
Flags: needinfo?(jcrowe) → needinfo?(bjohnson)
(Reporter)

Comment 9

3 years ago
Confirmed this works. I'd say ready for prod.
Flags: needinfo?(bjohnson)
(Assignee)

Comment 10

3 years ago
Currently scheduled to go into prod on Monday (to circumvent possibly making a breaking change before the weekend.)  Brandon has kindly done a preliminary purge on the DB so we *ought* to be good on space until then.
(Assignee)

Comment 11

3 years ago
Change pushed out to production this morning at around 7:30 AM PDT and will be asking Brandon to do table clear at his leisure.
(Reporter)

Comment 12

3 years ago
Purged, tested, and verified. We're good to go!
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.