Closed Bug 921783 Opened 11 years ago Closed 11 years ago

carbon-cache is hitting MAX_CACHE_SIZE limit

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: ericz)

References

Details

I noticed some new metrics failed to get created when firing up some test nodes.  And I also see metrics being dropped. 

29/09/2013 01:15:00 :: MetricCache is full: self.size=10000361
29/09/2013 01:15:00 :: MetricCache is full: self.size=10000362
29/09/2013 01:15:00 :: MetricCache is full: self.size=10000363
29/09/2013 01:15:00 :: MetricCache is full: self.size=10000364
29/09/2013 01:15:00 :: MetricCache is full: self.size=10000365
This is on graphite6.private.scl3
Thanks Jake for the heads up.  Restarted the a and g instances which were the offenders here and it is cleared up.
Assignee: server-ops → eziegenhorn
After applying the fix you found in 921789, I suspect this will not be much of an issue any more.  The caches are comically small compared to what they used to regularly be.  I opened bug 921106 to put monitoring in place for this condition and with that I think this is done.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
it early in the morning and I have insomnia but I think figured out why the graphite6 is still having major issues even after rolling back the incoming metrics from collectd to 60s.  c&p from what i wrote in irc

<dividehex> we've been raising the metric cache sizes on the carbon-cache when they started to get overloaded
<dividehex> from the increased metrics
<dividehex> this is causing them to starve for memory
<dividehex> the reason the metric cache was filling up in the first place is because the the dump cache to disk method is rate limited and all the new metrics coming from the nodes we've been added collectd to has exceeded that rate limit
<dividehex> right now we have 363276 metrics which should be incoming at 363276 per min, thats 6054.6/s
<dividehex> split that against 8 carbon-cache instances
<dividehex> that's 756.825 an instance (assume an instance is running and is processing)
<dividehex> and each instance is currently rate limited to 400 metrics writes a second
<dividehex> we need to up the metric write rate limit to be able to keep up with the incoming cache growth
<dividehex> decrease the carbon-cache max cache size (which gets sorted in place before yeilding to the writer method)
<dividehex> and increase the carbon-relay queue cache size (which doesn't get sorted)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It's not as bad as that because that 400 limit is for update_many() calls which can write out many values for a single metric at the same time.  So it can write a lot more than 400 datapoints per second per carbon-cache.

That being said, I'll look at the relay and cache queue sizes more because we've absolutely had to shrink them due to memory concerns in the last week and it needs revisiting.
(In reply to Eric Ziegenhorn :ericz from comment #5)
> It's not as bad as that because that 400 limit is for update_many() calls
> which can write out many values for a single metric at the same time.  So it
> can write a lot more than 400 datapoints per second per carbon-cache.
> 
> That being said, I'll look at the relay and cache queue sizes more because
> we've absolutely had to shrink them due to memory concerns in the last week
> and it needs revisiting.

I'm aware of that and I still think this is a bottleneck and should be increased. update_many() does take multiple datapoints for a single metric but as the number of metrics increase, it is still held to this constant.  If a single carbon-cache queue is holding a single datapoint for every metrics (all 363276) , it will still be limited to only writing to 400 wsp files.
Blocks: 920626, 920629
Raised MAX_UPDATES_PER_MINUTE to 800 and so far I'm seeing no adverse affects on the disk.  This change is keeping the carbon cache queues really low, which is great.  This combined with the memory upgrade means we likely won't have any trouble with memory now.  Going to keep this bug open for a while to monitor things.
I just checked and carbon-cache b,c,e,g are currently reporting MetricCache is full
Restarted and raised the MAX_UPDATES_PER_MINUTE to 1200.  Will continue to investigate what is going on after I get alerting in place for this.
Got the alerts in place in bug 921106 and it already caught instance C misbehaving.  I looked at it and the cache is full and it isn't using the CPU at all.  So this appears like a bug where it just gets stuck, but I have to investigate further.
The relays show incoming traffic of about 285k metrics/minute which corresponds to a slight increase from what we had before, which is expected.  We have 8 carbon-cache instances.  That's a little under 36k metrics per cache that need to be written per minute.  The main configuration control on this is MAX_UPDATES_PER_MINUTE which limits the number of unique metrics it writes to disk per minute (There is no limit on how many values per metric it writes.  This is how Graphite batches writes and attempts to go easy on disks.)

After work on 923224 the disk, or i/o subsystem does not appear to be a constraint any longer.  Hence we're trying to raise the limit on MAX_UPDATES_PER_MINUTE to determine if it can keep up with a higher rate of updates.
So most of the config changes were red herrings (though tuning MAX_UPDATES was good to do), the big issue is that we were triggering what appears to be a bug with carbon's WHISPER_LOCK_WRITES setting when it was set to true.  Symptoms include:

* A cache would hum along fine for a while and then simply stop writing to disk while still receiving metrics thus filling up it's cache eventually and at that point discarding metrics.

* strace'ing the process would show a series of flock calls and epoll_waits, likely showing some kind of locking problem and the reception of new metrics.  Regular carbon processes would show lots of reads/writes to disk as well.

I'm going to try and replicate this is in dev and file a bug and/or patch for the Graphite project if I can narrow it down, but in the meantime I've determined through reading the code and tracking all the metrics that we don't need this problematic setting on because the relays' consistent hashing mechanism is completely consistent and doesn't send the same metric to more than one cache -- hence they won't trample each other and we don't need to worry about locking.

The caches have been stable for 5 days without getting anywhere near MAX_CACHE_SIZE.  Everything seems to be flowing smoothly and we have very good headroom in terms of CPU, memory and disk I/O.  Much thanks to :dividehex for all his help!
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.