Closed Bug 1033288 Opened 10 years ago Closed 8 years ago

Create documentation on Mana about our Memcache clusters

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: arich, Assigned: Atoll)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/236] )

As part of the releng nagios audit, we've come up with a few open monitoring and documentation questions/actions for webops.

Questions:
1) why isn't upload2.dmz.scl3.mozilla.com in hostgroup seamicro-nodes like other seamicro nodes?
2) we seem to monitor product-delivery-ftp-vip but not product-delivery-ftp. Should we be monitoring both?


Actions:
1) please add documentation for memcached to mana
2) the tbpl documentation doesn't list any database dependencies, but there are two tbpl database. Can you please update the documentation and/or document what the tbpl databases are for?

Thanks!
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/479]
Blocks: 993044
I suspect that the requests in this bug have been rendered moot during the intervening time.  (>_<)  I wanted to check if any of these were currently relevant and, if so, get them done.

RE: upload2.dmz.scl3.mozilla.com
This server went virtual in virtual in bug 1061825, so it should no longer need to be grouped with the seamicro nodes.


RE: monitoring of product-delivery-ftp-vip versus product-delivery-ftp
I'm not quite sure I grok this.  (This request may have been fulfilled by someone else.)  It looks like these are two host groups in SCL3, with the  ftp-vip hostgroup monitoring the ZLB-related properties and the ftp hostgroup monitoring things on a server level.  What were you looking to have monitored?


RE: memcached docs 
Are you looking for "generic" memcached documentation?  (I didn't find any in mana.)  There are docs for responding to Nagios memcache alerts (https://mana.mozilla.org/wiki/display/NAGIOS/memcached).


RE: documenting TBPL databases
It looks like both the mana page for tbpl.mozilla.org has the DBs listed (https://mana.mozilla.org/wiki/display/websites/tbpl.mozilla.org#tbpl.mozilla.org-Database) as does the wiki architecture docs (https://wiki.mozilla.org/Sheriffing/TBPL/DeveloperDocs#Architecture).  Is this sufficient or are you looking to capture what would be in the table descriptions?
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/479] → [kanban:https://webops.kanbanize.com/ctrl_board/2/236]
(In reply to C. Liang [:cyliang] from comment #1)
> RE: monitoring of product-delivery-ftp-vip versus product-delivery-ftp
> I'm not quite sure I grok this.  (This request may have been fulfilled by
> someone else.)  It looks like these are two host groups in SCL3, with the 
> ftp-vip hostgroup monitoring the ZLB-related properties and the ftp
> hostgroup monitoring things on a server level.  What were you looking to
> have monitored?

> RE: memcached docs 
> Are you looking for "generic" memcached documentation?  (I didn't find any
> in mana.)  There are docs for responding to Nagios memcache alerts
> (https://mana.mozilla.org/wiki/display/NAGIOS/memcached).
> 
> 
> RE: documenting TBPL databases
> It looks like both the mana page for tbpl.mozilla.org has the DBs listed
> (https://mana.mozilla.org/wiki/display/websites/tbpl.mozilla.org#tbpl.
> mozilla.org-Database) as does the wiki architecture docs
> (https://wiki.mozilla.org/Sheriffing/TBPL/DeveloperDocs#Architecture).  Is
> this sufficient or are you looking to capture what would be in the table
> descriptions?

Setting needinfo? :arr for these three questions.
Flags: needinfo?(arich)
since product delivery is changing, I think the first is a noop.

It looks like the database info is there for TBPL now, cool.

As far as memcached, documentation on how it's architected/configured at mozilla.
Flags: needinfo?(arich)
(In reply to Amy Rich [:arich] [:arr] from comment #3)
> As far as memcached, documentation on how it's architected/configured at
> mozilla.

I look through all of our mentions of memcached in mana and it appears that memcached is generally documented as "1-X servers" on each individual app's page. It's not immediately apparent to me how each app chooses to make use of its memcached server pool, but I have a suspicion (based on various mentions of Couchbase at certain points) that it's basically up to each app to make use of the memcached pool however they see fit.

Specifically which apps are of interest to releng as part of this Nagios check review bug? I can pin down their sharding mechanisms if needed, or if the general answer "each app uses the configured pool of memcached as designed" without a more precise definition of sharding is okay, cool. (Or if there's other questions about memcached I'm not addressing in this reply, I can try to answer those as well.)
Flags: needinfo?(arich)
The perspective I was looking at this from was that memcached is its own service, so I would expect to see some sort of documentation in either SysAdminWiki or IT Wiki (possibly using the ServiceTemplate) that documents the service itself. Since we have multiple different memcached "clusters," maybe that documentation just talks about the defaults (what specs the VMs are created with, do we do kernel tuning, what's our default for sharding, cachesize, etc). Maybe a reference to the puppet module and node defs where these things can be configured. The service documentation page would also have links to the nagios checks, etc. Does that make sense?
Flags: needinfo?(arich)
Assignee: server-ops-webops → rsoderberg
QA Contact: nmaul → smani
Summary: releng nagios audit: open questions/actions for webops → Create documentation on Mana about our Memcache clusters
(In reply to Amy Rich [:arr] [:arich] from comment #5)
> The perspective I was looking at this from was that memcached is its own
> service, so I would expect to see some sort of documentation in either
> SysAdminWiki or IT Wiki (possibly using the ServiceTemplate) that documents
> the service itself. Since we have multiple different memcached "clusters,"
> maybe that documentation just talks about the defaults (what specs the VMs
> are created with, do we do kernel tuning, what's our default for sharding,
> cachesize, etc). Maybe a reference to the puppet module and node defs where
> these things can be configured. The service documentation page would also
> have links to the nagios checks, etc. Does that make sense?

This does make sense, but we don't have *any* of these things defined. They are all inherited from legacy snowflake deployments, and there are no best practices evaluated, tested, defined, and so on. This is obviously wildly suboptimal, but at least we can accurately define how it functions today.

I'd like to say that we're going to do this someday, but we clearly aren't going to get around to it for a very long time to come. RESO INCO until we can get further traction on this.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.