Closed Bug 894942 Opened 12 years ago Closed 12 years ago

RFO needed for multi-server tegra outage just after 8am PT, July 17

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

Details

Attachments

(3 files)

We had a ton of tegra build slaves disconnect from their buildbot master just after 8am this morning. I'm trying to compile a longer list, but one example is: foopy128.build.mtv1.mozilla.com <-> buildbot-master20.build.mtv1.mozilla.com I know that these machines are very close to each other, but is there any evidence that there was a network problem that could've caused them to fail to communicate?
Another example is: foopy124.build.mtv1.mozilla.com <-> buildbot-master22.build.mtv1.mozilla.com So it's not isolated to a single buildbot master.
I think this was a DB problem, either in connectivity or the DB itself. All of the mozpool servers suicided (bug 817762) at exactly the same time, although there were no connection-failure errors. For foopy/slave connections, if the master was hung on a DB connection in the main thread then you'd see this kind of problem. I asked Sheeri, and she said there was some backup work. Sheeri, is it possible that activity locked tables on the rw master for more than a second or two at a time?
Assignee: network-operations → server-ops-database
Component: Server Operations: Netops → Server Operations: Database
Flags: needinfo?(scabral)
QA Contact: ravi → scabral
Summary: network blip just after 8am PT, July 17? → database blip just after 8am PT, July 17?
It was an innodb hot backup, so it's very unlikely that it locked the tables. It was doing a lot of reads, though.
Flags: needinfo?(scabral)
(and it was directly streamed to the backup server, so it's unlikely it was write I/O).
(In reply to Sheeri Cabral [:sheeri] from comment #3) > It was an innodb hot backup, so it's very unlikely that it locked the > tables. So, sounds like it "shouldn't" be a db issue. Back to netops for confirmation nothing happened to the local switches etc. After that, route to dcops to find out if there was an accidental cable pull. SOMETHING happened - we need to know what. dustin: the bug you linked in comment 2 has nothing in it about today's outage. Is there a different bug showing a db impact today?
Assignee: server-ops-database → network-operations
Component: Server Operations: Database → Server Operations: Netops
Flags: needinfo?(dustin)
QA Contact: scabral → ravi
change summary to not imply root cause yet -- we don't know it
Summary: database blip just after 8am PT, July 17? → RFO needed for multi-server tegra outage just after 8am PT, July 17
I linked to bug 817762 as a reference for "suicide". I didn't file a bug, since the Mozpool services recovered automatically (as the releng Buildbot install will, eventually, need to learn to do). To be clear, it's not certain this was an outage. In any large system, things fail from time to time, and the correct solution is to make the pieces resilient to those failures. This is more of an "event" that we haven't characterized completely, and thus are having a lot of trouble building resilience against. Netops: this would be something unusual affecting flows between {build.mtv1,p*.releng.scl1} and db.scl3.
Flags: needinfo?(dustin)
What is the action item here?
Oh. I see comment 5. No network activity during that time frame according to logs.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: Netops → NetOps: Other
Product: mozilla.org → Infrastructure & Operations
One more place to check - dmoore - any "oops" around this time in the data centers? (0800 PT July 17)
Assignee: network-operations → server-ops-dcops
Status: RESOLVED → REOPENED
Component: NetOps: Other → Server Operations: DCOps
Flags: needinfo?(dmoore)
Product: Infrastructure & Operations → mozilla.org
QA Contact: ravi → dmoore
Resolution: FIXED → ---
Any DCOps incident would have also manifested as a network outage. We definitely had no staff in Mountain View at that time.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Flags: needinfo?(dmoore)
Resolution: --- → INCOMPLETE
From further discussion with Sheeri, the Buildbot DB is MyISAM, which *would* lock during a hot copy. That left a query from each of the buildbot masters waiting for the lock, and since the queries are done in the main thread, that hung everything on those masters, including connections from slaves. When the lock cleared, things returned to normal, but lots of timeouts had occurred in the interim. So, that means we shouldn't be doing hot copies from the master. That doesn't explain the mozpool suicides, though - mozpool's tables are all InnoDB, which doesn't lock. It looks like the thread pool on the DB server filled up (at ~325 threads) and stalled out the remaining queries. Sheeri, can you shed a little light on what that spike in threads might mean? I'll upload some graphs to stare at.
Assignee: server-ops-dcops → server-ops-database
Status: RESOLVED → REOPENED
Component: Server Operations: DCOps → Server Operations: Database
QA Contact: dmoore → scabral
Resolution: INCOMPLETE → ---
:cyborgshadow and I had a discussion about this. We had a look at CPU and IO load, which were not elevated during the 17-minute outage. The commit log above shows that nothing was written. The threads graph shows that threads were stacking up during that time, but locks weren't stacking up. It's not clear whether the number of locks at that time was zero or one. The most suitable hypothesis is that xtrabackup ran FLUSH TABLES WITH READ LOCK, which has the side-effect of draining all running queries before returning, and locking the entire server until that time. Given the painful queries that Buildbot runs regularly, it's not at all surprising to think that the FLUSH TABLES ended up waiting 17m for a such a query. During that time, nothing else -- not even mozpool, with its InnoDB tables -- could do any reads or writes. And, as we've seen, Buildbot-0.8.3 does lots of blocking queries on the main thread, so once such a query waits on a lock, *everything* else on that master stops, which is what caused the outage. So, I think the remedies here are: * buildbot should get smarter about DB queries * mozpool should move to a less abused server cluster I'll open a bug for the latter. The former might do well with a contractor, but that's too far over my pay grade to even file a bug on.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Capturing the prior remedy recommendation from comment 12: (In reply to Dustin J. Mitchell [:dustin] from comment #12) > From further discussion with Sheeri, the Buildbot DB is MyISAM, which > *would* lock during a hot copy. ... > So, that means we shouldn't be doing hot copies from the master. I've opened bug 897128 to track if there are alternatives to hot copy for this situation.
There are - do hot copies from slaves.
Hal, For the record, this is a one off. Hot copies are generally read/write safe and not what caused the issue here. The root of the problem comes from the fact that buildbot has some nasty queries, which is something we're aware of and that needs some work. That doesn't change the suggestion however that the solution is twofold, as dustin mentioned: - make better buildbot queries - move mozpool out of the way so it's unaffected by this in the interim. Also, for the record, it's not a common occurrence that we hotcopy a database. We do this only if we're provisioning a new slave, a new backup, or fixing irreparable checksum failures of an existing one.
To update recommendations in comment 16: - bug 897109 is to move mozpool to a different cluster - bug 848082 was to remove the problem queries Since bug 848082 was resolved prior to this incident, more may need to be done there.
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: