Do some manual prototyping of capturing intermittent failures using rr

NEW
Unassigned

Status

3 years ago
3 years ago

People

(Reporter: ted, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

roc proposed doing test recording with rr on a dedicated machine on dev.platform:
https://lists.mozilla.org/pipermail/dev-platform/2015-March/008977.html

jmaher and I have been talking about this for the past week, he has some machines that are sitting idle in a datacenter that were intended for perf testing and have CPUs new enough to run rr, so he's volunteered to donate them to the cause. We'd like to spend a little time prototyping a manual setup that runs tests under rr to see if we can successfully capture intermittent failures and have engineers debug them. If we're successful we can talk about writing some automation around the process.

There's some historical precedent for this--we used to have a dedicated Windows machine in Mountain View that had a VMWare record and replay setup, and I know some tricky bugs were debugged using that.

Some things we've talked about:
* We should try to set these up to match the test machines as closely as possible. If we can use RelEng's puppet scripts to image them that would be fantastic, but if not installing the same version of Ubuntu + the same packages might suffice.
* roc says we should use btrfs as the filesystem, so that is going to require us to set these machines up from scratch anyway.
* We want to hand-pick some intermittents that we know show up on Linux and kick off manual runs. I'm not sure that we want to use --run-until-failure, since I think that keeps the same browser open and re-runs tests, which would make the rr recording very large. A better solution is probably just to run the harness in a loop until it fails and delete the rr recording if the test didn't fail, something like:
  while ! runtests.py --debugger=rr <whatever tests> ; do
    # delete recording
  done

Our plan is for jmaher and I to do the setup and run the harnesses, then if we capture a failure we'll hand the machine over to roc or someone to try debugging.
I talked to dustin about whether we could puppetize these machines, apparently we'd either need to put them on the QA VLAN or make the QA puppet master reachable from the A-Team VLAN (which they're currently on), but after that it sounded pretty doable.

(I also filed bug 1147481 for something we'll probably need to use this in practice, getting debug symbols for system libs installed.)
See Also: → bug 1147481
best bet is to move a couple machines into the QA vlan instead of getting puppetagain to work in the ateam vlan.  

IIRC- we need to:
* do find the machine names and get the network ports
* file an IT request to change the ports to the correct vlan
* rename the machines to be a more accurate name

then we can:
* figure out how to use puppetagain on the machines
* adjust it so we can use the proper filesystem (btrfs)


Right now we should take:
perf-w7-002.ateam.scl3.mozilla.com
perf-w7-003.ateam.scl3.mozilla.com

If these two can work, I would be happy considering the other 4 to move over.

Dustin, can you help confirm these steps and maybe figure out the network ports?
Flags: needinfo?(dustin)
That sounds correct.  You can file and Infrastructure & Operations: DCOps bugs to move the hosts.  If you give the hostnames then they can determine the ports and the new names from inventory.  Please double-check with them that these machines have working IPMI interfaces, too, as we'll need those to do the re-images.

Starting with two machines makes a lot of sense.  Just to check, though, you're looking for one of our supported Linux operating systems here, right?  Ubuntu Precise?
Flags: needinfo?(dustin)
correct, we have windows on these machines right now, and we will reinstall to have ubuntu (12.04 just as the test machines have)
If using brtfs is hard for some reason, you can drop that requirement. btrfs supports cheap copy-on-write file copies, which rr could utilize in the future to make traces more robust to system and browser updates, but currently we don't utilize it.
Depends on: 1148331
roc: thanks for the info. We'll see if it's a hassle to make that work with the existing puppet scripts. If not, we'll do it.
It's definitely a hassle :)

We can revisit when rr supports it.  From comment 5 I gather that a trace would include a COW copy of basically the entire filesystem, so that the trace can index binaries as they were when the trace was captured, even if those binaries are later modified.
Dustin, how do we use puppet to reimage these machines?
Flags: needinfo?(dustin)
https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+iX+and+HP+Linux+Machines is the reimaging process.  You'll need to work with Henrik to add node definitions first, though.
Flags: needinfo?(dustin)
:whimboo is on pto and could be for a while. Dustin, what are node definitions?  I would be happy to do this or figure out how to be a backup for :whimboo when he is afk.

that mana looks easy to follow!
Flags: needinfo?(dustin)
Node defs are in qa-nodes.pp:

  http://hg.mozilla.org/qa/puppet/file/e54aef533448/manifests/qa-nodes.pp

Note that this repository live-deploys from the production branch, so generally patches land in default and are merged to production.

I'm not *terribly* familiar with the current status of the QA puppet infrastructure -- what's running on it, what the ramifications of a patch might be, and so on.  But I think that adding some node definitions and doing an install of some new hardware is unlikely to hurt anything critical, once the hosts are moved.
Created attachment 8585597 [details] [diff] [review]
add puppet support for the rr machines (1.0)

the two machines are in the QA vlan, I have renamed them in inventory.mozilla.org as well as renaming the existing w7 images.

I presume once this patch lands and is in 'production' that we can pxe boot and install via puppet!
Attachment #8585597 - Flags: review?(dustin)
Comment on attachment 8585597 [details] [diff] [review]
add puppet support for the rr machines (1.0)

Review of attachment 8585597 [details] [diff] [review]:
-----------------------------------------------------------------

The patch itself looks fine, and yes, we can try pxe-booting.  I don't feel generally comfortable OK'ing changes to QA PuppetAgain, but this one is simple enough.
Attachment #8585597 - Flags: review?(dustin) → review+
pushed to the QA repo, Dustin, let me know when this makes it to production and I can give this a try!
Looks like you only pushed to default.  If you merge that to production, it will deploy in the next five minutes (it's pulled in a crontask).  I don't have any visibility into the system to know when that occurs, though.
Comment on attachment 8585597 [details] [diff] [review]
add puppet support for the rr machines (1.0)

I have gone through the puppetagain steps and the installation hangs at kickseed.  This has happened twice, so it is repeatable.

I chose IX machines and Ubuntu 12.04 x64.

Dustin, you don't accept needinfo, so I am doing feedback :)
Flags: needinfo?(dustin)
Attachment #8585597 - Flags: feedback?(dustin)
The whole point of not requiring needinfo is that I am one of the rare breed who actually reads his emails.  Don't abuse feedback on the incorrect assumption that I won't read a message from bugzilla that doesn't set a flag.
Attachment #8585597 - Flags: feedback?(dustin) → feedback?
ah, sending a macro of ALT+F1, yields some text console.  I see stuff like this:

Volume group "sda" not found
Skipping volume group sda
Setting partition 1 of /dev/sda to active... done.
umount: can't umount /target/proc /sys: No such file or directory
DEBUG: resolver (libnewt0.52): package doesn't exist (ignored)
DEBUG: resolver (ext2-modules): package doesn't exist (ignored)
INFO: menu item 'finish-install' selected
info: Running /usr/lib/finish-install.d/01kickseed
kickseed: Running post script /var/spool/kickseed/parse/post/0.script using interpreter /bin/sh (chrooted: 1):


that is the end of it.  Dustin, any thoughts?
Looks like a hardware issue.  Do you know what recent Linuxes call the disk devices in this version?  Apparently not sda :(

If you can get to a command line, 'dmesg' or 'mount' might be able to help.
running mount, I see:
rootfs on / type rootfs (rw)
none on /run type tmpfs(rw,...)
none on /proc type proc (rw,relatime)
none on /sys type sysfs (rw, relatime)
devtmpfs on /dev type devtmpfs (rw,...)
devpts on /dev/pts type devpts (rw,...)
/dev/sda1 on /taret type extra (rw,relatime,errors=remount-ro,user_xattr,barrier=1,dataordered)
devtmpfs on /target/dev type devtmpfs (rw,...)
none on /proc type proc (rw,relatime)
devpto on /target/dev/pts type devpts (rw...)
non on /target/run type tmpfs (rw,...)


looking at dmesg is fun as there is no scrolling in the console redirection java applet.

What I see at the end of dmesg is:
eth0: no IPv6 routers present
NTFS driver 2.1.30 [Flags: R/O MODULE].
Btrfs loaded
JFS: nTxBlock = 8192, nTxLock = 65536
SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
SGI XFS Quota Management subsystem
Adding 3999740k swap on /dev/sda2.  Priority:-1 extends:1 across:399740k
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: errors=remount-ro


not sure if this is really helpful, we appear to have sda* partitions, maybe they are read only somehow?
dustin, any thoughts on this?  I am really not sure how to proceed here.  I don't mind trying out a few things.
let me take a poke..
Depends on: 1167687
Depends on: 1167847
Attachment #8585597 - Flags: feedback?

Updated

3 years ago
See Also: → bug 1182298
You need to log in before you can comment on or make changes to this bug.