Closed Bug 781012 Opened 12 years ago Closed 11 years ago

sanity check hg 2.5.4 on hgssh.stage.dmz.scl3.mozilla.com

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Assigned: Callek)

References

Details

(Whiteboard: [reit])

IT has staged a 2.2.3 on hg1.dmz.scl3.mozilla.com - test to ensure no known breakage.

Bonus points for documenting all the tests to simplify the next update cycle.
Assignee: nobody → catlee
Whiteboard: [reit] → [reit][buildduty]
(In reply to Hal Wine [:hwine] from comment #0)
> IT has staged a 2.2.3 on hg1.dmz.scl3.mozilla.com - test to ensure no known
> breakage.

Per IRC this was hgweb1.dmz.scl3.mozilla.com which itself is a VHost that doesn't respond to that and only responds to hg.m.o so you should adjust HOSTS locally for that.
Summary: sanity check hg 2.2.3 on hg1.dmz.scl3.mozilla.com → sanity check hg 2.2.3 on hgweb1.dmz.scl3.mozilla.com
I tried using bld-centos6-hp-002.build.mozilla.org (10.12.52.43) to test, but I got connection timeouts.

bkero, let me know when I should try again.
Assignee: catlee → nobody
Can you ensure that "nc -z hgweb1.dmz.scl3.mozilla.com 80" actually works?  Otherwise we're going to have to add a network flow.  Or wait for 781925 to finish, then actually use that address.
There is no flow from build-vpn to DMZ (although the host cited in comment #2 doesn't have any tools installed). From a build-vpn box with tools:
[cltbld@buildbot-master33 ~]$ time nc -zv hgweb1.dmz.scl3.mozilla.com 80
nc: connect to hgweb1.dmz.scl3.mozilla.com port 80 (tcp) failed: Connection timed out

real    3m9.000s
user    0m0.000s
sys     0m0.001s
[cltbld@buildbot-master33 ~]$ host hgweb1.dmz.scl3.mozilla.com
hgweb1.dmz.scl3.mozilla.com has address 10.22.74.32

NOTE: the following command can be used to verify URLs without modifying the /etc/hosts file:
  curl -IH "Host: hg.mozilla.org" http://hgweb1.dmz.scl3.mozilla.com/build/tools
Still the same, connection time out for tcp 80.
We need to test ssh access as well, so vm being set up in bug 781995 for this purpose
Depends on: 781995
correct blocking bug number - apologies for spam
Depends on: 781955
No longer depends on: 781995
This doesn't look like something actionable by buildduty to me.
Whiteboard: [reit][buildduty] → [reit]
Host is hg-test.dmz.scl3.mozilla.com

Hal/releng,

Can we get a list of tests you'd like to run, we'd like to make sure you have the right firewall holes opened up on this host for you.
Summary: sanity check hg 2.2.3 on hgweb1.dmz.scl3.mozilla.com → sanity check hg 2.3.x on hg-test.dmz.scl3.mozilla.com
Shyam,

We need http access (assuming https is terminated at the lb), and ssh access. That's it.

Also, we need some dummy content we can update - one of my user repos is fine.
Hal,

I was planning on getting a local dump of the entire repo for you to play with or to save time, we could just do a few important ones, like m-c and try?
Yes - our interest is that nothing changed with the functionality of the hooks we care about. m-c & try would be good candidates to capture the hooks we care about.

Also, on the test box, it'd be handy to have a shell account with sufficient rights to view all log files (to confirm actions, since it won't be hooked up to an operational pushlog, etc.)
note: holding on this to understand bug introduced in hg 2.1 see http://bz.selenic.com/show_bug.cgi?id=3648#c8

Current discussion is about hg client versions - want to confirm that updating server version (from current 2.0.2) won't cause the problem more often or for more users.
OS: Mac OS X → All
Hardware: x86 → All
Sorry again, the right host is hgssh.stage.dmz.scl3.mozilla.com
Summary: sanity check hg 2.3.x on hg-test.dmz.scl3.mozilla.com → sanity check hg 2.3.x on hgssh.stage.dmz.scl3.mozilla.com
I've created a user for you, the VM is available to test at hgssh.stage.dmz.scl3.mozilla.com.  Please login with the username 'hwine' which has sudo access.  The hg repos are mounted read-only, please do whatever testing you need on there.
(In reply to Hal Wine [:hwine] from comment #13)
> note: holding on this to understand bug introduced in hg 2.1 see
> http://bz.selenic.com/show_bug.cgi?id=3648#c8
> 
> Current discussion is about hg client versions - want to confirm that
> updating server version (from current 2.0.2) won't cause the problem more
> often or for more users.

per http://bz.selenic.com/show_bug.cgi?id=3648#c10 the case folding bug is client side only. Yay!
does anything else need to be done for this?
Yes - this is not yet done - the other issue was a temporary blocker. Now we can get back to the normal process.
FWIW, I ran the hghooks/pushlog unit tests on:

$ hg --version
Mercurial Distributed SCM (version 2.3.2)

on my local machine and they all pass, so no problems there.
This is a quarterly goal for us. Can we get a status update so we can fix 741353?
I'll bring it up for discussion at our next team meeting, and we'll update after that.
(In reply to Hal Wine [:hwine] from comment #21)
> I'll bring it up for discussion at our next team meeting, and we'll update
> after that.

Hal, any updates here?
(In reply to Shyam Mani [:fox2mike] from comment #22)
> 
> Hal, any updates here?

Yeah - we're swamped, so this is unlikely to complete this quarter without help.

Our main grief is we have to cross check across all the hg client versions we support (see, e.g. bug 779569 comment #0 - slightly out of date, but the list is still long).

If you know of any compatibility matrix that shows the testing done by the mercurial team as part of their release process, that'd be very helpful.
(In reply to Hal Wine [:hwine] from comment #23)
> Our main grief is we have to cross check across all the hg client versions
> we support (see, e.g. bug 779569 comment #0 - slightly out of date, but the
> list is still long).
> 

I don't understand what you're afraid of here. The build slaves don't do anything particularly complicated, they just clone and pull/update, right? Unless you're afraid upstream broke the wire protocol this doesn't seem worth the effort. I think we should assume that our upstream tools have done at least minimal QA and not waste our time trying to do it for them. (Especially when we don't have the time to waste, as evidenced by how long this bug has been sitting.)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #24)
> (In reply to Hal Wine [:hwine] from comment #23)
> > Our main grief is we have to cross check across all the hg client versions
> > we support (see, e.g. bug 779569 comment #0 - slightly out of date, but the
> > list is still long).
> > 
> 
> I don't understand what you're afraid of here. The build slaves don't do
> anything particularly complicated, they just clone and pull/update, right?
> Unless you're afraid upstream broke the wire protocol this doesn't seem
> worth the effort. I think we should assume that our upstream tools have done
> at least minimal QA and not waste our time trying to do it for them.
> (Especially when we don't have the time to waste, as evidenced by how long
> this bug has been sitting.)

Upstream has broken the wire protocol before for us. e.g. they changed how each side advertised their heads to each other, which busted build slaves trying to clone the try repo because the HTTP header sizes exploded in size and caused the load balancer to drop or truncate the requests.
Thanks for the info. That doesn't sound like "broke the wire protocol" to me, but "broke using our load balancer". That's a pretty fair thing to test against.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #24)
> I think we should assume that our upstream tools have done
> at least minimal QA and not waste our time trying to do it for them.

Agreed that we shouldn't duplicate any upstream testing. But besides the case mentioned in comment 25, there have been some hg client side regressions, so we do need some level of internal testing. 

For example, hg 2.1 introduced a client side regression (still not fixed) that impacted non-case sensitive file systems (Mac & Windows for us). You might recall the hassle: http://bz.selenic.com/show_bug.cgi?id=3648#c8. (Fortunately, we move slowly enough to not have been affected by this bug in production. :/)
Assignee: nobody → bugspam.Callek
Mercurial 2.5 has just been released (http://mercurial.selenic.com/wiki/WhatsNew) with some substantial performance improvements. 

Would be good to see if we can get onto 2.3 as soon as possible, to then pave the way for 2.5 :-)
Another item to test - interaction with MercurialVCS per bug 828029
See Also: → 828029
I'd suggest we just upgrade straight to 2.5.

I can bang out the script to test all versions in an afternoon, but only on Linux. If you're concerned with OS X (case insensitive) and Windows support, you'll have to do that legwork on your own. The thing we seem to be concerned with here is wire protocol, not platform support.

What operations/tests should this be doing?

hg update? hg clone? hg pull? Between all mercurial versions 1.5 or newer?
(In reply to Ben Kero [:bkero] from comment #30)
> I'd suggest we just upgrade straight to 2.5.

The (supposedly) final 2.5.x version, 2.5.3, just got released.
Depends on: 867470
Depends on: 868279
:ted, 

You mentioned in IRC that you had run the hooks through the testsuite for 2.5.4. And that one had failed, can you officially say/claim in here the following:

* For the hook/test that failed {one of:}
** The test is broken, with the hook being fine, and will be fixed in Bug XXX
** The hook is broken and has problem X, described in Bug XXX and will be fixed there.
** The results of failed test are {x} but I {ted} have no time to look further... Bug XXX has been opened to track the issue.

* That the [remaining] full list of hooks are working in a satisfactory way with hg 2.5.4.
** (making sure that we either have a full test suite that was run, or by human code scanning/etc of hooks without automated tests)

* There are no config changes or patches blocking this deployment from the hook side.

:ted, :fox2mike, :bkero,

* hgweb templates?
-- I would like to identify which teams/persons will be able to qualify our hgweb templates meet our use-case needs.
-- Two primary concerns here around that:
  -- changes that break functionality [for human or app] (e.g. the pushlog json views)
  -- changes that will surprise our community/developers in their use-cases.
   -- surprises will likely not block rollout but will need to be enumerated and communicated prior to cutover to new hg server.
Flags: needinfo?(ted)
Flags: needinfo?(shyam)
Flags: needinfo?(bkero)
Summary: sanity check hg 2.3.x on hgssh.stage.dmz.scl3.mozilla.com → sanity check hg 2.5.4 on hgssh.stage.dmz.scl3.mozilla.com
The hooks work fine with 2.5.4. I tested the pushlog extension and I'm having some issues. I pushed one change to fix an issue I hit:
http://hg.mozilla.org/hgcustom/pushlog/rev/e4b06bae4d5b

Now the tests are running (and claim they're passing), but there's a lot of error spew from the hgweb HTTP server the test suite runs, so I'm not sure if it's actually passing or not.

If the pushlog test suite succeeds then the pushlog JSON+ATOM feeds should be fine, and it runs some very basic sanity checks on the HTML view, so that shouldn't be horribly broken.
Flags: needinfo?(ted)
I poked at it a bit more and the errors I was seeing seem harmless, they look like they're just a result of the way I'm running the tests. I pushed a wallpaper fix, so all the pushlog tests pass with hg 2.5.4 now.
Not quite sure what use I can be with the templates stuff, ben might have a better answer for you there.
Flags: needinfo?(shyam)
I've just finished setting up the SSH and HTTP portions of the hgssh.stage server.

You have 2 repositories in place, mozilla-central and try
All templates should be the same, all extensions should be the same.

Please let me know if anything is missing.
Flags: needinfo?(bkero)
First [known] failure/issue:

* pushing to try/ is hanging, in a test with no trychooser syntax present.
** Expected is remote: aborting with an informative error message.
** client not returning to shell

:bkero, can you tell me if there is anything obvious here to fix, as well as what version of the hg hooks we have installed?

(I'm working in https://etherpad.mozilla.org/hg-server-testing-plan so far which has a link to my pastebin of output -- I'll drop final plans to real bug/wiki when done)
Flags: needinfo?(bkero)
I talked with Callek on IRC last night and we determined (with the --debug flag and running htop on the server) that the "hanging" was actually a large system load on the server (8+ on a 1-core server). It eventually finished I believe.

I've also fixed the permission issue you were referring to earlier.

Callek, is there anything more you need from me?
Flags: needinfo?(bkero)
per Callek in IRC, all good to go. Callek - please confirm & resolve.
Status: NEW → ASSIGNED
Flags: needinfo?(bugspam.Callek)
(In reply to Hal Wine [:hwine] from comment #39)
> per Callek in IRC, all good to go. Callek - please confirm & resolve.

Indeed. I thought of one helpful tidbit that would be good for me, attaching the cltbld pubkey to my @gmail.com privkey for my hg user on the host.

This would simplify the windows testing so I don't have to pull my own privkey onto a windows server, since ssh -A doesn't work for connecting to our windows systems.
Flags: needinfo?(bugspam.Callek)
I've added the cltbld public key to Callek's LDAP credentials as he requested so he can do further testing on Windows.
Was just working on this, since I woke up early today. And realized (with Usul's help) that we added the cltbld@gitolite not the actual hg production key.

:Usul is fixing that up now, but wanted an explicit additional request here for the fix.
added callek's  http://pastebin.mozilla.org/2422550 key to ldap too
This has been sanity checked and releng signs off.

I note that Bug 774766 was deployed outside of my testing window and any fallout from the upgrade around that change is un verified. (my comments regarding that are in its bug)

I also note that due to technical limitations we did not test anything with the load balancer:
* see-related https://bugzil.la/781012#c25

The following was part of my sanity check:
* Clone over ssh yields same repo end-state as clone over ssh of current production repo
* Clone over http yields same repo end state as clone over http of current production repo
* Pushlog/json-pushes yields correct results
* Pushing multiple heads on a branch fails
* Pulling by rev for try continues to only get one head grabbed
* Pushing to try still works with more heads.
* Hg Phase behavior will not regress in the upgrade

The following is suggestions/improvements to be made during or shortly after the upgrade:
* A single push to try (to make sure the cache of the repo is populated before it opens, initial push will be slow but additional ones after that should be fast)
* Set mercurial phase for try to be non-publishing (Bug 725362)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.