Closed Bug 1055319 Opened 5 years ago Closed 5 years ago

Add soft-start mechanism for Loop in release builds

Categories

(Hello (Loop) :: Client, defect)

defect
Not set

Tracking

(firefox34+ verified, firefox35 fixed)

RESOLVED FIXED
mozilla35
Tracking Status
firefox34 + verified
firefox35 --- fixed

People

(Reporter: abr, Assigned: abr)

References

Details

(Whiteboard: [loop-uplift])

Attachments

(1 file, 6 obsolete files)

In order to provide a controlled ramp-up of the Loop service once it gets into release
(mostly to prove out server capacity at various load levels), we need to have some means
of throttling its appearance in the release toolbar.

This patch adds a set of "soft-start" configuration prefs that control when the
"loop.enabled" pref transitions from false to true in release builds. Upon first run,
the client chooses an integer between 1 and 4294967294 ((2 ^ 32) - 2) inclusive. It then
checks a configured URL to retreive an integer in this range. When initially deployed,
this integer will be relatively small, and will let on a relatively small number of
users. As we wish to ramp up usage, we adjust this number upwards until it reaches
((2 ^ 32) - 1), which will serve to activate every client.

We still need to identify a server that can serve the static integer, with enough scale
to do so once per Firefox instance startup.
This implementation is code complete, and requires testing.
Anthony: similar to the push server scaling, I wanted to tag you on this bug to let you know that we plan to land this for 34 also, so that we can throttle client deployment.
Flags: needinfo?(anthony.s.hughes)
Apart from checking integer values in about:config, is there any other way to test this is working?
Flags: needinfo?(anthony.s.hughes)
QA Contact: anthony.s.hughes
We would like to have the number defined in a DNS TXT record rather than getting it with an http request.

On the radar is the ability (or non ability) to get information from DNS TXT records with javascript from inside gecko.
I have implemented a PROOF of Concept here:

    host -t txt loop-dev.stage.mozaws.net 

And a python script to give back the rampup number:

    # pip install dnspython
    import dns.resolver
    answer = dns.resolver.query("loop-dev.stage.mozaws.net", "TXT")
    for data in answer:
        key, value = data.to_text().strip('"').split("=")
        if key == "rampup":
            print int(value)
(In reply to Alexis Metaireau (:alexis) from comment #4)
> We would like to have the number defined in a DNS TXT record rather than
> getting it with an http request.
> 
> On the radar is the ability (or non ability) to get information from DNS TXT
> records with javascript from inside gecko.

Right. This is clever, and I like the scaling properties, but the Gecko DNS resolver doesn't allow the lookup of things like TXT or SRV (see Bug 14328). I've double checked the code in netwerk/dns, and it's very much oriented towards A and AAAA records.

So, we have two options: (1) Use an HTTP endpoint as I suggest above, or (2) use a hack involving A records.

We're planning on using a simple 32-bit integer for this value anyway, and DNS A records happen to return 32-bit integers. We represent them in dotted-decimal notation, but that's just a nicety to make them easier to read and write.

So, keeping with the DNS-based approach, we could provision an A record representing the integer, and query this with normal A-record resolution techniques.

I just double checked, and bind9 is more than happy to accept an A record of the form:

  soft-start     	 	A	4294967295

Which makes editing the number relatively fool-proof -- in other words, you don't need to worry about translating the desired load number into dotted-decimal format when you add/change up the DNS record.

If you're happy with this, I'll tweak the patch to use this technique instead.
Flags: needinfo?(alexis+bugs)
I like it it is even easier than to parse a TXT value.
With Route53 I couldn't configure directly an integer.

    host soft-start.loop-dev.stage.mozaws.net

Here is a quick script to convert an integer to ip and the reverse:

    import socket, struct

    def ip2int(addr):
        return struct.unpack("!I", socket.inet_aton(addr))[0]


    def int2ip(addr):
        return socket.inet_ntoa(struct.pack("!I", addr))
Attachment #8474847 - Attachment is obsolete: true
Thanks!
Flags: needinfo?(alexis+bugs)
Attachment #8479106 - Attachment is obsolete: true
Attachment #8480892 - Attachment is obsolete: true
Attachment #8480892 - Flags: review?(standard8)
Comment on attachment 8480896 [details] [diff] [review]
Add DNS-based soft-start mechanism for Loop in release buildsBug 1055319 - Add DNS-based soft-start mechanism for Loop in release builds

Mark or one of the desktop team should review this.
Attachment #8480896 - Flags: review?(standard8) → review?(mhammond)
Dolske, are you able to both sanity-check the concept (including considering if we really want to do this immediately on delayed-browser-start) and take (or delegate) the review?
Flags: needinfo?(dolske)
Comment on attachment 8480896 [details] [diff] [review]
Add DNS-based soft-start mechanism for Loop in release buildsBug 1055319 - Add DNS-based soft-start mechanism for Loop in release builds

Review of attachment 8480896 [details] [diff] [review]:
-----------------------------------------------------------------

I assume this is all essentially temporary code, in that after Loop is fully-launched, this can be removed a release or two later? (Unless disaster strikes and scaling up Loop takes longer than that.)

I suppose I'll also note that running the softstart from a DNS resolution can result in the oddity of a button magically appearing some arbitrary amount of time after startup... Not sure there's any solution to that -- best I can think of is always having Loop disabled until the _second_ startup (ie, the slowstart effect isn't seen until you've reset the browser), but that seems pretty undesirable and would make testing a pain.

Another inherent "hmm how to do that" is that if we're going to promo Loop in first-run tour pages, the page obviously can't promo a button that's been throttled. Nor will the first-run page be shown on future startups, after throttling is over. That could be a significant problem for exposure?

r- mainly for the .enabled bit, but I obviously have a few questions too.

::: browser/app/profile/firefox.js
@@ +1578,5 @@
>  #else
>  pref("loop.enabled", false);
> +pref("loop.soft_start", true);
> +pref("loop.soft_start_ticket_number", "4294967295"); // (2^32)-1
> +pref("loop.soft_start_hostname", "soft-start.loop-dev.stage.mozaws.net");

Followup bug to move this to a non-dev / non-staging host?

::: browser/base/content/browser-loop.js
@@ +41,5 @@
> +        buttonNode.hidden = true;
> +        try {
> +          if (Services.prefs.getBoolPref("loop.soft_start")) {
> +            MozLoopService.checkSoftStart(buttonNode);
> +          }

Seems like this should all go down a few lines, so that when loop.enabled == false nothing at all happens. In other words, if Loop is disabled, there should be no soft-start code running.

::: browser/components/loop/MozLoopService.jsm
@@ +654,5 @@
> +      }
> +      return;
> +    }
> +    // We use a char pref because int prefs are signed 32-bit
> +    // integers, and we want to use numbers larger than (2^31)-1

But surely you don't need > 1 part-per-billion precision? The pref can be an arbitrary scale mapped to the DNS value.

@@ +656,5 @@
> +    }
> +    // We use a char pref because int prefs are signed 32-bit
> +    // integers, and we want to use numbers larger than (2^31)-1
> +    let ticket = parseInt(Services.prefs.getCharPref("loop.soft_start_ticket_number"));
> +    if (!ticket || ticket > MAX_SOFT_START_TICKET_NUMBER || ticket < 0) {

Personally, I'd probably have phrased this all around a "self-throttle percentage", with the server initially returning ~100 going to 0. But fine, same end effect. :)

I think it would be less confusing if the uninitialized ticket number pref was -1. That's more human-readable, and makes it more obvious when troubleshooting that something hasn't yet been set.

Probably is unlikely to happen, but storing an int in a char pref opens up the possibility for a subtle bug if someone creates it as an integer pref in a testing profile. I've footgunned myself with dumber things, so if this saves a single developer an hour of "WTF?!" time it would be a win. ;)

Finally, even less likely: if Math.random() returns 0 (or so close that ticket ends up 0), you could end up soft-starting again?

@@ +693,5 @@
> +
> +      if (now_serving > ticket) {
> +        // Hot diggity! It's our turn! Activate the service.
> +        console.log("MozLoopService: Activating Loop via soft-start");
> +        Services.prefs.setBoolPref("loop.enabled", true);

This shouldn't use loop.enabled, since you could effectively be enabling it if someone has explicitly disabled it (including us, should we need to urgently disable Loop for some reason). I think this should instead just flip loop.soft_start to false, and leave .enabled as a generic master switch that isn't touched.

@@ +706,5 @@
> +    // We use DNS to propagate the slow-start value, since it has well-known
> +    // scaling properties. Ideally, this would use something more semantic,
> +    // like a TXT record; but we don't support TXT in our DNS resolution (see
> +    // Bug 14328), so we instead treat the IP address corresponding to our
> +    // "slow start DNS name" as a 32-bit integer.

Clever, although I worry about two things:

1) Does this provide adequate control? DNS caching can be finicky, and could result is not being able to throttle fast enough, or weird effects if the throttle drops faster than clients see the new value (possibly resulting in a surge when caches expire).

2) This opens up the potential for security fun, since you'd be pointing a mozaws.net(?) hostname at random IP addresses. Should a lucky owner of that IP decide to take advantage of it, they could do interesting things like read cookies, since they'd be a bona-fide site for that domain. [This would require separate effort to get browsers to visit the site -- it's not a direct Loop bug -- but that's easy enough for someone to do.]

So I think you ought to get a secreview on that. But I suspect there's an easy workaround: limit the IP range to a subnet we control. Even just 8 bits from a class C would would give you 0.5% throttling precision, which seems like plenty. (At 500million users, that's ~2mil a tick).
Attachment #8480896 - Flags: review?(mhammond) → review-
I'll be around over the weekend for a followup review, although I'll be traveling on Monday. Shouldn't be a big deal to uplift after Tuesday, but I assume you're trying to get it in before that.
Flags: needinfo?(dolske)
Thanks for the quick turnaround!

(In reply to Justin Dolske [:Dolske] from comment #15)
> I assume this is all essentially temporary code, in that after Loop is
> fully-launched, this can be removed a release or two later? (Unless disaster
> strikes and scaling up Loop takes longer than that.)

That's the plan, yes.

> I suppose I'll also note that running the softstart from a DNS resolution
> can result in the oddity of a button magically appearing some arbitrary
> amount of time after startup... Not sure there's any solution to that --
> best I can think of is always having Loop disabled until the _second_
> startup (ie, the slowstart effect isn't seen until you've reset the
> browser), but that seems pretty undesirable and would make testing a pain.

That was my analysis also.

> Another inherent "hmm how to do that" is that if we're going to promo Loop
> in first-run tour pages, the page obviously can't promo a button that's been
> throttled. Nor will the first-run page be shown on future startups, after
> throttling is over. That could be a significant problem for exposure?

Would it be acceptable to have a supplemental first-time page that is shown
some time after the button appears?

> r- mainly for the .enabled bit, but I obviously have a few questions too.

I considered doing this two different ways, and it sounds like you would
prefer the other one.

My analysis here was that code that looks at "loop.enabled" should discover
that the service is off if the current client hasn't been activated (rather
than having to check both loop.enabled and whatever pref we use for the
throttling). If, for whatever reason, we wanted to shut down the service in
the field, we could set the slow-start pref to off (leaving the loop.enabled
pref off also).

I'm hearing that you want the loop.enabled pref to be on, even if the client
hasn't yet been selected to participate in Loop; and that any code that wants
to check if it should do Loopy things needs to check both loop.enabled *and*
the throttling flag. I'm happy to change it to work that way, with the caveat
that there's an outside chance that someone might land code that only looks at
"enabled" and ignores the throttling flag.

> ::: browser/app/profile/firefox.js
> @@ +1578,5 @@
> >  #else
> >  pref("loop.enabled", false);
> > +pref("loop.soft_start", true);
> > +pref("loop.soft_start_ticket_number", "4294967295"); // (2^32)-1
> > +pref("loop.soft_start_hostname", "soft-start.loop-dev.stage.mozaws.net");
> 
> Followup bug to move this to a non-dev / non-staging host?

Bug 1060809.

> ::: browser/components/loop/MozLoopService.jsm
> @@ +654,5 @@
> > +      }
> > +      return;
> > +    }
> > +    // We use a char pref because int prefs are signed 32-bit
> > +    // integers, and we want to use numbers larger than (2^31)-1
> 
> But surely you don't need > 1 part-per-billion precision? The pref can be an
> arbitrary scale mapped to the DNS value.

Sure. I was trying to make this as foolproof as possible -- if we use the
entire 32-bit space as our range, then whoever adjusts the number can't do
something like think we're using all 32 bits when in fact we're only using,
say, 16, thereby getting 65000 times the load they were expecting. But we can
manage this by making sure that there are very clear expectations around the
number range.

I also have an idea around a technical solution that helps with this a bit,
which I explain below.

> @@ +656,5 @@
> > +    }
> > +    // We use a char pref because int prefs are signed 32-bit
> > +    // integers, and we want to use numbers larger than (2^31)-1
> > +    let ticket = parseInt(Services.prefs.getCharPref("loop.soft_start_ticket_number"));
> > +    if (!ticket || ticket > MAX_SOFT_START_TICKET_NUMBER || ticket < 0) {
> 
> Personally, I'd probably have phrased this all around a "self-throttle
> percentage", with the server initially returning ~100 going to 0. But fine,
> same end effect. :)
> 
> I think it would be less confusing if the uninitialized ticket number pref
> was -1. That's more human-readable, and makes it more obvious when
> troubleshooting that something hasn't yet been set.

Sure. Based on your comments, I'm changing this to be a 24-bit number, so -1
opens up as an option.

> Finally, even less likely: if Math.random() returns 0 (or so close that
> ticket ends up 0), you could end up soft-starting again?

"ticket = Math.floor(Math.random() * MAX_SOFT_START_TICKET_NUMBER) + 1;"
can't return 0 because of the "+ 1" at the end. It shouldn't be larger than
MAX_SOFT_START_TICKET_NUMBER because Math.random() always returns less than
1.0 (although floating point arithmetic is dicey enough that I think I'll add
an explicit check to ensure this doesn't happen).

> @@ +706,5 @@
> > +    // We use DNS to propagate the slow-start value, since it has well-known
> > +    // scaling properties. Ideally, this would use something more semantic,
> > +    // like a TXT record; but we don't support TXT in our DNS resolution (see
> > +    // Bug 14328), so we instead treat the IP address corresponding to our
> > +    // "slow start DNS name" as a 32-bit integer.
> 
> Clever, although I worry about two things:
> 
> 1) Does this provide adequate control? DNS caching can be finicky, and could
> result is not being able to throttle fast enough, or weird effects if the
> throttle drops faster than clients see the new value (possibly resulting in
> a surge when caches expire).

The idea here would be to set the TTL on the DNS record to something quite
low, probably in the range of 10 minutes to an hour. I'll leave it up to the
ops guys to determine how they want to trade off authoritative DNS server load
against load ramp-up responsivity, but I suspect that values in this range
would give them the right balance.

Yes, some DNS caches might not honor the TTL, and some hosts might do exhibit
odd caching behavior, but this is a stochastic startup, and there's some
tolerance for a little bit of slop.

The intention here is that the DNS value won't be changed very frequently --
probably on the order of once per day or so -- so that we have time to let
load settle in and time to evaluate the impact on the server. This is
inherently a conservative mechanism, and the intention is certainly to deploy
it conservatively.

> 2) This opens up the potential for security fun, since you'd be pointing a
> mozaws.net(?) hostname at random IP addresses. Should a lucky owner of that
> IP decide to take advantage of it, they could do interesting things like
> read cookies, since they'd be a bona-fide site for that domain. [This would
> require separate effort to get browsers to visit the site -- it's not a
> direct Loop bug -- but that's easy enough for someone to do.]

While that would be ludicrously difficult to exploit, I do see the danger
you're concerned about. Rather than restricting this to a set of routable IP
addresses that we control, it seems to be even safer to make these addresses
functionally useless. The most foolproof address range meeting this criteria
(that is, the one that is most enforced by OSes and routing equipment) is
127.0.0.0/24, which is required never to leave the host.

So that's what I've done in this revised patch.
Attachment #8481818 - Flags: review?(dolske)
Attachment #8480896 - Attachment is obsolete: true
(In reply to Adam Roach [:abr] from comment #17)

> > Another inherent "hmm how to do that" is that if we're going to promo Loop
> > in first-run tour pages, the page obviously can't promo a button that's been
> > throttled. Nor will the first-run page be shown on future startups, after
> > throttling is over. That could be a significant problem for exposure?
> 
> Would it be acceptable to have a supplemental first-time page that is shown
> some time after the button appears?

More of a question for Product/Marketing/UX. Having Loop open a new tab on its wouldn't be hard, it's just a question of if that's the UX we want. And noodling over the edgecases, like avoiding a first-run tab and another Loop-first-run tab when it's unthrottled on first run, and when exactly the Loop tab should open (immediately? next run?), etc.

Followup fodder!

> I'm hearing that you want the loop.enabled pref to be on, even if the client
> hasn't yet been selected to participate in Loop; and that any code that wants
> to check if it should do Loopy things needs to check both loop.enabled *and*
> the throttling flag. I'm happy to change it to work that way, with the caveat
> that there's an outside chance that someone might land code that only looks
> at "enabled" and ignores the throttling flag.

Right. I think it's important that .enabled == false firmly disables things.

> > Finally, even less likely: if Math.random() returns 0 (or so close that
> > ticket ends up 0), you could end up soft-starting again?
> 
> "ticket = Math.floor(Math.random() * MAX_SOFT_START_TICKET_NUMBER) + 1;"
> can't return 0 because of the "+ 1" at the end.

Hey, I don't have time to read the whole line! *ahem* :)

> Yes, some DNS caches might not honor the TTL, and some hosts might do exhibit
> odd caching behavior, but this is a stochastic startup, and there's some
> tolerance for a little bit of slop.

Yeah. I'm not deathly afraid of it, just wanted to note it. Sounds like you've thought it through.

> The most foolproof address range meeting this criteria
> (that is, the one that is most enforced by OSes and routing equipment) is
> 127.0.0.0/24, which is required never to leave the host.

Ah, nice, WFM.
Comment on attachment 8481818 [details] [diff] [review]
Add DNS-based soft-start mechanism for Loop in release builds

Review of attachment 8481818 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/loop/MozLoopService.jsm
@@ +696,5 @@
> +        return
> +      }
> +
> +      // Can't use bitwise operations here because JS treats all bitwise
> +      // operations as 32-bit *signed* integers.

FWIW, if I remember my asm.js trickery correct, you can sprinkle in ">>> 0" as a way to get unsigned results (e.g. "-1 >>> 0"). But it's kind of ugly and I've found it easy to screw up.
Attachment #8481818 - Flags: review?(dolske) → review+
Landed a follow-up fix for unit-tests:

https://hg.mozilla.org/integration/fx-team/rev/b2bae68e3809
Target Milestone: --- → mozilla34
Unfortunately, we had to back this out due to the unit test failures continuing even after the attempted fix:

https://hg.mozilla.org/integration/fx-team/rev/dd158b25ffc2
https://hg.mozilla.org/integration/fx-team/rev/aae76004579d

13:37:24     INFO -  1412 INFO TEST-START | chrome://mochitests/content/browser/browser/components/tabview/test/browser_tabview_bug659594.js
13:37:24     INFO -  JavaScript error: resource:///modules/loop/MozLoopService.jsm, line 710: NS_ERROR_OFFLINE: Component returned failure code: 0x804b0010 (NS_ERROR_OFFLINE) [nsIDNSService.asyncResolve]
13:38:09     INFO -  TEST-INFO | screenshot: exit status 0
13:38:09     INFO -  dumping last 3 message(s)
13:38:09     INFO -  if you need more context, please use SimpleTest.requestCompleteLog() in your test
13:38:09     INFO -  1413 INFO checking window state
13:38:09     INFO -  1414 INFO TEST-PASS | chrome://mochitests/content/browser/browser/components/tabview/test/browser_tabview_bug659594.js | It is now offline
13:38:09     INFO -  1415 INFO Console message: [JavaScript Error: "NS_ERROR_OFFLINE: Component returned failure code: 0x804b0010 (NS_ERROR_OFFLINE) [nsIDNSService.asyncResolve]" {file: "resource:///modules/loop/MozLoopService.jsm" line: 710}]
13:38:09     INFO -  1416 INFO TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/tabview/test/browser_tabview_bug659594.js | Test timed out
[Tracking Requested - why for this release]:

This goes with Bug 1055139 and Bug 1060610. We need this to reduce load on the push infrastructure when Loop goes to release.  This bug may re-land before uplift, but if it doesn't, we need this in Fx34.
Now we check to make sure the browser is online before we try to hit DNS.
Running a full try since I've broken the tree enough today.

https://tbpl.mozilla.org/?tree=Try&rev=bfe1e6078c02
https://hg.mozilla.org/mozilla-central/rev/583c3b9d82aa
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
(In reply to Phil Ringnalda (:philor) from comment #25)
> https://hg.mozilla.org/mozilla-central/rev/583c3b9d82aa

But backed out in https://hg.mozilla.org/mozilla-central/rev/aae76004579d

New, improved version on try:
https://tbpl.mozilla.org/?tree=Try&rev=f432b62f0ed7
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Well, that hit some XPCSHELL failures because the loop.throttled pref got left out of the "EARLY_BETA_OR_EARLIER" section in my patch unrotting. Here's a fix to *that*, against try, for the XPCSHELL tests:

https://tbpl.mozilla.org/?tree=Try&rev=43c2189a4384
https://hg.mozilla.org/mozilla-central/rev/e4ef776167ca
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: mozilla34 → mozilla35
Adam, is the testsuite coverage sufficient here or will this need specific manual testing?
Flags: qe-verify?
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #30)
> Adam, is the testsuite coverage sufficient here or will this need specific
> manual testing?

The mochi tests that landed with this patch end up mocking enough of the functionality that I think we probably want some basic manual testing. A minimal manual test would look something like this:

1) Set "loop.loop.soft_start_hostname" to point to a DNS A record that you can easily update, with a very, very small TTL (like, 10 seconds or so). Start with the DNS record set to 127.0.0.0.

2) With "loop.enabled" and "loop.throttled" set to true, reset "loop.soft_start_ticket_number" to -1. Restart the browser. 

3) Verify that "loop.soft_start_ticket_number" is greater than 0, and less than 16777215.

4) Verify that "loop.throttled" is still "true" and that the Loop button is not visible.

5) Change "loop.soft_start_ticket_number" to 8388607

6) Set the DNS A record to 127.127.255.255

7) Restart the browser. Verify that "loop.soft_start_ticket_number" is still 8388607. Verify that "loop.throttled" is still "true" and that the Loop button is not visible.

8) Set the DNS A record to 127.128.0.0.

9) Restart the browser. Verify that "loop.soft_start_ticket_number" is still 8388607. Verify that "loop.throttled" is now FALSE, and that the Loop button is VISIBLE.
Attachment #8481818 - Attachment is obsolete: true
Attachment #8482425 - Attachment is obsolete: true
Comment on attachment 8484464 [details] [diff] [review]
Add DNS-based soft-start mechanism for Loop in release builds (as landed)

Carrying forward r+ from dolske

Approval Request Comment
[Feature/regressing bug #]:
Loop

[User impact if declined]:
Probable unavailability of Loop service sure to unpredictably large ramp-up of load when 34 hits release. Probable unavailability of all push services as Mozilla's push server cluster craters under the load. Possible service outage for some or all users of TokBox's services, if resulting load exceeds their deployed capacity.

[Describe test coverage new/current, TBPL]:
Core logic automatically checked by browser_mozLoop_softStart.js
Manual testing performed as described in comment 31.

[Risks and why]: 
The changes in this patch are pretty well isolated to Loop-specific components, so the chances that it impacts other features are fairly low. The risk to not taking this patch is the potential for a catastrophic overload of various infrastructure -- both Mozilla's and TokBox's -- when 34 gets into release.

[String/UUID change made/needed]:
None
Attachment #8484464 - Flags: review+
Attachment #8484464 - Flags: approval-mozilla-aurora?
ni Philipp to comment on the solution from a Firefox UX perspective. 
ni Chad for comment on the Loop roll out plan.

Both - See comment 0, comment 15, comment 18.
Flags: needinfo?(philipp)
Flags: needinfo?(cweiner)
I'm not sure if I understand everything correctly here, so let me recap:

- When Loop first hits release in 34, it won't be enabled for all users.
- Firefox will periodically poll the server to check if Loop can be enabled
- When the server gives a positive response, loop becomes available for this instance
- Then and only then will the loop icon show up anywhere in the UI

FWIW, to my knowledge, the plan was to only have Loop as a customize option in 34 and then expose it in primary UI in 35 (along with some explanation/promotion). Not sure if and how that plays into this.

In any case, I think we should only show the icon following the next restart after getting enabled. That way it is less likely to distract the user and we have a better chance to use a promotional mechanism to explain what is happening.
Flags: needinfo?(philipp)
(In reply to Adam Roach [:abr] from comment #31)
> 1) Set "loop.loop.soft_start_hostname" to point to a DNS A record that you
> can easily update, with a very, very small TTL (like, 10 seconds or so).
> Start with the DNS record set to 127.0.0.0.

I'm not quite sure how to do this. Can you please provide further instruction?
Flags: qe-verify?
Flags: qe-verify+
Flags: needinfo?(adam)
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #36)
> (In reply to Adam Roach [:abr] from comment #31)
> > 1) Set "loop.loop.soft_start_hostname" to point to a DNS A record that you
> > can easily update, with a very, very small TTL (like, 10 seconds or so).
> > Start with the DNS record set to 127.0.0.0.
> 
> I'm not quite sure how to do this. Can you please provide further
> instruction?

I assume that you're referring to the DNS part of the process rather than the "loop.soft_start_hostname" pref (sorry for the additional ".loop" in the instructions).

Ideally, if you can get someone who is running a public DNS server to let you get access to that, this will be a lot easier. Just ask them how they update their zone files, add a new record for this soft-start host, and go to town.

That might be tricky, though. If you can't get that set up, you'll need to configure up your own DNS server and point your machine at it to run these tests. For these instructions, I'm going to assume you can set up a Linux machine running on something substantially similar to Ubuntu 12 LTS. This isn't the machine you're running the tests from; it's the machine you're going to use as your DNS server for testing. You'll need to perform these instructions as root.

First, install a DNS server:

# apt-get install bind9

Now, add a new local zone. In /etc/bind/named.conf.local, add:

> zone "moztest.net" IN {
>         type master;
>         file "/etc/bind/moztest.net";
> };

Create a new zone file, called /etc/bind/moztest.net, and put the following in it:

> $TTL 4h
> $ORIGIN moztest.net.
> @  1D  IN       SOA     dns.moztest.net. root.localhost. (
>         2014091201 10m 5m 10m 1m )
> 
>         IN   NS         dns.moztest.net.
> 
> soft-start              1m A            127.0.0.0
> dns                     5d A            w.x.y.z

Except you need to set "w.x.y.z" to whatever the IP address of your Linux machine is.

Now, start the bind server:

# service bind9 start

You should be able to verify that everything is working by typing:

# dig soft-start.moztest.net @localhost

Somewhere in the output, you should see:

> ;; ANSWER SECTION:
> soft-start.moztest.net.	60	IN	A	127.0.0.0

Assuming that all works, you're ready to use this. Go to the machine you're going to run Firefox on, and get into the network settings. Change the setting for the DNS server to the IP address of your Linux server that you just set up a DNS server on.

Now, run the test steps; when you need to set the value of "loop.soft_start_hostname", use soft-start.moztest.net. Every time you need to change the value of the DNS record, go to the Linux machine, update /etc/bind/moztest.net to reflect the new value, and restart the DNS server:

# service bind9 restart

Then, after that completes, set a timer for 60 seconds. Go grab a cup of coffee or just stretch your legs.

When your timer goes off, you're ready to run the next step of the test.

There's a lot of easily-googled information on the web about setting up and running bind9, if any of the preceding steps causes you any trouble.
Flags: needinfo?(adam)
Given how complex this is to test I do not think it's reasonable to expect a smoketest for this. I also don't think there's much value in going to all that effort if the test has already been run and is mostly covered by mochitest (comment 31 and 33). I'll re-evaluate if you think this is a high risk change but I need to balance that with how much time it's going to take to get the right environment set up to test this reliably.
Flags: needinfo?(adam)
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #38)
> Given how complex this is to test I do not think it's reasonable to expect a
> smoketest for this. I also don't think there's much value in going to all
> that effort if the test has already been run and is mostly covered by
> mochitest (comment 31 and 33). I'll re-evaluate if you think this is a high
> risk change but I need to balance that with how much time it's going to take
> to get the right environment set up to test this reliably.

I don't think it's particularly high-risk. If you want to leave it in the hands of the mochi tests, they're likely to catch the vast majority of any conceivable regressions.
Flags: needinfo?(adam)
Thanks for your help with this Adam. I'll mark this qe-verify- for now but may loop back around to this if we get through higher risk work in time.
Flags: qe-verify+ → qe-verify-
You may also simply set the value that you want for this test in your /etc/hosts file.

> 127.0.0.0     soft-start.loop.services.mozilla.com
(In reply to Philipp Sackl [:phlsa] from comment #35)

> 
> - When Loop first hits release in 34, it won't be enabled for all users.

I'm not sure what enabled means in this context.  If it means "the Loop icon won't show up in chrome by default", this is my understanding.

> - Firefox will periodically poll the server to check if Loop can be enabled

same question with "enabled" here.

> - When the server gives a positive response, loop becomes available for this
> instance
> - Then and only then will the loop icon show up anywhere in the UI
> 
> FWIW, to my knowledge, the plan was to only have Loop as a customize option
> in 34 and then expose it in primary UI in 35 (along with some
> explanation/promotion). Not sure if and how that plays into this.

The plan, as you note below, is to have the Loop icon in the customize panel in 34 GA.  The goal of this work, as I understand it, is to allow us to surface the icon more prominently to a subset of our release users if and when we feel the need to actively drive more usage so we can get answers to the questions we're trying to answer in 34.  

Adam, Romain--let's make sure we're all on the same page?

> 
> In any case, I think we should only show the icon following the next restart
> after getting enabled. That way it is less likely to distract the user and
> we have a better chance to use a promotional mechanism to explain what is
> happening.
Flags: needinfo?(rtestard)
Flags: needinfo?(cweiner)
Flags: needinfo?(adam)
This patch will actually control whether the icon appears in the customization palette at all, not where it appears. We aren't going to have a mechanism in 34 to make the button appear in the default set of buttons.

Because of its relatively obscure location, we're counting on users not discovering it in droves, and maybe that's safe. But maybe it's not.

Basically, having this mechanism in 34 gives us a set of brakes that we can gradually let off as we get a feel for how the feature impacts server load. If, a week into this, we decide that the load we're seeing does not warrant a throttle, then we can pull out the stops and let it appear in everyone's browser.

But it's a lot better to have the control there and not need it than it is to need it and not have it.
Flags: needinfo?(adam)
Whiteboard: [loop-uplift]
Blocks: 1067845
 > In any case, I think we should only show the icon following the next restart
> after getting enabled. That way it is less likely to distract the user and
> we have a better chance to use a promotional mechanism to explain what is
> happening.

Adam, how about following Philipp's UX recommendation? ^
Flags: needinfo?(adam)
(In reply to Lawrence Mandel [:lmandel] from comment #44)
>  > In any case, I think we should only show the icon following the next
> restart
> > after getting enabled. That way it is less likely to distract the user and
> > we have a better chance to use a promotional mechanism to explain what is
> > happening.
> 
> Adam, how about following Philipp's UX recommendation? ^

I don't think we want to do that.

As Justin points out in comment 15, this would be very complicated to test.

But the real engineering issue here is that this is a throttle that's being used as part of a feedback loop to gradually ramp up load on the service. When we turn it up, we want to see the load impact as quickly as possible; otherwise, it becomes very difficult to overshoot what we're aiming for. Basically, it's the difference between having, say, a one-second lag between pressing the brakes and having the car slow down, and a five-second lag: the second one makes it much harder to slow down when you see a reason to do so.

As a final comment on the actual user-perceived experience here: except for the highly unusual circumstances where DNS takes multiple seconds to respond to a query, activation of the button will happen before the UI even finishes rendering (you can play around with the patch and demonstrate this to yourself). In other words, the only user-visible difference is whether the button appears when their browser is selected to be part of the service, or whether they only see it the next time. The second situation seems strictly worse, UX-wise.
Flags: needinfo?(adam)
(In reply to Adam Roach (Work week through Sep 22) [:abr] from comment #45)
> (In reply to Lawrence Mandel [:lmandel] from comment #44)
> >  > In any case, I think we should only show the icon following the next
> > restart
> > > after getting enabled. That way it is less likely to distract the user and
> > > we have a better chance to use a promotional mechanism to explain what is
> > > happening.
> > 
> > Adam, how about following Philipp's UX recommendation? ^
> 
> I don't think we want to do that.
> 
> As Justin points out in comment 15, this would be very complicated to test.
> 
> But the real engineering issue here is that this is a throttle that's being
> used as part of a feedback loop to gradually ramp up load on the service.
> When we turn it up, we want to see the load impact as quickly as possible;
> otherwise, it becomes very difficult to overshoot what we're aiming for.
> Basically, it's the difference between having, say, a one-second lag between
> pressing the brakes and having the car slow down, and a five-second lag: the
> second one makes it much harder to slow down when you see a reason to do so.
> 
> As a final comment on the actual user-perceived experience here: except for
> the highly unusual circumstances where DNS takes multiple seconds to respond
> to a query, activation of the button will happen before the UI even finishes
> rendering (you can play around with the patch and demonstrate this to
> yourself). In other words, the only user-visible difference is whether the
> button appears when their browser is selected to be part of the service, or
> whether they only see it the next time. The second situation seems strictly
> worse, UX-wise.

Oh, I wasn't aware that the check would only happen on startup. If we can be reasonably sure that the icon will appear, say, in the first 3 seconds after the browser window is visible, then that's OK with me.

The situation that would not be OK is when a user is in the middle of a browsing session an suddenly gets a new icon thrown into his toolbar.
Comment on attachment 8484464 [details] [diff] [review]
Add DNS-based soft-start mechanism for Loop in release builds (as landed)

I think we're all in agreement now about this bug. Approving for Aurora. Any work to surface Loop more prominently should be done in a followup bug.

Adam - Is the plan to remove the soft start mechanism sometime in the future after we have successfully launched to the entire user base? If so, can you file a bug on that now so that don't forget?
Attachment #8484464 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Flags: needinfo?(adam)
Depends on: 1066609
Clearing my NI.
This bug addresses a soft start solution for Fx34 launch where the feature is exposed on the customization palette.
Another bug will be created to deliver a similar soft start solution for Fx35 where the feature will be exposed on the toolbar.
Flags: needinfo?(rtestard)
:rpappa - could you attempt to verify this via Comment 31 . It didn't seem to work for me in today's nightly.  I'm adding the following prefs myself as I don't see them in my about:config

loop.soft_start_ticket_number
loop.soft_start_hostname
Flags: needinfo?(rpappalardo)
:edwong - looks like it's much more involved see Comments 36-40.
Flags: needinfo?(rpappalardo)
Blocks: 1073215
Blocks: 1073218
(In reply to Lawrence Mandel [:lmandel] from comment #47)
> Adam - Is the plan to remove the soft start mechanism sometime in the future
> after we have successfully launched to the entire user base? If so, can you
> file a bug on that now so that don't forget?

Sure -- thanks for the reminder. See Bug 1073215 and (more directly relevant to your comment) Bug 1073218.
Flags: needinfo?(adam)
I tried this todays Aurora 34.0a2 (2014-10-09)

(In reply to Adam Roach [:abr] from comment #31)
> 1) Set "loop.loop.soft_start_hostname" to point to a DNS A record that you
> can easily update, with a very, very small TTL (like, 10 seconds or so).
> Start with the DNS record set to 127.0.0.0.

Is this really loop.loop.soft_start_hostname or was that a typo and it should be loop.soft_start_hostname like Edwin used above?
 
> 2) With "loop.enabled" and "loop.throttled" set to true, reset
> "loop.soft_start_ticket_number" to -1. Restart the browser. 

Just to make sure: I can't reset loop.soft_start_ticket_number as the pref does not exist. So added this a new variable with the value -1.
 
> 3) Verify that "loop.soft_start_ticket_number" is greater than 0, and less
> than 16777215.
>
> 4) Verify that "loop.throttled" is still "true" and that the Loop button is
> not visible.
>
> 5) Change "loop.soft_start_ticket_number" to 8388607
> 
> 6) Set the DNS A record to 127.127.255.255
> 
> 7) Restart the browser. Verify that "loop.soft_start_ticket_number" is still
> 8388607. Verify that "loop.throttled" is still "true" and that the Loop
> button is not visible.

Steps 3-7 work as expected.
 
> 8) Set the DNS A record to 127.128.0.0.
> 
> 9) Restart the browser. Verify that "loop.soft_start_ticket_number" is still
> 8388607. Verify that "loop.throttled" is now FALSE, and that the Loop button
> is VISIBLE.

This last step fails. Neither doe the loop.throttled pref gets changed nor does Loop button become available.

Adam, any suggestions on how to investigate this any further? Could you maybe briefly explain the magic numbers in the ticket number (I assume that is relevant) and does that relate to the IP address at all?

FYI I use loop.rfc3261.com and verified via dig that the new value is properly cached on my local system.
Flags: needinfo?(adam)
(In reply to Nils Ohlmeier [:drno] from comment #53)
> Could you
> maybe briefly explain the magic numbers in the ticket number (I assume that
> is relevant) and does that relate to the IP address at all?

Ah, sorry. I should have included a pointer to the explanation of how this all works:

https://wiki.mozilla.org/Loop/Load_Handling
Flags: needinfo?(adam)
> FYI I use loop.rfc3261.com and verified via dig that the new value is properly cached on my local system.

Ah, I forgot to mention -- dig bypasses the DNS cache. What ttl are you using on your DNS record?
Depends on: 1081192
(In reply to Adam Roach [:abr] from comment #55)
> Ah, I forgot to mention -- dig bypasses the DNS cache.

What other tool do you suggest which respects the DNS cache?

> What ttl are you using on your DNS record?

Currently it is 1 minute. And I waited for the TTL count on the dig reply to drop to/below zero.

BTW I just tried starting Aurora again (after the night caches should be invalid). Still no Loop button.
(In reply to Nils Ohlmeier [:drno] from comment #53)
> I tried this todays Aurora 34.0a2 (2014-10-09)
> 
> (In reply to Adam Roach [:abr] from comment #31)
> > 1) Set "loop.loop.soft_start_hostname" to point to a DNS A record that you
> > can easily update, with a very, very small TTL (like, 10 seconds or so).
> > Start with the DNS record set to 127.0.0.0.
> 
> Is this really loop.loop.soft_start_hostname or was that a typo and it
> should be loop.soft_start_hostname like Edwin used above?

After looking at the code it turns out this is really suppose to be just loop.soft_start_hostname. And then everything works as designed.

Verified with 34.0a2 (2014-10-10) on Mac OSX all 9 steps from comment #31.
Verified that put of range values 126.255.255.255 and 128.0.0.0 and invalid host names get ignored and Hello icons does not show up.
Flags: qe-verify- → qe-verify+
Thanks Nils. Could you please also test this with Firefox 35?
Flags: needinfo?(drno)
QA Contact: anthony.s.hughes → drno
According to a IRC discussion with Adam http://logs.glob.uno/?c=mozilla%23media&s=9+Oct+2014&e=9+Oct+2014
this feature is going to change in 35 once bug 1073215. Bug 1073215 recommends removing the code from this bug and replace it with other code. So I think we should rather test 35 once bug 1073215 has landed.
Flags: needinfo?(drno)
(In reply to Nils Ohlmeier [:drno] from comment #59)
> According to a IRC discussion with Adam
> http://logs.glob.uno/?c=mozilla%23media&s=9+Oct+2014&e=9+Oct+2014
> this feature is going to change in 35 once bug 1073215. Bug 1073215
> recommends removing the code from this bug and replace it with other code.
> So I think we should rather test 35 once bug 1073215 has landed.

Nils is correct. This behavior isn't landed on 35, and the behavior that lands on 35 will be substantially different.
I untracking this for further verification based on comment 60. Any further testing can take place in bug 1073215.
Flags: qe-verify+ → qe-verify-
You need to log in before you can comment on or make changes to this bug.