Closed Bug 1404332 Opened 7 years ago Closed 7 years ago

[Shield] Pioneer Enrollment - failure code: 0x80004005 on show pop-up notification

Categories

(Shield :: Shield Study, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: aflorinescu, Assigned: osmose)

References

Details

[Prerequisites:]
Ensure the following preferences are set :
        extensions.pioneer-enrollment-study.updateTimerInterval = 1
        extensions.pioneer-enrollment-study.firstPromptDelay = 1
        extensions.pioneer-enrollment-study.secondPromptDelay = 180000
        extensions.pioneer-enrollment-study.studyEndDelay = 180000
        extensions.pioneer-enrollment-study.studyEnrolledEndDelay = 180000
        extensions.pioneer-enrollment-study.enrollmentState = ‘’ (empty string)

[Steps:]
1. Open FF and set the prerequisites.
2. Open browser console.
3. Install add-on https://net-mozaws-prod-addons-signxpi-output.s3.amazonaws.com/pioneer-enrollment-study-2.0.1.xpi.


[Actual Result:]
When the pop-up notification is displayed, the browser console states:

[Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIURI.hostPort]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://gre/modules/PopupNotifications.jsm :: PopupNotifications_refreshPanel/< :: line 816"  data: no]

[Note1:] There is no visible failure in the expected workflow.
[Note2:] There are two additional errors in the console, but not sure if there are also pioneer related:
       a. Object { error: "connectionClosed", message: "'getTab' active request packet to '…" }
        b. NotFoundError: Node was not found  notification.xml:219 in view-source:chrome://global/content/bindings/notification.xml
After talking with Gijs, the PopupNotifications error does not appear to be related to Pioneer (PopunderNotifications has to do with doorhanger notifications, which we don't use at all). I also cannot replicate on MacOS or Windows 10.

The getTab error also looks unrelated, and I cannot replicate. However, the "Node was not found" error does look like it's our fault, and I've filed https://github.com/mozilla/pioneer-enrollment-study/pull/1 to deal with it.
Assignee: nobody → mkelly
Commits pushed to master at https://github.com/mozilla/pioneer-enrollment-study

https://github.com/mozilla/pioneer-enrollment-study/commit/6472143c35fc5d515357f084b7c22c43d46ba984
Bug 1404332: Do not throw if failing to clean up notifications.

https://github.com/mozilla/pioneer-enrollment-study/commit/09a3934341dae127c166cf3024d476524039c5cc
Merge pull request #1 from Osmose/node-not-found

Bug 1404332: Do not throw if failing to clean up notifications.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment 2 fix resolves the node issue. I cannot spot the node error anymore. Mike, any idea what the other two console errors might be? (resolve this and log separate issues for those?)

Also related to browser console, I'm not sure if this is actually related to the Pioneer study, but I do get a bunch of telemetry errors - I wonder if these are expected: https://pastebin.mozilla.org/9069025
Status: RESOLVED → VERIFIED
Flags: needinfo?(mkelly)
(In reply to Adrian Florinescu [:AdrianSV] from comment #3)
> Comment 2 fix resolves the node issue. I cannot spot the node error anymore.
> Mike, any idea what the other two console errors might be? (resolve this and
> log separate issues for those?)

Yeah, I think logging them under Firefox::General with STR is our best bet. The popup notification one in particular could probably use a look from someone more familiar with that code.
 
> Also related to browser console, I'm not sure if this is actually related to
> the Pioneer study, but I do get a bunch of telemetry errors - I wonder if
> these are expected: https://pastebin.mozilla.org/9069025

gfritzsche: chutten: We're getting the errors linked above when sending the persisted custom ping data from a Shield study, but I can't trace them to anything in particular that we're doing. What do you think about that error log?
Flags: needinfo?(mkelly) → needinfo?(chutten)
`TelemetryStopwatch: key "TELEMETRY_PENDING_LOAD_MS" was already initialized` means that TelemetryStopwatch.start() was called twice without an interleaving TelemetryStopwatch.stop or .cancel for the name "TELEMETRY_PENDING_LOAD_MS

This isn't a good sign as this is a TELEMETRY_* probe which measures internal work. If the internals are misbehaving, something is likely wrong.

`TelemetrySend::sendPersistedPings - failed to send ping b684490f-806e-4b7f-a249-003aac449b99: SyntaxError: The URI is malformed.` as you can imagine this isn't a particularly nice error to receive either.

This means the URI we're trying to send pings to is malformed.

When is this happening? During tests?
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #5)
> `TelemetryStopwatch: key "TELEMETRY_PENDING_LOAD_MS" was already
> initialized` means that TelemetryStopwatch.start() was called twice without
> an interleaving TelemetryStopwatch.stop or .cancel for the name
> "TELEMETRY_PENDING_LOAD_MS
> 
> This isn't a good sign as this is a TELEMETRY_* probe which measures
> internal work. If the internals are misbehaving, something is likely wrong.
> 
> `TelemetrySend::sendPersistedPings - failed to send ping
> b684490f-806e-4b7f-a249-003aac449b99: SyntaxError: The URI is malformed.` as
> you can imagine this isn't a particularly nice error to receive either.
> 
> This means the URI we're trying to send pings to is malformed.
> 
> When is this happening? During tests?

Yes, but as far as I can imagine, the tests are as relevant as the real thing, since we are using production environment. (production normandy with Shield opt-out study which delivers the add-on from  https://bugzilla.mozilla.org/show_bug.cgi?id=1382761#c10).

I could try to see if I get the same telemetry result having the add-on installed manually (I don't remember seeing these errors during the first tests w/o the Shield Study).
Flags: needinfo?(adrian.florinescu)
(In reply to Chris H-C :chutten from comment #5)
> When is this happening? During tests?

I'm able to replicate this by following the steps from bug 1382761 comment 11 on a brand new profile in Firefox 56 on release. Those steps install a Shield study add-on, and a few minutes after the study is installed and shows its prompts, the errors show up in the console, presumably when Telemetry gets around to sending the custom pings that the study generates.

The add-on sends custom telemetry pings here: https://github.com/mozilla/pioneer-enrollment-study/blob/master/extension/StudyUtils.jsm#L901

Are there any particular usage patterns of submitExternalPing that might cause this? I can provide more context about how we send these pings if there's specific things to look out for.
tests have sometimes been known to not clean up after themselves, which is why I've gotten used to ignoring TelemetryStopwatch warnings in tests.

To debug this we'd benefit from trace logging (Services.prefs.setCharPref("toolkit.telemetry.log.level", "Trace");)

That ought to give us the information we'd need to get this sorted. I'm especially interested in the server of the _doPing call. The paths ought to be straightforward, so it's only the server portion that remains unknown.
Here's a gist of my logs with tracing enabled: https://gist.github.com/Osmose/20f955fdbb26cdd12641db4ab5b9e9c6

Note that I'm only able to replicate the stopwatch errors, and not the invalid URL errors. I'll leave it to Adrian to provide a trace log for that.
I can see the malformed URI lines for Ubuntu and Mac alike. Adding the browser console with Trace enabled:
https://pastebin.mozilla.org/9069143

The steps to reproduce are:
1. Open FF with a new profile.
2. Set extensions.shield-recipe-client.user_id pref to the string "pioneer-quick-timer-2".
3. Set toolkit.telemetry.log.level - Trace
4. Set "extensions.shield-recipe-client.dev_mode" pref to true
5. Restart FF.
6. Open Browser console.


The Pioneer notification is shown but not interacted with. Might this be the part we are missing?
Flags: needinfo?(adrian.florinescu)
Well, the "URL is malformed" part is easy to explain:

1507013861888   Toolkit.Telemetry  TRACE     TelemetrySend::_doPing - server: undefined, persisted: true, id: 89e1cc3f-a761-4343-bf90-ad37ac87cda7

We're trying to send a ping before TelemetrySend::setup (which is when the server host is grabbed from the prefs).

Now, this should just be a temporary failure. We'll store that ping and resend in a scheduled interval. In fact, it should (and is) caught much earlier than _doPing:

1507013857014   Toolkit.Telemetry  TRACE     TelemetrySend::submitPing - can't send ping now, persisting to disk - canSendNow: false

However, accepting the data submission policy seems to kick off triggerSendingPings (via _userNotified) which doesn't bother to check if it can actually send before trying.

My guess is that this triggers an unlikely error flow that breaks our net of async functions allowing crazy things like re-entrant loadPendingPings:

1507013861855   Toolkit.Telemetry  TRACE     TelemetryStorage::loadPendingPing - id: 89e1cc3f-a761-4343-bf90-ad37ac87cda7
1507013861856   Toolkit.Telemetry  TRACE     TelemetryStorage::loadPendingPing - id: 2d8c1919-b9bb-408e-a058-8c0513108225

Which in turn causes the TelemetryStopwatch traces.

tl;dr - It seems as though Telemetry doesn't handle (and doesn't test) the user accepting the data submission policy before TelemetrySend::setup, which causes a lot of noise in the logs. I'm not sure if this eventually results in the error as originally reported, so I'll file a bug specific to the Telemetry parts of this.
Depends on: 1405299
You need to log in before you can comment on or make changes to this bug.