Closed Bug 1625157 Opened 5 years ago Closed 4 years ago

Glean runs as a separate instance in iOS application extensions

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: travis_, Assigned: travis_)

References

Details

(Whiteboard: [telemetry:glean-rs:backlog])

Recently it was noticed that we were receiving schema-rejection errors for Lockwise-iOS under the application id/document namespace of: org-mozilla-ios-Lockbox-CredentialProvider. The expected application ID is org-mozilla-ios-Lockbox which leads me to believe that we are seeing AppInfo.name report a different ID where we use it to build the configuration to pass to the FFI layer when initializing Glean, but only when it's used/invoked to provide credentials, and not when the app is launched by the user.

Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:backlog]

Based on a conversation with Kayla, it looks like using AppInfo.baseBundleIdentifier rather than AppInfo.name might help guarantee that we always get the org-mozilla-ios-Lockbox identifier, even if the CredentialProvider is the bit that's running. She also helped out in giving some information on how to launch the CredentialProvider from Xcode.

From Slack:

Kayla 12:01 PM
You build lockwise with xcode, log in with a test account or something that has a login for a website. Then you set your autofill settings in your
device or simulator to autofill with Lockwise. Then you build the CredentialProvider extension and it asks you like what app you want to build it
with and you click safari. (edited)
12:01
Then you go to the website you have the autofill password for and it will launch credential provider to fill it
12:02
I know it's a few steps, but that is how you launch and test the CredentialProvider extension and see what it is doing

The iOS team wasn't sure this would work, but it's easy enough to test. I think it's also important to ensure that AppInfo.baseBundleIdentifier returns the same thing as AppInfo.name so as to not cause issues with the telemetry we've already collected under that identifier.

(In reply to Travis Long [:travis_] from comment #1)

I think it's also important to ensure that AppInfo.baseBundleIdentifier returns the same thing as AppInfo.name so as to not cause issues with the telemetry we've already collected under that identifier.

Looks like AppInfo.baseBundleIdentifier is what was used on our iOS apps so far. Can you provide the official documentation entries for both these properties?

It sounds like using AppInfo.name is wrong. We should probably make this bug a P1 rather than a backlog entry.

Flags: needinfo?(tlong)

Currently we are using bundleIdentifier in Glean in the Utils.swift file for AppInfo.name

After taking a look at the method used to get the "base identifier" in the searchfox link you posted, this should be returning the same bundle identifier as the way we are retrieving it and this code is manually chopping off the extension id part if it is present.

So, it looks like we would have to allow the app to set the app-id in Glean, otherwise, I don't see a way to intelligently detect and remove an extension id like the CredentialProvider bit in the case we are looking at for this bug.

Flags: needinfo?(tlong)

(In reply to Travis Long [:travis_] from comment #3)

So, it looks like we would have to allow the app to set the app-id in Glean, otherwise, I don't see a way to intelligently detect and remove an extension id like the CredentialProvider bit in the case we are looking at for this bug.

We should not allow the definition of application ids, this must be automated.

So, this code seems to be chopping off based on the number of components, relying on a fixed format of the bundle identifier. Is this correct? If so, can we use such an approach?

Flags: needinfo?(tlong)

Looking at the linked code: What even is the "XPC!" package type? Why is it an indicator for chopping of the last component based on dots?
Did we establish that we do want these two different pieces of data under the same application ID? (I guess it makes sense because same app = same database, or not?)

(I do agree and prefer an automatic way over allowing to set application IDs)

(In reply to Alessio Placitelli [:Dexter] from comment #4)

(In reply to Travis Long [:travis_] from comment #3)

So, it looks like we would have to allow the app to set the app-id in Glean, otherwise, I don't see a way to intelligently detect and remove an extension id like the CredentialProvider bit in the case we are looking at for this bug.

We should not allow the definition of application ids, this must be automated.

So, this code seems to be chopping off based on the number of components, relying on a fixed format of the bundle identifier. Is this correct? If so, can we use such an approach?

This particular function appears to be very app specific and I don't think we could do this in a generic way that would work for all apps, I'm afraid. We would have a difficult time detecting this and when to do it. Not every app has extensions like this that would require this, and there are even different kinds of app extensions. The names of these unknown app-id's could be unconventional making this more difficult to detect in an automated way. I think that while this is possible, it's definitely not feasible.

After speaking with Kayla and Elise on the Lockwise team, they seemed to agree that having the data in a separate table wasn't terrible and could be interesting to see what users were doing in Autofill vs. the app, but they admitted that such a decision would be best made by Data-Science/Product.

So I think the best course of action right now would be to start a proposal and involve DS on how to handle this situation in a way that allows us to best analyze the data, before we try to implement some fancy detection code for when the app-id gets things appended to it.

Flags: needinfo?(tlong)

(In reply to Travis Long [:travis_] from comment #6)

After speaking with Kayla and Elise on the Lockwise team, they seemed to agree that having the data in a separate table wasn't terrible and could be interesting to see what users were doing in Autofill vs. the app, but they admitted that such a decision would be best made by Data-Science/Product.

I think there are two problems here:

(1) - How to deal with the lockwise case, if the best solution is to have two ids. And that's something that DS/Product should chime in.
(2) - Figure out if there's a technical solution to get the id of the application as it is advertised on the app store.

So I think the best course of action right now would be to start a proposal and involve DS on how to handle this situation in a way that allows us to best analyze the data, before we try to implement some fancy detection code for when the app-id gets things appended to it.

I think it is important for us to have a clear understanding about (2) before moving forward with (1).

So, this code seems to be chopping off based on the number of components, relying on a fixed format of the bundle identifier. Is this correct? If so, can we use such an approach?

This particular function appears to be very app specific and I don't think we could do this in a generic way that would work for all apps, I'm afraid. We would have a difficult time detecting this and when to do it. Not every app has extensions like this that would require this, and there are even different kinds of app extensions. The names of these unknown app-id's could be unconventional making this more difficult to detect in an automated way. I think that while this is possible, it's definitely not feasible.

I think that, at this point, we don't have all the elements we need to decide on this. At the very minimum, I think we need to understand the following:

  • What's the XPC! package type?
  • What's the usual format of the identifiers on apple platforms?
  • Is there a specific format/standard for defining ids for the services/companion components? (e.g. mozilla.ios.Lockbox.CredentialProvider)
Flags: needinfo?(tlong)

I think that, at this point, we don't have all the elements we need to decide on this. At the very minimum, I think we need to understand the following:

  • What's the XPC! package type?

This appears to be a package type that defines lightweight services in the Apple Ecosystem. Based on this, it might be possible to detect at least these sorts of service applications that fall into the 'package' categories, but, just looking at the documentation, even this seems like it has a lot of opportunity to go wrong

  • What's the usual format of the identifiers on apple platforms?

They typically take the form of reverse URLs, such as org.mozilla.ios.Lockbox or com.apple.camera, but there is some flexibility to this as while they are typically named in camel case, dashes and underscores are still allowed.

  • Is there a specific format/standard for defining ids for the services/companion components? (e.g. mozilla.ios.Lockbox.CredentialProvider)

The only standard or method I could find through a quick search is that they take the "base" identifier and tack on the class name of the service or extension, so it would be dependent on the actual class name.

Flags: needinfo?(tlong)

(In reply to Travis Long [:travis_] from comment #8)

I think that, at this point, we don't have all the elements we need to decide on this. At the very minimum, I think we need to understand the following:

  • What's the XPC! package type?

This appears to be a package type that defines lightweight services in the Apple Ecosystem. Based on this, it might be possible to detect at least these sorts of service applications that fall into the 'package' categories, but, just looking at the documentation, even this seems like it has a lot of opportunity to go wrong

We can at least tell if we're in a service, that's a good start. Is there any way from the service to access the parent's bundle id?

Flags: needinfo?(tlong)

We can at least tell if we're in a service, that's a good start. Is there any way from the service to access the parent's bundle id?

While you cannot access it through any Apple API directly, we can still do the same thing the function you linked from searchfox does by stripping off the last bit separated by the .'s.

Flags: needinfo?(tlong)

(In reply to Travis Long [:travis_] from comment #10)

We can at least tell if we're in a service, that's a good start. Is there any way from the service to access the parent's bundle id?

While you cannot access it through any Apple API directly, we can still do the same thing the function you linked from searchfox does by stripping off the last bit separated by the .'s.

Can services define more than one component ? e.g. mozilla.ios.lockbox.credential.provider?

Flags: needinfo?(tlong)

(In reply to Alessio Placitelli [:Dexter] from comment #11)

Can services define more than one component ? e.g. mozilla.ios.lockbox.credential.provider?

I'm not sure if a sub-component like a service could define something another level like this or not. My instincts tell me that this is probably possible, but I will do my due diligence and research this possibility.

Flags: needinfo?(tlong)
Assignee: nobody → tlong
Priority: P3 → P1

(In reply to Travis Long [:travis_] from comment #12)

(In reply to Alessio Placitelli [:Dexter] from comment #11)

Can services define more than one component ? e.g. mozilla.ios.lockbox.credential.provider?

I'm not sure if a sub-component like a service could define something another level like this or not. My instincts tell me that this is probably possible, but I will do my due diligence and research this possibility.

Super-worst-case: the glean "build system" define some project-wide variable that can be read by the service?

I've dug into this a bit more and I think that the approach taken by Focus (and thus by FF-iOS via their fork) in using the XPC! CFBundlePackageType to determine if this is an extension or not and then chopping off the last bit is probably the best approach if we decide that the extension data needs to be in the same table.

The more I think about this, the more I think it does need to be in the same table, and just be annotated somehow as :Dexter has suggested, as this solves the issue of being able to tell the difference between the telemetry generated by the app vs. the extension.

Now that we have explored this a bit (thanks a bunch Dexter!), let me put this together in a concise way and present it to Data Science and Lockwise to see what they think about the situation.

(In reply to Travis Long [:travis_] from comment #14)

Now that we have explored this a bit (thanks a bunch Dexter!), let me put this together in a concise way and present it to Data Science and Lockwise to see what they think about the situation.

I don't think this is something DS should care about.
This is an internal discussion for an SDK-related change :-)

Once we make this change, the Lockwise team might be interested, but mostly because they will need to decide whether or not they care about knowing if some data came from the service or from the app itself (i.e. they need to add a metric for that).

I feel that this internal discussion still requires some investigation, though:

  • Can the service and the app run in parallel?
  • If the two can run in parallel, do they run using the same data dir?
  • If the above are true, does this result in concurrent writes/reads to the underlying glean dataset?
  • Can the glean instances compete for ping uploads?
Flags: needinfo?(tlong)

(In reply to Alessio Placitelli [:Dexter] from comment #15)

I feel that this internal discussion still requires some investigation, though:

  • Can the service and the app run in parallel?

Yes, but the service would only ever run while the app was in the background, because the service is only invoked for other apps or for the OS, not for the parent Lockwise app. From what I have read and seen, the service is invoked and destroyed by the OS for each interaction, I don't see any evidence of the service continuing to run after Autofill is complete, but I'll do a little more digging just to be sure.

  • If the two can run in parallel, do they run using the same data dir?

I believe so, but this is definitely something to find out for sure. I wonder if I can get access to a few of the discarded CredentialProvider pings...

  • If the above are true, does this result in concurrent writes/reads to the underlying glean dataset?

I think that the data directory is probably the same and I'm fairly certain that we wouldn't have both the app and autofill running at the same time, but I can see where maybe there would be some interesting interactions if they ran either concurrently or right after each other so I'll focus my investigation on whether this is a problem or not.

  • Can the glean instances compete for ping uploads?

I wouldn't think that they could compete for ping uploads, but what I see happening is that the "app" is trying to collect the baseline ping while the service is setting baseline metrics.. a read/write conflict. I doubt that this would be the case, but since the service and the app run in different processes and potentially access the same data directories, it's definitely worth investigating to ensure we aren't colliding.

Flags: needinfo?(tlong)

Here's what I found:

  • The CredentialProvider creates its own Glean object and initializes it.
  • The CredentialProvider uses a different directory for data storage than the base Lockwise application
  • Both the CredentialProvider and the Lockwise app could potentially be running at the same time (but not in the foreground together, Lockwise cannot use the CredentialProvider to log into iteself, which is the only way this could happen).
  • The application lifetime of the CredentialProvider is completely controlled by the OS, it is launched when the OS needs it and it is terminated as the OS sees fit.
  • We still seem to see the lifecycle events that we expect as the OS invokes the service, so it should work as a standalone Glean app/service.

Based on this information, I don't think there is going to be an easy way to make the embedded service use the same application ID as the base Lockwise application because they are essentially separate things. In light of this, I think that the best way forward would be to add the Lockwise CredentialProvider application id to the pipeline in order to capture this usage. I'm not certain how bad this is to implement on the analysis side of things, but it would put some burden on it to join the data from the tables to get a full usage picture for the application.

Thoughts, Dexter?

Flags: needinfo?(alessio.placitelli)

(In reply to Travis Long [:travis_] from comment #17)

Here's what I found:

  • The CredentialProvider creates its own Glean object and initializes it.
  • The CredentialProvider uses a different directory for data storage than the base Lockwise application

Ouch. This means that if some user uses both the app and CredentialProvider then they'll have two instanfces of Glean with different client ids, meaning that analysis will end up potentially overcounting users.

  • Any chance you can verify that?
  • Can you check how legacy telemetry in Lockbox behaves in this case? (i.e. do we get the same client ids?)

Given that we ruled out potential data corruption, this is my biggest concern. Before going to product/DS, I'd recommend giving one last look at the above points to paint a complete picture of the problem. This is the reason wny Glean doesn't want to know about processes :-)

An alternative solution, for example, could be to do something similar to what lib-crash does on Android to make sure that autofill usage is attributed to the right clients.

Flags: needinfo?(alessio.placitelli)

(In reply to Alessio Placitelli [:Dexter] from comment #18)

Ouch. This means that if some user uses both the app and CredentialProvider then they'll have two instanfces of Glean with different client ids, meaning that analysis will end up potentially overcounting users.

  • Any chance you can verify that?

I didn't consider this when I was working on it yesterday, but yes, this looks to telemetry like two completely separate applications. Each has it's own directory, client-id, everything.

  • Can you check how legacy telemetry in Lockbox behaves in this case? (i.e. do we get the same client ids?)

Legacy telemetry allows setting the application ID, this is why the "chopping off" of the CredentialProvider bit works.

An alternative solution, for example, could be to do something similar to what lib-crash does on Android to make sure that autofill usage is attributed to the right clients.

Ewwww, that would also be icky as we'd have to find a good way to share data between processes. Even sharing a file like lib-crash does could potentially be problematic due to sandboxing, and I'm not sure we could even do it within Glean, it would have to be a Lockwise/app specific sort of change. The other challenge is because the telemetry code is shared between the app and the extension so I'd have to come up with some way to detect that and prevent the extension from initializing Glean.

Let me think on this a bit more to see if there is any way possible to treat all of this like a single Glean app...

(In reply to Travis Long [:travis_] from comment #19)

  • Can you check how legacy telemetry in Lockbox behaves in this case? (i.e. do we get the same client ids?)

Legacy telemetry allows setting the application ID, this is why the "chopping off" of the CredentialProvider bit works.

Sorry if I'm being pedantic :-) Does that mean that "core" pings from CredentialProvider and the Lockwise app report the same client_id? How would the application id affect this?

Flags: needinfo?(tlong)

(In reply to Alessio Placitelli [:Dexter] from comment #20)

(In reply to Travis Long [:travis_] from comment #19)

  • Can you check how legacy telemetry in Lockbox behaves in this case? (i.e. do we get the same client ids?)

Legacy telemetry allows setting the application ID, this is why the "chopping off" of the CredentialProvider bit works.

Sorry if I'm being pedantic :-) Does that mean that "core" pings from CredentialProvider and the Lockwise app report the same client_id? How would the application id affect this?

They use UserDefaults to store the client-id, which is a key-value store (similar to what our original Glean-AC used) that appears to be shared and accessible to both the app and extension. By allowing setting of the application ID explicitly when telemetry starts, they don't have to care running the app or the extension, they pull the client-id from shared key-value storage and set the application ID using their detection of the app extension and chopping off the CredentialProvider if needed, which is then used for the telemetry URL path, ensuring that all telemetry is going to the same endpoint with the same client ID.

For Glean to be able to accomplish this, we would have to trick the extension into using the same directory as the base application (I'm not entirely certain if the extension can access the directory where the app stores Glean data due to sandboxing). That would at least mean that both were storing in the same place and using the same client-id. Unfortunately, since both could run in different processes this would mean that we would have to know when it was the extension and record data to a file so that the app could read it and process it like the hoop we jump through for lib-crash. The problem here is how do we know when it's the extension? Do we force the consuming application to worry about this? Do we try and detect it internally in Glean and always check to see if that XPC! flag is set like Lockwise/Focus does to do the right thing, and if so are we sure we are handling it the way that the application requires?

Flags: needinfo?(tlong)
Flags: needinfo?(alessio.placitelli)

(In reply to Travis Long [:travis_] from comment #21)

(In reply to Alessio Placitelli [:Dexter] from comment #20)

(In reply to Travis Long [:travis_] from comment #19)

  • Can you check how legacy telemetry in Lockbox behaves in this case? (i.e. do we get the same client ids?)

Legacy telemetry allows setting the application ID, this is why the "chopping off" of the CredentialProvider bit works.

Sorry if I'm being pedantic :-) Does that mean that "core" pings from CredentialProvider and the Lockwise app report the same client_id? How would the application id affect this?

[...] they pull the client-id from shared key-value storage [...]

Ah ha! That's the trick!

For Glean to be able to accomplish this, we would have to trick the extension into using the same directory as the base application (I'm not entirely certain if the extension can access the directory where the app stores Glean data due to sandboxing). That would at least mean that both were storing in the same place and using the same client-id.

I think this may get us into comment 15 land again :(

Do we try and detect it internally in Glean and always check to see if that XPC! flag is set like Lockwise/Focus does to do the right thing, and if so are we sure we are handling it the way that the application requires?

"For some definition of right thing" :-) I think the SDK should not know/care about processes: this would complicate things quite a lot given how stuff differs across use-cases and platform. The fact that legacy did that, in Lockwise, doesn't mean we should. Given the status of Lockwise maybe this is not a problem we need to solve in the short run.

Maybe, given the outcome of this investigation, the best option right now is to add this id to the pipeline. Mike, thoughts?

Flags: needinfo?(alessio.placitelli) → needinfo?(mdroettboom)

I was just sort of in these parts for Python reasons... So I can probably answer some of these.

(In reply to Alessio Placitelli [:Dexter] from comment #15)

  • Can the service and the app run in parallel?
  • If the two can run in parallel, do they run using the same data dir?
  • If the above are true, does this result in concurrent writes/reads to the underlying glean dataset?

LMDB explicitly handles concurrent reads/writes from multiple processes. And iOS is Unixy enough I'm confident it's correct there.

  • Can the glean instances compete for ping uploads?

For this, there is good discussion over in bug 1628294.

Maybe, given the outcome of this investigation, the best option right now is to add this id to the pipeline. Mike, thoughts?

IIUC correctly, there's two separate things here:

  1. The app is sending different client_ids in the two different contexts. This, to me, seems like a critical bug to fix. If each user/device is sending multiple client_ids, it will be impossible to get a real measure on just about anything.

  2. The app sends to two different doctypes depending on context. This worries me less, but we'd want to document it clearly at a minimum. I still think this feels like the kind of thing where analyses could be incorrect because someone isn't aware of this behavior. I think I'd prefer trying to find a way to merge the two into the same table. If there's no good client solution, maybe there's a data platform solution to that?

Flags: needinfo?(mdroettboom)

(In reply to Michael Droettboom [:mdroettboom] from comment #23)

I was just sort of in these parts for Python reasons... So I can probably answer some of these.

(In reply to Alessio Placitelli [:Dexter] from comment #15)

  • Can the service and the app run in parallel?
  • If the two can run in parallel, do they run using the same data dir?
  • If the above are true, does this result in concurrent writes/reads to the underlying glean dataset?

LMDB explicitly handles concurrent reads/writes from multiple processes. And iOS is Unixy enough I'm confident it's correct there.

  • Can the glean instances compete for ping uploads?

For this, there is good discussion over in bug 1628294.

Maybe, given the outcome of this investigation, the best option right now is to add this id to the pipeline. Mike, thoughts?

IIUC correctly, there's two separate things here:

  1. The app is sending different client_ids in the two different contexts. This, to me, seems like a critical bug to fix. If each user/device is sending multiple client_ids, it will be impossible to get a real measure on just about anything.

In order to fix this, we will have to trick the Glean in the extension to use the directory of the base application's Glean store, and I'm not certain how trivial or even possible that this is due to sandboxing. There may be other ways of dealing with this, but it's not as simple as treating the extension like a library that uses Glean because libraries don't typically run on their own and in their own process. I'm not convinced that this is the easiest way to the solution but I'm starting to get to the point that I need to know what the requirements of analysis are in this case so that I can target solutions that get us there.

  1. The app sends to two different doctypes depending on context. This worries me less, but we'd want to document it clearly at a minimum. I still think this feels like the kind of thing where analyses could be incorrect because someone isn't aware of this behavior. I think I'd prefer trying to find a way to merge the two into the same table. If there's no good client solution, maybe there's a data platform solution to that?

If we can overcome the differing client-id problem, then the document-id problem is easier to fix because we have identified that it's possible to detect when it's the extension running and strip off the additional id piece to get back to the base document-id.

(In reply to Michael Droettboom [:mdroettboom] from comment #23)

Maybe, given the outcome of this investigation, the best option right now is to add this id to the pipeline. Mike, thoughts?

IIUC correctly, there's two separate things here:

  1. The app is sending different client_ids in the two different contexts. This, to me, seems like a critical bug to fix. If each user/device is sending multiple client_ids, it will be impossible to get a real measure on just about anything.

Yes, I concur that this problem needs to be fixed. I'm not sure the fix should be on the SDK side, though: this is an example of platform-specific concept (app extentions?) that walks outside of the process boundary (something that the SDK doesn't want to know about).
In this case I believe the lower-hanging fruit fix is in Lockwise: the extension can report back to the main application that some telemetry needs to be reported.

The fact that makes this real quirky is that the extension app (autofill) has a completely different lifecycle compared to the Lockwise application (and they can even run at the same time...?!).

Given that iOS makes it so easy to use these XPC! thingies, maybe we need a better strategy for handling this on iOS. However, this doesn't help us with the immediate problem of the data having a different client id and being in a different table.

Sounds like lightweight-proposal time? I have a couple of ideas on how to deal with this (some of which need a little more thought to be complete), ranging from documenting this behavior/drawback to more creative solutions. Considering the feedback that this has gathered so far, I think a proposal/doc might be a good place to plan out the strategy for dealing with this on iOS.

Flags: needinfo?(mdroettboom)
Flags: needinfo?(alessio.placitelli)
Flags: needinfo?(alessio.placitelli)
Flags: needinfo?(mdroettboom)

Updating the title to better reflect the direction this investigation has gone.

Summary: Lockwise-iOS appears to report a different application ID when used for Autofill → Glean runs as a separate instance in iOS application extensions
See Also: → 1632218
Blocks: 1632218
See Also: 1632218
Blocks: 1637987
Blocks: 1637990

It appears that the investigation on this has wound down to a resolution and it doesn't look like we can do much on the Glean side of things to make this easier and will have to put the burden on the integrating application to ensure that telemetry is recorded through the base application process.

I've filed a couple of follow-up bugs, blocked by this bug, that will track documenting the problem with using extensions with Glean and will also implement something to prevent Glean from being initialized in the extension processes to prevent Glean from running and sending telemetry from them.

Closing this investigation/proposal bug as FIXED.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.