Closed Bug 1635865 Opened 4 years ago Closed 4 years ago

Sentry: ConcurrentModificationException

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdroettboom, Assigned: mdroettboom)

Details

Attachments

(3 files)

java.util.ConcurrentModificationException: null
    at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:757)
    at java.util.LinkedHashMap$LinkedKeyIterator.next(LinkedHashMap.java:780)
    at mozilla.telemetry.glean.GleanInternalAPI$initialize$1.invokeSuspend(Glean.kt:16)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:2)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:19)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:457)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:301)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1162)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:636)
    at java.lang.Thread.run(Thread.java:764)

https://sentry.prod.mozaws.net/operations/fenix/issues/7519801/

There is at least a theoretical cause of this that I can see:

The known ping types in PingTypeQueue are iterated over here, and then the ping is added (again, which should be functionally a no-op) here.

Assignee: nobody → mdroettboom
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

Hi Michael, I was going through the crashes on Nightly, and I'm seeing this signature continue to show up, even in buildids from 5/13.

https://crash-stats.mozilla.org/report/index/1aae6d08-110e-4ad1-b0e4-2c86c0200515

(See "More Reports" and sort by build id for other reports.)

For our 5.0 release, this was a top 3 crasher from Google Play Store (although that was before your fix landed). I don't have as good visibility into Nightly, so I just wanted to raise it in case those reports are helpful.

Feel free to close again if there's other reasons this could be showing up! We'll also know more when we release 5.1 next week.

Status: RESOLVED → REOPENED
Flags: needinfo?(mdroettboom)
Resolution: FIXED → ---

Thanks @liuche. This has been tricky, because I still haven't been able to reproduce locally -- there's some combination of things folks are hitting the wild here. I will have a renewed look -- I agree that this is high priority.

Flags: needinfo?(mdroettboom)
Priority: P3 → P1
Whiteboard: [telemetry:glean-rs:m?]

This Sentry link for Nightly shows the problem persisting with Glean 29.1.0 (which contains the first "fix"): https://sentry.prod.mozaws.net/operations/fenix-nightly/issues/7960114

Update: I was able to write a unit test that registers pings while Glean is being initialized, but this still doesn't lead to a ConcurrentModificationException:

    @Test
    fun `Initializing while registering pings isn't a race condition`() {
        Glean.testDestroyGleanHandle()

        Dispatchers.API.setTaskQueueing(true)
        Dispatchers.API.setTestingMode(false)

        val stringMetric = StringMetricType(
                disabled = false,
                category = "telemetry",
                lifetime = Lifetime.Application,
                name = "string_metric",
                sendInPings = listOf("store1")
        )
        stringMetric.set("foo")

        for (i in 1..1000) {
        val ping = PingType<NoReasonCodes>(
                name = "race-condition-ping${i}",
                includeClientId = true,
                sendIfEmpty = false,
                reasonCodes = listOf()
        )
            Glean.registerPingType(ping)
        }
        val config = Configuration()
        Glean.initialize(context, true, config)

        val ping = PingType<NoReasonCodes>(
                name = "race-condition-ping",
                includeClientId = true,
                sendIfEmpty = false,
                reasonCodes = listOf()
        )
        while (Dispatchers.API.taskQueue.size > 0) {
            Glean.registerPingType(ping)
        }
    }

This unit test (very similar to the one above, but registering pings from another coroutine thread rather than the main thread) reproduces the bug more than 50% of the time. Woohoo. Progress... Now to experiment with various mitigations.

    @Test
    fun `Initializing while registering pings isn't a race condition`() {
        Glean.testDestroyGleanHandle()

        Dispatchers.API.setTaskQueueing(true)
        Dispatchers.API.setTestingMode(false)

        val stringMetric = StringMetricType(
                disabled = false,
                category = "telemetry",
                lifetime = Lifetime.Application,
                name = "string_metric",
                sendInPings = listOf("store1")
        )
        stringMetric.set("foo")

        for (i in 1..1000) {
            val ping = PingType<NoReasonCodes>(
                    name = "race-condition-ping${i}",
                    includeClientId = true,
                    sendIfEmpty = false,
                    reasonCodes = listOf()
            )
            Glean.registerPingType(ping)
        }
        val config = Configuration()
        Glean.initialize(context, true, config)

        GlobalScope.launch {
            val ping = PingType<NoReasonCodes>(
                    name = "race-condition-ping", 
                    includeClientId = true,
                    sendIfEmpty = false,
                    reasonCodes = listOf()
            )
            while (Dispatchers.API.taskQueue.size > 0) {
                Glean.registerPingType(ping)
            }
        }
    }
}

Not seeing any reports from Fenix Nightly with Glean 29.1.1 (with the second fix), so closing this. Feel free to reopen if we see them again.

Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: