Last Comment Bug 687367 - Bionic domain name functions are not thread-safe on pre-3.0 Android
: Bionic domain name functions are not thread-safe on pre-3.0 Android
Status: RESOLVED FIXED
[mobile-crash][B2G]
: crash, topcrash
Product: Core
Classification: Components
Component: Networking (show other bugs)
: unspecified
: ARM Android
: -- major (vote)
: mozilla10
Assigned To: Steve Workman [:sworkman] (please use needinfo)
:
Mentors:
Depends on: 693103 693280
Blocks: 689989 694325
  Show dependency treegraph
 
Reported: 2011-09-18 12:34 PDT by Jim Chen [:jchen] [:darchons]
Modified: 2012-01-02 03:36 PST (History)
22 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
+
fixed
fixed


Attachments
WIP: Android 2.3 getaddrinfo(), with stdio replaced by mmap(), adjusted for use with --wrap (109.88 KB, patch)
2011-09-27 22:51 PDT, Michael K. Edwards (:medwards)
dougt: review-
Details | Diff | Splinter Review
diff android 2.3 bionic code vs WIP patch (8.54 KB, patch)
2011-09-27 23:07 PDT, Andreas Gal :gal
no flags Details | Diff | Splinter Review
diff android 2.3 bionic code vs WIP patch (8.20 KB, patch)
2011-09-27 23:08 PDT, Andreas Gal :gal
no flags Details | Diff | Splinter Review
Updated patch, with Honeycomb handling (128.43 KB, patch)
2011-09-30 15:28 PDT, Michael K. Edwards (:medwards)
no flags Details | Diff | Splinter Review
Updated patch, addressing review comments (128.11 KB, patch)
2011-10-04 01:25 PDT, Michael K. Edwards (:medwards)
dougt: review-
Details | Diff | Splinter Review
patch (1.12 KB, patch)
2011-10-11 23:36 PDT, Andreas Gal :gal
mcmanus: feedback-
Details | Diff | Splinter Review
no-concurrent-android-dns patch 2 (2.62 KB, patch)
2011-10-12 07:01 PDT, Patrick McManus [:mcmanus]
bzbarsky: review+
akeybl: approval‑mozilla‑aurora+
christian: approval‑mozilla‑beta+
Details | Diff | Splinter Review

Description Jim Chen [:jchen] [:darchons] 2011-09-18 12:34:32 PDT
Offshoot from Bug 662936, which was the same bug occurring on dual-core Tegras running talos.

In bionic (ugh), domain name functions (getaddrinfo, gethostbyname_r, et al) are not thread-safe because stdio is not thread-safe... These functions rely on stdio for reading from the /etc/hosts file.

AFAIK in Gecko, we launch worker threads which call these domain functions, and we crash inside bionic when we run into such a race condition in stdio. This is the #2 top crasher in 7.0 Beta and #1 in 6.0.

I haven't seen single-core devices running into these crashes, but on dual-core devices they are a lot more severe. The ugly fix would be using locks around getaddrinfo calls. Another workaround would be setting the CPU affinity on worker threads; doesn't fix the issue, but should make it a lot less common. Also, might be possible to not use getaddrinfo, but that seems to have implications in IPv6 support (Bug 626866); I don't know enough to say.
Comment 1 Doug Turner (:dougt) 2011-09-18 20:53:28 PDT
lets add locking around the domain resolution and ifdef it for #android.  This should only be done for pre-honeycomb (i think).  jim, want to throw up a patch?
Comment 2 Patrick McManus [:mcmanus] 2011-09-19 07:28:09 PDT
(In reply to Doug Turner (:dougt) from comment #1)
> lets add locking around the domain resolution and ifdef it for #android. 
> This should only be done for pre-honeycomb (i think).  jim, want to throw up
> a patch?

wow. That could have such a huge negative impact. We need parallelism on high latency operations.

Do you have a pointer to the android getaddrinfo() implementation in question? I can't find it because android.git.kernel.org is still offline. Is the io conditional on anything - a lazy init perhaps?

Maybe we can clone that code and remove the stdio (or just lock it?)?

If we really are forced to go down this path I would prefer MAX_RESOLVER_THREADS_FOR_ANY_PRIORITY, and MAX_RESOLVER_THREADS_FOR_HIGH_PRIORITY were changed into prefs that could be set to {1,0} for this case rather than the lock. But I really think that's a big step back for phones as modern as gingerbread!
Comment 3 Doug Turner (:dougt) 2011-09-19 07:33:31 PDT
slow is better than crashy.  You're right, we also could just move the bionic code into nspr (or necko), but I am not sure how much that code fans out.  That would be better than locking for sure.
Comment 4 Jim Chen [:jchen] [:darchons] 2011-09-19 08:32:38 PDT
(In reply to Patrick McManus from comment #2)
> (In reply to Doug Turner (:dougt) from comment #1)
> > lets add locking around the domain resolution and ifdef it for #android. 
> > This should only be done for pre-honeycomb (i think).  jim, want to throw up
> > a patch?
> 
> wow. That could have such a huge negative impact. We need parallelism on
> high latency operations.
> 
> Do you have a pointer to the android getaddrinfo() implementation in
> question? I can't find it because android.git.kernel.org is still offline.
> Is the io conditional on anything - a lazy init perhaps?
> 
Here's a mirror (code is forked from netbsd): https://github.com/android/platform_bionic/blob/master/libc/netbsd/net/getaddrinfo.c

fopen() is called inside _sethtent() and fclose() in _endhtent(). Both are called by _files_getaddrinfo() which is responsible for reading /etc/hosts. I don't see a way to skip that (unless you count the workaround for our testing infra which is to delete /etc/hosts and make fopen() fail; but obviously a dirty hack used internally)

(In reply to Doug Turner (:dougt) from comment #3)
> slow is better than crashy.  You're right, we also could just move the
> bionic code into nspr (or necko), but I am not sure how much that code fans
> out.  That would be better than locking for sure.

Yeah. Also setting affinity will keep parallelism while reducing chance of race condition to pre-dual-core levels, which wasn't really an issue back then (I don't think).
Comment 5 Patrick McManus [:mcmanus] 2011-09-19 08:57:55 PDT
> Here's a mirror (code is forked from netbsd):
> https://github.com/android/platform_bionic/blob/master/libc/netbsd/net/
> getaddrinfo.c
> 
> fopen() is called inside _sethtent() and fclose() in _endhtent(). Both are
> called by _files_getaddrinfo() which is responsible for reading /etc/hosts.

Thanks Jim!

Obviously this code can crash on single cpu devices too, it just doesn't happen often. right? It seems worth trying to fix that. I don't really think process affinity is a replacement for a data safety model.

> I don't see a way to skip that 

One thing we could do would be to cache the contents of the file in ram in a read-only threadsafe buffer. Update it under lock for the same conditions we call res_ninit() right now. 

A slightly uglier approach would be to just put a lock around _files_get_addrinfo. If STDIO is really not threadsafe (independent of file) couldn't either of these still have races against the content process saving a download, for instance? Or is the issue that all the threads are reading /etc/hosts?

Either way means pretty much importing the c file you referenced and calling that version of getaddrinfo on android < honeycomb, but that actually looks relatively easy to do to me. (almost everything in there is static, so just drag it along and make the necessary changes).
Comment 6 Jim Chen [:jchen] [:darchons] 2011-09-19 11:32:57 PDT
(In reply to Patrick McManus from comment #5)
> > I don't see a way to skip that 
> 
> One thing we could do would be to cache the contents of the file in ram in a
> read-only threadsafe buffer. Update it under lock for the same conditions we
> call res_ninit() right now. 
> 
> A slightly uglier approach would be to just put a lock around
> _files_get_addrinfo. If STDIO is really not threadsafe (independent of file)
> couldn't either of these still have races against the content process saving
> a download, for instance? Or is the issue that all the threads are reading
> /etc/hosts?

Hm, good point; actually stdio is not thread-safe in general :(

There might be other race conditions, but the one I found happens when allocating FILE handles: (the netbsd code had locks but they were removed in the android fork; boo)
https://github.com/android/platform_bionic/blob/master/libc/stdio/findfp.c#L95

So in theory, even if we lock getaddrinfo, we could race against other stdio usages.

> Either way means pretty much importing the c file you referenced and calling
> that version of getaddrinfo on android < honeycomb, but that actually looks
> relatively easy to do to me. (almost everything in there is static, so just
> drag it along and make the necessary changes).

Thanks for the ideas!
Comment 7 Alon Zakai (:azakai) 2011-09-20 14:43:36 PDT
Sorry for the long list of questions here.

Do we know why other apps on Android running in parallel do not hit this crash? Do they do all their IO through Java? Or does it have to do with each Android app being a separate unix user somehow?

Also, do we know why we are not hitting this with our two-process model? I guess we simply don't do much stdio stuff in the child? Surely we do plenty of it indirectly though, through various libraries (font loading, for example)?

How sure are we that the problem is limited to getaddrinfo and gethostbyname? Do we run a risk of this crash with every file opened, for example? Or do we feel it is limited to those two functions?

If we pull this code and fix it in our codebase, wouldn't we run into potential problems with different versions of bionic on different phones? (I mean, bionic could be patched for some issue on some device, and our single fixed implementation would not have that stuff.)

Finally, how does the actual threading issue happen: On one side we are calling getaddrinfo or gethostbyname, what is the other thread doing that causes the problem, do we know?
Comment 8 Patrick McManus [:mcmanus] 2011-09-20 15:04:27 PDT
(In reply to Alon Zakai (:azakai) from comment #7)

> Finally, how does the actual threading issue happen: On one side we are
> calling getaddrinfo or gethostbyname, what is the other thread doing that
> causes the problem, do we know?

I would assume, from the way this is described, the other thread is also commonly doing getaddrinfo(). There are (up to) 8 different dedicated DNS threads because getaddrinfo() is a blocking network operation that we need to perform in parallel. The system libraries do not provide an async API for it, so we do it in parallel.

It would also be common for those parallel lookups to start almost simultaneously, maximizing the chances to be executing the same stdio code simultaneously, because they are discovered in the same HTML parse (perhaps as sharded domain names for images or perhaps as DNS pefetches discovered by parsing links).
Comment 9 Michael K. Edwards (:medwards) 2011-09-20 17:30:19 PDT
My plan is to pull in the gingerbread bionic implementation of getaddrinfo() (in libc/netbsd/net/getaddrinfo.c) and reimplement _gethtent() without stdio calls.  Note that freeaddrinfo() and gai_strerror() will also be provided, to ensure that they match the getaddrinfo() implementation.

I intend to add getaddrinfo.o to libmozutils, inside /other-licenses/android, and to rename the public symbols to __wrap_getaddrinfo(), etc. so they will shadow the libc symbols at link time.
Comment 10 Andreas Gal :gal 2011-09-26 14:46:24 PDT
medwards will try to fix this today for android, we need a separate patch for B2G later
Comment 11 Andreas Gal :gal 2011-09-27 14:31:09 PDT
Michael, can you please post a status update?
Comment 12 Michael K. Edwards (:medwards) 2011-09-27 14:44:46 PDT
I'm setting up to push the candidate fix to the TryChooser.  Should have a testable build this afternoon.
Comment 13 Andreas Gal :gal 2011-09-27 22:20:45 PDT
I couldn't find any try server runs with this. Can you please link it?

Also, can you please attach the patch?

Thanks.
Comment 14 Michael K. Edwards (:medwards) 2011-09-27 22:48:39 PDT
Note that I have hacked out the code in getaddrinfo.c that looks aside to dnsproxyd, since the DNS proxy is not enabled in Android 2.3 or earlier, according to http://code.google.com/p/android/issues/detail?id=15722 .

(see android_getaddrinfo_proxy() in the original code; I copied http://git.android-x86.org/?p=platform/bionic.git;a=blob;f=libc/netbsd/net/getaddrinfo.c;h=edb4f707e730057d01c10872f01c932c31f42fbf;hb=refs/heads/gingerbread-x86)
Comment 15 Michael K. Edwards (:medwards) 2011-09-27 22:51:15 PDT
Created attachment 562974 [details] [diff] [review]
WIP: Android 2.3 getaddrinfo(), with stdio replaced by mmap(), adjusted for use with --wrap

(Lightly tested in a local build; bsmith is helping with tryserver run)
Comment 16 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-09-27 22:58:19 PDT
Thanks for your try submission (http://hg.mozilla.org/try/pushloghtml?changeset=4974bec3a8d3).  It's the best!

Watch https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=4974bec3a8d3 for your results to come in.

Builds and logs will be available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bsmith@mozilla.com-4974bec3a8d3.

This directory won't be created until the first builds are uploaded, so please be patient.
Comment 17 Andreas Gal :gal 2011-09-27 23:07:18 PDT
Created attachment 562978 [details] [diff] [review]
diff android 2.3 bionic code vs WIP patch
Comment 18 Andreas Gal :gal 2011-09-27 23:07:38 PDT
Comment on attachment 562978 [details] [diff] [review]
diff android 2.3 bionic code vs WIP patch

>--- getaddrinfo.c	2011-09-27 23:03:51.000000000 -0700
>+++ /Users/gal/workspace/B2G/./glue/gonk/bionic/libc/netbsd/net/getaddrinfo.c	2011-08-16 00:45:19.000000000 -0700
>@@ -31,16 +31,6 @@
>  */
> 
> /*
>- * This version of getaddrinfo.c is derived from Android 2.3 "Gingerbread",
>- * which contains uncredited changes by Android/Google developers.  It has
>- * been modified in 2011 for use in the Android build of Mozilla Firefox by
>- * Mozilla contributors (including Michael Edwards <m.k.edwards@gmail.com>).
>- * These changes are offered under the same license as the original NetBSD
>- * file, whose copyright and license are unchanged above.
>- */
>-#define ANDROID_CHANGES 1
>-
>-/*
>  * Issues to be discussed:
>  * - Thread safe-ness must be checked.
>  * - Return values.  There are nonstandard return values defined and used
>@@ -87,14 +77,10 @@
>  *	  friends.
>  */
> 
>-#include <fcntl.h>
> #include <sys/cdefs.h>
> #include <sys/types.h>
>-#include <sys/stat.h>
> #include <sys/param.h>
> #include <sys/socket.h>
>-#include <sys/un.h>
>-#include <sys/mman.h>
> #include <net/if.h>
> #include <netinet/in.h>
> #include <arpa/inet.h>
>@@ -102,10 +88,10 @@
> #include <assert.h>
> #include <ctype.h>
> #include <errno.h>
>-#include <fcntl.h>
> #include <netdb.h>
> #include "resolv_private.h"
> #include <stddef.h>
>+#include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
>@@ -114,87 +100,6 @@
> #include <stdarg.h>
> #include "nsswitch.h"
> 
>-#ifdef ANDROID_CHANGES
>-#include <sys/system_properties.h>
>-#endif /* ANDROID_CHANGES */
>-
>-typedef struct _pseudo_FILE {
>-    int fd;
>-    off_t maplen;
>-    void* mapping;
>-    off_t offset;
>-} _pseudo_FILE;
>-
>-#define _PSEUDO_FILE_INITIALIZER { -1, 0, MAP_FAILED, 0 }
>-
>-static void
>-_pseudo_fclose(_pseudo_FILE* fp)
>-{
>-    fp->offset = 0;
>-    if (fp->mapping != MAP_FAILED) {
>-        (void) munmap(fp->mapping, fp->maplen);
>-        fp->mapping = MAP_FAILED;
>-    }
>-    fp->maplen = 0;
>-    if (fp->fd != -1) {
>-        (void) close(fp->fd);
>-        fp->fd = -1;
>-    }
>-}
>-
>-static _pseudo_FILE*
>-_pseudo_fopen_r(_pseudo_FILE* fp, const char* fname)
>-{
>-    struct stat statbuf;
>-    fp->fd = open(fname, O_RDONLY);
>-    if (fp->fd < 0) {
>-        fp->fd = -1;
>-        return NULL;
>-    }
>-    if ((0 != fstat(fp->fd, &statbuf)) || (statbuf.st_size <= 0)) {
>-        close(fp->fd);
>-        fp->fd = -1;
>-        return NULL;
>-    }
>-    fp->maplen = statbuf.st_size;
>-    fp->mapping = mmap(NULL, fp->maplen, PROT_READ, MAP_PRIVATE, fp->fd, 0);
>-    if (fp->mapping == MAP_FAILED) {
>-        close(fp->fd);
>-        fp->fd = -1;
>-        return NULL;
>-    }
>-    fp->offset = 0;
>-    return fp;
>-}
>-
>-static void
>-_pseudo_rewind(_pseudo_FILE* fp)
>-{
>-    fp->offset = 0;
>-}
>-
>-static char*
>-_pseudo_fgets(char* buf, int bufsize, _pseudo_FILE* fp)
>-{
>-    char* current;
>-    char* endp;
>-    int maxcopy = fp->maplen - fp->offset;
>-    if (fp->mapping == MAP_FAILED)
>-        return NULL;
>-    if (maxcopy > bufsize - 1)
>-        maxcopy = bufsize - 1;
>-    if (maxcopy <= 0)
>-        return NULL;
>-    //fprintf(stderr, "_pseudo_fgets(): copying up to %d bytes\n", maxcopy);
>-    current = ((char*) fp->mapping) + fp->offset;
>-    endp = memccpy(buf, current, '\n', maxcopy);
>-    if (endp)
>-        maxcopy = endp - buf;
>-    buf[maxcopy] = '\0';
>-    fp->offset += maxcopy;
>-    return buf;
>-}
>-
> typedef union sockaddr_union {
>     struct sockaddr     generic;
>     struct sockaddr_in  in;
>@@ -319,9 +224,9 @@
> static struct addrinfo *getanswer(const querybuf *, int, const char *, int,
> 	const struct addrinfo *);
> static int _dns_getaddrinfo(void *, void *, va_list);
>-static void _sethtent(_pseudo_FILE *);
>-static void _endhtent(_pseudo_FILE *);
>-static struct addrinfo *_gethtent(_pseudo_FILE *, const char *,
>+static void _sethtent(FILE **);
>+static void _endhtent(FILE **);
>+static struct addrinfo *_gethtent(FILE **, const char *,
>     const struct addrinfo *);
> static int _files_getaddrinfo(void *, void *, va_list);
> 
>@@ -391,20 +296,8 @@
> #define MATCH(x, y, w) 							\
> 	((x) == (y) || (/*CONSTCOND*/(w) && ((x) == ANY || (y) == ANY)))
> 
>-#pragma GCC visibility push(default)
>-
>-extern const char *
>-__wrap_gai_strerror(int ecode);
>-extern void
>-__wrap_freeaddrinfo(struct addrinfo *ai);
>-extern int
>-__wrap_getaddrinfo(const char *hostname, const char *servname,
>-    const struct addrinfo *hints, struct addrinfo **res);
>-
>-#pragma GCC visibility pop
>-
> const char *
>-__wrap_gai_strerror(int ecode)
>+gai_strerror(int ecode)
> {
> 	if (ecode < 0 || ecode > EAI_MAX)
> 		ecode = EAI_MAX;
>@@ -412,7 +305,7 @@
> }
> 
> void
>-__wrap_freeaddrinfo(struct addrinfo *ai)
>+freeaddrinfo(struct addrinfo *ai)
> {
> 	struct addrinfo *next;
> 
>@@ -499,7 +392,7 @@
> }
> 
> int
>-__wrap_getaddrinfo(const char *hostname, const char *servname,
>+getaddrinfo(const char *hostname, const char *servname,
>     const struct addrinfo *hints, struct addrinfo **res)
> {
> 	struct addrinfo sentinel;
>@@ -694,7 +587,7 @@
>  free:
>  bad:
> 	if (sentinel.ai_next)
>-		__wrap_freeaddrinfo(sentinel.ai_next);
>+		freeaddrinfo(sentinel.ai_next);
> 	*res = NULL;
> 	return error;
> }
>@@ -755,7 +648,7 @@
> 
> free:
> 	if (result)
>-		__wrap_freeaddrinfo(result);
>+		freeaddrinfo(result);
> 	return error;
> }
> 
>@@ -823,7 +716,7 @@
> 
> free:
> 	if (sentinel.ai_next)
>-		__wrap_freeaddrinfo(sentinel.ai_next);
>+		freeaddrinfo(sentinel.ai_next);
> 	return error;
> }
> 
>@@ -910,7 +803,7 @@
> free:
> bad:
> 	if (sentinel.ai_next)
>-		__wrap_freeaddrinfo(sentinel.ai_next);
>+		freeaddrinfo(sentinel.ai_next);
> 	return error;
> }
> 
>@@ -1473,6 +1366,8 @@
> 		const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> 		if (IN6_IS_ADDR_LOOPBACK(&addr6->sin6_addr)) {
> 			return 0;
>+		} else if (IN6_IS_ADDR_ULA(&addr6->sin6_addr)) {
>+			return 1;
> 		} else if (IN6_IS_ADDR_V4MAPPED(&addr6->sin6_addr)) {
> 			return 3;
> 		} else if (IN6_IS_ADDR_6TO4(&addr6->sin6_addr)) {
>@@ -1512,6 +1407,8 @@
> 		const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> 		if (IN6_IS_ADDR_LOOPBACK(&addr6->sin6_addr)) {
> 			return 60;
>+		} else if (IN6_IS_ADDR_ULA(&addr6->sin6_addr)) {
>+			return 50;
> 		} else if (IN6_IS_ADDR_V4MAPPED(&addr6->sin6_addr)) {
> 			return 30;
> 		} else if (IN6_IS_ADDR_6TO4(&addr6->sin6_addr)) {
>@@ -1900,24 +1797,27 @@
> }
> 
> static void
>-_sethtent(_pseudo_FILE *hostf)
>+_sethtent(FILE **hostf)
> {
> 
>-	if (hostf->mapping == MAP_FAILED)
>-		(void) _pseudo_fopen_r(hostf, _PATH_HOSTS);
>+	if (!*hostf)
>+		*hostf = fopen(_PATH_HOSTS, "r" );
> 	else
>-		_pseudo_rewind(hostf);
>+		rewind(*hostf);
> }
> 
> static void
>-_endhtent(_pseudo_FILE *hostf)
>+_endhtent(FILE **hostf)
> {
> 
>-	(void) _pseudo_fclose(hostf);
>+	if (*hostf) {
>+		(void) fclose(*hostf);
>+		*hostf = NULL;
>+	}
> }
> 
> static struct addrinfo *
>-_gethtent(_pseudo_FILE *hostf, const char *name, const struct addrinfo *pai)
>+_gethtent(FILE **hostf, const char *name, const struct addrinfo *pai)
> {
> 	char *p;
> 	char *cp, *tname, *cname;
>@@ -1926,18 +1826,15 @@
> 	const char *addr;
> 	char hostbuf[8*1024];
> 
>-	//fprintf(stderr, "_gethtent() name = '%s'\n", name);
>+//	fprintf(stderr, "_gethtent() name = '%s'\n", name);
> 	assert(name != NULL);
> 	assert(pai != NULL);
> 
>-	if (hostf->mapping == MAP_FAILED)
>-		(void) _pseudo_fopen_r(hostf, _PATH_HOSTS);
>-	if (hostf->mapping == MAP_FAILED)
>+	if (!*hostf && !(*hostf = fopen(_PATH_HOSTS, "r" )))
> 		return (NULL);
>  again:
>-	if (!(p = _pseudo_fgets(hostbuf, sizeof hostbuf, hostf)))
>+	if (!(p = fgets(hostbuf, sizeof hostbuf, *hostf)))
> 		return (NULL);
>-	//fprintf(stderr, "/etc/hosts line: %s", p);
> 	if (*p == '#')
> 		goto again;
> 	if (!(cp = strpbrk(p, "#\n")))
>@@ -1959,7 +1856,7 @@
> 		tname = cp;
> 		if ((cp = strpbrk(cp, " \t")) != NULL)
> 			*cp++ = '\0';
>-		//fprintf(stderr, "\ttname = '%s'\n", tname);
>+//		fprintf(stderr, "\ttname = '%s'", tname);
> 		if (strcasecmp(name, tname) == 0)
> 			goto found;
> 	}
>@@ -1968,7 +1865,7 @@
> found:
> 	hints = *pai;
> 	hints.ai_flags = AI_NUMERICHOST;
>-	error = __wrap_getaddrinfo(addr, NULL, &hints, &res0);
>+	error = getaddrinfo(addr, NULL, &hints, &res0);
> 	if (error)
> 		goto again;
> 	for (res = res0; res; res = res->ai_next) {
>@@ -1977,7 +1874,7 @@
> 
> 		if (pai->ai_flags & AI_CANONNAME) {
> 			if (get_canonname(pai, res, cname) != 0) {
>-				__wrap_freeaddrinfo(res0);
>+				freeaddrinfo(res0);
> 				goto again;
> 			}
> 		}
>@@ -1993,12 +1890,12 @@
> 	const struct addrinfo *pai;
> 	struct addrinfo sentinel, *cur;
> 	struct addrinfo *p;
>-	_pseudo_FILE hostf = _PSEUDO_FILE_INITIALIZER;
>+	FILE *hostf = NULL;
> 
> 	name = va_arg(ap, char *);
> 	pai = va_arg(ap, struct addrinfo *);
> 
>-	//fprintf(stderr, "_files_getaddrinfo() name = '%s'\n", name);
>+//	fprintf(stderr, "_files_getaddrinfo() name = '%s'\n", name);
> 	memset(&sentinel, 0, sizeof(sentinel));
> 	cur = &sentinel;
>
Comment 19 Andreas Gal :gal 2011-09-27 23:08:54 PDT
Created attachment 562979 [details] [diff] [review]
diff android 2.3 bionic code vs WIP patch
Comment 20 Michael K. Edwards (:medwards) 2011-09-27 23:33:04 PDT
Comment on attachment 562974 [details] [diff] [review]
WIP: Android 2.3 getaddrinfo(), with stdio replaced by mmap(), adjusted for use with --wrap

Review of attachment 562974 [details] [diff] [review]:
-----------------------------------------------------------------

See also https://bug687367.bugzilla.mozilla.org/attachment.cgi?id=562979, which shows the delta between the new files and the Android 2.3 fork of NetBSD from which they were derived.
Comment 21 Andreas Gal :gal 2011-09-27 23:51:32 PDT
dougt, can you land when you r+, medwards has no commit access yet. thanks
Comment 22 Doug Turner (:dougt) 2011-09-28 12:58:17 PDT
Comment on attachment 562974 [details] [diff] [review]
WIP: Android 2.3 getaddrinfo(), with stdio replaced by mmap(), adjusted for use with --wrap

Review of attachment 562974 [details] [diff] [review]:
-----------------------------------------------------------------

Discussed on irc.  I wanted to see the ofllow:

* add new files, in patched state
* patch that removes the DNS proxy lookaside from the original Android 2.3 files
* patch, incremental, that replaces stdio with mmap
* changes to configure.in and Makefile.in to compile the new file and do the wrap mechanics
* pre-honeycomb only
Comment 23 Michael K. Edwards (:medwards) 2011-09-28 18:00:23 PDT
Detecting >= Honeycomb can only be done from Java.  (See http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsSystemInfo.cpp#146 and Christopher's comments on http://stackoverflow.com/questions/2441680/how-do-i-know-what-android-ndk-version-im-running-in .)  Are you OK with polling it during startup and poking it into a global?  (Yuck, I know, but a callback into Java on every call to getaddrinfo()/freeaddrinfo() sounds painful.)
Comment 24 Doug Turner (:dougt) 2011-09-29 20:50:46 PDT
JNI can do this pretty easily, right?
Comment 25 Doug Turner (:dougt) 2011-09-29 20:55:52 PDT
how about this Michael... lets do this for all versions of android, then follow up making this pre-honeycomb only.
Comment 26 Michael K. Edwards (:medwards) 2011-09-30 15:28:07 PDT
Created attachment 563867 [details] [diff] [review]
Updated patch, with Honeycomb handling
Comment 27 Doug Turner (:dougt) 2011-09-30 15:53:23 PDT
Comment on attachment 563867 [details] [diff] [review]
Updated patch, with Honeycomb handling

Review of attachment 563867 [details] [diff] [review]:
-----------------------------------------------------------------

looking better.  Please fixup.  mwu should also look at these changes when the patch is cleaned up.

mwu - are you comfortable with the license here?  is it similar to the APKOpen stuff?

::: memory/mozutils/Makefile.in
@@ +83,5 @@
>  ifeq (Android, $(OS_TARGET))
>  # Add Android linker
>  EXTRA_DSO_LDOPTS += $(ZLIB_LIBS)
>  SHARED_LIBRARY_LIBS += $(call EXPAND_LIBNAME_PATH,android,$(DEPTH)/other-licenses/android)
> +WRAP_LDFLAGS = -Wl,--wrap=getaddrinfo,--wrap=freeaddrinfo,--wrap=gai_strerror

Why is this needed, but wasn't needed to wrap malloc?

::: other-licenses/android/002-replace-stdio-with-mmap.patch
@@ +28,5 @@
> +@@ -94,7 +105,6 @@
> + #include <netdb.h>
> + #include "resolv_private.h"
> + #include <stddef.h>
> +-#include <stdio.h>

why this change?

@@ +54,5 @@
> ++
> ++#define _PSEUDO_FILE_INITIALIZER { -1, 0, MAP_FAILED, 0 }
> ++
> ++static void
> ++_pseudo_fclose(_pseudo_FILE* fp)

test for a null fp? and elsewhere

@@ +111,5 @@
> ++    if (maxcopy > bufsize - 1)
> ++        maxcopy = bufsize - 1;
> ++    if (maxcopy <= 0)
> ++        return NULL;
> ++    //fprintf(stderr, "_pseudo_fgets(): copying up to %d bytes\n", maxcopy);

remove

@@ +259,5 @@
> +@@ -1373,8 +1508,6 @@
> + 		const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> + 		if (IN6_IS_ADDR_LOOPBACK(&addr6->sin6_addr)) {
> + 			return 0;
> +-		} else if (IN6_IS_ADDR_ULA(&addr6->sin6_addr)) {

what is this change about?

@@ +268,5 @@
> +@@ -1414,8 +1547,6 @@
> + 		const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> + 		if (IN6_IS_ADDR_LOOPBACK(&addr6->sin6_addr)) {
> + 			return 60;
> +-		} else if (IN6_IS_ADDR_ULA(&addr6->sin6_addr)) {

and this?

@@ +278,5 @@
> + }
> + 
> + static void
> +-_sethtent(FILE **hostf)
> ++_sethtent(_pseudo_FILE *hostf)

can't you just

#undef FILE
#define FILE _pseudo_FILE

for this and similar redefs. at the top of this file?  It will safe a bunch of these changes.

@@ +313,5 @@
> + 	const char *addr;
> + 	char hostbuf[8*1024];
> + 
> +-//	fprintf(stderr, "_gethtent() name = '%s'\n", name);
> ++	//fprintf(stderr, "_gethtent() name = '%s'\n", name);

revert this change.

@@ +335,5 @@
> + 		tname = cp;
> + 		if ((cp = strpbrk(cp, " \t")) != NULL)
> + 			*cp++ = '\0';
> +-//		fprintf(stderr, "\ttname = '%s'", tname);
> ++		//fprintf(stderr, "\ttname = '%s'\n", tname);

same.

@@ +368,5 @@
> + 	name = va_arg(ap, char *);
> + 	pai = va_arg(ap, struct addrinfo *);
> + 
> +-//	fprintf(stderr, "_files_getaddrinfo() name = '%s'\n", name);
> ++	//fprintf(stderr, "_files_getaddrinfo() name = '%s'\n", name);

same.

::: xpcom/base/nsSystemInfo.cpp
@@ +48,5 @@
>  #ifdef ANDROID
>  #include "AndroidBridge.h"
> +
> +extern "C" {
> +extern volatile int android_sdk_version;

volatile keyword not needed.  you should not use extern here either.  (this is the implicit definition).  you will need extern for the explicit declaration (where it is used outside of this file)
Comment 28 Michael K. Edwards (:medwards) 2011-09-30 16:08:50 PDT
(In reply to Doug Turner (:dougt) from comment #27)
> Comment on attachment 563867 [details] [diff] [review] [diff] [details] [review]
> Updated patch, with Honeycomb handling
> 
> Review of attachment 563867 [details] [diff] [review] [diff] [details] [review]:
> -----------------------------------------------------------------
> 
> looking better.  Please fixup.  mwu should also look at these changes when
> the patch is cleaned up.
> 
> mwu - are you comfortable with the license here?  is it similar to the
> APKOpen stuff?
> 
> ::: memory/mozutils/Makefile.in
> @@ +83,5 @@
> >  ifeq (Android, $(OS_TARGET))
> >  # Add Android linker
> >  EXTRA_DSO_LDOPTS += $(ZLIB_LIBS)
> >  SHARED_LIBRARY_LIBS += $(call EXPAND_LIBNAME_PATH,android,$(DEPTH)/other-licenses/android)
> > +WRAP_LDFLAGS = -Wl,--wrap=getaddrinfo,--wrap=freeaddrinfo,--wrap=gai_strerror
> 
> Why is this needed, but wasn't needed to wrap malloc?

Now that we call __real_getaddrinfo() from __wrap_getaddrinfo(), libmozutils linking breaks without it.

> ::: other-licenses/android/002-replace-stdio-with-mmap.patch
> @@ +28,5 @@
> > +@@ -94,7 +105,6 @@
> > + #include <netdb.h>
> > + #include "resolv_private.h"
> > + #include <stddef.h>
> > +-#include <stdio.h>
> 
> why this change?

Because the point of the patch is to remove all use of stdio from getaddrinfo().  The commented-out debug log lines, if restored, should be changed from stderr to the Android log mechanism (which is useful on a non-rooted device, unlike stderr).

> @@ +54,5 @@
> > ++
> > ++#define _PSEUDO_FILE_INITIALIZER { -1, 0, MAP_FAILED, 0 }
> > ++
> > ++static void
> > ++_pseudo_fclose(_pseudo_FILE* fp)
> 
> test for a null fp? and elsewhere

Really just needs a "restrict" keyword, since it's never valid to pass it a NULL pointer.

> @@ +111,5 @@
> > ++    if (maxcopy > bufsize - 1)
> > ++        maxcopy = bufsize - 1;
> > ++    if (maxcopy <= 0)
> > ++        return NULL;
> > ++    //fprintf(stderr, "_pseudo_fgets(): copying up to %d bytes\n", maxcopy);
> 
> remove

remove what?  just the commented-out log line?

> @@ +259,5 @@
> > +@@ -1373,8 +1508,6 @@
> > + 		const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> > + 		if (IN6_IS_ADDR_LOOPBACK(&addr6->sin6_addr)) {
> > + 			return 0;
> > +-		} else if (IN6_IS_ADDR_ULA(&addr6->sin6_addr)) {
> 
> what is this change about?

That macro isn't defined by the older headers in the NDK we're using.  It's a side effect of another patch that was already applied to the version of getaddrinfo.c I started from; we want the rest of that patch.

> @@ +268,5 @@
> > +@@ -1414,8 +1547,6 @@
> > + 		const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> > + 		if (IN6_IS_ADDR_LOOPBACK(&addr6->sin6_addr)) {
> > + 			return 60;
> > +-		} else if (IN6_IS_ADDR_ULA(&addr6->sin6_addr)) {
> 
> and this?

ditto

> @@ +278,5 @@
> > + }
> > + 
> > + static void
> > +-_sethtent(FILE **hostf)
> > ++_sethtent(_pseudo_FILE *hostf)
> 
> can't you just
> 
> #undef FILE
> #define FILE _pseudo_FILE
> 
> for this and similar redefs. at the top of this file?  It will safe a bunch
> of these changes.

No.  Note the different number of *s.  I changed the APIs so that the _pseudo_FILE struct is caller-allocated, in order to avoid the extra malloc/free.

> @@ +313,5 @@
> > + 	const char *addr;
> > + 	char hostbuf[8*1024];
> > + 
> > +-//	fprintf(stderr, "_gethtent() name = '%s'\n", name);
> > ++	//fprintf(stderr, "_gethtent() name = '%s'\n", name);
> 
> revert this change.

OK; it was just for consistency throughout the commented-out log lines (which are bad practice IMHO anyway, but whatever).

> @@ +335,5 @@
> > + 		tname = cp;
> > + 		if ((cp = strpbrk(cp, " \t")) != NULL)
> > + 			*cp++ = '\0';
> > +-//		fprintf(stderr, "\ttname = '%s'", tname);
> > ++		//fprintf(stderr, "\ttname = '%s'\n", tname);
> 
> same.

OK

> @@ +368,5 @@
> > + 	name = va_arg(ap, char *);
> > + 	pai = va_arg(ap, struct addrinfo *);
> > + 
> > +-//	fprintf(stderr, "_files_getaddrinfo() name = '%s'\n", name);
> > ++	//fprintf(stderr, "_files_getaddrinfo() name = '%s'\n", name);
> 
> same.

OK

> ::: xpcom/base/nsSystemInfo.cpp
> @@ +48,5 @@
> >  #ifdef ANDROID
> >  #include "AndroidBridge.h"
> > +
> > +extern "C" {
> > +extern volatile int android_sdk_version;
> 
> volatile keyword not needed.  you should not use extern here either.  (this
> is the implicit definition).  you will need extern for the explicit
> declaration (where it is used outside of this file)

A bare "volatile" is never really the right answer, I agree; but I am concerned about the access getting optimized away, given that the poke happens from a different thread than the peek.  But I'll take it off if you'd rather, along with the extern (which I am aware is the implicit default; I am just in the habit of making it explicit).
Comment 29 Michael Wu [:mwu] 2011-09-30 17:23:48 PDT
(In reply to Doug Turner (:dougt) from comment #27)
> mwu - are you comfortable with the license here?  is it similar to the
> APKOpen stuff?

The license looks fine to me but IANAL. You can always open up a bug to have legal check. The APKOpen stuff was also lifted from bionic.

It would probably be nice to have *just* the bionic files come in as their own check-in and then the actual changes/integration in another patch.
Comment 30 Doug Turner (:dougt) 2011-09-30 18:03:48 PDT
pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=a7ea2b47360f
Comment 31 Mike Hommey [:glandium] 2011-09-30 23:05:47 PDT
(In reply to Michael K. Edwards from comment #28)
> (In reply to Doug Turner (:dougt) from comment #27)
> > Comment on attachment 563867 [details] [diff] [review] [diff] [details] [review] [diff] [details] [review]
> > Updated patch, with Honeycomb handling
> > 
> > Review of attachment 563867 [details] [diff] [review] [diff] [details] [review] [diff] [details] [review]:
> > -----------------------------------------------------------------
> > 
> > looking better.  Please fixup.  mwu should also look at these changes when
> > the patch is cleaned up.
> > 
> > mwu - are you comfortable with the license here?  is it similar to the
> > APKOpen stuff?
> > 
> > ::: memory/mozutils/Makefile.in
> > @@ +83,5 @@
> > >  ifeq (Android, $(OS_TARGET))
> > >  # Add Android linker
> > >  EXTRA_DSO_LDOPTS += $(ZLIB_LIBS)
> > >  SHARED_LIBRARY_LIBS += $(call EXPAND_LIBNAME_PATH,android,$(DEPTH)/other-licenses/android)
> > > +WRAP_LDFLAGS = -Wl,--wrap=getaddrinfo,--wrap=freeaddrinfo,--wrap=gai_strerror
> > 
> > Why is this needed, but wasn't needed to wrap malloc?
> 
> Now that we call __real_getaddrinfo() from __wrap_getaddrinfo(), libmozutils
> linking breaks without it.

Just call getaddrinfo() from __wrap_getaddrinfo(), and you don't need -Wl,--wrap... like __wrap_dl* functions call dl* (see dlfcn.h)
Comment 32 Mike Hommey [:glandium] 2011-09-30 23:06:22 PDT
(In reply to Mike Hommey [:glandium] from comment #31)
> Just call getaddrinfo() from __wrap_getaddrinfo(), and you don't need
> -Wl,--wrap... like __wrap_dl* functions call dl* (see dlfcn.h)

err, dlfcn.c.
Comment 33 Andreas Gal :gal 2011-10-03 19:03:38 PDT
How are we doing with a new patch here?
Comment 34 Michael K. Edwards (:medwards) 2011-10-04 01:25:47 PDT
Created attachment 564480 [details] [diff] [review]
Updated patch, addressing review comments
Comment 35 Michael K. Edwards (:medwards) 2011-10-04 01:29:35 PDT
This updated patch isn't properly tested yet.  I also didn't address Mike Hommey's comment, because I think it's safer to use __real_getaddrinfo() and thus require that direct users of getaddrinfo() use the --wrap flags.  (Indirect users don't need them.)
Comment 36 Doug Turner (:dougt) 2011-10-04 12:19:08 PDT
michael, did you look at the debug test failures?  Are they random, or were they caused by this patch?
Comment 37 Michael K. Edwards (:medwards) 2011-10-04 12:25:00 PDT
I looked at them with people on #mobile (mbrubeck, IIRC) on Friday, who declared them to be symptoms of perma-orange issues on Android, unrelated to my patch.
Comment 38 Mike Hommey [:glandium] 2011-10-05 09:00:32 PDT
(In reply to Michael K. Edwards from comment #35)
> This updated patch isn't properly tested yet.  I also didn't address Mike
> Hommey's comment, because I think it's safer to use __real_getaddrinfo() and
> thus require that direct users of getaddrinfo() use the --wrap flags. 
> (Indirect users don't need them.)

There could be a concern if libmozutils was much bigger and was doing much more. Anyways, if ted is fine with that (i.e. using __real_getaddrinfo and --wrap flags), then so be it.
Comment 39 Ted Mielczarek [:ted.mielczarek] 2011-10-05 09:05:08 PDT
I defer to glandium here.
Comment 40 Mike Hommey [:glandium] 2011-10-05 09:24:58 PDT
ooook. So, the only calls intended for libc's getaddrinfo() from libmozutils, at the moment, are from that wrapper code. I don't see any reason in the near future (and even longer term) for something from libmozutils to call getaddrinfo(). I don't think there's a strong case for cluttering the build rules with additional --wrap flags (for libmozutils, that is ; other uses are required).
Comment 41 Michael K. Edwards (:medwards) 2011-10-05 09:55:28 PDT
The point is not that other code in libmozutils needs to call getaddrinfo(); it doesn't.  The point is that libraries that call getaddrinfo() need to be linked against libmozutils, with the appropriate --wrap flags, in order to ensure that their calls to getaddrinfo() get rewritten as calls to __wrap_getaddrinfo().  Otherwise they will get the wrong getaddrinfo() (the bionic one) when the library is loaded via dlopen().  The only occurrence of the getaddrinfo() symbol in the final set of binaries should be the one inside the implementation of __wrap_getaddrinfo(); and the "official" way to achieve that is to call __real_getaddrinfo() in the source code and let the linker rewrite that to getaddrinfo().
Comment 42 Mike Hommey [:glandium] 2011-10-05 10:11:55 PDT
the "official" way to achieve that is for when all your code compiles to a program or library. Here, we have libmozutils, and the rest. The rest needs --wrap so that getaddrinfo() calls are properly redirected to __wrap_getaddrinfo and linked against the corresponding symbols in libmozutils. Libmozutils, on the other hand, just needs to declare __wrap_getaddrinfo functions, and these functions can just call getaddrinfo. No need for __real_getaddrinfo, no need for additional --wrap in memory/mozutils/Makefile.in. That was my point.
Comment 43 Andreas Gal :gal 2011-10-05 17:06:49 PDT
medwards is heads down in B2G work and mwu agreed to take this over the finish line and land it. Reassigning.
Comment 44 Michael Wu [:mwu] 2011-10-06 21:35:42 PDT
Landed with wrapping adjusted.

https://hg.mozilla.org/mozilla-central/rev/c3a50afc2243
Comment 45 Brad Lassey [:blassey] (use needinfo?) 2011-10-10 07:45:32 PDT
backed out due to the crashes reported in bug 693103:
https://hg.mozilla.org/mozilla-central/rev/3a910924c50c
Comment 46 Matt Brubeck (:mbrubeck) 2011-10-10 09:58:05 PDT
The backout was backed out until we can figure out which of several patches broke Android tests:
https://hg.mozilla.org/mozilla-central/rev/bcf12565b95b

Leaving this bug open, as we still plan to back it out ASAP.
Comment 47 Matt Brubeck (:mbrubeck) 2011-10-10 11:08:54 PDT
Re-landed the backout:
https://hg.mozilla.org/mozilla-central/rev/1db52ff3bfe9
Comment 48 Michael K. Edwards (:medwards) 2011-10-10 12:52:32 PDT
The crashes look exotic to me, in that they are segfaults due to an assortment of unreasonable values for the "name" pointer, at what happens to be the first place where it is dereferenced after it is received through the inter-thread dispatch mechanism.  Either it doesn't make it through dispatch correctly or it gets corrupted thereafter.  The only reasonable thing I can think of to do is to dereference the pointer in the caller (nsHostResolver::ThreadFunc) and re-land the patch (with the fixes from 693280), and see whether the crashes appear in ThreadFunc or down in strlen.  (One could also add a couple more dereferences along the way, in __wrap_getaddrinfo and _dns_getaddrinfo.)
Comment 49 Matt Brubeck (:mbrubeck) 2011-10-10 16:41:02 PDT
This was also re-landed and then re-backed out on inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a5b88a4500aa
https://hg.mozilla.org/integration/mozilla-inbound/rev/09ca3ba9ccb0

(These should be a no-op in the next merge from inbound to m-c.  Inbound sheriff, you can just leave this bug OPEN after merging these two changesets.)
Comment 50 Michael K. Edwards (:medwards) 2011-10-10 20:53:21 PDT
jchen points out that it could be strlen(domain) that fails, and that the domain pointer is extracted from a structure returned by a function internal to the NetBSD resolver code (__res_get_state()), which we haven't copied.  If this is where the problem originates -- due to incompatible layout of struct _res_state, or some such -- we're going to have to pull in res_state.c, and perhaps more (if we also need to replace res_ninit()).  Does someone who can reproduce the crashes want to check this hypothesis before we dive further into it?
Comment 51 Marco Bonardo [::mak] (Away 6-20 Aug) 2011-10-11 02:32:36 PDT
https://hg.mozilla.org/mozilla-central/rev/09ca3ba9ccb0
Comment 52 Andreas Gal :gal 2011-10-11 09:23:32 PDT
This seems to be quite a rat hole here. Is there an easier fix? Can we do DNS lookups from one thread only? Can we ferry the lookups to the Java layer? (which would give us free caching, btw). Doug? Brian?
Comment 53 Doug Turner (:dougt) 2011-10-11 14:46:03 PDT
I suggested that we just put a lock around the call site in comment #1, then fix it for real.  This allows us to unfuck our users.  We can fix it correctly post that commit.
Comment 54 Andreas Gal :gal 2011-10-11 16:42:03 PDT
medwards was argueing that might not help since the crash is in stdio, which is called from other paths as well. Though, since this is all one huge clusterfuck in Android, I would be down to just giving that a try and see whether the crash frequency goes down. Also, we should check what webkit does.
Comment 55 Michael K. Edwards (:medwards) 2011-10-11 17:52:41 PDT
I do think that throwing a lock around the use of getaddrinfo() will reduce the crash frequency, because the contention on the stdio FILE table is most often caused by concurrent DNS requests initiated from two or more worker threads.  It won't *fix* the problem; but it will probably stem the bleeding.
Comment 56 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-10-11 18:34:06 PDT
(In reply to Patrick McManus from comment #2)
> If we really are forced to go down this path I would prefer
> MAX_RESOLVER_THREADS_FOR_ANY_PRIORITY, and
> MAX_RESOLVER_THREADS_FOR_HIGH_PRIORITY were changed into prefs that could be
> set to {1,0} for this case rather than the lock. But I really think that's a
> big step back for phones as modern as gingerbread!

Let's stop the crash now and deal with the performance fallout later.

Patrick, could you take this bug? It seems to be something that urgently needs to be fixed and you seem to know best what needs to be done. If you cannot take it up right away, please assign it to me and I will figure it out.
Comment 57 Andreas Gal :gal 2011-10-11 23:36:27 PDT
Created attachment 566450 [details] [diff] [review]
patch

Lets get this minimal fix landed. I heard we only have around 10k honeycomb users, so we can take a couple days for a follow-up fix with dynamic detection.
Comment 58 Patrick McManus [:mcmanus] 2011-10-12 06:28:10 PDT
Comment on attachment 566450 [details] [diff] [review]
patch

Review of attachment 566450 [details] [diff] [review]:
-----------------------------------------------------------------

I got any/high backwards in my bugzilla comment.

I'll post a different patch when I test it (shortly!).

bz will probably be the right reviewer.
Comment 59 Patrick McManus [:mcmanus] 2011-10-12 07:01:03 PDT
Created attachment 566517 [details] [diff] [review]
no-concurrent-android-dns patch 2

This will create concurrency of 1 and disable dns prefetch for android.

A followup bug is really important.
Comment 60 Boris Zbarsky [:bz] 2011-10-12 10:46:46 PDT
Comment on attachment 566517 [details] [diff] [review]
no-concurrent-android-dns patch 2

r=me
Comment 61 Patrick McManus [:mcmanus] 2011-10-12 15:03:37 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/332bc69d0f2d

please file a high priority followup bug. Resolutions that timeout will severely gum up the works on mobile now.
Comment 62 Marco Bonardo [::mak] (Away 6-20 Aug) 2011-10-13 07:31:06 PDT
https://hg.mozilla.org/mozilla-central/rev/332bc69d0f2d
Comment 63 Mark Finkle (:mfinkle) (use needinfo?) 2011-10-13 07:40:41 PDT
(In reply to Andreas Gal :gal from comment #57)
> Created attachment 566450 [details] [diff] [review] [diff] [details] [review]
> patch
> 
> Lets get this minimal fix landed. I heard we only have around 10k honeycomb
> users, so we can take a couple days for a follow-up fix with dynamic
> detection.

FWIW, we have >120K honeycomb users. Not that it changes the type of fix.
Comment 64 Andreas Gal :gal 2011-10-13 08:35:36 PDT
Sorry I misspoke. I meant we only have 10k gingerbread (3.0) users. Is that correct Mark?
Comment 65 Mark Finkle (:mfinkle) (use needinfo?) 2011-10-13 10:32:03 PDT
(In reply to Andreas Gal :gal from comment #64)
> Sorry I misspoke. I meant we only have 10k gingerbread (3.0) users. Is that
> correct Mark?

"Gingerbread (3.0)" is confusing me, but the breakdown:

Gingerbread (2.3.*): >1.4 million users
Honeycomb (3.*): >120K users
Comment 66 Andreas Gal :gal 2011-10-13 11:01:55 PDT
Ok, thats curious. In the tablet planning meetings we were using wrong numbers than. Where is the 120k number coming from?
Comment 67 Mark Finkle (:mfinkle) (use needinfo?) 2011-10-13 11:19:30 PDT
(In reply to Andreas Gal :gal from comment #66)
> Ok, thats curious. In the tablet planning meetings we were using wrong
> numbers than. Where is the 120k number coming from?

The Android Market developer portal
Comment 68 Matt Brubeck (:mbrubeck) 2011-10-13 11:26:54 PDT
Note that the Android Market gives us numbers of "Active Installs", which I believe counts devices that are active (i.e. still pinging the Market server) and currently have Firefox installed (i.e. have not uninstalled it).

This is very different from our own "Active Daily Users" metric based on blocklist pings, which only counts actual usage.

We currently have a total of about 2,700,000 "Active Installs" but only about 330,000 "Active Daily Users".   Comparing usage metrics is impossible unless you actually specify which metric you are talking about.
Comment 69 Brad Lassey [:blassey] (use needinfo?) 2011-10-13 11:44:43 PDT
also note that we typically use a multiplier on ADUs to estimate our actual install base since not everyone uses there browser every day or long enough to get a block list ping. I believe the current multiplier is around 6x. So if we have 10K ADUs we'd estimate 60k active users, which would be half our 120k current installs.
Comment 70 Andreas Gal :gal 2011-10-13 11:55:34 PDT
I see. Thanks.
Comment 71 Steve Workman [:sworkman] (please use needinfo) 2011-11-01 12:05:06 PDT
Noticed that this was identified as one of the top 3 crashes for Android - adjusted tracking.  There should be no stability downside to the patch, but there is also no good way to determine what if any performance regression there would be

Note: patch fixes crash, but does so by disabling parallelism for Android DNS lookups.
Comment 72 Alex Keybl [:akeybl] 2011-11-01 14:23:21 PDT
Comment on attachment 566517 [details] [diff] [review]
no-concurrent-android-dns patch 2

[Triage Comment]
* Approving for Aurora given this is a top crasher
* Leaving approval‑mozilla‑beta as "?" since we'd reconsider if another bug causes us to re-roll
Comment 73 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-01 14:43:47 PDT
Will check into mozilla-aurora after finishing XPCOM proxy patch fixups.
Comment 74 Steve Workman [:sworkman] (please use needinfo) 2011-11-01 14:46:41 PDT
Thanks, Alex, Brian.
Comment 75 christian 2011-11-02 17:34:01 PDT
We respun for a different issue so decided to take this on beta (and I transplanted to aurora):

http://hg.mozilla.org/releases/mozilla-aurora/rev/4b3528fc67a8
http://hg.mozilla.org/releases/mozilla-beta/rev/dc48fbe54c4a
Comment 76 juan becerra [:juanb] 2011-11-03 09:29:50 PDT
Adding [qa+] for bug verification tracking.

We'll look at crash-stats before and after the fix, but it would be great if anyone has suggestions on how to verify this bug directly.
Comment 77 Steve Workman [:sworkman] (please use needinfo) 2011-12-06 10:55:48 PST
Juan, reproducing the bug seems to be difficult, and I don't have a lot to offer other than it was happening mostly on dual core devices, pre-honeycomb.  I never observed it myself.

Jchen might be able to offer some more help - he added a stack trace highlighting the weakness in thise code in bug 662936 comment 54.
Comment 78 Jim Chen [:jchen] [:darchons] 2011-12-06 12:46:23 PST
I think Bug 689989 should be enough. After removing the original workaround, our tests on the Tegra boards should be able to verify the fix.

Note You need to log in before you can comment on or make changes to this bug.