Rewrite oom-msg-logger in C or C++

RESOLVED FIXED

Status

defect
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: dhylands, Assigned: danieru.dressler)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [good first bug][mentor=dhylands][lang=C++])

Attachments

(3 attachments, 8 obsolete attachments)

oom-msg-logger is currently a shell script which runs as root. It should be rewritten as a C++ program which assigns itself the CAP_SYSLOG capability, and then lowers its uid/gid to shell/system.

It turns out that bionic supports regex and a sample which uses it can be found in B2G/gonk-misc/b2g-ingo/process.cpp
See Also: → 939370
Assignee: nobody → danieru.dressler
@Dave: Thank you, you've been a lot of help so far already.

If there are no objections I would like to suggest that oom-msg-logger be put into service as fxos's go to klog monitor.

The kernel log is a ring buffer in kernel space. Through /proc/kmsg userland can retrieve the log messages only once, anything read is removed from the buffer. Reading from /proc/kmsg is how the current oom monitor script functions. This has the side-effect of hiding kernel messages from dmesg which may make debugging low level issues harder. dmesg works by echoing the kernel ring buffer without flushing.

Through klogctl(2) it is possible to read the log without flushing. The tradeoff for not flushing being no easy method for getting only new messages, and more complex yielding. Finding only the new messages is possible since kernel messages are given timestamps but it requires further parsing of the messages. We lose the automatic yielding during waits for kernel messages. At present when we request more kernel messages our thread will get slept until messages exist. Under a flush-less monitor we must add a polling interval else we burn battery and cpu.

Assuming more things may need to be monitored from the kernel log in the future I propose oom-msg-logger be renamed to something grander yet more general like 'kmsgd', 'kmsg-monitor', 'klog-mon', or similar (or dissimilar). Then people should feel free to add extra regex for the monitor to watch for. This rewrite must already touch the init scripts since we'll be dropping the '.sh' thus a rename is not much extra work.

Also for consideration would be logging all kernel messages to android's log by default under verbose priority. Thus if a message matches a regex it would get a higher priority (example: info). Others will need to weigh in on the idea since since kernel messages may contain privileged information. 

I have attached a .patch of my current work. If someone could give it a look over I would appreciate the help. Warning, the patch is a work in progress as I must rewrite the capabilities handling to use libcap and fix an issue with some headers on my machine, it shouldn't compile.
Work in progress, please don't kill me!
Comment on attachment 8334763 [details] [diff] [review]
0001-Add-regex-based-kernel-log-monitor.patch

Review of attachment 8334763 [details] [diff] [review]:
-----------------------------------------------------------------

I think that you'll also need to remove oom-msg-logger from gonk-misc/Android.mk

::: oom-msg-logger/oom-msg-logger.c
@@ +17,5 @@
> +
> +#define PUBLICNAME "OOM"
> +
> +#ifdef ANDROID
> +	#define dest_log(...) do {\

I'd change dest_log to be all caps. That seems to be the most common convention in the code base (all caps for macros)

@@ +18,5 @@
> +#define PUBLICNAME "OOM"
> +
> +#ifdef ANDROID
> +	#define dest_log(...) do {\
> +		__android_log_print( \

nit: Our coding standard https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style uses 2 spaces per indent.

@@ +28,5 @@
> +	#warning "Warning: OOM messages shall only get sent to stdout"
> +	#define dest_log(...) do { printf(__VA_ARGS__); } while (0)
> +#endif
> +
> +				

nit: trailing space

@@ +46,5 @@
> +	// Saves us from brining in libcap
> +	// This will drop all caps but cap_syslog
> +	// Is an effective de-root
> +	struct __user_cap_header_struct request_header;
> +		request_header.version = _LINUX_CAPABILITY_VERSION_2;

nit: I wouldn't indent here. I understand what you're trying to do, but it isn't something I see in the rest of our code base. I'd probably just put it at the same indentation and perhaps add a blank line after the assignments.

Optionally, since we know we're using gcc, you could also do this style of initialization:

  struct __user_cap_header_struct request_header = {
    .version = _LINUX_CAPABILITY_VERSION_2,
    .pid = getpid()
  };

@@ +61,5 @@
> +	//       Now we make it obvious to userland
> +	struct passwd *shell_user = getpwnam("shell");
> +	if (shell_user == NULL) {
> +		dest_log("Error: Failed to find 'shell' user uid, exiting\n");
> +		exit(-3);

I think that the exit codes should be positive numbers.

@@ +66,5 @@
> +	}
> +	setuid(shell_user->pw_uid);
> +
> +	// Precompile our regex
> +	for (int i = 0; i < regex_count; i++) {

Declaring int in the for seems like a C++ism (as is declaring the variables throughout the function and using // for comments).

I'd probably name this to be a .cpp file and then you don't need to make it look like a C file.

@@ +80,5 @@
> +
> +	// -- Monitor kernel log
> +	// Forward interesting messages onto android log
> +	int msg_buf_size = klogctl(SYSLOG_ACTION_SIZE_BUFFER, NULL, 0);
> +	char *passthrough_buf = malloc((msg_buf_size + 1) * sizeof(char));

null check malloc return? I think its better to fail with an explicit message rather than a segfault.

@@ +86,5 @@
> +		// yields until kernel log is non-empty
> +		int read_size = klogctl(SYSLOG_ACTION_READ, passthrough_buf, msg_buf_size);
> +
> +		// Paranoid transformation of buffer into string
> +		read_size= msg_buf_size < read_size ? msg_buf_size : read_size;

nit: space before =

super nit: Personally, I would have coded this this way (I find it flows a bit easier - but that's just a personal preference)

read_size = read_size >= msg_buf_size ? msg_buf_size : read_size;

@@ +88,5 @@
> +
> +		// Paranoid transformation of buffer into string
> +		read_size= msg_buf_size < read_size ? msg_buf_size : read_size;
> +		passthrough_buf[read_size] = '\0';
> +		

nit: trailing space

@@ +94,5 @@
> +		// for each line
> +		do {
> +			char *line_begin = line_end;
> +			// find end of current line
> +			while (*line_end++ != '\0')

nit: Use braces here (even for single line stuff - see coding style guide - Search for "Control structures")

@@ +96,5 @@
> +			char *line_begin = line_end;
> +			// find end of current line
> +			while (*line_end++ != '\0')
> + 				if (*line_end == '\n')
> +					break;

Actually, you should probably use: line_end = strchr('\n') and line_end will be NULL if no \n is found.

@@ +107,5 @@
> +			for (int i = 0; i < regex_count; i++) {
> +				int not_matched = regexec(&(regex_compiled[i]), line_begin, 0, NULL, 0);
> +				if (not_matched)
> +					continue;
> +					

nit: trailing space
(In reply to Daniel Dressler from comment #1)
> @Dave: Thank you, you've been a lot of help so far already.
> 
> If there are no objections I would like to suggest that oom-msg-logger be
> put into service as fxos's go to klog monitor.

Philisophically, I have no objections to this.

> The kernel log is a ring buffer in kernel space. Through /proc/kmsg userland
> can retrieve the log messages only once, anything read is removed from the
> buffer. Reading from /proc/kmsg is how the current oom monitor script
> functions. This has the side-effect of hiding kernel messages from dmesg
> which may make debugging low level issues harder. dmesg works by echoing the
> kernel ring buffer without flushing.
> 
> Through klogctl(2) it is possible to read the log without flushing. The
> tradeoff for not flushing being no easy method for getting only new
> messages, and more complex yielding. Finding only the new messages is
> possible since kernel messages are given timestamps but it requires further
> parsing of the messages. We lose the automatic yielding during waits for
> kernel messages. At present when we request more kernel messages our thread
> will get slept until messages exist. Under a flush-less monitor we must add
> a polling interval else we burn battery and cpu.

If we are going to consume the messages, then we should also provide a dmesg that explains whats going on so that the user can reconfigure things, look in logcat or whatever is appropriate.

> Assuming more things may need to be monitored from the kernel log in the
> future I propose oom-msg-logger be renamed to something grander yet more
> general like 'kmsgd', 'kmsg-monitor', 'klog-mon', or similar (or
> dissimilar). Then people should feel free to add extra regex for the monitor
> to watch for. This rewrite must already touch the init scripts since we'll
> be dropping the '.sh' thus a rename is not much extra work.
> 
> Also for consideration would be logging all kernel messages to android's log
> by default under verbose priority. Thus if a message matches a regex it
> would get a higher priority (example: info). Others will need to weigh in on
> the idea since since kernel messages may contain privileged information. 

My general feeling is that most people aren't interested in the kernel messages and would consider it to be log spam.

What if we make it possible to do what you're suggesting but use some command line options to control how things get logged?

Or better yet, have a configuration file which contains the regular expressions and what log level they should be record at? Then a developer can customzie the config file to get more or less information?

> I have attached a .patch of my current work. If someone could give it a look
> over I would appreciate the help. Warning, the patch is a work in progress
> as I must rewrite the capabilities handling to use libcap and fix an issue
> with some headers on my machine, it shouldn't compile.
Adding a needinfo for mwu to weigh in on comment #1

So the question is, is it ok to consume the messages? (i.e. if you ran dmesg you'd get no output)
Or should we put in the extra complexity to continue to allow dmesg to work?
Or is having a configuration file to allow the kernel messages to goto logcat sufficient?
Or something else?
Flags: needinfo?(mwu)
Hmm, I wouldn't like kernel messages to be consumed. dmesg is useful for me in many cases where I didn't have any idea what I was looking for.

Can we have b2g main process check for messages every time a child process dies?
Flags: needinfo?(mwu)
Thanks for the review Dave, I'll get all those issues you noticed fixed.

Just to give some more context for mwu. The extra complexity for a dmesg compatible solution is to walk the buffer backwards until we find a message older than our "last message" timestamp. Not a big performance issue. The real performance question comes from the need to use polling. The kernel does not log too much in normal operation. Wifi, sleeps, only big events from my laptop's perspective. My desktop is even quieter, lots of noise during the first 11 seconds of startup then nothing for hours. If we're waiting up every N seconds we'll often wake up to no new messages, yet we'll require a full copying of the ring buffer to know that.

Augmenting with a configuration file and usurping dmesg by sending kernel messages to logcat might be the best approach. Logcat is already well used with developer friendly infrastructure, meanwhile dmesg requires users to shell in. We get the low performance impact thanks to the kernel managed yielding. The downside is the log noise. I'm not sure how much noise verbose is expected too endure but with a config file it could be used only on dev builds.
So the original motivation behind all of this was that we have processes dying due to OOM, and there isn't any indication given in logcat to indicate that.

My very original idea was that the oom logger would run as a thread inside the b2g process, and then it got morphed into being a standalone process.

I think mwu's idea of doing a check whenever the main process detects a child process dying is worth considering.

Developers are used to looking in logcat and providing logcat attachments to bugs, but most developers are unaware of dmesg and the fact that OOM messages are only available there (which is the real motivation behind the OOM logger).

So lets not over-engineer this and keep it as simple as we can.

I like the idea of doing this in the b2g process (has less system impact since it will use less memory).
I also think that we really only need to check when a child process dies.

I think that if we register an observer on the ipc:content-shutdown topic in the parent then that observer will be called every time a child dies. We could then make a single pass through the kernel log to report any new message what we haven't yet reported.
Checking in b2g process only on dead children does sound like the better approach. It saves us a stack and should cause even fewer wakeups.

Meta: What must we do to make that our new plan? Should anyone else weigh in?
Sorry for my short response to your review Dave. Your comments were helpful.

> I'd change dest_log to be all caps. That seems to be the most common
> convention in the code base (all caps for macros)
You're right, without CAPS it is not obvious there are no return values.


> nit: Our coding standard
> https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style uses 2
> spaces per indent.
Ok, not good news but at least its consistent.


> Optionally, since we know we're using gcc, you could also do this style of
> initialization:
> 
>   struct __user_cap_header_struct request_header = {
>     .version = _LINUX_CAPABILITY_VERSION_2,
>     .pid = getpid()
>   };
Thank you that is just plain better.

> Declaring int in the for seems like a C++ism (as is declaring the variables
> throughout the function and using // for comments).
This might be a nit-pick but those are features of C99.  It was C89 and C90 which only allowed var declaration at the beginning of blocks. Likewise C99 added C++'s comment style. I used to use /**/ but another project I contributed to wanted //. Their reasoning is /**/ does not allow nested comments but otherwise is fantastic for commenting out code. So by default I've been trying to use //

The issue with C99 is that microsoft has never implemented it. Internal to Microsoft only the kernel team was C only. Thus the compilers team convinced the kernel team to migrate to compiling under C++. Similar to how GCC has migrated to C++. Thus this might be the only mozilla project which can ignore MSVC's "issues".

 

> I'd probably name this to be a .cpp file and then you don't need to make it
> look like a C file.
> 
> @@ +80,5 @@
> > +
> > +	// -- Monitor kernel log
> > +	// Forward interesting messages onto android log
> > +	int msg_buf_size = klogctl(SYSLOG_ACTION_SIZE_BUFFER, NULL, 0);
> > +	char *passthrough_buf = malloc((msg_buf_size + 1) * sizeof(char));
> 
> null check malloc return? I think its better to fail with an explicit
> message rather than a segfault.
> 
> @@ +86,5 @@
> > +		// yields until kernel log is non-empty
> > +		int read_size = klogctl(SYSLOG_ACTION_READ, passthrough_buf, msg_buf_size);
> > +
> > +		// Paranoid transformation of buffer into string
> > +		read_size= msg_buf_size < read_size ? msg_buf_size : read_size;
> 
> nit: space before =
> 
> super nit: Personally, I would have coded this this way (I find it flows a
> bit easier - but that's just a personal preference)
> 
> read_size = read_size >= msg_buf_size ? msg_buf_size : read_size;
> 
> @@ +88,5 @@
> > +
> > +		// Paranoid transformation of buffer into string
> > +		read_size= msg_buf_size < read_size ? msg_buf_size : read_size;
> > +		passthrough_buf[read_size] = '\0';
> > +		
> 
> nit: trailing space
> 
> @@ +94,5 @@
> > +		// for each line
> > +		do {
> > +			char *line_begin = line_end;
> > +			// find end of current line
> > +			while (*line_end++ != '\0')
> 
> nit: Use braces here (even for single line stuff - see coding style guide -
> Search for "Control structures")
> 
> @@ +96,5 @@
> > +			char *line_begin = line_end;
> > +			// find end of current line
> > +			while (*line_end++ != '\0')
> > + 				if (*line_end == '\n')
> > +					break;
> 
> Actually, you should probably use: line_end = strchr('\n') and line_end will
> be NULL if no \n is found.
> 
> @@ +107,5 @@
> > +			for (int i = 0; i < regex_count; i++) {
> > +				int not_matched = regexec(&(regex_compiled[i]), line_begin, 0, NULL, 0);
> > +				if (not_matched)
> > +					continue;
> > +					
> 
> nit: trailing space
I am very sorry. Somehow that comment got posted while I was still writing it!

Only thing I missed commenting on would be thanking you for pointing out strchr. So thank you for pointing out strchr()!
Oh I also missed the part where you found me not checking malloc failure, that's pretty embarrassing. That check was not skipped on purpose.
I think its still fairly isolated. The biggest difference is that the code would now go in the gecko tree (which is maintained under mercurial).

You can still do your work using the git tree, the final results will just get delivered a little differently.

I took a look around the tree, and I think that logical place to put the new file would be in gecko/hal/gonk

You can register an observer against ipc:content-shutdown, as is done here:
http://dxr.mozilla.org/mozilla-central/source/dom/ipc/ProcessPriorityManager.cpp#415

I'm pretty sure that this will cause the Observe method to be called whenever a child process goes away.
http://dxr.mozilla.org/mozilla-central/source/dom/ipc/ProcessPriorityManager.cpp#415

This function: http://dxr.mozilla.org/mozilla-central/source/hal/gonk/GonkHal.cpp#1042 seems like a reasonable place to call your OOM logger initialization.
Thank you Dave those pointers are the exact thing I need! Without them I honestly would not know where to start.
Uncompile tested code, please do not try and compile. Hand indented from the finest space bar I have. Warning: made in text editor which processes tab indented files, may contain traces of tabs.
Attachment #8334763 - Attachment is obsolete: true
Flags: needinfo?(dhylands)
Hey Dave, I'm not MIA. Hope you're not already busy off your feet with this merge window (Or I suppose it is closed now).

Do you know how to compile my altered HAL? The build instructions I followed to get my nexus s build working only mentioned pulling updates from the mainline repos.
Hey Daniel,

So the method I use for building is to do the following:

git clone https://github.com/mozilla-b2g/B2G.git
cd B2G
./config.sh nexus-s
./build.sh
./flash.sh

I normally only do the ./config.sh step once, and when I want to update to the  latest I do:
cd B2G
git pull
./repo sync

I think that there is still a bug that crops up occaisonally which causes the build to fail due to some incorrect dependency chain when doing parallel builds, so I sometimes get a build failure and then need to rebuild.

You'll need all of the prerequisites from here:
https://developer.mozilla.org/en-US/Firefox_OS/Firefox_OS_build_prerequisites
Flags: needinfo?(dhylands)
Updated the patch which my latest progress. The patch in theory should work. In practice my OOM testing webpage (danieru.com/mxaas/index.html) is crashing the phone. In logcat I'm seeing:

F/libc    (  713): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1)

Then much later the 'nice' for one pid gets set to 18 then a bit later set back to 1.

I'm on a nexus s but I wouldn't expect an OOM handling bug to be phone specific. Dave, could you confirm if your phone crashes on my test page?
Dave could you take a look at this? I have some questions.
Attachment #8339080 - Attachment is obsolete: true
Flags: needinfo?(dhylands)
What are your questions?
Flags: needinfo?(dhylands)
Sorry Dave I should have mentioned that the questions are all in-line in patch. Most are coding standard related, plus a design one.
Fanstatic news Dave, the crashing was just a bug in my code. Not sure where but a quick early return test avoid the crash on OOM. With that said most of the questions I wrote in the patch are things like "what should I return here instead of NS_OK?". So you shouldn't need to test anything, or even make a close reading of the code.
Comment on attachment 8341337 [details] [diff] [review]
add_oom_victim_logging_to_hal_v2.patch

Review of attachment 8341337 [details] [diff] [review]:
-----------------------------------------------------------------

I think I answered all of your questions.

I also threw in nits (with references) for style related stuff.

::: hal/gonk/GonkHal.cpp
@@ +1043,5 @@
> +class OomVictimLogger MOZ_FINAL
> +  : public nsIObserver
> +{
> +public:
> +  OomVictimLogger() {

nit: For functions the open curly brace should be on the next line
See: https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style#Control_Structures

And, in general we should initializes mRegexes as part of the constructor initializer list.

@@ +1055,5 @@
> +
> +  NS_DECL_ISUPPORTS
> +  NS_DECL_NSIOBSERVER
> +private:
> +  regex_t* regexes;

This should be called mRegexes
https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style#Prefixes

@@ +1058,5 @@
> +private:
> +  regex_t* regexes;
> +};
> +NS_IMPL_ISUPPORTS1(OomVictimLogger,
> +                   nsIObserver);

nit: Not sure why this is split across 2 lines?

@@ +1069,5 @@
> +{
> +  nsDependentCString event_type(aTopic);
> +
> +  if (!event_type.EqualsLiteral("ipc:content-shutdown")) {
> +    return NS_OK; // This should not occur. @dave: what is the correct ret?

You could use MOZ_ASSERT like this:
http://dxr.mozilla.org/mozilla-central/source/dom/power/WakeLock.cpp#118

Returning NS_OK seems to be what other ::Observe methods do in this situation. I think either is fine.

@@ +1074,5 @@
> +  }
> +
> +  // @dave: would it be possible to
> +  // only proceed if we know the app
> +  // died during a low mem state?

Unfortunately, the OOM killer just does a kill -9, so the child process gets no chance to do anything.

Similarly, I'm not aware that the parent can find out why the child was killed, which is why we're doing this in the first place :)

@@ +1075,5 @@
> +
> +  // @dave: would it be possible to
> +  // only proceed if we know the app
> +  // died during a low mem state?
> +  // That is to ask: who gets notified?

The parent gets notified that a child died, but no particular reason (we don't actually use wait, we detect that the other end of the IPC pipe went away.)

Child processes really only die for one of a few reasons:
1 - a crash of some sort
2 - The user forcibly killed the process (swipe up from task switcher)
3 - The kernel forcibly killed the process due to OOM
4 - Somebody did a kill from an adb shell command line

@@ +1076,5 @@
> +  // @dave: would it be possible to
> +  // only proceed if we know the app
> +  // died during a low mem state?
> +  // That is to ask: who gets notified?
> +  // and how could victim logger get notified?

victim logger? If you mean the child. There is no way for the child to be notified as the kernel uses a kill -9.

@@ +1077,5 @@
> +  // only proceed if we know the app
> +  // died during a low mem state?
> +  // That is to ask: who gets notified?
> +  // and how could victim logger get notified?
> + 

nit: trailing space

@@ +1086,5 @@
> +    "*lowmem_shrink*, return",
> +    "*lowmem_shrink*, ofree*"};
> +  const size_t regex_count =
> +    sizeof(regexes_raw) /
> +      sizeof(regexes_raw[0]); 

nit: trailing spaces.
nit: why split over 3 lines?
It looks like this would all fit on one line and still be under 80 characters.

You could also use NS_ARRAY_LENGTH
http://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsMemory.h#124

@@ +1090,5 @@
> +      sizeof(regexes_raw[0]); 
> +
> +  // Compile our regex just in time
> +  if (regexes == nullptr) {
> +    // @Dave: should we be locking regexes?

I'm pretty sure that the Observe method will always be called on the main thread. 

You can add

MOZ_ASSERT(NS_IsMainThread());

to be sure.

@@ +1092,5 @@
> +  // Compile our regex just in time
> +  if (regexes == nullptr) {
> +    // @Dave: should we be locking regexes?
> +    // Or are we only going to get one callback
> +    // at a time?

Since Observe is only called from the main thread, there would only be one callback at a time.

@@ +1097,5 @@
> +    regexes = (regex_t*)malloc(sizeof(*regexes) * regex_count);
> +    for (size_t i = 0; i < regex_count; i++) {
> +      int compilation_err = regcomp(&(regexes[i]), regexes_raw[i], REG_NOSUB);
> +	  if (compilation_err) {
> +		// @dave: should we log this error? What is a proper NS_*?

I'd be inclined to report the regex compilation problem using the same method that's being used to log the messages.

I think returning NS_OK is probably fine

@@ +1111,5 @@
> +  // Our current bionic does not hit this
> +  // change yet so handle the future change.
> +  #define KLOG_SIZE_UNREAD KLOG_UNREADSIZE
> +#else
> +  // Once the change hits our bionic this ifndef 

nit: trailing space

@@ +1117,5 @@
> +  #warning "Please remove KLOG_UNREAD_SIZE compatability def"
> +#endif
> +  // Retreive kernel log
> +  int kern_buf_size = klogctl(KLOG_SIZE_UNREAD, NULL, 0);
> +  // @Dave: Is there something I should be using instead of malloc()?

Using malloc is fine. However, I would use the ScopedFreePtr to ensure that the buffer is released:
http://dxr.mozilla.org/mozilla-central/source/mfbt/Scoped.h#22
http://dxr.mozilla.org/mozilla-central/source/mfbt/Scoped.h#189

So we call malloc or new, but should never explicitly call free ot delete, but rather rely on an auto ptr of some type to release the memory.

@@ +1118,5 @@
> +#endif
> +  // Retreive kernel log
> +  int kern_buf_size = klogctl(KLOG_SIZE_UNREAD, NULL, 0);
> +  // @Dave: Is there something I should be using instead of malloc()?
> +  char* kern_buf = (char*) malloc((kern_buf_size + 1) * sizeof(char));

And inside gecko malloc is infallible, so we don't have to check for nullptr return.

See: https://developer.mozilla.org/en-US/docs/Infallible_memory_allocation (the docs seem to be out of date, since moz_malloc just calls malloc)

I know new is definitely infallible, so you could use new with AutoPtr as well.

@@ +1250,5 @@
>      WriteToFile("/sys/module/lowmemorykiller/parameters/notify_trigger",
>        nsPrintfCString("%d", lowMemNotifyThresholdMB * 1024 * 1024 / PAGE_SIZE).get());
>    }
> +
> +  // Ensure OOM events appear in catlog

nit: logcat

@@ +1262,5 @@
> +  // This relates to my earlier question about
> +  // how victim logger could check for OOM state.
> +  // Should victim logger be a singleton?
> +  // Does it matter if the object is only cleaned up
> +  // at our process end?

I'd probably make it be a singleton. Search of sUsbCableObserver in 
http://dxr.mozilla.org/mozilla-central/source/dom/system/gonk/AutoMounter.cpp#626

You can register for the xpcom-shutdown event and do the cleanup at that time.
Thank you Dave. Your review was most useful. Almost every suggestion has been implemented.

The best part is: this patch works. Kills are logged and no extra messages are logged. The second patch removes the old script based oom logger which needs to be applied to gonk-misc else it will eat all our messages.
Attachment #8341337 - Attachment is obsolete: true
Attachment #8345115 - Flags: review?(dhylands)
Attachment #8345112 - Flags: review+
Comment on attachment 8345115 [details] [diff] [review]
add_oom_victim_logging_to_hal_v3.patch

Review of attachment 8345115 [details] [diff] [review]:
-----------------------------------------------------------------

So this looks good.

It needs a few cleanups, so I gave it feedback+ rather than review+, and I'd like to take another look once you've reworked it.

::: hal/gonk/GonkHal.cpp
@@ +1048,5 @@
> +  {
> +    mLastLineChecked = 0;
> +    mRegexes = nullptr;
> +  }
> +  ~OomVictimLogger()

This can be removed if you use nsAutoPtr (as per note below)

@@ +1059,5 @@
> +  NS_DECL_ISUPPORTS
> +  NS_DECL_NSIOBSERVER
> +private:
> +  double mLastLineChecked;
> +  regex_t* mRegexes;

We try really hard not to ever call free directly, but rather use helper classes to make sure it happens automatically.

So I would use ScopedFreePtr here:

ScopedFreePtr<regex_t> mRegexes;

If you did want to free, then you can just assign mRegexes = nullptr; but if you just want it to go away with the destructor then you don't need to do anything.

@@ +1080,5 @@
> +    ".*select.*to kill.*",
> +    ".*send sigkill to.*",
> +    ".*lowmem_shrink.*, return",
> +    ".*lowmem_shrink.*, ofree.*"};
> +  const size_t regex_count = NS_ARRAY_LENGTH(regexes_raw); 

nit: trailing space

@@ +1123,5 @@
> +  char* line_begin = msg_buf.rwget();
> +  do {
> +    // find line end
> +    line_end = strchr(line_begin, '\n');
> +	if (!line_end) {

Indentation seems incorrect (might be a space versus tab issue)

@@ +1130,5 @@
> +
> +    // make line into cstring
> +    *line_end = '\0';
> +
> +	// kernel messages look like:

Indentation looks messed up here (probably a space versus tab issue).

We use spaces for indentation.

Ref: https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style#Whitespace

@@ +1136,5 @@
> +	// 5 is the loging level
> +	// [*] is the time timestamp, seconds since boot
> +	// last comes the logged message
> +	//
> +	// This logging format is part of the 

nit: trailing space

@@ +1155,5 @@
> +        LOG("OomLogger: %s\n", line_begin);
> +        break;
> +      }
> +    }
> +  } while (true && (line_begin = line_end + 1));

You can drop the true &&

While I personally don't mind compond statements like this, other reviewers seem to dislike these.

And this particular condition/assignment doesn't make any sense to me.

What you're really doing here is:

while ((line_begin = line_end + 1) != nullptr) which is essentially equivalent to while (true)

So I'd just put line_begin = line_end + 1 at the end of the loop and use a while (true) loop, or perhaps a 

while (*line_begin)

@@ +1255,5 @@
>        nsPrintfCString("%d", lowMemNotifyThresholdMB * 1024 * 1024 / PAGE_SIZE).get());
>    }
> +
> +  // Ensure OOM events appear in logcat
> +  OomVictimLogger *oomLogger = new OomVictimLogger();

Hmm this looks like a memory leak.

Since OomVictimLogger is a reference counted object, you should really do:

nsRefPtr<OomVictimLogger> oomLogger = new OomVictimLogger();

The reference count will be incremented when you call new, and again when you call AddObserver.
The nsRefPtr destructor will remove a count when this function exits.

Then when the victim logger is eventually removed from the observer list, it will be destructed.

This also takes care of the case where os is null. Then the oomLogger that was allocated will be freed when this function exits.
Attachment #8345115 - Flags: review?(dhylands) → feedback+
Thanks Dave! You caught a lot of things I'd have missed. I should be able to fix them all but one or two things I should reply to, even just to explain why I did something silly.

> Indentation seems incorrect (might be a space versus tab issue)
Yup, looks like it. I've been manually adding two spaces but on every new line vim wants to indent using tabs so I have to delete those, looks like I missed it here.


> You can drop the true &&
> 
> While I personally don't mind compond statements like this, other reviewers
> seem to dislike these.
> 
> And this particular condition/assignment doesn't make any sense to me.
> 
> What you're really doing here is:
> 
> while ((line_begin = line_end + 1) != nullptr) which is essentially
> equivalent to while (true)
> 
> So I'd just put line_begin = line_end + 1 at the end of the loop and use a
> while (true) loop, or perhaps a 
> 
> while (*line_begin)

This was something I mulled over. You're that it's equivalent to while (true) but the increment of line_begin cannot be at the loop's bottom. The 'true &&' is there to make the loop look odd so that readers might stop and wonder. The issue is in the middle we check the line's timestamp and conditional "continue", without the incrementation this would cause an infinite loop.

I considered four options:
1. Duplicate the incrementing of line_begin to right before the continue.
2. Use a label and goto instead of the continue, thus bringing us to the single incrementation.
3. Reorder the incrementation of line_begin so it can occur at the loop's top (I couldn't think of a clean re-ordering, but I'll spend some more time on it).
4. Move the regex checking loop into the else clause of the timestamp check.

Thank you for bringing this up, it's not simple but looking at it again I think I can do better. I'll spend some time seeing if I can reorder the incrementation. Otherwise it'll be option 3. That's for the meta cultural info on compound loop control expressions.


> Hmm this looks like a memory leak.
> 
> Since OomVictimLogger is a reference counted object, you should really do:
> 
> nsRefPtr<OomVictimLogger> oomLogger = new OomVictimLogger();
> 
> The reference count will be incremented when you call new, and again when
> you call AddObserver.
> The nsRefPtr destructor will remove a count when this function exits.
> 
> Then when the victim logger is eventually removed from the observer list, it
> will be destructed.
> 
> This also takes care of the case where os is null. Then the oomLogger that
> was allocated will be freed when this function exits.

Thank you Dave, you mentioned the ref counting in the last review but I didn't understand what you meant. I thought it was related to singletons and all I could find at the link was ref-counting stuff. I was assuming that the logger would live as long as gecko. Except I don't actually know that so if it's not a mem leak yet it could soon become one when my assumption changes.
(In reply to Daniel Dressler from comment #27)
> > You can drop the true &&
> > 
> > While I personally don't mind compond statements like this, other reviewers
> > seem to dislike these.
> > 
> > And this particular condition/assignment doesn't make any sense to me.
> > 
> > What you're really doing here is:
> > 
> > while ((line_begin = line_end + 1) != nullptr) which is essentially
> > equivalent to while (true)
> > 
> > So I'd just put line_begin = line_end + 1 at the end of the loop and use a
> > while (true) loop, or perhaps a 
> > 
> > while (*line_begin)
> 
> This was something I mulled over. You're that it's equivalent to while
> (true) but the increment of line_begin cannot be at the loop's bottom. The
> 'true &&' is there to make the loop look odd so that readers might stop and
> wonder. The issue is in the middle we check the line's timestamp and
> conditional "continue", without the incrementation this would cause an
> infinite loop.
> 
> I considered four options:
> 1. Duplicate the incrementing of line_begin to right before the continue.
> 2. Use a label and goto instead of the continue, thus bringing us to the
> single incrementation.
> 3. Reorder the incrementation of line_begin so it can occur at the loop's
> top (I couldn't think of a clean re-ordering, but I'll spend some more time
> on it).
> 4. Move the regex checking loop into the else clause of the timestamp check.
> 
> Thank you for bringing this up, it's not simple but looking at it again I
> think I can do better. I'll spend some time seeing if I can reorder the
> incrementation. Otherwise it'll be option 3. That's for the meta cultural
> info on compound loop control expressions.

You could put the increment at the beginning of the loop by doing

if (!line_end) {
  line_begin = ...
}

That seems alot clearer to me than the funky construct in the while, especially with no comments to explain what's really going on.

> > Hmm this looks like a memory leak.
> > 
> > Since OomVictimLogger is a reference counted object, you should really do:
> > 
> > nsRefPtr<OomVictimLogger> oomLogger = new OomVictimLogger();
> > 
> > The reference count will be incremented when you call new, and again when
> > you call AddObserver.
> > The nsRefPtr destructor will remove a count when this function exits.
> > 
> > Then when the victim logger is eventually removed from the observer list, it
> > will be destructed.
> > 
> > This also takes care of the case where os is null. Then the oomLogger that
> > was allocated will be freed when this function exits.
> 
> Thank you Dave, you mentioned the ref counting in the last review but I
> didn't understand what you meant. I thought it was related to singletons and
> all I could find at the link was ref-counting stuff. I was assuming that the
> logger would live as long as gecko. Except I don't actually know that so if
> it's not a mem leak yet it could soon become one when my assumption changes.

Yeah - I'd still reference count it even if it does have the same lifetime of the program.

We actually do leak detection (as part of debug builds) and this would eventually get exposed....
This should fix the indentation, the missed ref counting, and simplify the control in the main loop.
Attachment #8345115 - Attachment is obsolete: true
Attachment #8347662 - Flags: review?(dhylands)
Comment on attachment 8347662 [details] [diff] [review]
add_oom_victim_logging_to_hal_v4.patch

Review of attachment 8347662 [details] [diff] [review]:
-----------------------------------------------------------------

Overall this looks pretty good.

I think we need to fixup the stuff about the timestamps and err on the side of over-reporting.

So I'll r- this since I'll want to look at the next revision.

::: hal/gonk/GonkHal.cpp
@@ +1046,5 @@
> +public:
> +  OomVictimLogger()
> +  {
> +    mLastLineChecked = 0;
> +    mRegexes = nullptr;

nit: There seems to be a preference to use initializer lists for simple initializatons, so

OomVictimLogger()
  : mLastLineChecked(0.0),
    mRegexes(nullptr)
{
}

And since mLastLineChecked is a fdouble, I'd use 0.0 - 0 works, but 0.0 is a little clearer.

@@ +1077,5 @@
> +    ".*lowmem_shrink.*, ofree.*"};
> +  const size_t regex_count = NS_ARRAY_LENGTH(regexes_raw);
> +
> +  // Compile our regex just in time
> +  if (!mRegexes.get()) {

nit: IIRC You can just use: if (!mRegexes) (I might be wrong and thinking of nsRefPtr)

You don't often need to call get. One case where you do have to use it is when calling varadic functions (like printf et al). T

I'm pretty sure that
    operator const Resource&() const { return value; }
is what gets invoked when you don't use .get()

@@ +1078,5 @@
> +  const size_t regex_count = NS_ARRAY_LENGTH(regexes_raw);
> +
> +  // Compile our regex just in time
> +  if (!mRegexes.get()) {
> +    mRegexes = (regex_t*) malloc(sizeof(regex_t) * regex_count);

nit: You shouldn't need a cast here (malloc returns void *) and C++ style casts are preferred over C casts.

@@ +1104,5 @@
> +  #warning "Please remove KLOG_UNREAD_SIZE compatability def"
> +#endif
> +  // Retreive kernel log
> +  int msg_buf_size = klogctl(KLOG_SIZE_BUFFER, NULL, 0);
> +  ScopedFreePtr<char> msg_buf((char*) malloc((msg_buf_size + 1) * sizeof(char)));

nit: Same comment about casting.

@@ +1112,5 @@
> +  read_size = read_size > msg_buf_size ? msg_buf_size : read_size;
> +  msg_buf.rwget()[read_size] = '\0';
> +
> +  // foreach line in msg_buf check for matching messages
> +  char* line_end = nullptr;

nit: No need for assigment here (since you do the assignment below in the for loop)

@@ +1114,5 @@
> +
> +  // foreach line in msg_buf check for matching messages
> +  char* line_end = nullptr;
> +  char* line_begin = msg_buf.rwget();
> +  for (; line_end = strchr(line_begin, '\n'); line_begin = line_end + 1) {

This is much better!

@@ +1126,5 @@
> +    // last comes the logged message
> +    //
> +    // This logging format is part of the
> +    // unchanging kernel ABI so we can rely on it.
> +    double lineTimestamp = atof(line_begin + 4); // skip <n>[

It occured to me that the timestamp is actually optional.

echo 0 > /sys/module/printk/parameters/time
echo "Message with no timestamp" > /dev/kmsg
echo 1 > /sys/module/printk/parameters/time
echo "Message with timestamp" > /dev/kmsg

On my unagi, the timestamp is being logged in the format: [12-16 22:32:00.940]
at least that's what dmesg is showing me.

My nexus 4 is showing: [73785.732358]

Unagi:

<4>[12-16 23:03:04.826] [681: sh]Line with no timestamp
<4>[681: sh]Line with no timestamp

Nexus 4:

<4>Line with no timestamp
<4>[73910.565991] Line with timestamp

My hamachi (aka buri) shows the same as the Nexus 4.

So my inclination would be that we should check and see if the line looks like "[%5lu.%06lu] " (this is what an unmodified kernel would print) then we can go ahead and do the timestamp "optimization" as you've code it.

Technically we don't need to support the unagi, but other phones may have similarly modified kernels.

If we don't detect the timestamp, we should probably run the regexes anyways. This will lead to over-reporting, but I think that's better than under-reporting. We might want to report a warning if we detect a regex match with no timestamp.

We should probably do the equivalent of "echo 1 > /sys/module/printk/parameters/time" at initialization time (if the file exists).
Attachment #8347662 - Flags: review?(dhylands) → review-
Wow, I did not expect the timestamps to be changed. Thank you for catching that, and thank you for testing it on the nexus 4 and unagi.

I have to apologize, our next review might not be until you get back from holidays. A group member on a school project has bailed and left me in "is there even enough time in the day?" panic mode. The assignment must be demoed on the 18th right before our final, and I havn't even studied for said final. Today I wrote our entire sql layer, and tomorrow I debug said layer and write the forms!

So, Merry Christmas!
Still working on the patch. Noticed some more complications:
1. Some log messages include [ brackets in the message.
2. Found a <12> logging level, my server even had a <30>.
3. The timestamp is space padded so not all timestamps will have the same number of digits.
4. My server has been running for a while and has 7 digits before the decimal place and still the standard 6 digits of precision. The kernel is not dropping precision to keep a static timestamp length.

So our new constraints:
1. Timestamp may not be formated as float -> Verify timestamp matches expected format, else complain
2. Number of Logging levels cannot be assumed -> Position of timestamp must be search for
3. Expected timestamp format is variable width -> End of timestamp should be detected by presence of ']'
4. Expected timestamp format contains digits along with spaces -> We must skip the spaces

My inclination is to have a small hardcoded finite state machine which can verify that the timestamp is formatted as expected.
As usual tried to address everything you noticed plus the extra constraints I mentioned in the prior comment.

Extra notes:
1. C++ requires casting when using malloc(), C does not, so I switched it to a C++ static cast.
2. To keep the malloc line under 80 chars I abused that the C standard defines sizeof(char) == 1
3. We only warn about invalid timestamps if the line contains a kill message.
4. Used WriteToFile() instead of C++ file handling since it's used else where in GonkHal. There is no C interface for this setting so we need to write to the FS.
5. Wasn't sure how to approach commenting the finite state machine, sided on  verbosity.

Merry Chirstmas!
Attachment #8347662 - Attachment is obsolete: true
Attachment #8351467 - Flags: review?(dhylands)
Comment on attachment 8351467 [details] [diff] [review]
add_oom_victim_logging_to_hal_v5.patch

Review of attachment 8351467 [details] [diff] [review]:
-----------------------------------------------------------------

So, not too bad, but I'd much rather see something simple like sscanf rather than writing a state machine (I think that the sscanf is easier to understand)

::: hal/gonk/GonkHal.cpp
@@ +1075,5 @@
> +    char c = *line;
> +    if (!foundBracket) {
> +      if (c == '[') {
> +        // state change
> +        foundBracket = true;

Since there is only ever a single transition to the state foundBracket and we never transition out of that state, I think it would be cleaner to do that state transition as a separate loop, or even:

line = strchr(line, '[');
if (!line) {
  return nullptr;
}

and then do the for loop now that we've detected the [

It also seems like this could all be simplified to:

line = strchr(line, '[');
if (!line) {
  return nullptr;
}
line++;
char terminator;
rc = sscanf(line, "%*d.%*d%c", &terminator);
if (rc != 1 || terminator != ']') {
  ...failed to match...
}

or use %lf along with a pointer to double and then return the parsed timestamp as well (the * in %*d says to perform the conversion but toss the results).

@@ +1133,5 @@
> +    mRegexes = static_cast<regex_t*>(malloc(sizeof(regex_t) * regex_count));
> +    for (size_t i = 0; i < regex_count; i++) {
> +      int compilation_err = regcomp(&(mRegexes[i]), regexes_raw[i], REG_NOSUB);
> +      if (compilation_err) {
> +        LOG("[OomLogger][Erro] Cannot compile regex \"%s\"\n", regexes_raw[i]);

nit: Typo: Error instead of Erro ?

@@ +1179,5 @@
> +    if (timestamp_begin) {
> +      double lineTimestamp = atof(timestamp_begin);
> +      if (lineTimestamp <= mLastLineChecked) {
> +         continue;
> +      } else {

nit: no else after continue

@@ +1182,5 @@
> +         continue;
> +      } else {
> +         mLastLineChecked = lineTimestamp;
> +      }
> +	}

nit: indentation (probably tab related)
Attachment #8351467 - Flags: review?(dhylands) → review-
Thank you for making me recheck sscanf. I didn't know sscan would treat a literal space character as zero or more whitespace chars. Also thank you for reviewing the state machine even though I'm sure you'd be screaming inside if I insisted on keeping the machine.

The 'erro' was to keep all the prefixes fixed length but if you noticed it then I'm sure someone else will too. I've replace erro with error and warn with warning. It should help with grep-ability.
Attachment #8351467 - Attachment is obsolete: true
Attachment #8355631 - Flags: review?(dhylands)
This is the little test program I used fot playing with sscanf
Comment on attachment 8355631 [details] [diff] [review]
add_oom_victim_logging_to_hal_v6.patch

Review of attachment 8355631 [details] [diff] [review]:
-----------------------------------------------------------------

Just a couple more tweaks and I think we'll have it :)

I tested this out on my unagi.

::: hal/gonk/GonkHal.cpp
@@ +1064,5 @@
> +NS_IMETHODIMP
> +OomVictimLogger::Observe(
> +  nsISupports* aSubject,
> +  const char* aTopic,
> +  const PRUnichar* aData)

You'll need to replace PRUnichar with char16_t (it changed over the Christmas break) now that bug 956507 has landed.

@@ +1131,5 @@
> +    // Since the logging level can be a string we must
> +    // skip it before using sscanf
> +    char *timestamp_begin = strchr(line_begin, '[');
> +    double lineTimestamp = -1;
> +	if (timestamp_begin &&

nit: indentation

@@ +1133,5 @@
> +    char *timestamp_begin = strchr(line_begin, '[');
> +    double lineTimestamp = -1;
> +	if (timestamp_begin &&
> +         // scanf treats a ' ' as [ ]*
> +         EOF != sscanf(timestamp_begin, "[ %lf]", &lineTimestamp)) {

This should be checking that sscanf returns 1. If it doesn't match, it will return 0.

Also, in my testing, I noticed that sscanf doesn't care if the ] matches (since its after the last match - it does care if it's between matches). So it will return 1 if the input is:

[99.99.99 99:99:99]

So I'd recommend using "[ %lf%c", verifying that sscanf returns 2 and that the %c contains the ] character.

I attached a little test program I used to try out variations of sscanf. And running that on my unagi timestamps shows:

519 >./sscanf "[01-09 23:36:11.198]"
Success: d = '1.000000' term = '-'

So it winds up using 1 as lineTimeStamp and never logs any messages.

You can also change if (lineTimestamp <= mLastLineChecked) to be &&'d after the sscanf. When I changed this to be:

if (timestamp_begin &&
    1 == sscanf(timestamp_begin, "[ %lf%c", &lineTimestamp, &term) &&
    term == ']' &&
    lineTimestamp <= mLastLineChecked) {
  continue;
}

then it worked as expected.

@@ +1147,5 @@
> +      if (matching) {
> +        // log content of kernel message
> +        line_begin = strchr(line_begin, ']') + 2;
> +        if (lineTimestamp == -1) {
> +          LOG("[OomLogger][Warning]: following kill message may be a duplicate");

Rather than putting [Warning] in here, lets create warning and error variants of LOG, which pass in ANDROID_LOG_WARN for warnings, and ANDROID_LOG_ERROR for the "interesting" lines.

Then the lines will be colored for people like me who use a colored-logcat
https://wiki.mozilla.org/User:Dhylands#colored_logcat

Leaving in [Kill] in the interesting lines seems reasonable to me.

Also you may want to not use LOG at all, but create an OOM_LOG so that you can do:

#define OOM_LOG(args...)  __android_log_print(ANDROID_LOG_INFO, "OomLogger", ##args)

Then you can remove [OomLogger] as well, since it will show up as the component doing the logging.
Attachment #8355631 - Flags: review?(dhylands) → review-
There was an error in my "what worked" It should have showed:

if (timestamp_begin &&
    2 == sscanf(timestamp_begin, "[ %lf%c", &lineTimestamp, &term) &&
    term == ']' &&
    lineTimestamp <= mLastLineChecked) {
  continue;
}

The result from sscanf has to be 2.
Good to hear we're getting close!

Thank you finding that scanf behaviour, I would not have expected it to do that! I didn't merge the (lineTimestamp <= mLastLineChecked) check with the scanf since the last checked needs to be updated only if the line is newer than the old one. If we merged updating the last checked with the continue then we'd be rolling back last checked to the past, which would be bad.

Thank you for mentioning the logging. That's reminded me I want to take a look at HAL's logging situation in time.

Daniel
Attachment #8355631 - Attachment is obsolete: true
Attachment #8358496 - Flags: review?(dhylands)
Comment on attachment 8358496 [details] [diff] [review]
add_oom_victim_logging_to_hal_v7.patch

Review of attachment 8358496 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Just one little detail that showed up in testing.

::: hal/gonk/GonkHal.cpp
@@ +1152,5 @@
> +      if (matching) {
> +        // log content of kernel message
> +        line_begin = strchr(line_begin, ']') + 2;
> +        if (lineTimestamp == -1) {
> +          OOM_LOG(ANDROID_LOG_WARN, "following kill message may be a duplicate");

In testing on my unagi, I don't see this message.

What's happening is that sscanf is parsing [01-11 00:02:45.161] and returning 2 with lineTimestamp set to 1 and after_float set to '-'

So rather than using lineTimestamp set to -1, I think we need a foundTimestamp bool that gets set if we pass the if test.

I tested on my hamachi (which has the expected timestamps) and things work as expected.
Attachment #8358496 - Flags: review?(dhylands) → review+
Thank you Dave! This has been pretty incredible! Attached is the 8th revision, is there anything I need to do after you review it?

Over the weekend I would like blog on this experience. Now the reason I mention this is to give you a heads up: I'd like to apply for a position at Mozilla. I was at a hackathon in the Toronto office in mid september and afterwards I emailed Greg Wilson and asked him how one should apply. Greg CC'd Mike Hoye who suggested I make a contribution and blog about that.

So this was that and I hope you're not dreading the email ("Ugh, he took 8 revisions and now he wants to apply. Hasn't he taken enough of my time!"). If that is the case I hope to give you the confidence that I can make a positive contribution.

Also: Thank you for staying up late to review so it didn't get pushed to monday.
Attachment #8358496 - Attachment is obsolete: true
Attachment #8358744 - Flags: review?(dhylands)
(In reply to Daniel Dressler from comment #41)
> Created attachment 8358744 [details] [diff] [review]
> add_oom_victim_logging_to_hal_v8.patch
> 
> Thank you Dave! This has been pretty incredible! Attached is the 8th
> revision, is there anything I need to do after you review it?

Since you don't have permissions to land the patch, you would normally put [checkin-needed] in the whiteboard, and one of the sheriffs would land it.

But I can also go ahead and land it.

> Over the weekend I would like blog on this experience. Now the reason I
> mention this is to give you a heads up: I'd like to apply for a position at
> Mozilla. I was at a hackathon in the Toronto office in mid september and
> afterwards I emailed Greg Wilson and asked him how one should apply. Greg
> CC'd Mike Hoye who suggested I make a contribution and blog about that.
> 
> So this was that and I hope you're not dreading the email ("Ugh, he took 8
> revisions and now he wants to apply. Hasn't he taken enough of my time!").
> If that is the case I hope to give you the confidence that I can make a
> positive contribution.
> 
> Also: Thank you for staying up late to review so it didn't get pushed to
> monday.

No problem. I'll go ahead and land it now.

The fact the it took 8 revisions doesn't really mean anything (at least not to me).

Thanks for fixing the problem, and hopefully you learned something along the way (which is my own personal motivation for mentoring bugs). I always learn something, even being the mentor :)
Comment on attachment 8358744 [details] [diff] [review]
add_oom_victim_logging_to_hal_v8.patch

Review of attachment 8358744 [details] [diff] [review]:
-----------------------------------------------------------------

Great work.
Attachment #8358744 - Flags: review?(dhylands) → review+
So I'd be every interested to read your blog, when you post it. Please attach a link to the bug.
For Daniel's benefit - here's the TBPL results of the push.
https://tbpl.mozilla.org/?tree=B2g-Inbound

So everything looks green (or orange and starred) so this should get merged to mozilla-central by the sheriffs when they do their next merge.
https://hg.mozilla.org/mozilla-central/rev/45369d72225b
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
So in bug 996991 I've noticed that the OomLogger/oom-msg-logger is missing some important regexes.  Specifically, here's an excerpt from attachment 8412189 [details] which will not trigger the current regexes at http://dxr.mozilla.org/mozilla-central/source/hal/gonk/GonkHal.cpp#1107

<3>[93351.288052] Out of memory: Kill process 2210 (E-Mail) score 659 or sacrifice child
<3>[93351.288070] Killed process 2210 (E-Mail) total-vm:253708kB, anon-rss:29848kB, file-rss:66432kB

The above happens after a bunch of kernel state-dumps like when there's an internal kernel fault.  The boilerplate starts with:

<4>[93351.284340] Compositor invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
<4>[93351.284399] [<c00497f4>] (unwind_backtrace+0x0/0x12c) from [<c01355c8>] (dump_header.clone.1+0x6c/0x18c)


I know I need to file a new bug on this, but this is not a great product/component pair to get a lot of visibility and there are no dependent bugs to give me an idea of the best place to file and other bugs against the same file all seem to live in a wide variety of products/components, so I figure I will bother you good people first.
Andrew - I think that either "Firefox:General" or "Core:Hardware Abstraction Layer (HAL)" would be fine.

However, the fact that different messages are being used must mean that different kernel code is being used. Which phone are you seeing these new messages on?

I'd like to get the output of:

adb shell cat /proc/version

as well for the phone that you're seeing these message on.

Thanks
Depends on: 964537
Trunk hamachi/buri build:
Linux version 3.0.21-perf-g23c4dd6-00003-g412d6fb (archermind@aclgcl-ubnt) (gcc version 4.4.3 (GCC) ) #1 SMP PREEMPT Thu Nov 14 11:09:01 CST 2013

Inference from https://bugzilla.mozilla.org/show_bug.cgi?id=996991#c0 is that this also affects v1.4 hamachi/buri builds too.
See Also: → 1001206
Blocks: 1036666
You need to log in before you can comment on or make changes to this bug.