memory leaks in ubuntu
Episode I: Detection
An important piece of optimizing the Ubuntu core on the Nexus 7 is slimming down Ubuntu’s memory requirements. It turns out this focus area has plenty of opportunity to help contribute, and today, I’ll talk about how to find memory leaks in an individual application using valgrind.
The best part? You don’t even have to be a developer to help. The second best part? You don’t even need a Nexus 7! What I describe below works on any Ubuntu machine. Let’s get started!
The first step is to find an application to profile. This is the easiest step. Maybe you have an app you use all the time and really care about making it perform as well as possible. Or maybe you’re experiencing a strange behavior problem in an app that takes a little while to show up. Or maybe you just pick a random application from the dash because you’re in a great mood. They’re all good.
In my case, I’ll use nm-applet as my example, since I’ve been struggling with LP: #780602 for a while, where the list of wifi access points would stop displaying after a day or two. Trés annoying!
Next, install valgrind if it is not already installed.
Pay attention to the next bit because it is important. In order for your valgrind report to be as helpful as possible for developers, you will also need to install debug packages related to your app. The debug packages contain information to help developers narrow in on exactly where problems might be. “Great” you say, “what do I need to install?”
UPDATE: 29 January 2013
After a bit more thinking and discussing with smart folks like infinity, xnox, and pitti, we realized that I was essentially reinventing a lot of code that already exists in apport-retrace, as that tool already knows how to go from a binary to a package and then solve dependencies.
I tossed the idea (and a really rough crappy version of a prototype) to Kyle Nitzsche who took the idea, ran with it, and fixed all my crap! Woo hoo! With a little bit of effort, we ended up with apport-valgrind which has already landed in raring (along with the required valgrind support patch). Even better, Kyle wrote a great apport-valgrind introduction explaining how it works.
So ignore the script below and use apport-valgrind instead (unfortunately only available in raring).
Today is your lucky day because I’ve written a small script to help you figure out which debug packages you’ll need. Go ahead and grab the python version of valgrind-ubuntu-dbg-packages. (Ignore the go version for now, that’s just something I’m playing with in my other spare time!)
Ok, now comes the tricky part. We have to do a quick valgrind run to see what libraries your app uses. Then we’ll use the helper script to see if there are debug packages for those libraries. Ready?
To run valgrind, use this command:
Replace with the name of your app.
Let this run long enough for your app to launch (which may take a while under valgrind) and then play with your app just a bit where you would reproduce your bug but without actually reproducing the bug. In the case of nm-applet, I did the following sequence:
Then I clicked the “More networks” menu item in the applet just to get it to display the other wifi access points, since this is the thing that was breaking for me. After doing that just once, I stopped my valgrind run completely by pressing control-c in the terminal where I launched it.
A valgrind log file should now exist, and you can run the helper script on the log:
You will see quite a bit of output, but at the end, you will get a list of recommended extra packages to install.
Go ahead and install the packages.
Now we are finally ready to collect our real logs.
Update: 29 January 2013
Instead of doing all that janky stuff above, just:
- apt-get install apport-valgrind
- run: apport-valgrind
- Do step 2 for as long as it takes to reproduce the bug. There is no step 3! Re-run valgrind exactly as above, but this time, let the app run as long as it needs to reproduce the bug. In the case of nm-applet, I had to let it just sit there and run normally for 24 hours before I saw the bug again. Hopefully your bug reproduces faster! Patience is key. I recommend eating a delicious sandwich if you can’t think of anything better to do.
After your bug has reproduced itself, kill the valgrind run. File a bug — you can use the Ubuntu Nexus7 project — and be sure to attach the valgrind log. It would also be great if you could describe how you reproduced the bug. Be sure to read the bug filing guidelines for more detail.
Huzzah, you’ve contributed something extremely valuable to making Ubuntu leaner and meaner — a great log file. With any luck, a developer will be able to pick up your bug and fix the problem.
And… if we’re even luckier, maybe that developer will be you! Next time I’ll show you how to actually analyze the valgrind log. Stay tuned.
Episode II: Analysis
In our last exciting episode, we learned how to capture a valgrind log. Today we’re going to take the next step and learn how to actually use it to debug memory leaks.
There are a few prerequisites:
- know C. If you don’t know it, go read the C programming language which is often referred to as K&R C. Be sure to understand the sections on pointers, and after you do, come back to my blog. See you in 2 weeks!
- a nice supply of your favorite beverages and snacks. I prefer coffee and bacon, myself. Get ready because you’re about to read an epic 2276 word blog entry.
That’s it. Ok, ready? Let’s go!
navigate the valgrind log
Open the valgrind log that you collected. If you don’t have one, you can grab one that I’ve already collected. Take a deep breath. It looks scary but it’s not so bad. I like to skip straight to the good part near the bottom. Search the file for “LEAK SUMMARY”. You’ll see something like:
You can see that valgrind thinks we’ve definitely leaked some memory. So let’s go figure out what leaked.
Valgrind lists all the leaks, in order from smallest to largest. The leaks are also categorized as “possibly” or “definitely”. We’ll want to focus on “definitely” for now. Right above the summary, you’ll see the worst, definite leak:
Wow, we lost 300K of memory in just a few hours. Now imagine if you don’t reboot your laptop for a week. Yeah, that’s not so good. Time for a coffee and bacon break, the next part is about to get fun.
read the stack trace
What you saw above is a stack trace, and it’s printed chronologically “backwards”. In this example, malloc() was called by g_malloc(), which was called by g_slice_alloc(), which in turn was called by g_slist_prepend(), which itself was called by get_menu_item_for_ap() and so forth. The first function ever called was main(), which should hopefully make sense.
At this point, we need to use a little bit of extra knowledge to understand what is happening. The first function, main() is in our program, nm-applet. That’s fairly easy to understand. However, the next few functions that begin with g_main_ don’t actually live inside nm-applet. They are part of glib, which is a library that nm-applet depends on. I happened to have just known this off the top of my head, but if you’re ever unsure, you can just google for the function name. After searching, we can see that those functions are in glib, and while there is some magic that is happening, we can blissfully ignore it because we see that we soon jump back into nm-applet code, starting with applet_update_indicator_menu().
Many Linux programs will have a stack trace similar to the above. The program starts off in its own main(), but will call various other libraries on your system, such as glib, and then jump back to itself. What’s going on? Well, glib provides a feature known as a “main loop” which is used by the program to look for inputs and events, and then react to them. It’s a common programming paradigm, and rather than have every application in the world write their own main loop, it’s easier if everyone just uses the one provided by glib.
The other observation is to note how the function names appear prominently in the stack trace. Pundits wryly say that naming things is one of the hardest things in computer science, and I completely agree. So take care when naming your functions, because people other than you will definitely see them and need to understand them!
Alright, let’s get back to the stack trace. We can see a few functions that look like they belong to nm-applet, based on their names and their associated filenames. For example, the function wireless_add_menu_item() is in the file applet-device-wifi.c on line 1138. Now you see why we wanted symbols from the last episode. Without the debug symbols, all we would have seen would have been a bunch of useless ??? and we’d be gnashing our teeth and wishing for more bacon right now.
Finally, we see a few more g_* functions, which means we’re back in the memory allocation functions provided by glib. It’s important to understand at this point that g_malloc() is _not_ the memory leak. g_malloc() is simply doing whatever nm-applet asks it to do, which is to allocate memory. The leak is _highly likely_ to be in nm-applet losing a reference to the pointer returned by g_malloc().
What does it mean?
Now we’re ready to start the real debugging. We know approximately where we are leaking memory inside nm-applet: get_menu_item_for_ap() which is the last function before calling the g_* memory functions. Time to top off on coffee because we’re about to get our hands dirty.
reading the source
The whole point of open source is being able to read the source. Are you as excited as I am? I know you are!
First, let’s get the source to nm-applet. Assuming you are using Ubuntu and you are using 12.04, you’d simply say:
Woo hoo! That wasn’t hard, right?
Luckily there are at least a few tools to help you do this. My favorite tools are cscope and ctags, which help me to rapidly understand the skeleton of a program and navigate around its complex structure.
Assuming you are in the network-manager-applet-0.9.4.1 source tree:
You are now presented with a menu. Use control-n and control-p to navigate input fields at the bottom. Try navigating to “Find this C symbol:” and then type in get_menu_item_for_ap, and press enter. The search results are returned, and you can press ‘0’ or ‘1’ to jump to either of the locations where the function is referenced. You can also press the space bar to see the final search result. Play around with some of the other search types and see what happens. I’ll talk about ctags in a bit.
Alrighty, let’s go looking for our suspicious nm-applet function. Start up cscope as described above. Navigate to “Find this global definition:” and search for get_menu_item_for_ap. cscope should just directly put you in the right spot.
Based on our stack trace, it looks like we’re doing something suspicious on line 879, so let’s go see what it means.
Cool, we can now see where the source code is matching up with the valgrind log.
Let’s start doing some analysis. The first thing to note are the #ifdef blocks on lines 870, 873, and 880. You should know that ENABLE_INDICATOR is defined, meaning we do not execute the code in lines 871 and 872. Instead, we do lines 874 to 879, and then we do 881. Why do we do 881 if it is after the #endif? That’s because we fell off the end of the #ifdef block, and then we do whatever is next, after we fall off, namely returning NULL.
Don’t worry, I don’t know what’s going on yet, either. Time for a refill!
Back? Great. Alright, valgrind says that we’re doing something funky with g_slist_prepend().
And our relevant code is:
We can see that we declare the pointer *dupes on line 874, but we don’t do anything with it. Then, we assign something to it on line 877. Then, we assign something to it again on line 879. Finally, we end up not doing anything with *dupes at all, and just return NULL on line 881.
This definitely seems weird and worth a second glance. At this point, I’m asking myself the following questions:
- did g_object_get_data() allocate memory?
- did g_slist_prepend() allocate memory?
- are we overwriting *dupes on line 879? that might be a leak.
- is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?
Let’s take them in order.
did g_object_get_data() allocate memory?
g_object_get_data has online documentation, so that’s our first stop. The documentation says:
Since I am not 100% familiar with glib terminology, I guess [transfer none] means that g_object_get_data() doesn’t actually allocate memory on its own. But let’s be 100% sure. Time to grab the glib source and find out for ourselves.
Pretty simple function.
Except I have no idea what g_datalist_get_data() does. Maybe that guy is allocating memory. Now I’ll use ctags to make my life easier. In vim, put your cursor over the “g” in “g_datalist_get_data” and then press control-]. This will “step into” the function. Magic!
his is a pretty simple loop, walking through an existing list of pointers which have already been allocated somewhere else, starting on line 861. We do our comparison on line 863, and if we get a match, we assign whatever we found to res on line 865. Note that all we are doing here is a simple assignment. We are not allocating any memory!
Finally, we return our pointer on line 874. Press control-t in vim to pop back to your last location.
Now we know for sure that g_object_get_data() and g_datalist_get_data() do not allocate any memory at all, so there can be no possibility of a leak here. Let’s try the next function.
did g_slist_prepend() allocate memory?
First, read the documentation, which says:
This probably means it allocates memory for us, but let’s double-check just to be sure. Back to cscope!
Ah ha! Look at line 265. We are 100% definitely allocating memory, and returning it on line 269. Things are looking up! Let’s keep going with our questions.
are we overwriting dupes on line 879? that might be a leak.
We’ve already proven to ourselves that line 877 doesn’t allocate any memory. It just sets dupes to some value. However, on line 879, we do allocate memory. It is equivalent to this code:
So simply setting dupes to the return value of g_object_get_data() and later overwriting it with the return value of malloc() does not inherently cause a leak.
By way of counter-example, the below code is a memory leak:
The above essentially illustrates the scenario I was worried about. I was worried that g_object_get_data() allocated memory, and then g_slist_prepend() also allocated memory which would have been a leak because the first value of dupes got scribbled over by the second value. My worry turned out to be incorrect, but that is the type of detective work you have to think about.
As a clearer example of why the above is a leak, consider the next snippet:
First we allocate dupes1. Then allocate dupes2. Finally, we set dupes1 = dupes2, and now we have a leak. No one knows what the old value of dupes1 was, because we scribbled over it, and it is gone forever.
is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?
We can definitively say that it is not safe to return NULL without doing anything to dupes. We definitely allocated memory, stuck it into dupes, and then threw dupes away. This is our smoking gun.
Next time, we’ll see how to actually fix the problem.
Episode III: Fixing
As a gentle reminder, during analysis, we saw the following block of code:
And we concluded with:
Is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?
We can definitively say that it is not safe to return NULL without doing anything to dupes. We definitely allocated memory, stuck it into dupes, and then threw dupes away. This is our smoking gun.
But there’s a twist! Eagle-eyed reader Dave Jackson (a former colleague of mine from HP, natch) spotted a second leak! It turns out that line 879 was exceptionally leaky during its inception. As Dave points out, the call to g_slist_prepend() passes g_strdup() as an argument. And as the documentation says:
Duplicates a string. If str is NULL it returns NULL. The returned string should be freed with g_free() when no longer needed.
In memory-managed languages like python, the above idiom of passing a function as an argument to another function is quite common. However, one needs to be more careful about doing so in C and C++, taking great care to observe if your function-as-argument allocates memory and returns it. There is no mechanism in the language itself to automatically free memory in the above situation, and thus the call to g_strdup() seems like it also leaks memory. Yowza!
So, what to do about it?
The basic goal here is that we don’t want to throw dupes away. We need to actually do something with it. Here again are the 3 most pertinent lines.
Let’s break these lines down.
- On line 877, we retrieve the dupes list from the dup_data.found object
- Line 878 gets a path to the duplicate wifi access point
- Finally, line 879 adds the duplicate access point to the old dupes list
- Line 881 throws it all away!
To me, the obvious thing to do is to change the code between lines 879 and 881, so that after we modify the duplicates list, we save it back into the dup_data object. That way, the next time around, the list stored inside of dup_data will have our updated list. Makes sense, right?
As long as you agree with me conceptually (and I hope you do), I’m going to take a quick shortcut and show you the end result of how to store the new list back into the dup_data object. The reason for the shortcut is that we are now deep in the details of how to program using the glib API, and like many powerful APIs, the key is to know which functions are necessary to accomplish your goal. Since this is a memory leak tutorial and not a glib API tutorial, just trust me that the patch hunk will properly store the dupes list back into the dup_data object. And if it’s confusing, as always, read the documentation for g_object_steal_data and g_object_set_data_full.
If the above patch format looks funny to you, it’s because we are changing a patch.
This means the old patch had the line calling g_object_get_data() and the refreshed patch now calls g_object_steal_data() instead. Likewise…
The above call to g_object_set_data_full is a brand new line in the new and improved patch.
Totally clear, right? Don’t worry, the more sitting and contemplating of the above you do, the fuller and more awesomer your neckbeard grows. Don’t forget to check it every once in a while for small woodland creatures who may have taken up residence there.
And thus concludes our series on how to detect, analyze, and fix memory leaks. All good? Good.
I can hear the observant readers out there already frantically scratching their necks and getting ready to point out the mistake I made. After all, our newly refreshed patch still contains this line:
And as we determined earlier, that’s our incepted memory leak, right? RIGHT!?‽
Not so fast. Take a look at the new line in our updated patch:
See that? The last argument to g_object_set_data_full() looks quite interesting indeed. It is in fact, a cleanup function named clear_dupes_list(), which according to the documentation, will be called
In other words, when we are ready to get rid of the dup_data.found object, as part of cleaning up that object, we’ll call the clear_dupes_list() function. And what does clear_dupes_list() do, praytell? Why, let me show you!
Trés interesante! You can see that we iterate across the dupes list, and call g_free on each of the strings we did a g_strdup() on before. So there wasn’t an inception leak after all. Tricky tricky.
A quick digression is warranted here. Contrary to popular belief, it is possible to write object oriented code in plain old C, with inheritance, method overrides, and even some level of “automatic” memory management. You don’t need to use C++ or python or whatever the web programmers are using these days. It’s just that in C, you build the OO features you want yourself, using primitives such as structs and function pointers and smart interface design.
Notice above we have specified that whenever the dup_data object is destroyed, it will free the memory that was stuffed into it. Yes, we had to specify the cleanup function manually, but we are thinking of our data structures in terms of objects.
In fact, the fancy features of many dynamic languages are implemented just this way, with the language keeping track of your objects for you, allocating them when you need, and freeing them when you’re done with them.
And finally now that you’ve seen just how tedious and verbose it is to track all this memory, it should no longer be a surprise to you that most fancy languages are slower than C. Paperwork. It’s always paperwork.
And here we come to the upshot, which is, tracking down memory leaks can be slow and time consuming and trickier than first imagined (sorry for the early head fake). But with the judicious application of science and taking good field notes, it’s ultimately just like putting a delicious pork butt in the slow cooker for 24 hours. Worth the wait, worth the effort, and it has a delicious smoky sweet payoff.