I am pretty diligent about following up on crash reports, probably because when it comes down to it, chasing crashes is one of the funner aspects of software development to me. Sure, when I’m hours into a frustrating problem and can’t make heads or tails of why something’s blowing up, it’s exhausting, and I wish to hell I could simply find the solution, but upon finally unwinding the riddle of a failure, it’s all those hours of toil that pay back dividends as the thrill of success.
Recently I had an odd MarsEdit crash report come in with a brief but ultimately very meaningful comment attached to it:
“I just updated a post with a PDF embedded from OneDrive.”
I love it when customers take the time to write something about the circumstances surrounding a crash. Often even a little clue can be enough to lead to the unique series of steps that will ultimately reproduce the problem. In this case though, I was confused. I know OneDrive is Microsoft’s cloud storage solution, but is the customer uploading a PDF from their mounted OneDrive, or … ?
Luckily the customer in this case also left an email address, so I got in touch and, after a few rounds of zeroing in on the meaning of the expression “embedded from OneDrive,” I learned that the issue was specifically to do with pasted HTML source code that Microsoft offers to customers who wish to literally embed a OneDrive document into another web page.
So I created a OneDrive account, added a PDF document to my storage, copied the “embed code” for the item, and pasted it into a new MarsEdit post. To my great surprise and satisfaction, when I pressed Send and published the post to my blog, MarsEdit crashed with exactly the same stack trace as the customer had submitted.
Well, I’ll be darned.
Pointing The Finger
The stack trace in this case was pretty cryptic, and implied that a private Apple class, NSURLConnectionInternal, was crashing while trying to message what I could only assume was its delegate. Here’s the gist of the very tip of the crashing thread’s trace:
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libobjc.A.dylib objc_release
1 libobjc.A.dylib AutoreleasePoolPage::pop(void*)
2 com.apple.CFNetwork -[NSURLConnectionInternal _withConnectionAndDelegate:onlyActive:]
3 com.apple.CFNetwork -[NSURLConnectionInternal _withActiveConnectionAndDelegate:]
Usually, when you see something like this where a thread is crashing while trying to release an object, the first thing to suspect is that you’ve screwed something up. Probably by over-releasing some object or neglecting to nil-out a delegate attribute when one of your objects is destroyed. But in the case of logs like this where the crash occurs with a backtrace that is separate from your own code’s control (it’s a run loop callback to Apple’s own networking code), it can be hard to trace it back to specific code in your app. In this case, I can tell by the NSURL clues that a network request is probably being serviced, but which one? And which object was its delegate? Luckily, the clue about OneDrive had led me to a working test case, so I could dig into this question with more care.
The first step I like to take is to hopefully separate the crashing issue from my own app’s code, by creating a stripped-down test app that exhibits the problem. If I couldn’t achieve that, then the bug probably did rest somewhere in MarsEdit, and if I could create such a test case, then the problem of poking at it to deduce what is going wrong is made much simpler.
I created a new document based application whose document windows possess a single WebView. When a new document is created, the document does one thing: load one of those OneDrive embedded documents as the HTML content of the view. Armed with this test app, I was able to create new WebView instanced with the presumably problematic HTML code, and observe the behavior of each WebView in isolation from other junk that might have been going on in a more complex app such as MarsEdit.
The only problem was that my test app didn’t seem to exhibit any problems. I could load 10, 20 WebViews with the OneDrive embed code in it, and they all ran like butter. Damn it, there was no bug here! Until, that is, I deigned to close one of those windows. Blam! The test app crashes, yielding an identical stack trace to the one provided by my customer and reproduced in MarsEdit.
It started to make sense: when you publish a post from MarsEdit, the default setting is to also then close the window associated with that post. The reason the customer experienced the crash after sending the post to their blog was only because that happened to correlate with MarsEdit subsequently closing the window, and thus destroying the WebView that had held the OneDrive embed content.
Whose Zombie Is This?
Armed with a simple test case that reproduces the bug at will, the first thing I did was to enable Apple’s built-in facility for debugging so-called “zombie” objects. These are perfectly named because they are objects that have been deallocated but which some code within your app treats as though they were still alive. When you message a zombie object, the behavior is unpredictable because all the data that had been associated with the object is now correlated with whatever other use the system has found for that area of memory. Xcode has a convenient checkbox that makes it easy to turn on zombie detection. Just open the scheme editor and look for the Diagnostics tab under the Run configuration:
This option instructs the Objective-C runtime to abstain from completely deallocating objects. In doing so, it transforms deallocated objects into a special kind of placeholder object that can detect when it has been messaged, and scream to high hell about it. When I run with zombie objects enabled, and reproduce the crash, I also see this message in the console log:
*** -[WebView release]: message sent to deallocated instance 0x6080001273a0
Ah ha! So the object in question, the one being delegated to by NSURLConnection, is probably the instance of the WebView itself. But, I don’t set a WebView as a delegate of any NSURLConnection, and certainly this simplified test case doesn’t. So it must be WebKit itself that is setting itself as an NSURLConnection’s delegate. Not my problem, I can file the bug and move on … or can I?
Going The Extra Mile
This is where the open-sourced nature of WebKit is both a blessing and a curse. If this issue had been revealed to exist somewhere deep within AppKit, or any of Apple’s other closed-source frameworks, I most certainly would have thrown up my arms, filed a bug, and hoped for the best. But Apple’s WebKit framework is something I can download, build, and debug. Sure, it takes nearly all day to build, and there is a steep learning curve to debugging it, but why did I become a software developer if I’m not up for a challenge?
Since I knew that the issue at its core appeared to be the over-release of an object, specifically of the WebView, I moved up from the basic utility of zombie objects to the high-octane “Zombies” flavored analysis configuration of Apple’s Instruments tool. Like the runtime flag we set earlier in Xcode, this makes it easy to detect attempts to message zombie objects, but it also goes a step further by correlating that knowledge with a recorded history of a given object, and all the memory-management related manipulations of that object.
The tail end of a very long table shows the running retain count for an object, in this case the WebView itself, as well information about which library, method, or function was responsible for the call that affected the object’s memory management status.
To make a long, long, long, long, story very short: I used Instruments and its “pairing” functionality, which lets you hide balanced pairs of retain/release calls so that you are left with a more manageable subset of calls to examine and scrutinize. After scratching my head over this for hours and not making much headway, I finally came upon a specific call stack that had a very suspicious heritage. I confirmed through Instruments and then by direct examination of WebKit source code, that the WebView instance in question was in fact retaining and then autoreleasing itself as part of its dealloc method.
In this era of ARC, some of you may not be so intricately familiar with Cocoa memory management to see at a glance why that’s a very bad thing. Here it is a nutshell:
- By the time an object’s dealloc method is done running, the object is gone. I mean, literally gone, or else being managed as a zombie object for debugging purposes.
- An object can be retained and released as much as you like, but if you’re already in the midst of dealloc’ing the object, it won’t make a bit of difference.
- Autoreleasing an object postpones release until a later time, when the active autorelease pool is released.
- Thus, autoreleasing an object in the midst of being dealloc’d guarantees that it will later be sent a release message. Oops, that would be bad. It could even cause a crash like the one we’re seeing…
How in the heck does WebView manage to autorelease itself as a consequence of dealloc’ing itself? And what unique relationship does that possibility have to the embedding of Microsoft OneDrive documents? I’ll try to be as brief as possible:
Sending the request would be fine, except for WebView’s own conscientious behavior which takes care to retain and autorelease itself whenever a new batch of network requests is made on its behalf. This guarantees that the WebView won’t disappear while an open request is running. But of course, it can’t make that guarantee when the WebView is in the process of being destroyed.
Short story shorter: WebView inadvertently adds itself to the autorelease pool when it’s on the verge of being destroyed, and subsequent release of that autorelease pool causes a zombie message to be sent to the deallocated instance.
I reported the bug, and submitted a patch to the WebKit team. My proposed fix works around the problem by wrapping the bulk of WebView’s -dealloc method with a separate autorelease pool, such that any inadvertent autoreleasing of the WebView will be tidied up before the object is literally destroyed. Because WebKit is open source, I was able to make the change, build and test it, and confirm that the crashes disappear both in my test app and in MarsEdit.
I would like to thank Alexey Proskuryakov for taking the time to help dig into the issue in parallel with me. He was on to the nut of the problem long before I was, but I glossed over some key hints in his analysis. He also helped in guiding me through the steps of submitting a proper patch with my bug report.
In The Mean Time
Hopefully Apple will accept my patch, or come up with a smarter way of fixing the problem for a future update to WebKit. In the mean time, I want my app to be as reliable as possible for the widest variety of users as possible. Yes, that even means Microsoft OneDrive users. How can I work around the problem today?
The simplest solution, requiring neither subclassing of WebView, swizzling, or anything like that, is simply to take care before releasing a WebView in your app, to manually call -[WebView close] on it first. This takes care of all the complicated business that runs the risk of autoreleasing the WebView, such that when WebView’s own dealloc method is reached, it won’t have any such complexities to expose itself to.
Tracking down and fixing this bug has been an incredibly tiresome yet rewarding experience. I hope you have learned a thing or two from the approach I took here, and that I have inspired you to take crash reports with the seriousness they deserve. Your customers will thank you for it, and you might even get a kick out of occasionally coming out on top!