Intrinsic String Encoding

I was baffled today while investigating a bug in MarsEdit, which a customer reported as only seeming to affect the app when writing in Japanese.

I pasted some Japanese text into the app and was able to reproduce the bug easily. What really confused me, though was that the bug persisted even after I replaced the Japanese text with very straight-forward ASCII-compatible English. I opened a new editor window, copied and pasted the English text in, and the bug disappeared. I copied and pasted back into the problematic editor, and the bug returned. What the heck? Two windows with identical editors, containing identical text, exhibiting varying behavior? I knew this was going to be good.

It turns out there’s a bug in my app where I erroneously ask for a string’s “fastestEncoding” in the process of converting it. The bug occurs when fastestEncoding returns something other than ASCII or UTF8. For example, with a string of Japanese characaters, the fastestEncoding tends to be NSUnicodeStringEncoding.

But why did the bug continue to occur even after I replaced the text with plain English? Well…

The documentation for NSString encourages developer to view it as a kind of encoding-agnostic repository of characters, which can be used to manipulate arbitrary strings, converting a specific encoding only as needed:

An NSString object encodes a Unicode-compliant text string, represented as a sequence of UTF—16 code units. All lengths, character indexes, and ranges are expressed in terms of 16-bit platform-endian values, with index values starting at 0.

This might lead you to believe that no matter how you create an NSString representation of “Hello”, the resulting objects will be identical both in value and in behavior. But it’s not true. Once I had worked with Japanese characters in my NSTextView, the editor’s text storage must have graduated to understanding its content as intrinsically unicode based. Thus when I proceeded to copy the string out of the editor and manipulate it, it behaved differently from a string that was generated in an editor that had never contained non-ASCII characters.

In a nutshell: NSString’s fastestEncoding can return different values for the same string, depending upon how the string was created. An NSString constant created from ASCII-compatible bytes in an Objective-C source file reports NSASCIIStringEncoding (1) for both smallest and fastest encoding:

printf("%ld\n", [@"Hello" fastestEncoding]);	// ASCII (1)

And a Swift string constant coerced to NSString at creation behaves exactly the same way:

let helloAscii =  "Hello" as NSString
helloAscii.fastestEncoding			// ASCII (1)

But here’s the same plain string constant, left as a native Swift String and only bridged to NSString when calling the method:

let helloUnicode = "Hello"
helloUnicode.fastestEncoding		// Unicode (10)

As confusing as I found this at first, I have to concede that the behavior makes sense. The high level documentation describing NSString representing “a sequence of UTF-16 code units” says nothing about the implementation details. It’s a conceptual description of the class, and for the most part all methods operating on an NSString comprising the same characters should be heave the same way. But the documentation for fastestEncoding is actually pretty clear:

“Fastest” applies to retrieval of characters from the string. This encoding may not be space efficient.

As I said earlier, my usage of fastestEncoding was erroneous, so the solution to my bug involves removing the call to the method completely. In fact, I don’t expect most developers will ever have a legitimate needs to call this method. Forthose who do, be very aware that it can and does behave differently, depending on the provenance of your string data!