Properly handle supplementary characters when saving XML files
About a month ago, I posted a pull request on GitHub that should resolve the numerous "reference to invalid character number at line ###" errors that have been posted on the Windows board. Here's the pull request:
What's going on here is that Audacity for Windows (but not Linux and probably not macOS) is saving Unicode supplementary characters (characters greater than U+FFFF) as escaped UTF-16 surrogate pairs, which are illegal in XML. (Supplementary characters include, among other things, many emoji and lesser-used writing systems.) Supplementary characters instead need to be saved either directly or as single 5- or 6-digit codes, which I have confirmed happens on Linux.
The cause of this problem is that Audacity uses wxString, which, at least on Windows*, stores characters as wchar_t, which is 2 bytes (UTF-16) on Windows and 4 bytes (UTF-32) on Linux and macOS. While a 4-byte wchar_t can store any Unicode code point in one unit, a 2-byte wchar_t cannot store supplement characters, so in the latter case such characters are represented as surrogate pairs (U+D800..U+DBFF followed by U+DC00..U+DFFF). However, the GetChar function of wxString doesn't seem to decode surrogate pairs, and Audacity does nothing to handle them in the XMLWriter::XMLEsc function (which is what escapes and filters out certain characters for project files, and is where surrogates are incorrectly being escaped).
My pull request fixes this by detecting surrogate pairs and passing them unescaped to the output string; the surrogate pairs are eventually decoded and the characters they encode are written properly to the project file.