Fwd: [Audacity-devel] Properly handle supplementary characters when saving XML files

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Fwd: [Audacity-devel] Properly handle supplementary characters when saving XML files

Is the issue / bug that this pull request addresses logged on
bugzilla? I couldn't find it, though it is an issue that we see
frequently on the Windows forum.


---------- Forwarded message ----------
From: Ryan Sakowski <[hidden email]>
Date: 6 July 2017 at 22:27
Subject: [Audacity-devel] Properly handle supplementary characters
when saving XML files
To: "[hidden email]"
<[hidden email]>

About a month ago, I posted a pull request on GitHub that should
resolve the numerous "reference to invalid character number at line
###" errors that have been posted on the Windows board. Here's the
pull request:


What's going on here is that Audacity for Windows (but not Linux and
probably not macOS) is saving Unicode supplementary characters
(characters greater than U+FFFF) as escaped UTF-16 surrogate pairs,
which are illegal in XML. (Supplementary characters include, among
other things, many emoji and lesser-used writing systems.)
Supplementary characters instead need to be saved either directly or
as single 5- or 6-digit codes, which I have confirmed happens on

The cause of this problem is that Audacity uses wxString, which, at
least on Windows*, stores characters as wchar_t, which is 2 bytes
(UTF-16) on Windows and 4 bytes (UTF-32) on Linux and macOS. While a
4-byte wchar_t can store any Unicode code point in one unit, a 2-byte
wchar_t cannot store supplement characters, so in the latter case such
characters are represented as surrogate pairs (U+D800..U+DBFF followed
by U+DC00..U+DFFF). However, the GetChar function of wxString doesn't
seem to decode surrogate pairs, and Audacity does nothing to handle
them in the XMLWriter::XMLEsc function (which is what escapes and
filters out certain characters for project files, and is where
surrogates are incorrectly being escaped).

My pull request fixes this by detecting surrogate pairs and passing
them unescaped to the output string; the surrogate pairs are
eventually decoded and the characters they encode are written properly
to the project file.

*According to the "Performance characteristics" section of the
wxString documentation
wxString uses wchar_t as its character type by default. However, the
documentation for wxStringCharType
says that it is the type used by wxString and that it is, by default,
wchar_t on Windows and char on various other platforms (in which case
I think wxString uses the UTF-8 encoding). I haven't tested which of
these is more accurate, but since Audacity on Linux saves
supplementary characters properly to project files, I don't think it's
a concern.

Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Audacity-quality mailing list
[hidden email]