On 08/21/2018 11:18 AM, Simon Kobyda wrote:
> On Thu, 2018-08-16 at 12:28 +0100, Daniel P. Berrangé wrote:
>> On Thu, Aug 16, 2018 at 12:56:24PM +0200, Simon Kobyda wrote:
>>>
>>
>> After asking around I have found the right solution that we need to
>> use
>> for measuring string width. mbstowcs()/wcswidth() will get the
>> answer
>> wrong wrt zero-width characters, combining characters, non-printable
>> characters, etc. We need to use the libunistring library:
>>
>>
>>
https://www.gnu.org/software/libunistring/manual/libunistring.html#uniwid...
>>
>>
> I've tried what you've suggested, but it seems that it doesn't work
> well with all unicode characters. I'm looking into the code of the
> library, and each function uN_strwidth calls function uN_width, and
> that function calls uc_width for calculation of width of characters.
> And if we look into the code of uc_width here:
>
>
http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/uniwidth/wi...
> it seems that this library is limited only to certain unicodes, e.g.:
> hangul characters, angle brackets, CJK characters... But it doesn't
> cover all multiple-width characters. Example: I try to throw any emoji
> (e.g. 🙉, 🦀, 🏙), it returns width of 1 column for each charact
> er, nevertheless these characters have width of 2 columns on terminal.
>
> BTW, it seems unistring library imports those funcions from gnulib.
I guess the only option then is to try smartcols [1]. If it is good for
util-linux it's going to be good for us too. Although, I'd prefer to
have our own wrappers over their API.
https://github.com/karelzak/util-linux/tree/master/libsmartcols
The util-linux code does something that uses mbstowcs / wcwidth to
convert the characters and count their width, sort of like the original
version of this patch. They have further code that decides to convert
certain unicode characters into "\xNN" escaped sequences, which avoids
the problems I raised wrt non-printable strings.
So we could pull that helper API into our code, since its LGPL loicensed.
I'm unclear if this correctly handles all the cases or not though as
there's no unit tests for it in util-linux AFACT.
Really the only way for us to be sure is to provide a unit test which
stresses our the code with a variety of unicode input strings.
Regards,
Daniel
--
|: