On Tue, 2018-08-21 at 11:46 +0100, Daniel P. Berrangé wrote:
On Tue, Aug 21, 2018 at 12:27:34PM +0200, Michal Privoznik wrote:
> On 08/21/2018 11:18 AM, Simon Kobyda wrote:
> > On Thu, 2018-08-16 at 12:28 +0100, Daniel P. Berrangé wrote:
> > > On Thu, Aug 16, 2018 at 12:56:24PM +0200, Simon Kobyda wrote:
> > > >
> > >
> > > After asking around I have found the right solution that we
> > > need to
> > > use
> > > for measuring string width. mbstowcs()/wcswidth() will get the
> > > answer
> > > wrong wrt zero-width characters, combining characters, non-
> > > printable
> > > characters, etc. We need to use the libunistring library:
> > >
> > >
> > >
https://www.gnu.org/software/libunistring/manual/libunistring.html#uniwid...
> > >
> > >
> >
> > I've tried what you've suggested, but it seems that it doesn't
> > work
> > well with all unicode characters. I'm looking into the code of
> > the
> > library, and each function uN_strwidth calls function uN_width,
> > and
> > that function calls uc_width for calculation of width of
> > characters.
> > And if we look into the code of uc_width here:
> >
> >
http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/uniwidth/wi...
> > it seems that this library is limited only to certain
unicodes,
> > e.g.:
> > hangul characters, angle brackets, CJK characters... But it
> > doesn't
> > cover all multiple-width characters. Example: I try to throw any
> > emoji
> > (e.g. 🙉, 🦀, 🏙), it returns width of 1 column for each charact
> > er, nevertheless these characters have width of 2 columns on
> > terminal.
> >
> > BTW, it seems unistring library imports those funcions from
> > gnulib.
>
> I guess the only option then is to try smartcols [1]. If it is good
> for
> util-linux it's going to be good for us too. Although, I'd prefer
> to
> have our own wrappers over their API.
>
>
https://github.com/karelzak/util-linux/tree/master/libsmartcols
The util-linux code does something that uses mbstowcs / wcwidth to
convert the characters and count their width, sort of like the
original
version of this patch. They have further code that decides to convert
certain unicode characters into "\xNN" escaped sequences, which
avoids
the problems I raised wrt non-printable strings.
https://github.com/karelzak/util-linux/blob/master/lib/mbsalign.c
So we could pull that helper API into our code, since its LGPL
loicensed.
I'm unclear if this correctly handles all the cases or not though as
there's no unit tests for it in util-linux AFACT.
Really the only way for us to be sure is to provide a unit test which
stresses our the code with a variety of unicode input strings.
About unit tests. Right now i've got tests for non-pritnable, zero-
width, combining characters and opposite (rigth to left) writing.
Anybody got any idea what else could be problematic with
mbstowcs()/wcswidth(), and therefore tested?
Simon Kobyda.