On Tue, 2018-08-21 at 11:46 +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 21, 2018 at 12:27:34PM +0200, Michal Privoznik wrote:
> > On 08/21/2018 11:18 AM, Simon Kobyda wrote:
> > > On Thu, 2018-08-16 at 12:28 +0100, Daniel P. Berrangé wrote:
> > > > On Thu, Aug 16, 2018 at 12:56:24PM +0200, Simon Kobyda wrote:
> > > > >
> > > >
> > > > After asking around I have found the right solution that we
> > > > need to
> > > > use
> > > > for measuring string width. mbstowcs()/wcswidth() will get the
> > > > answer
> > > > wrong wrt zero-width characters, combining characters, non-
> > > > printable
> > > > characters, etc. We need to use the libunistring library:
> > > >
> > > >
> > > >
https://www.gnu.org/software/libunistring/manual/libunistring.html#uniwid...
> > > >
> > > >
> > >
> > > I've tried what you've suggested, but it seems that it
doesn't
> > > work
> > > well with all unicode characters. I'm looking into the code of
> > > the
> > > library, and each function uN_strwidth calls function uN_width,
> > > and
> > > that function calls uc_width for calculation of width of
> > > characters.
> > > And if we look into the code of uc_width here:
> > >
> > >
http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/uniwidth/wi...
> > > it seems that this library is limited only to certain unicodes,
> > > e.g.:
> > > hangul characters, angle brackets, CJK characters... But it
> > > doesn't
> > > cover all multiple-width characters. Example: I try to throw any
> > > emoji
> > > (e.g. 🙉, 🦀, 🏙), it returns width of 1 column for each charact
> > > er, nevertheless these characters have width of 2 columns on
> > > terminal.
> > >
> > > BTW, it seems unistring library imports those funcions from
> > > gnulib.
> >
> > I guess the only option then is to try smartcols [1]. If it is good
> > for
> > util-linux it's going to be good for us too. Although, I'd prefer
> > to
> > have our own wrappers over their API.
> >
> >
https://github.com/karelzak/util-linux/tree/master/libsmartcols
>
> The util-linux code does something that uses mbstowcs / wcwidth to
> convert the characters and count their width, sort of like the
> original
> version of this patch. They have further code that decides to convert
> certain unicode characters into "\xNN" escaped sequences, which
> avoids
> the problems I raised wrt non-printable strings.
>
>
https://github.com/karelzak/util-linux/blob/master/lib/mbsalign.c
>
> So we could pull that helper API into our code, since its LGPL
> loicensed.
> I'm unclear if this correctly handles all the cases or not though as
> there's no unit tests for it in util-linux AFACT.
>
> Really the only way for us to be sure is to provide a unit test which
> stresses our the code with a variety of unicode input strings.
About unit tests. Right now i've got tests for non-pritnable, zero-
width, combining characters and opposite (rigth to left) writing.
Anybody got any idea what else could be problematic with
mbstowcs()/wcswidth(), and therefore tested?
I think that sounds reasonable enough for now - passing such tests would
already be massively better than the code that exists today with strlen()
Regards,
Daniel
--
|: