Re: [libvirt PATCH 00/51] Use permutable format strings in translations

27 Mar 2023

On Mon, Mar 27, 2023 at 01:08:09PM +0200, Jiri Denemark wrote:
...
On Fri, Mar 10, 2023 at 17:14:32 +0000, Daniel P. Berrangé wrote:
...
Even if fixed, it might be worth switching the .pot file anyway, but
this can't be done without us bulk updating the translations, and
bulk re-importing them, which will be challenging. We'll almost
certainly want to try this on a throw-away repo in weblate first,
not our main repo.
I was able to come up with steps leading to the desired state:
0. lock weblate repository
 1. update libvirt.pot from the most recent potfile job
 2. push to libvirt.git
 2. wait for translations update from Fedora Weblate and merge it
 3. pull from libvirt.git
 4. apply the first 50 patches from this seires (with required changes
    to make sure all translation strings are updated)
 5. update all po files with the attached script
 6. update libvirt.pot by running meson compile libvirt-pot
 7. apply patch 51 of this series
 8. push to libvirt.git
 9. wait for translations update from Fedora Weblate and merge it
10. unlock weblate repository
The process takes about an hour if we're lucky as weblate is quite slow
when processing such large amount of changes.
The result can be seen at
https://gitlab.com/jirkade/libvirt/-/commits/format-strings
and the corresponding weblate repository at
https://translate.fedoraproject.org/projects/libvirt/test/
I used d05ad0f15e737fa2327dd68870a485821505b58f commit as a base.
Looking at this, I picked a random language (Bengali) and compared
stats:

  https://translate.fedoraproject.org/projects/libvirt/test/bn_IN/

vs

  https://translate.fedoraproject.org/projects/libvirt/libvirt/bn_IN/

Translated strings matches to within 2 words, which is probably
accounted for by being based on different HEAD

Strings with failing checks is massively different, and that is
the fault of 'failing check: C format' - 1300 more failing checks
afterwards.

Comparing

https://translate.fedoraproject.org/browse/libvirt/test/bn_IN/?q=check%3Ac_format&sort_by=source&offset=3

with

https://translate.fedoraproject.org/browse/libvirt/libvirt/bn_IN/?offset=1&q=check%3Ac_format&sort_by=source&checksum=

we can see some obvious missing examples

https://translate.fedoraproject.org/translate/libvirt/test/bn_IN/?checksum=260fc1387343083b&q=check%3Ac_format&sort_by=source

Which is:

 msgid  "active commit requested but '%1$s' is not active"
 msgstr "সংরক্ষণের পুল '%s' সক্রিয় নয়"

looking at po/bn_IN.po I see that this string was already marked as
'fuzzy' before your changes, and thus your script did not try to
convert its format string.

Skipping fuzzy strings makes sense when the number of format
strings is mis-matched. If there's a matching count and matching
ordering, I think we ought to update the msgstr even when fuzzy,
but *keep* it marked fuzzy, so translators can review.

Anyway broadly speaking this script seems to have done the right
thing such that we don't loose translation coverage in the
compiled .mo files. My query is merely about fuzzy strings
which already get excluded from .mo files.
...
If we agree this is a reasonable approach, I think we should apply it
just after a release to give translators the whole release cycle to
check or update the translations if they wish so.
Yep, doing it at the start makes sense.
...
The attached script analyzes a single po file and updates all msgid
strings to use permutable format strings. It also tries to update all
translations, but only if the format strings in them exactly match
(including their order) the corresponding msgid format string. That is,
a msgstr will not be updated if format strings in it were incorrect or
reordered or they already used the permutable form. That is, the
processing should be a NO-OP except for strings that already used
permutable format in msgstr, such translations were failing c-format
check in weblate before but would be marked as correct now.
NB, even though your script would fix those cases of pre-existng use
of format positions, they'd still be left marked 'fuzzy' so will need
manual review in weblate. At least that is now possible that the
c-format check is no longer failed though.
...
Jirka

...
#!/usr/bin/env python3
import sys
import re
# see man 3 printf
reIndex = r"([1-9][0-9]*\$)?"
reFlags = r"([-#0+I']|' ')*"
reWidth = rf"([1-9][0-9]*|\*{reIndex})?"
rePrecision = rf"(\.{reWidth})?"
reLenghtMod = r"(hh|h|l|ll|q|L|j|z|Z|t)?"
reConversion = r"[diouxXeEfFgGaAcspnm%]"
reCFormat = "".join([
    r"%",
    rf"(?P<index>{reIndex})",
    rf"(?P<flags>{reFlags})",
    rf"(?P<width>{reWidth})",
    rf"(?P<precision>{rePrecision})",
    rf"(?P<length>{reLenghtMod})",
    rf"(?P<conversion>{reConversion})"])
def translateFormat(fmt, idx, m):
    groups = m.groupdict()
if groups["index"] or groups["conversion"] == "%":
        print(f"Ignoring c-format '{fmt}'")
        return idx, fmt
for field in "width", "precision":
        if "*" in groups[field]:
            groups[field] = f"{groups[field]}{idx}$"
            idx += 1
newFmt = f"%{idx}${''.join(groups.values())}"
    idx += 1
return idx, newFmt
def process(ids, strs, fuzzy):
    regex = rf"(.*?)({reCFormat})(.*)"
    fmts = []
    idx = 1
newIds = []
    for s in ids:
        new = []
        m = re.search(regex, s)
        while m is not None:
            new.append(m.group(1))
oldFmt = m.group(2)
            idx, newFmt = translateFormat(oldFmt, idx, m)
            fmts.append((oldFmt, newFmt))
            new.append(newFmt)
s = m.group(m.lastindex)
            m = re.search(regex, s)
new.append(s)
        newIds.append("".join(new))
if fuzzy:
        return newIds, strs
n = 0
    newStrs = []
    for s in strs:
        new = []
        m = re.search(regex, s)
        while m is not None:
            new.append(m.group(1))
if n < len(fmts) and fmts[n][0] == m.group(2):
                new.append(fmts[n][1])
                n += 1
            else:
                print("Ignoring translation", strs)
                print("              for id", newIds)
                return newIds, strs
s = m.group(m.lastindex)
            m = re.search(regex, s)
new.append(s)
        newStrs.append("".join(new))
return newIds, newStrs
def writeMsg(po, header, strs):
    if len(strs) == 0:
        return
po.write(header)
    po.write(" ")
    for s in strs:
        po.write('"')
        po.write(s)
        po.write('"\n')
if len(sys.argv) != 2:
    print(f"usage: {sys.argv[0]} PO-FILE", file=sys.stderr)
    sys.exit(1)
pofile = sys.argv[1]
with open(pofile, "r") as po:
    polines = po.readlines()
with open(pofile, "w") as po:
    current = None
    cfmt = False
    fuzzy = False
    ids = []
    strs = []
for line in polines:
        m = re.search(r'^(([a-z]+) )?"(.*)"', line)
        if m is None:
            if cfmt:
                ids, strs = process(ids, strs, fuzzy)
writeMsg(po, "msgid", ids)
            writeMsg(po, "msgstr", strs)
            po.write(line)
cfmt = line.startswith("#,") and " c-format" in line
            fuzzy = line.startswith("#,") and " fuzzy" in line
current = None
            ids = []
            strs = []
            continue
if m.group(2):
            current = m.group(2)
if current == "msgid":
            ids.append(m.group(3))
        elif current == "msgstr":
            strs.append(m.group(3))
if cfmt:
        ids, strs = process(ids, strs, fuzzy)
writeMsg(po, "msgid", ids)
    writeMsg(po, "msgstr", strs)
My attempt at convertnig fuzzy strings involved this diff:

--- /home/berrange/format-strings.py~	2023-03-27 13:29:05.777343030 +0100
+++ /home/berrange/format-strings.py	2023-03-27 13:43:33.950701633 +0100
@@ -62,9 +62,6 @@
         new.append(s)
         newIds.append("".join(new))
 
-    if fuzzy:
-        return newIds, strs
-
     n = 0
     newStrs = []
     for s in strs:
@@ -77,8 +74,9 @@
                 new.append(fmts[n][1])
                 n += 1
             else:
-                print("Ignoring translation", strs)
-                print("              for id", newIds)
+                if not fuzzy:
+                    print("Ignoring translation", strs)
+                    print("              for id", newIds)
                 return newIds, strs
 
             s = m.group(m.lastindex)
@@ -87,6 +85,12 @@
         new.append(s)
         newStrs.append("".join(new))
 
+    if n != len(fmts):
+        if not fuzzy and "".join(strs) != "":
+            print("Ignoring mismatched format count", strs)
+            print("                          for id", newIds)
+        return newIds, strs
+            
     return newIds, newStrs
 
 


With that I believe "Failing check: C format" should match before/after
your changes.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|