po: minimize & canonicalize translations stored in git

Similar to the libvirt.pot, .po files contain line numbers and file
names identifying where in the source a translatable string comes from.
The source locations in the .po files are thrown away and replaced with
content from the libvirt.pot whenever msgmerge is run, so this is not
precious information that needs to be stored in git.

When msgmerge processes a .po file, it will add in any msgids from the
libvirt.pot that were not already present. Thus, if a particular msgid
currently has no translation, it can be considered redundant and again
does not need storing in git.

When msgmerge processes a .po file and can't find an exact existing
translation match, it will try todo fuzzy matching instead, marking such
entries with a "# fuzzy" comment to alert the translator to take a
look and either discard, edit or accept the match. Looking at the
existing fuzzy matches in .po files shows that the quality is awful,
with many having a completely different set of printf format specifiers
between the msgid and fuzzy msgstr entry. Fortunately when msgfmt
generates the .gmo, the fuzzy entries are all ignored anyway. The fuzzy
entries could be useful to translators if they were working on the .po
files directly from git, but Libvirt outsourced translation to the
Fedora Zanata system, so keeping fuzzy matches in git is not much help.

Finally, by default msgids are sorted based on source location. Thus, if
a bit of code with translatable text is moved from one file to another,
it may shift around in the .po file, despite the msgid not itself changing.
If the msgids were sorted alphabetically, the .po files would have
stable ordering when code is refactored.

This patch takes advantage of the above observations to canonicalize
and minimize the content stored for .po files in git. Instead of storing
the real .po files, we now store .mini.po files.

The .mini.po files are the same file format as .po files, but have no
source location comments, are sorted alphabetically, and all fuzzy
msgstrs and msgids with no translation are discarded. This cuts the size
of content in the po directory from 109MB to 19MB.

Users working from a libvirt git checkout who need the full .po files
can run "make update-po", which merges the libvirt.pot and .mini.po
file to create a .po file containing all the content previously stored
in git.

Conversely if a full .po file has been modified, for example, by
downloading new content from Zanata, the .mini.po files can be updated
by running "make update-mini-po". The resulting diffs of the .mini.po
file will clearly show the changed translations without any of the noise
that previously obscured content. Being able to see content changes
clearly actually identified a bug in the zanata python client where it
was adding bogus "fuzzy" annotations to many messages:

  https://bugzilla.redhat.com/show_bug.cgi?id=1564497

Users working from libvirt releases should not see any difference in
behaviour, since the tarballs only contain the full .po files, not the
.mini.po files.

As an added benefit, generating tarballs with "make dist", will no
longer cause creation of dirty files in git, since it won't touch the
.mini.po files, only the .po files which are no longer kept in git.

To avoid creating a single commit 100+MB in size, each language is
minimized separately in a following commit.

Reviewed-by: Ján Tomko <jtomko@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
This commit is contained in:
Daniel P. Berrangé 2018-04-04 19:45:40 +01:00
parent d3034944e4
commit a3857dbeeb
4 changed files with 102 additions and 21 deletions

3
.gitignore vendored
View File

@ -101,6 +101,9 @@
/mingw-libvirt.spec
/mkinstalldirs
/po/*gmo
/po/*po
!/po/*.mini.po
/po/*pot
/proxy/
/python/
/run

37
build-aux/minimize-po.pl Executable file
View File

@ -0,0 +1,37 @@
#!/usr/bin/perl
my @block;
my $msgstr = 0;
my $empty = 0;
my $unused = 0;
my $fuzzy = 0;
while (<>) {
if (/^$/) {
if (!$empty && !$unused && !$fuzzy) {
print @block;
}
@block = ();
$msgstr = 0;
$fuzzy = 0;
push @block, $_;
} else {
if (/^msgstr/) {
$msgstr = 1;
$empty = 1;
}
if (/^#.*fuzzy/) {
$fuzzy = 1;
}
if (/^#~ msgstr/) {
$unused = 1;
}
if ($msgstr && /".+"/) {
$empty = 0;
}
push @block, $_;
}
}
if (@block && !$empty && !$unused) {
print @block;
}

View File

@ -20,6 +20,8 @@ POTFILE := $(DOMAIN).pot
POFILES := $(LANGS:%=%.po)
GMOFILES := $(LANGS:%=%.gmo)
MAINTAINERCLEANFILES = $(POTFILE) $(POFILES) $(GMOFILES)
EXTRA_DIST = \
POTFILES \
$(POTFILE) \
@ -46,26 +48,26 @@ SED_PO_FIXUP_ARGS = \
-e "s|Copyright (C) YEAR|Copyright (C) $$(date +'%Y')|" \
$(NULL)
# Although they're in EXTRA_DIST, we still need to
# copy these again, because update-gmo will change
# their content, and dist-hook runs after the
# things in EXTRA_DIST are copied.
dist-hook: $(GMOFILES)
cp -f $(POTFILE:%=$(srcdir)/%) $(distdir)/
cp -f $(POFILES:%=$(srcdir)/%) $(distdir)/
cp -f $(GMOFILES:%=$(srcdir)/%) $(distdir)/
update-po: $(POFILES)
update-gmo: $(GMOFILES)
update-mini-po: $(POTFILE)
for lang in $(LANGS); do \
echo "Minimizing $$lang content" && \
$(MSGMERGE) --no-location --no-fuzzy-matching --sort-output \
$$lang.po $(POTFILE) | \
$(SED) $(SED_PO_FIXUP_ARGS) | \
$(PERL) $(top_srcdir)/build-aux/minimize-po.pl > \
$(srcdir)/$$lang.mini.po ; \
done
push-pot: $(POTFILE)
zanata push --push-type=source
pull-po: $(POTFILE)
zanata pull --create-skeletons
$(MAKE) update-po
$(MAKE) update-mini-po
$(MAKE) update-gmo
$(POTFILE): POTFILES $(POTFILE_DEPS)
@ -74,9 +76,9 @@ $(POTFILE): POTFILES $(POTFILE_DEPS)
$(SED) $(SED_PO_FIXUP_ARGS) < $@-t > $@
rm -f $@-t
%.po: $(POTFILE)
cd $(srcdir) && \
$(MSGMERGE) --backup=off --no-fuzzy-matching --update $@ $(POTFILE)
%.po: %.mini.po $(POTFILE)
$(MSGMERGE) --no-fuzzy-matching $< $(POTFILE) | \
$(SED) $(SED_PO_FIXUP_ARGS) > $@
%.gmo: %.po
rm -f $(srcdir)/$@ $@-t

View File

@ -7,17 +7,39 @@ file formats, in combination with the Zanata web service.
Source repository
=================
The libvirt GIT repository stores the master "libvirt.pot" file and full "po"
files for translations. The master "libvirt.pot" file can be re-generated using
The libvirt GIT repository does NOT store the master "libvirt.pot" file, nor
does it store full "po" files for translations. The master "libvirt.pot" file
can be generated at any time using
make libvirt.pot
The full po files can have their source locations and msgids updated using
The translations are kept in minimized files that are the same file format
as normal po files but with all redundant information stripped and messages
re-ordered. The key differences between the ".mini.po" files in GIT and the
full ".po" files are
- msgids with no current translation are omitted
- msgids are sorted in alphabetical order not source file order
- msgids with a msgstr marked "fuzzy" are discarded
- source file locations are omitted
The full po files can be created at any time using
make update-po
Normally these updates are only done when either refreshing translations from
Zanata, or when creating a new release.
This merges the "libvirt.pot" with the "$LANG.mini.po" for each language, to
create the "$LANG.po" files. These are included in the release archives created
by "make dist".
When a full po file is updated, changes can be propagated back into the
minimized po files using
make update-mini-po
Note, however, that this is generally not something that should be run by
developers normally, as it is triggered by 'make pull-po' when refreshing
content from Zanata.
Zanata web service
==================
@ -32,5 +54,22 @@ directly to libvirt GIT. Any changes made to "$LANG.mini.po" files in libvirt
GIT will be overwritten and lost the next time content is imported from Zanata.
The master "libvirt.pot" file is periodically pushed to Zanata to provide the
translation team with content changes. New translated text is then periodically
pulled down from Zanata to update the po files.
translation team with content changes, using
make push-pot
New translated text is then periodically pulled down from Zanata to update the
minimized po files, using
make pull-po
Sometimes the translators make mistakes, most commonly with handling printf
format specifiers. The "pull-po" command re-generates the .gmo files to try to
identify such mistakes. If a mistake is made, the broken msgstr should be
deleted in the local "$LANG.mini.po" file, and the Zanata web interface used
to reject the translation so that the broken msgstr isn't pulled down next time.
After pulling down new content the diff should be examined to look for any
obvious mistakes that are not caught automatically. There have been bugs in
Zanata tools which caused messges to go missing, so pay particular attention to
diffs showing deletions where the msgid still exists in libvirt.pot