Front page | perl.perl5.porters |
Postings from May 2009
Re: Unicode under Mac OS
Thread Previous
|
Thread Next
From:
Tom Christiansen
Date:
May 27, 2009 19:23
Subject:
Re: Unicode under Mac OS
Message ID:
31559.1243477313@chthon
SUMMARY:
· When you're using Unicode data, try Unicode sorting.
· Don't trust locales all that much. They often suck.
Alberto Simões, do Departamento de Informática,
Universidade do Minho Campus de Gualtar, Braga
Portugal, wrote:
» My perl 5.10 (check below for full conf
» settings) is not working properly with unicode:
Happens to the best of us. :-)
» #!/usr/bin/perl
»
» use utf8;
» use POSIX qw(locale_h);
» setlocale(LC_COLLATE, "pt_PT.UTF-8");
» setlocale(LC_CTYPE, "pt_PT.UTF-8");
» use locale;
» binmode(STDOUT, ":utf8");
»
» @a = qw.ý a é í o ú ã é z y.;
» print join("|",sort @a),"\n";
» ---------------------------------
» prints a|o|y|z|ã|é|é|í|ú|ý
» But under linux (perl 5.10 as well)
» prints a|ã|é|é|í|o|ú|y|ý|z
» Any hint?
» [ambs@rachmaninoff ProjectoDicionario]$ locale -a |grep pt_PT
» pt_PT
» pt_PT.ISO8859-1
» pt_PT.ISO8859-15
» pt_PT.UTF-8
Many--I hope.
First off, this may or may not be a problem, but your program
came across the wire with ISO8859-1 literals, and was marked as
ISO-8859-1, but was saying internally that it was in UTF-8.
Something may have been lost in translation, because otherwise
when run, it produces nonsense about malformed UTF-8.
BTW, if you but use the pt_PT.ISO8859-1 locale on the Mac
(since you seem to have just 8859-1 data), it turns out to
work perfectly well, at least with the data you provided:
use POSIX qw(locale_h);
setlocale(LC_COLLATE, "pt_PT.ISO8859-1");
setlocale(LC_CTYPE, "pt_PT.ISO8859-1");
use locale;
binmode(STDOUT, ":encoding(ISO8859-1");
@a = qw.ý a é í o ú ã é z y.;
print join("|",sort @a),"\n";
This now prints a|ã|é|é|í|o|ú|y|ý|z for me on the Mac.
Secondly...
Alas! Saying C< use utf8 > is neither necessary
nor sufficient to guarantee that you actually
have utf8 characters--or semantics. Similarly,
so too with setting LC_COLLATE: maybe not enough.
¡¡¡ So very sorry !!! :-{
I'd also be a bit more comfortable seeing
something more along these lines:
#!/usr/bin/env perl5.10.0
use 5.10.0;
use strict;
use warnings;
use encoding "latin1";
use POSIX qw[ :locale_h ];
our $LOC_PT;
BEGIN {
$LOC_PT = "pt_PT.ISO8859-1";
my $retstr;
if ($retstr = setlocale(LC_COLLATE, $LOC_PT)) {
# say "setlocale LC_COLLATE to $LOC_PT returned $retstr";
} else {
die "can't setlocale LC_COLLATE to $LOC_PT: $!"
}
if ($retstr = setlocale(LC_CTYPE, $LOC_PT)) {
# say "setlocale LC_CTYPE to $LOC_PT returned $retstr";
} else {
die "can't setlocale LC_CTYPE to $LOC_PT: $!"
}
}
use locale;
my @letras = split(/\s+/, "\x{FD} a \x{E9} \x{ED} o \x{FA} \x{E3} \x{E9} z y");
printf "[ %s ] sort to [ %s ]\n",
join(" " => @letras),
join(" " => sort @letras);
# now show that it works externally, too
$ENV{LC_CTYPE} = $ENV{LC_COLLATE} = $LOC_PT;
open (SORTER, "| sort") || die "can't open pipe to sort: $!";
binmode(SORTER, ":encoding(latin1)")
|| die "can't binmode to :encoding(latin1): $!";
for (@letras) { say SORTER }
close(SORTER) || die "can't close pipe to sort: $!";
However, you are in a very real sense correct that
the PT UTF-8 locale under Leopard seems "broken".
It may be even worse than you thought, though.
Witness:
Mac% locate pt_PT.UTF-8
/usr/share/locale/pt_PT.UTF-8
/usr/share/locale/pt_PT.UTF-8/LC_COLLATE
/usr/share/locale/pt_PT.UTF-8/LC_CTYPE
/usr/share/locale/pt_PT.UTF-8/LC_MESSAGES
/usr/share/locale/pt_PT.UTF-8/LC_MESSAGES/LC_MESSAGES
/usr/share/locale/pt_PT.UTF-8/LC_MONETARY
/usr/share/locale/pt_PT.UTF-8/LC_NUMERIC
/usr/share/locale/pt_PT.UTF-8/LC_TIME
Now the tragic part:
Mac% ls -l /usr/share/locale/pt_PT.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 Nov 7 2008 /usr/share/locale/pt_PT.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
Which is pretty (sub-)par for the course, I'm afraid.
Look how many are screwed up that same way:
Mac% find /usr/share/locale -name LC_COLLATE -ls | wc -l
206
Mac% find /usr/share/locale -name LC_COLLATE -ls | grep -c US-ASCII
122
I would never trust a vendor's LC_COLLATE. *Which* sort
of collation are they speaking of? Just for one thing,
you don't usually use the same for a dictionary as you
do for a phone book. Think about "sort -df" for example.
I HAVE GOOD NEWS FOR YOU: if you just want to do something
exceedingly simple, than there's a pretty easy way to get
that done. This works:
#!/usr/bin/env perl5.10.0
use 5.10.0;
use strict;
use warnings;
use Unicode::Collate;
my @a = qw[ ý a é í o ú ã é z y ];
my @sa = Unicode::Collate->new->sort(@a);
printf "[ %s ] sorts to [ %s ]\n",
join(" " => @a),
join(" " => @sa);
Here's a larger suite/demo where I create a PT_sorter
object and repeatedly use it.
#!/usr/bin/env perl5.10.0
use 5.10.0;
use strict;
use warnings;
use encoding "latin1", STDOUT => "utf8";
use List::Util qw[ shuffle ];
use Unicode::Collate;
my $DEBUG = 0;
my $PT_sorter = Unicode::Collate::->new();
$| = 1;
srand(42);
my @tests = (
[qw[ aba abá abacá abaçai abacalhoar abaçanar abacate ]],
[qw[ abominável abomínio abominoso abonação abonado ]],
[qw[ sequela sequencial sequências seqüências ]],
[qw[ avo avó avô avoação avoaçar avoaçasse avoaçásseis
avoaçassem avoaçássemos avondo avós ]],
[qw[ eferente eferescer efesíaco efésio éfetas
eficácia eficaz eficiência eficiente ]],
[qw[ economia económico econômico económicos economismo ]],
[qw[ falamos falámos falar falara falará falaram
faláramos falarem falaremos falaríamos falárica
falarmos falaz falem falemos falhado falo falou ]],
[qw[ faça façanha facão facção facções facha
facílimo facneia faço faz fazeis fazem fazer
fazermonos fazermos fazes fazível ]],
);
my $testno = 0;
for ($testno = 0; $testno < @tests; $testno++) {
print "test $testno... ";
my @set = @{ $tests[$testno] };
my %order;
for (my $i = 0; $i < @set; $i++) {
$order{ $set[$i] } = $i;
}
my @shuffled = ();
my $try = 0;
do {
@shuffled = shuffle(@set);
print "\b. " if "@shuffled" eq "@set";
} until ("@shuffled" ne "@set" || ++$try > 20);
if ($try >= 20) {
die "couldn't shuffle @set after 20 tries";
}
my @sorted = $PT_sorter->sort(@shuffled);
print "\n\tsorted:\t@shuffled\n\tinto:\t@sorted\n" if $DEBUG;
my @sindices = ();
for (@sorted) { push @sindices, $order{$_} };
my $idx_wanted = join(" " => 0 .. $#sindices);
my $idx_gotten = join(" " => @sindices);
if ($idx_wanted eq $idx_gotten) {
print "ok\n";
} else {
print <<"EO_OOPS";
not ok:
\twanted\t$idx_wanted
\t\t(@set)
\tbut got\t$idx_gotten
\t\t(@sorted)
EO_OOPS
}
}
When run without debuggery enabled, produces:
test 0... ok
test 1... ok
test 2... ok
test 3... ok
test 4... ok
test 5... ok
test 6... ok
test 7... ok
or with debugging, this:
test 0...
sorted: abacate abaçai abacalhoar abá aba abacá abaçanar
into: aba abá abacá abaçai abacalhoar abaçanar abacate
ok
test 1...
sorted: abominável abonação abonado abomínio abominoso
into: abominável abomínio abominoso abonação abonado
ok
test 2...
sorted: sequela sequências seqüências sequencial
into: sequela sequencial sequências seqüências
ok
test 3...
sorted: avós avoaçar avoaçássemos avó avô avoaçásseis avoação avondo avoaçasse avoaçassem avo
into: avo avó avô avoação avoaçar avoaçasse avoaçásseis avoaçassem avoaçássemos avondo avós
ok
test 4...
sorted: éfetas efesíaco efésio eficiente eferescer eficácia eficiência eficaz eferente
into: eferente eferescer efesíaco efésio éfetas eficácia eficaz eficiência eficiente
ok
test 5...
sorted: económico econômico economismo economia económicos
into: economia económico econômico económicos economismo
ok
test 6...
sorted: falaríamos falará falou falem falo falhado falamos falaremos falarem falaram faláramos falámos falar falarmos falaz falara falemos falárica
into: falamos falámos falar falara falará falaram faláramos falarem falaremos falaríamos falárica falarmos falaz falem falemos falhado falo falou
ok
test 7...
sorted: facha fazermonos facções fazeis fazermos fazer façanha fazes fazem faz faço facílimo faça facneia fazível facão facção
into: faça façanha facão facção facções facha facílimo facneia faço faz fazeis fazem fazer fazermonos fazermos fazes fazível
ok
However, consider mixed data where sometimes you have
mixed canonical and non-canonical forms, like this:
"economia",
"econ\N{o_acute}mica",
"econo\N{CB_circ}micas",
"econo\N{CB_acute}mico",
"econ\N{o_circ}micos",
"economismo"
where those are named character aliases enabled via:
use charnames qw[ :full :alias ] => {
o_acute => "LATIN SMALL LETTER O WITH ACUTE",
o_circ => "LATIN SMALL LETTER O WITH CIRCUMFLEX",
CB_circ => "COMBINING CIRCUMFLEX ACCENT",
CB_acute => "COMBINING ACUTE ACCENT",
};
You still want that to show up looking like:
economia económica econômicas económico econômicos economismo
Well, it will just magically work right. Yah!
You are *so* lucky you're dealing with Portuguese: you get off easy!
It orders (as of 2009) just as English does with 26 letter A..Z,
and it disregards diacriticals unless there's a tie, in which
case the unadorned letter precedes the one with a marking, in a
normal left to right fashion.
While Spanish also discards accents except for ties, it doesn't
count the tilde over the N as an accent--it's a whole new latter.
In Portuguese, as I know you know but other readers may not,
the til(de) over an A or an O is but a diacritic ("accent
mark") for stress and nasalization, so doesn't count as
anything special there.
But not so everywhere in Iberia! In Castilian and Galician,
the letter Ñ falls after N and before O, making *this* the
proper ordering of these words in Spanish:
radio ráfaga rana ranúnculo raña rápido rastrillo
You therefore must create your sorter this way:
$ES_sorter = Unicode::Collate->new(entry => <<'END_SPANISH_ENTRY');
00F1 ; [.112B.0020.0002.00F1] # n-tilde
006E 0303 ; [.112B.0020.0002.00F1] # n-tilde
00D1 ; [.112B.0020.0008.00D1] # N-tilde
004E 0303 ; [.112B.0020.0008.00D1] # N-tilde
END_SPANISH_ENTRY
Again you luck out working in Portuguese. When you see
a Ç in either Portuguese or French, well, it's just a C
with a diacritical.
No big deal.
But in Catalan, it make for a whole new letter, one coming
after C but before D. This leads to a Catalan sort object
declared like so:
$CA_sorter = Unicode::Collate->new(entry => <<'END_CATALAN_ENTRY');
00E7 ; [.0FFC.0020.0002.0063] # c-cedilla
0063 0327 ; [.0FFC.0020.0002.0063] # c-cedilla
00C7 ; [.0FFC.0020.0002.0043] # C-cedilla
0043 0327 ; [.0FFC.0020.0002.0043] # C-cedilla
END_CATALAN_ENTRY
Similarly in Spanish aka Castilian, prior to 1997 the standard
said that CH was its own letter of the alphabet (named "che")
falling after C and before D. That means "chocolate" comes
*AFTER* "color" in dictionaries before 1997, but before it in
those published later. What fun!
Also until that year of orthographic reform, they had
historically always decreed that LL was its own letter, one
falling after L and before M. Many people would get confused
whether to write "LLave" vs "Llave". The second was and is
right, but the first was often really disturbingly seen.
Still is sometimes -- "Next Exit to LLérida", or whatever.
So you'd have to create your sorter this way:
$ES_trad_sorter = Unicode::Collate->new(entry => <<'TRAD_SPANISH_ENTRY');
0063 0068 ; [.1000.0020.0002.0063] # ch
0043 0068 ; [.1000.0020.0007.0043] # Ch
0043 0048 ; [.1000.0020.0008.0043] # CH
006C 006C ; [.10F5.0020.0002.006C] # ll
004C 006C ; [.10F5.0020.0007.004C] # Ll
004C 004C ; [.10F5.0020.0008.004C] # LL
00F1 ; [.112B.0020.0002.00F1] # n-tilde
006E 0303 ; [.112B.0020.0002.00F1] # n-tilde
00D1 ; [.112B.0020.0008.00D1] # N-tilde
004E 0303 ; [.112B.0020.0008.00D1] # N-tilde
TRAD_SPANISH_ENTRY
In French, the accent marks are disregarded save for
tie-breaking, just like in Portuguese -- EXCEPT that
instead of going left-to-right as I'm pretty sure you
do in Portuguese, in French (but which French? :-), it
appears that they resolve ties by going right to left!
# Level 2 (diacrits) tie-breakers must be
# weighted by reverse order here:
$FR_sorter = Unicode::Collate->new(backwards => 2);
Can you believe it? That means that, for example, using
made-up words:
WRONG: bebe bebé bébe bébé
RIGHT: bebe bébe bebé bébé
Then there's what to do about non-letters, like hyphens or
apostrophes. Are they part of the word? `sort -df` doesn't
thinks so, although it DOES count spaces. Most dictionaries
I use do not seem to, though.
That makes for sequences in PT like these:
avo avó avô avondo à-vontade avós
and
faca faça facalhão façalvo faca-marcador facaneia façanha
facão faca-sola facção facções facha facílimo faço fac-símile
faz fazeis fazê-lo fazem fazer fazermonos fazermos fazes
fazível faz-tudo
What about case? Upper first, or lower? Or the same?
If you're sorting book-titles or place-names, shouldn't you
disregard a leading article? That is, strip off what would be
"The", "A", and "And" if it were in English?
But in Portuguese, you may wish to strip the article
contractions, too, I imagine.
I once needed to sort a bunch "Spanish" city names (that is:
Castillian and Galician and Catalan toponyms), so wound up
cobbling together ("cobble" is that it was not quite right for
Galician, but was better with handling the many Catalan names,
since it counts Ç as its own letter) like so:
$Pueblo_Sorter = Unicode::Collate->new( entry => <<'END_ENTRY',
0063 0068 ; [.1000.0020.0002.0063] # ch
0043 0068 ; [.1000.0020.0007.0043] # Ch
0043 0048 ; [.1000.0020.0008.0043] # CH
006C 006C ; [.10F5.0020.0002.006C] # ll
004C 006C ; [.10F5.0020.0007.004C] # Ll
004C 004C ; [.10F5.0020.0008.004C] # LL
00E7 ; [.0FFC.0020.0002.0063] # c-cedilla
0063 0327 ; [.0FFC.0020.0002.0063] # c-cedilla
00C7 ; [.0FFC.0020.0002.0043] # C-cedilla
0043 0327 ; [.0FFC.0020.0002.0043] # C-cedilla
00F1 ; [.112B.0020.0002.00F1] # n-tilde
006E 0303 ; [.112B.0020.0002.00F1] # n-tilde
00D1 ; [.112B.0020.0008.00D1] # N-tilde
004E 0303 ; [.112B.0020.0008.00D1] # N-tilde
END_ENTRY
upper_before_lower => 1,
normalization => "NFKD",
preprocess => sub { # strip leading articles
my $_ = shift;
s/^L'//; # Catalan
s{ ^ # remove leading articles etc
(?:
# Castilian
El
| Los
| La
| Las
# Catalan
| Els
| Les
| Sa
| Es
# Galego
| O
| Os
| A
| As
)
\s+
}{}x;
# strip various internal, low-importance particles
s/\b[dl]'//; # Catalan
s{
\b
(?:
el | los | la | las | de | del | y # ES
| els | les | i | sa | es | dels # CA
| o | os | a | as | do | da | dos | das # GAL
)
\b
}{}gx;
return $_;
},
) || die ...
Fun, eh?! :-)
When you've mixed data from three different languages,
operating under three differently conflicting rules
schemes, something just has to give. Oh well.
It actually worked out well for me here though.
What's the lesson in all this?
#1: If you're using Unicode data, use Unicode sorting.
#2: Don't trust locales so much. (Well, *I* don't.)
I actually don't know that LC_COLLATE can *ever* do what
UTS #10 (the Unicode collating standard) requires, but
I've never had as much luck with it as I've had with
the slowishly multipass but Correct-As-You-Can-Code-It
full-blown collator approach outlined above.
Good luck--hope this helps give some ideas.
--tom
--
##########################################
# some useful(?) PT charname aliasings #
##########################################
use charnames qw[ :full latin :alias ] => {
CB_acute => "COMBINING ACUTE ACCENT",
CB_circ => "COMBINING CIRCUMFLEX ACCENT",
CB_tilde => "COMBINING TILDE",
CB_grave => "COMBINING GRAVE ACCENT",
CB_cedil => "COMBINING CEDILLA",
CB_trema => "COMBINING DIAERESIS",
A_acute => "LATIN CAPITAL LETTER A WITH ACUTE",
a_acute => "LATIN SMALL LETTER A WITH ACUTE",
A_grave => "LATIN CAPITAL LETTER A WITH GRAVE",
a_grave => "LATIN SMALL LETTER A WITH GRAVE",
A_tilde => "LATIN CAPITAL LETTER A WITH TILDE",
a_tilde => "LATIN SMALL LETTER A WITH TILDE",
E_acute => "LATIN CAPITAL LETTER E WITH ACUTE",
E_open => "LATIN CAPITAL LETTER E WITH ACUTE",
e_acute => "LATIN SMALL LETTER E WITH ACUTE",
e_open => "LATIN SMALL LETTER E WITH ACUTE",
E_circ => "LATIN CAPITAL LETTER E WITH CIRCUMFLEX",
E_closed => "LATIN CAPITAL LETTER E WITH CIRCUMFLEX",
e_circ => "LATIN SMALL LETTER E WITH CIRCUMFLEX",
e_closed => "LATIN SMALL LETTER E WITH CIRCUMFLEX",
E_tilde => "LATIN CAPITAL LETTER E WITH TILDE",
e_tilde => "LATIN SMALL LETTER E WITH TILDE",
I_acute => "LATIN CAPITAL LETTER I WITH ACUTE",
i_acute => "LATIN SMALL LETTER I WITH ACUTE",
O_acute => "LATIN CAPITAL LETTER O WITH ACUTE",
O_open => "LATIN CAPITAL LETTER O WITH ACUTE",
o_acute => "LATIN SMALL LETTER O WITH ACUTE",
o_open => "LATIN SMALL LETTER O WITH ACUTE",
O_circ => "LATIN CAPITAL LETTER O WITH CIRCUMFLEX",
O_closed => "LATIN CAPITAL LETTER O WITH CIRCUMFLEX",
o_circ => "LATIN SMALL LETTER O WITH CIRCUMFLEX",
o_closed => "LATIN SMALL LETTER O WITH CIRCUMFLEX",
O_tilde => "LATIN CAPITAL LETTER O WITH TILDE",
o_tilde => "LATIN SMALL LETTER O WITH TILDE",
U_acute => "LATIN CAPITAL LETTER U WITH ACUTE",
u_acute => "LATIN SMALL LETTER U WITH ACUTE",
U_trema => "LATIN CAPITAL LETTER U WITH DIAERESIS",
u_trema => "LATIN SMALL LETTER U WITH DIAERESIS",
C_cedil => "LATIN CAPITAL LETTER C WITH CEDILLA",
c_cedil => "LATIN SMALL LETTER C WITH CEDILLA",
};
Thread Previous
|
Thread Next