Re: [Alpine-info] Re: Alpine is converting á to ??
Joshua Daniel Franklin
jdf.lists at gmail.com
Mon Dec 1 14:02:24 PST 2008
On Mon, 1 Dec 2008, Mike Miller wrote:
> $ echo Chavez | perl -pe 's/a/\341/g' | \alpine ""
I don't think that's UTF-8. What you want is:
echo Chavez | perl -pe 's/a/\303\241/g'
Chávez
# or alternatively:
echo -e 'Ch\xc3\xa1vez'
Chávez
(That's on Mac Terminal ssh'd to RHEL5 with LANG=en_US.UTF-8)
I think you've got U+00E1 (aka perl's \0341) mixed up with the
actual UTF-8 bits.
For completeness here is the unum output for á is:
unum á
Octal Decimal Hex HTML Character Unicode
0341 225 0xE1 á "á" LATIN SMALL
LETTER A WITH ACUTE
You can use the 0xE1 to find the character in this Unicode website:
http://www.fileformat.info/info/unicode/char/00e1/index.htm
This page tells you that the UTF-8 bits for U+00E1 are:
UTF-8 (hex) 0xC3 0xA1 (c3a1)
Here's some boilerplate I keep around for Unicode/terminal/LANG
problems:
echo -e '\xc2\xa3' # that's UTF-8 for a sterling symbol.
If you see capital-a-circumflex and a sterling symbol you're
using ISO-8859-1.
echo -e '\xc2\xa3' > pound-symbol
cat pound-symbol
vi pound-symbol
If vi (or nano or whatever) don't contain that but it displays OK when you cat
it, then vi/nano are messing you about. If the file contains something else
it'll be interesting to know what because the terminal emulator is broken.
Note that HTML hex entity reference £ drops the 'C2' (high bit) part
because the UTF-8 for Unicode 127-2047 (0x7F-0x07FF) is spread across two
bytes. The first byte will have the two high bits set and the third bit clear
(i.e. 0xC2 to 0xDF). The second byte will have the top bit set and the second
bit clear (i.e. 0x80 to 0xBF). This is to differentiate it from codepages like
ISO-8859-1.
By the way, there's a lot of confusion that UTF-8 is "wasting" bits vs ASCII;
for chars 0-127 UTF-8 is one byte--same as ASCII. So for virtually all your
code and scripts UTF8 == ASCII. The exceptions are things like MS Word Smart
Quotes which should be exorcised anyway. But to detect it you need to be
using Unicode aware tools that will complain when your files are mixing
ISO-8859-1 and UTF-8.
If you end up messing with this a lot, there are two excellent tools I
use all the time:
* unum.pl which allows you to look up Unicode and HTML characters by
name or number
http://www.fourmilab.ch/webtools/unum/
Note you need to convert from Unicode number to whatever encoding like
UTF-8 you're using.
* FeedParser (RSS/Atom, but you can wrap any file) which has a "bozo
bit" for wrong encodings
http://feedparser.org/docs/character-encoding.html
http://www.fileformat.info/info/unicode/utf8.htm
More information about the Alpine-info
mailing list