An “encoding” describes the correspondence
between CHARACTER
s and raw bytes during input/output via
STREAM
s with STREAM-ELEMENT-TYPE
CHARACTER
.
An EXT:ENCODING
is an object composed of the following facets:
CHARACTER
s that
can be represented and passed through the I/O channel, and the way
these characters translate into raw bytes, i.e., the map between
sequences of CHARACTER
and (UNSIGNED-BYTE
8)
in the form of STRING
s
and (VECTOR
(UNSIGNED-BYTE
8))
as well as character and byte STREAM
s.
In this context, for example, CHARSET:UTF-8
and CHARSET:UCS-4
are considered different, although they can represent the same set
of characters.EXT:ENCODING
s are also TYPE
s. As such, they represent the set of
characters encodable in the character set. In this context, the way
characters are translated into raw bytes is ignored, and the line
terminator mode is ignored as well. TYPEP
and SUBTYPEP
can be used
on encodings:
(SUBTYPEP
CHARSET:UTF-8
CHARSET:UTF-16
) ⇒; ⇒
T
(
T
SUBTYPEP
CHARSET:UTF-16
CHARSET:UTF-8
) ⇒; ⇒
T
(
T
SUBTYPEP
CHARSET:ASCII CHARSET:ISO-8859-1) ⇒; ⇒
T
(
T
SUBTYPEP
CHARSET:ISO-8859-1 CHARSET:ASCII) ⇒; ⇒
NIL
T
“1:1” encodings. Encodings which define a bijection between character and byte
sequences are called “1:1” encodings. CHARSET:ISO-8859-1
is an example of such an
encoding: any byte sequence corresponds to some character sequence and
vice versa. ASCII, however, is not a “1:1” encoding: there are no
characters for bytes in the range [128;255]. CHARSET:UTF-8
is not a
“1:1” encoding either: some byte sequences do not correspond to any character
sequence.
The following character sets are supported, as values of the corresponding (constant) symbol in the “CHARSET” package:
Symbols in package “CHARSET”
UCS-2
≡ UNICODE-16
≡ UNICODE-16-BIG-ENDIAN
,
the 16-bit basic multilingual plane of the UNICODE character set.
Every character is represented as two bytes.UNICODE-16-LITTLE-ENDIAN
UCS-4
≡ UNICODE-32
≡ UNICODE-32-BIG-ENDIAN
,
the 21-bit UNICODE character set. Every character is represented as
four bytes. This encoding is used by CLISP internally.UNICODE-32-LITTLE-ENDIAN
UTF-8
,
the 21-bit UNICODE character set.
Every character is represented as one to four bytes.
ASCII characters represent themselves and need one byte per character.
Most Latin/Greek/Cyrillic/Hebrew characters need two bytes per
character. Most other characters need three bytes per character,
and the rarely used remaining characters need four bytes per
character. This is therefore, in general, the most space-efficient
encoding of all of Unicode.UTF-16
,
the 21-bit UNICODE character set. Every character in the 16-bit
basic multilingual plane is represented as two bytes, and the
rarely used remaining characters need four bytes per character.
This character set is only available on
platforms with GNU libc or GNU libiconv.UTF-7
,
the 21-bit UNICODE character set. This is a stateful 7-bit encoding.
Not all ASCII characters represent themselves.
This character set is only available on
platforms with GNU libc or GNU libiconv.JAVA
,
the 21-bit UNICODE character set.
ASCII characters represent themselves and need one byte per character.
All other characters of the basic multilingual plane are represented
by \unnnn
sequences
(nnnn
a hexadecimal number)
and need 6 bytes per character. The remaining characters are represented
by \uxxxx
\uyyyy
and need 12 bytes per character. While this encoding is very comfortable
for editing Unicode files using only ASCII-aware tools and editors, it
cannot faithfully represent all UNICODE text. Only text which
does not contain \u
(backslash followed by
lowercase Latin u) can be faithfully represented by this encoding.
ASCII
,
the well-known US-centric 7-bit character set (American Standard
Code for Information Interchange - ASCII).ISO-8859-1
,
an extension of the ASCII character set, suitable for the Afrikaans, Albanian, Basque, Breton, Catalan,
Cornish, Danish, Dutch, English, Færoese, Finnish, French,
Frisian, Galician, German, Greenlandic, Icelandic, Irish, Italian,
Latin, Luxemburgish, Norwegian, Portuguese, Ræto-Romanic,
Scottish, Spanish, and Swedish languages.
This encoding has the nice property that
(LOOP
:for i :from 0 :toCHAR-CODE-LIMIT
:for c = (CODE-CHAR
i) :always (OR
(NOT
(TYPEP
c CHARSET:ISO-8859-1)) (EQUALP
(EXT:CONVERT-STRING-TO-BYTES
(STRING
c) CHARSET:ISO-8859-1) (VECTOR
i)))) ⇒T
i.e., it is compatible with CLISP CODE-CHAR
/CHAR-CODE
in its own domain.
ISO-8859-2
,
an extension of the ASCII character set, suitable for the Croatian, Czech, German, Hungarian, Polish,
Slovak, Slovenian, and Sorbian languages. ISO-8859-3
,
an extension of the ASCII character set, suitable for the Esperanto and Maltese languages.ISO-8859-4
,
an extension of the ASCII character set, suitable for the Estonian, Latvian, Lithuanian and Sami (Lappish)
languages.ISO-8859-5
,
an extension of the ASCII character set, suitable for the Bulgarian, Byelorussian, Macedonian, Russian,
Serbian, and Ukrainian languages.ISO-8859-6
,
suitable for the Arabic language.ISO-8859-7
,
an extension of the ASCII character set, suitable for the Greek language.ISO-8859-8
,
an extension of the ASCII character set, suitable for the Hebrew language (without punctuation).ISO-8859-9
,
an extension of the ASCII character set, suitable for the Turkish language.ISO-8859-10
,
an extension of the ASCII character set, suitable for the Estonian, Icelandic, Inuit (Greenlandic), Latvian,
Lithuanian, and Sami (Lappish) languages.ISO-8859-13
,
an extension of the ASCII character set, suitable for the Estonian, Latvian, Lithuanian, Polish and Sami
(Lappish) languages.ISO-8859-14
,
an extension of the ASCII character set, suitable for the Irish Gælic, Manx Gælic, Scottish
Gælic, and Welsh languages.ISO-8859-15
,
an extension of the ASCII character set, suitable for the ISO-8859-1 languages, with improvements for
French, Finnish and the Euro.ISO-8859-16
an extension of the ASCII character set, suitable for the Rumanian language.KOI8-R
,
an extension of the ASCII character set, suitable for the Russian language (very popular, especially on the
internet).KOI8-U
,
an extension of the ASCII character set, suitable for the Ukrainian language (very popular, especially on the
internet).KOI8-RU
,
an extension of the ASCII character set, suitable for the Russian language. This character set is only available on
platforms with GNU libiconv.JIS_X0201
,
a character set for the Japanese language.MAC-ARABIC
,
a platform specific extension of the ASCII character set.MAC-CENTRAL-EUROPE
,
a platform specific extension of the ASCII character set.MAC-CROATIAN
,
a platform specific extension of the ASCII character set.MAC-CYRILLIC
,
a platform specific extension of the ASCII character set.MAC-DINGBAT
,
a platform specific character set.MAC-GREEK
,
a platform specific extension of the ASCII character set.MAC-HEBREW
,
a platform specific extension of the ASCII character set.MAC-ICELAND
,
a platform specific extension of the ASCII character set.MAC-ROMAN
≡ MACINTOSH
,
a platform specific extension of the ASCII character set.MAC-ROMANIA
,
a platform specific extension of the ASCII character set.MAC-SYMBOL
,
a platform specific character set.MAC-THAI
,
a platform specific extension of the ASCII character set.MAC-TURKISH
,
a platform specific extension of the ASCII character set.MAC-UKRAINE
,
a platform specific extension of the ASCII character set.CP437
, a DOS oldie,
a platform specific extension of the ASCII character set.CP437-IBM
,
an IBM variant of CP437
.CP737
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for the Greek language.CP775
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for some Baltic languages.CP850
, a DOS oldie,
a platform specific extension of the ASCII character set.CP852
, a DOS oldie,
a platform specific extension of the ASCII character set.CP852-IBM
,
an IBM variant of CP852
.CP855
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for the Russian language.CP857
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for the Turkish language.CP860
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for the Portuguese language.CP860-IBM
,
an IBM variant of CP860
.CP861
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for the Icelandic language.CP861-IBM
,
an IBM variant of CP861
.CP862
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for the Hebrew language.CP862-IBM
,
an IBM variant of CP862
.CP863
, a DOS oldie,
a platform specific extension of the ASCII character set.CP863-IBM
,
an IBM variant of CP863
.CP864
, a DOS oldie,
meant to be suitable for the Arabic language.CP864-IBM
,
an IBM variant of CP864
.
CP865
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for some Nordic languages.CP865-IBM
,
an IBM variant of CP865
.
CP866
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for the Russian language.CP869
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for the Greek language.CP869-IBM
,
an IBM variant of CP869
.
CP874
, a DOS oldie,
a platform specific extension of the ASCII character set, meant to be suitable for the Thai language.CP874-IBM
,
an IBM variant of CP874
.
WINDOWS-1250
≡ CP1250
,
a platform specific extension of the ASCII character set, heavily incompatible with ISO-8859-2.
WINDOWS-1251
≡ CP1251
,
a platform specific extension of the ASCII character set, heavily incompatible with ISO-8859-5,
meant to be suitable for the Russian language.WINDOWS-1252
≡ CP1252
,
a platform specific extension of the ISO-8859-1 character set.
WINDOWS-1253
≡ CP1253
,
a platform specific extension of the ASCII character set, gratuitously incompatible with ISO-8859-7,
meant to be suitable for the Greek language.WINDOWS-1254
≡ CP1254
,
a platform specific extension of the ISO-8859-9 character set.
WINDOWS-1255
≡ CP1255
,
a platform specific extension of the ASCII character set, gratuitously incompatible with ISO-8859-8,
suitable for the Hebrew language.
This character set is only available on
platforms with GNU libc or GNU libiconv.WINDOWS-1256
≡ CP1256
,
a platform specific extension of the ASCII character set, meant to be suitable for the Arabic language.WINDOWS-1257
≡ CP1257
,
a platform specific extension of the ASCII character set.WINDOWS-1258
≡ CP1258
, a platform specific extension of the ASCII character set, meant to be suitable for the
Vietnamese language. This character set is only available on
platforms with GNU libc or GNU libiconv.HP-ROMAN8
,
a platform specific extension of the ASCII character set.NEXTSTEP
,
a platform specific extension of the ASCII character set.EUC-JP
,
a multibyte character set for the Japanese language.
This character set is only available on
platforms with GNU libc or GNU libiconv.SHIFT-JIS
,
a multibyte character set for the Japanese language.
This character set is only available on
platforms with GNU libc or GNU libiconv.CP932
,
a Microsoft variant of SHIFT-JIS
.
This character set is only available on
platforms with GNU libc or GNU libiconv.ISO-2022-JP
,
a stateful 7-bit multibyte character set for the Japanese language.
This character set is only available on
platforms with GNU libc or GNU libiconv.ISO-2022-JP-2
,
a stateful 7-bit multibyte character set for the Japanese language.
This character set is only available on platforms with GNU libc 2.3
or newer or GNU libiconv.ISO-2022-JP-1
,
a stateful 7-bit multibyte character set for the Japanese language.
This character set is only available on
platforms with GNU libiconv.EUC-CN
,
a multibyte character set for simplified Chinese.
This character set is only available on
platforms with GNU libc or GNU libiconv.HZ
,
a stateful 7-bit multibyte character set for simplified Chinese.
This character set is only available on
platforms with GNU libiconv.GBK
,
a multibyte character set for Chinese,
This character set is only available on
platforms with GNU libc or GNU libiconv.CP936
,
a Microsoft variant of GBK
.
This character set is only available on
platforms with GNU libc or GNU libiconv.GB18030
,
a multibyte character set for Chinese,
This character set is only available on
platforms with GNU libc or GNU libiconv.EUC-TW
,
a multibyte character set for traditional Chinese.
This character set is only available on
platforms with GNU libc or GNU libiconv.BIG5
,
a multibyte character set for traditional Chinese.
This character set is only available on
platforms with GNU libc or GNU libiconv.CP950
,
a Microsoft variant of BIG5
.
This character set is only available on
platforms with GNU libc or GNU libiconv.BIG5-HKSCS
,
a multibyte character set for traditional Chinese.
This character set is only available on
platforms with GNU libc or GNU libiconv.ISO-2022-CN
,
a stateful 7-bit multibyte character set for Chinese.
This character set is only available on
platforms with GNU libc or GNU libiconv.ISO-2022-CN-EXT
,
a stateful 7-bit multibyte character set for Chinese.
This character set is only available on
platforms with GNU libc or GNU libiconv.EUC-KR
,
a multibyte character set for Korean.
This character set is only available on
platforms with GNU libc or GNU libiconv.CP949
,
a Microsoft variant of EUC-KR
.
This character set is only available on
platforms with GNU libc or GNU libiconv.ISO-2022-KR
,
a stateful 7-bit multibyte character set for Korean.
This character set is only available on
platforms with GNU libc or GNU libiconv.JOHAB
,
a multibyte character set for Korean used mostly on DOS.
This character set is only available on
platforms with GNU libc or GNU libiconv.ARMSCII-8
,
an extension of the ASCII character set, suitable for the Armenian. This character set is only available on
platforms with GNU libc or GNU libiconv.GEORGIAN-ACADEMY
,
an extension of the ASCII character set, suitable for the Georgian. This character set is only available on
platforms with GNU libc or GNU libiconv.GEORGIAN-PS
,
an extension of the ASCII character set, suitable for the Georgian. This character set is only available on
platforms with GNU libc or GNU libiconv.TIS-620
,
an extension of the ASCII character set, suitable for the Thai. This character set is only available on
platforms with GNU libc or GNU libiconv.MULELAO-1
,
an extension of the ASCII character set, suitable for the Laotian. This character set is only available on
platforms with GNU libiconv.CP1133
,
an extension of the ASCII character set, suitable for the Laotian. This character set is only available on
platforms with GNU libc or GNU libiconv.VISCII
,
an extension of the ASCII character set, suitable for the Vietnamese. This character set is only available on
platforms with GNU libc or GNU libiconv.TCVN
,
an extension of the ASCII character set, suitable for the Vietnamese. This character set is only available on
platforms with GNU libc or GNU libiconv.BASE64
, encodes
arbitrary byte sequences with 64 ASCII characters
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
as specifined by MIME
; 3 bytes are encoded with 4
characters, line breaks are inserted after every 76 characters.
While this is not a traditional character set (i.e., it does
not map a set of characters in a natural language into bytes), it
does define a map between arbitrary byte sequences and certain
character sequences, so it falls naturally into the EXT:ENCODING
class.
The character sets provided by the library function
iconv
can also be used as encodings. To create such an encoding,
call EXT:MAKE-ENCODING
with the character set name (a string) as the
:CHARSET
argument.
When an EXT:ENCODING
is available both as a built-in and
through iconv
, the built-in is used, because it is more
efficient and available across all platforms.
These encodings are not assigned to global variables, since
there is no portable way to get the list of all character sets
supported by iconv
.
On standard-compliant UNIX systems (e.g., GNU systems, such as GNU/Linux and GNU/Hurd) and on systems with GNU libiconv you get this list by calling the program: iconv -l.
The reason we use only GNU libc 2.2 or GNU libiconv is
that the other iconv
implementations are broken in various ways and
we do not want to deal with random CLISP crashes caused by those bugs.
If your system supplies an iconv
implementation which passes the
GNU libiconv's test suite, please report that
to clisp-list and a
future CLISP version will use iconv
on your system.
The line terminator mode can be one of the following three keywords:
Windows programs typically use the :DOS
line terminator,
sometimes they also accept :UNIX
line terminators or produce
:MAC
line terminators.
The HTTP protocol also requires :DOS
line terminators.
The line terminator mode is relevant only for output (writing to a
file/pipe/socket STREAM
). During input, all three kinds of line terminators
are recognized. See also Section 13.11, “Treatment of Newline during Input and Output sec_13-1-8”.
EXT:MAKE-ENCODING
The function (
returns an EXT:MAKE-ENCODING
&KEY
:CHARSET
:LINE-TERMINATOR :INPUT-ERROR-ACTION :OUTPUT-ERROR-ACTION)EXT:ENCODING
. The :CHARSET
argument may be
an encoding, a string, or :DEFAULT
.
The possible values for the line terminator argument are the
keywords :UNIX
, :MAC
, :DOS
.
The :INPUT-ERROR-ACTION
argument specifies
what happens when an invalid byte sequence is encountered while
converting bytes to characters. Its value can be :ERROR
, :IGNORE
or a character to be used instead. The UNICODE character
#\uFFFD is typically used to indicate an error in the
input sequence.
The :OUTPUT-ERROR-ACTION
argument specifies
what happens when an invalid character is encountered while converting
characters to bytes. Its value can be :ERROR
, :IGNORE
, a byte to
be used instead, or a character to be used instead. The UNICODE
character #\uFFFD can be used here only if it is
encodable in the character set.
EXT:ENCODING-CHARSET
The function (
returns the
charset of the EXT:ENCODING-CHARSET
encoding
)encoding
, as a SYMBOL
or a STRING
.
(
is
not necessarily a valid STRING
(EXT:ENCODING-CHARSET
encoding
))MIME
name.
Besides every file/pipe/socket STREAM
containing an encoding,
the following SYMBOL-MACRO
places contain global EXT:ENCODING
s:
SYMBOL-MACRO
CUSTOM:*DEFAULT-FILE-ENCODING*
. The SYMBOL-MACRO
place CUSTOM:*DEFAULT-FILE-ENCODING*
is the encoding used for
new file/pipe/socket STREAM
, when no :EXTERNAL-FORMAT
argument was specified.
The following are SYMBOL-MACRO
places.
CUSTOM:*PATHNAME-ENCODING*
is the encoding used for converting filenames in the
file system (represented with byte sequences by the OS) to lisp
PATHNAME
components (STRING
s).
If this encoding is incompatible with some file names on your system,
file system access (e.g., DIRECTORY
) may SIGNAL
ERROR
s,
thus extreme caution is recommended if this is not a “1:1” encoding.
Sometimes it may not be obvious that the encoding is involved at all.
E.g., on Win32:
(PARSE-NAMESTRING
(STRING
#\ARMENIAN_SMALL_LETTER_RA)) *** - PARSE-NAMESTRING: syntax error in filename "ռ" at position 0
when CUSTOM:*PATHNAME-ENCODING*
is CHARSET:UTF-16
because then
#\ARMENIAN_SMALL_LETTER_RA
corresponds
to the 4 bytes #(255 254 124 5)
and the byte 124
is not a valid
byte for a Win32 file name because it
means |
in ASCII.
The set of valid pathname bytes is
determined by the GNU autoconf test
src/m4/filecharset.m4
at configure time. While rather stable for the first 127 bytes,
on Win32 it varies wildly for the bytes 128-256, depending on the
OS version and the file system.
The line terminator mode of CUSTOM:*PATHNAME-ENCODING*
is ignored.
Platform Dependent: Mac OS X platform only:
Mac OS X pathnames are actually UNICODE STRING
s, so
CUSTOM:*PATHNAME-ENCODING*
is a constant with value CHARSET:UTF-8
.
CUSTOM:*TERMINAL-ENCODING*
*TERMINAL-IO*
.
CUSTOM:*MISC-ENCODING*
CUSTOM:*FOREIGN-ENCODING*
The default encoding objects are initialized according to
.-E
domain
encoding
You have to use EXT:LETF
/EXT:LETF*
for SYMBOL-MACRO
s; LET
/LET*
will not work!
The line terminator facet of the above EXT:ENCODING
s is determined by
the following logic: since CLISP understands all possible
line terminators on input (see
Section 13.11, “Treatment of Newline during Input and Output sec_13-1-8”), all that matters is what line terminator
do most other programs expect?
O_BINARY
cpp
constant is defined, we assume that the OS distinguishes between text
and binary files, and, since the encodings are relevant only for text
files, we thus use :DOS
; otherwise the default is :UNIX
.
:DOS
.This boils down to the following code
in src/encoding.d
:
#if defined(WIN32) || (defined(UNIX) && (O_BINARY != 0))
Both of the above tests
pass on Cygwin, so the default line terminator is :DOS
.
If you so desire, you can change it in your RC file.
Encodings can also be used to convert directly between strings and their corresponding byte vector representation according to that encoding.
(EXT:CONVERT-STRING-FROM-BYTES
vector
encoding
&KEY
:START
:END
)
vector
(a (VECTOR
(UNSIGNED-BYTE
8))
)
from start
to end
to a STRING
, according to the given
encoding
, and returns the resulting string.
(EXT:CONVERT-STRING-TO-BYTES
string
encoding
&KEY
:START
:END
)
string
from
start
to end
to a (VECTOR
(UNSIGNED-BYTE
8))
, according to the given
encoding
, and returns the resulting byte vector.
These notes document CLISP version 2.49.93+ | Last modified: 2018-02-19 |