31.5. Encodings

31.5. Encodings
Prev	Chapter 31. Platform Independent Extensions	Next

31.5.1. Introduction

An “encoding” describes the correspondence between CHARACTERs and raw bytes during input/output via STREAMs with STREAM-ELEMENT-TYPE CHARACTER.

An EXT:ENCODING is an object composed of the following facets:

character set: This denotes both the set of CHARACTERs that can be represented and passed through the I/O channel, and the way these characters translate into raw bytes, i.e., the map between sequences of CHARACTER and (UNSIGNED-BYTE 8) in the form of STRINGs and (VECTOR (UNSIGNED-BYTE 8)) as well as character and byte STREAMs. In this context, for example, CHARSET:UTF-8 and CHARSET:UCS-4 are considered different, although they can represent the same set of characters.
line terminator mode: This denotes the way newline characters are represented.

EXT:ENCODINGs are also TYPEs. As such, they represent the set of characters encodable in the character set. In this context, the way characters are translated into raw bytes is ignored, and the line terminator mode is ignored as well. TYPEP and SUBTYPEP can be used on encodings:

(SUBTYPEP CHARSET:UTF-8 CHARSET:UTF-16)
⇒ T ;
⇒ T
(SUBTYPEP CHARSET:UTF-16 CHARSET:UTF-8)
⇒ T ;
⇒ T
(SUBTYPEP CHARSET:ASCII CHARSET:ISO-8859-1)
⇒ T ;
⇒ T
(SUBTYPEP CHARSET:ISO-8859-1 CHARSET:ASCII)
⇒ NIL ;
⇒ T

“1:1” encodings. Encodings which define a bijection between character and byte sequences are called “1:1” encodings. CHARSET:ISO-8859-1 is an example of such an encoding: any byte sequence corresponds to some character sequence and vice versa. ASCII, however, is not a “1:1” encoding: there are no characters for bytes in the range [128;255]. CHARSET:UTF-8 is not a “1:1” encoding either: some byte sequences do not correspond to any character sequence.

31.5.2. Character Sets

Platform Dependent: Only in CLISP built without compile-time flag UNICODE

Only one character set is understood: the platform's native (8-bit) character set. See Chapter 13, Characters chap-13.

Platform Dependent: Only in CLISP built with compile-time flag UNICODE

The following character sets are supported, as values of the corresponding (constant) symbol in the “CHARSET” package:

Symbols in package “CHARSET”

UCS-2 ≡ UNICODE-16 ≡ UNICODE-16-BIG-ENDIAN, the 16-bit basic multilingual plane of the UNICODE character set. Every character is represented as two bytes.
UNICODE-16-LITTLE-ENDIAN
UCS-4 ≡ UNICODE-32 ≡ UNICODE-32-BIG-ENDIAN, the 21-bit UNICODE character set. Every character is represented as four bytes. This encoding is used by CLISP internally.
UNICODE-32-LITTLE-ENDIAN
UTF-8, the 21-bit UNICODE character set. Every character is represented as one to four bytes. ASCII characters represent themselves and need one byte per character. Most Latin/Greek/Cyrillic/Hebrew characters need two bytes per character. Most other characters need three bytes per character, and the rarely used remaining characters need four bytes per character. This is therefore, in general, the most space-efficient encoding of all of Unicode.
UTF-16, the 21-bit UNICODE character set. Every character in the 16-bit basic multilingual plane is represented as two bytes, and the rarely used remaining characters need four bytes per character. This character set is only available on platforms with GNU libc or GNU libiconv.
UTF-7, the 21-bit UNICODE character set. This is a stateful 7-bit encoding. Not all ASCII characters represent themselves. This character set is only available on platforms with GNU libc or GNU libiconv.
JAVA, the 21-bit UNICODE character set. ASCII characters represent themselves and need one byte per character. All other characters of the basic multilingual plane are represented by \unnnn sequences (nnnn a hexadecimal number) and need 6 bytes per character. The remaining characters are represented by \uxxxx\uyyyy and need 12 bytes per character. While this encoding is very comfortable for editing Unicode files using only ASCII-aware tools and editors, it cannot faithfully represent all UNICODE text. Only text which does not contain \u (backslash followed by lowercase Latin u) can be faithfully represented by this encoding.
ASCII, the well-known US-centric 7-bit character set (American Standard Code for Information Interchange - ASCII).
ISO-8859-1, an extension of the ASCII character set, suitable for the Afrikaans, Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, English, Færoese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Ræto-Romanic, Scottish, Spanish, and Swedish languages.
This encoding has the nice property that
```
(LOOP :for i :from 0 :to CHAR-CODE-LIMIT :for c = (CODE-CHAR i)
  :always (OR (NOT (TYPEP c CHARSET:ISO-8859-1))
              (EQUALP (EXT:CONVERT-STRING-TO-BYTES (STRING c) CHARSET:ISO-8859-1)
                      (VECTOR i))))
⇒ T
```
i.e., it is compatible with CLISP CODE-CHAR/CHAR-CODE in its own domain.
ISO-8859-2, an extension of the ASCII character set, suitable for the Croatian, Czech, German, Hungarian, Polish, Slovak, Slovenian, and Sorbian languages.
ISO-8859-3, an extension of the ASCII character set, suitable for the Esperanto and Maltese languages.
ISO-8859-4, an extension of the ASCII character set, suitable for the Estonian, Latvian, Lithuanian and Sami (Lappish) languages.
ISO-8859-5, an extension of the ASCII character set, suitable for the Bulgarian, Byelorussian, Macedonian, Russian, Serbian, and Ukrainian languages.
ISO-8859-6, suitable for the Arabic language.
ISO-8859-7, an extension of the ASCII character set, suitable for the Greek language.
ISO-8859-8, an extension of the ASCII character set, suitable for the Hebrew language (without punctuation).
ISO-8859-9, an extension of the ASCII character set, suitable for the Turkish language.
ISO-8859-10, an extension of the ASCII character set, suitable for the Estonian, Icelandic, Inuit (Greenlandic), Latvian, Lithuanian, and Sami (Lappish) languages.
ISO-8859-13, an extension of the ASCII character set, suitable for the Estonian, Latvian, Lithuanian, Polish and Sami (Lappish) languages.
ISO-8859-14, an extension of the ASCII character set, suitable for the Irish Gælic, Manx Gælic, Scottish Gælic, and Welsh languages.
ISO-8859-15, an extension of the ASCII character set, suitable for the ISO-8859-1 languages, with improvements for French, Finnish and the Euro.
ISO-8859-16 an extension of the ASCII character set, suitable for the Rumanian language.
KOI8-R, an extension of the ASCII character set, suitable for the Russian language (very popular, especially on the internet).
KOI8-U, an extension of the ASCII character set, suitable for the Ukrainian language (very popular, especially on the internet).
KOI8-RU, an extension of the ASCII character set, suitable for the Russian language. This character set is only available on platforms with GNU libiconv.
JIS_X0201, a character set for the Japanese language.
MAC-ARABIC, a platform specific extension of the ASCII character set.
MAC-CENTRAL-EUROPE, a platform specific extension of the ASCII character set.
MAC-CROATIAN, a platform specific extension of the ASCII character set.
MAC-CYRILLIC, a platform specific extension of the ASCII character set.
MAC-DINGBAT, a platform specific character set.
MAC-GREEK, a platform specific extension of the ASCII character set.
MAC-HEBREW, a platform specific extension of the ASCII character set.
MAC-ICELAND, a platform specific extension of the ASCII character set.
MAC-ROMAN ≡ MACINTOSH, a platform specific extension of the ASCII character set.
MAC-ROMANIA, a platform specific extension of the ASCII character set.
MAC-SYMBOL, a platform specific character set.
MAC-THAI, a platform specific extension of the ASCII character set.
MAC-TURKISH, a platform specific extension of the ASCII character set.
MAC-UKRAINE, a platform specific extension of the ASCII character set.
CP437, a DOS oldie, a platform specific extension of the ASCII character set.
CP437-IBM, an IBM variant of CP437.
CP737, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for the Greek language.
CP775, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for some Baltic languages.
CP850, a DOS oldie, a platform specific extension of the ASCII character set.
CP852, a DOS oldie, a platform specific extension of the ASCII character set.
CP852-IBM, an IBM variant of CP852.
CP855, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for the Russian language.
CP857, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for the Turkish language.
CP860, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for the Portuguese language.
CP860-IBM, an IBM variant of CP860.
CP861, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for the Icelandic language.
CP861-IBM, an IBM variant of CP861.
CP862, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for the Hebrew language.
CP862-IBM, an IBM variant of CP862.
CP863, a DOS oldie, a platform specific extension of the ASCII character set.
CP863-IBM, an IBM variant of CP863.
CP864, a DOS oldie, meant to be suitable for the Arabic language.
CP864-IBM, an IBM variant of CP864.
CP865, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for some Nordic languages.
CP865-IBM, an IBM variant of CP865.
CP866, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for the Russian language.
CP869, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for the Greek language.
CP869-IBM, an IBM variant of CP869.
CP874, a DOS oldie, a platform specific extension of the ASCII character set, meant to be suitable for the Thai language.
CP874-IBM, an IBM variant of CP874.
WINDOWS-1250 ≡ CP1250, a platform specific extension of the ASCII character set, heavily incompatible with ISO-8859-2.
WINDOWS-1251 ≡ CP1251, a platform specific extension of the ASCII character set, heavily incompatible with ISO-8859-5, meant to be suitable for the Russian language.
WINDOWS-1252 ≡ CP1252, a platform specific extension of the ISO-8859-1 character set.
WINDOWS-1253 ≡ CP1253, a platform specific extension of the ASCII character set, gratuitously incompatible with ISO-8859-7, meant to be suitable for the Greek language.
WINDOWS-1254 ≡ CP1254, a platform specific extension of the ISO-8859-9 character set.
WINDOWS-1255 ≡ CP1255, a platform specific extension of the ASCII character set, gratuitously incompatible with ISO-8859-8, suitable for the Hebrew language. This character set is only available on platforms with GNU libc or GNU libiconv.
WINDOWS-1256 ≡ CP1256, a platform specific extension of the ASCII character set, meant to be suitable for the Arabic language.
WINDOWS-1257 ≡ CP1257, a platform specific extension of the ASCII character set.
WINDOWS-1258 ≡ CP1258, a platform specific extension of the ASCII character set, meant to be suitable for the Vietnamese language. This character set is only available on platforms with GNU libc or GNU libiconv.
HP-ROMAN8, a platform specific extension of the ASCII character set.
NEXTSTEP, a platform specific extension of the ASCII character set.
EUC-JP, a multibyte character set for the Japanese language. This character set is only available on platforms with GNU libc or GNU libiconv.
SHIFT-JIS, a multibyte character set for the Japanese language. This character set is only available on platforms with GNU libc or GNU libiconv.
CP932, a Microsoft variant of SHIFT-JIS. This character set is only available on platforms with GNU libc or GNU libiconv.
ISO-2022-JP, a stateful 7-bit multibyte character set for the Japanese language. This character set is only available on platforms with GNU libc or GNU libiconv.
ISO-2022-JP-2, a stateful 7-bit multibyte character set for the Japanese language. This character set is only available on platforms with GNU libc 2.3 or newer or GNU libiconv.
ISO-2022-JP-1, a stateful 7-bit multibyte character set for the Japanese language. This character set is only available on platforms with GNU libiconv.
EUC-CN, a multibyte character set for simplified Chinese. This character set is only available on platforms with GNU libc or GNU libiconv.
HZ, a stateful 7-bit multibyte character set for simplified Chinese. This character set is only available on platforms with GNU libiconv.
GBK, a multibyte character set for Chinese, This character set is only available on platforms with GNU libc or GNU libiconv.
CP936, a Microsoft variant of GBK. This character set is only available on platforms with GNU libc or GNU libiconv.
GB18030, a multibyte character set for Chinese, This character set is only available on platforms with GNU libc or GNU libiconv.
EUC-TW, a multibyte character set for traditional Chinese. This character set is only available on platforms with GNU libc or GNU libiconv.
BIG5, a multibyte character set for traditional Chinese. This character set is only available on platforms with GNU libc or GNU libiconv.
CP950, a Microsoft variant of BIG5. This character set is only available on platforms with GNU libc or GNU libiconv.
BIG5-HKSCS, a multibyte character set for traditional Chinese. This character set is only available on platforms with GNU libc or GNU libiconv.
ISO-2022-CN, a stateful 7-bit multibyte character set for Chinese. This character set is only available on platforms with GNU libc or GNU libiconv.
ISO-2022-CN-EXT, a stateful 7-bit multibyte character set for Chinese. This character set is only available on platforms with GNU libc or GNU libiconv.
EUC-KR, a multibyte character set for Korean. This character set is only available on platforms with GNU libc or GNU libiconv.
CP949, a Microsoft variant of EUC-KR. This character set is only available on platforms with GNU libc or GNU libiconv.
ISO-2022-KR, a stateful 7-bit multibyte character set for Korean. This character set is only available on platforms with GNU libc or GNU libiconv.
JOHAB, a multibyte character set for Korean used mostly on DOS. This character set is only available on platforms with GNU libc or GNU libiconv.
ARMSCII-8, an extension of the ASCII character set, suitable for the Armenian. This character set is only available on platforms with GNU libc or GNU libiconv.
GEORGIAN-ACADEMY, an extension of the ASCII character set, suitable for the Georgian. This character set is only available on platforms with GNU libc or GNU libiconv.
GEORGIAN-PS, an extension of the ASCII character set, suitable for the Georgian. This character set is only available on platforms with GNU libc or GNU libiconv.
TIS-620, an extension of the ASCII character set, suitable for the Thai. This character set is only available on platforms with GNU libc or GNU libiconv.
MULELAO-1, an extension of the ASCII character set, suitable for the Laotian. This character set is only available on platforms with GNU libiconv.
CP1133, an extension of the ASCII character set, suitable for the Laotian. This character set is only available on platforms with GNU libc or GNU libiconv.
VISCII, an extension of the ASCII character set, suitable for the Vietnamese. This character set is only available on platforms with GNU libc or GNU libiconv.
TCVN, an extension of the ASCII character set, suitable for the Vietnamese. This character set is only available on platforms with GNU libc or GNU libiconv.
BASE64, encodes arbitrary byte sequences with 64 ASCII characters

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

as specifined by MIME; 3 bytes are encoded with 4 characters, line breaks are inserted after every 76 characters.
While this is not a traditional character set (i.e., it does not map a set of characters in a natural language into bytes), it does define a map between arbitrary byte sequences and certain character sequences, so it falls naturally into the EXT:ENCODING class.

Platform Dependent: Only on GNU systems with GNU libc 2.2 or better and other systems (UNIX and Win32) on which the GNU libiconv C library has been installed

The character sets provided by the library function iconv can also be used as encodings. To create such an encoding, call EXT:MAKE-ENCODING with the character set name (a string) as the :CHARSET argument.

When an EXT:ENCODING is available both as a built-in and through iconv, the built-in is used, because it is more efficient and available across all platforms.

These encodings are not assigned to global variables, since there is no portable way to get the list of all character sets supported by iconv.

On standard-compliant UNIX systems (e.g., GNU systems, such as GNU/Linux and GNU/Hurd) and on systems with GNU libiconv you get this list by calling the program: iconv -l.

The reason we use only GNU libc 2.2 or GNU libiconv is that the other iconv implementations are broken in various ways and we do not want to deal with random CLISP crashes caused by those bugs. If your system supplies an iconv implementation which passes the GNU libiconv's test suite, please report that to clisp-list and a future CLISP version will use iconv on your system.

31.5.3. Line Terminators

The line terminator mode can be one of the following three keywords:

:UNIX: Newline is represented by the ASCII LF character (U000A).
:MAC: Newline is represented by the ASCII CR character (U000D).
:DOS: Newline is represented by the ASCII CR followed by the ASCII LF.

Windows programs typically use the :DOS line terminator, sometimes they also accept :UNIX line terminators or produce :MAC line terminators.

The HTTP protocol also requires :DOS line terminators.

The line terminator mode is relevant only for output (writing to a file/pipe/socket STREAM). During input, all three kinds of line terminators are recognized. See also Section 13.11, “Treatment of Newline during Input and Output sec_13-1-8”.

31.5.4. Function `EXT:MAKE-ENCODING`

The function (EXT:MAKE-ENCODING &KEY :CHARSET :LINE-TERMINATOR :INPUT-ERROR-ACTION :OUTPUT-ERROR-ACTION) returns an EXT:ENCODING. The :CHARSET argument may be an encoding, a string, or :DEFAULT. The possible values for the line terminator argument are the keywords :UNIX, :MAC, :DOS.

The :INPUT-ERROR-ACTION argument specifies what happens when an invalid byte sequence is encountered while converting bytes to characters. Its value can be :ERROR, :IGNORE or a character to be used instead. The UNICODE character #\uFFFD is typically used to indicate an error in the input sequence.

The :OUTPUT-ERROR-ACTION argument specifies what happens when an invalid character is encountered while converting characters to bytes. Its value can be :ERROR, :IGNORE, a byte to be used instead, or a character to be used instead. The UNICODE character #\uFFFD can be used here only if it is encodable in the character set.

31.5.5. Function `EXT:ENCODING-CHARSET`

Platform Dependent: Only in CLISP built with compile-time flag UNICODE

The function (EXT:ENCODING-CHARSET encoding) returns the charset of the encoding, as a SYMBOL or a STRING.

Warning

(STRING (EXT:ENCODING-CHARSET encoding)) is not necessarily a valid MIME name.

31.5.6. Default encodings

31.5.6.1. Default line terminator

Besides every file/pipe/socket STREAM containing an encoding, the following SYMBOL-MACRO places contain global EXT:ENCODINGs:

SYMBOL-MACRO CUSTOM:*DEFAULT-FILE-ENCODING*. The SYMBOL-MACRO place CUSTOM:*DEFAULT-FILE-ENCODING* is the encoding used for new file/pipe/socket STREAM, when no :EXTERNAL-FORMAT argument was specified.

Platform Dependent: Only in CLISP built with compile-time flag UNICODE

The following are SYMBOL-MACRO places.

CUSTOM:*PATHNAME-ENCODING*

is the encoding used for converting filenames in the file system (represented with byte sequences by the OS) to lisp PATHNAME components (STRINGs). If this encoding is incompatible with some file names on your system, file system access (e.g., DIRECTORY) may SIGNAL ERRORs, thus extreme caution is recommended if this is not a “1:1” encoding. Sometimes it may not be obvious that the encoding is involved at all. E.g., on Win32:

(PARSE-NAMESTRING (STRING #\ARMENIAN_SMALL_LETTER_RA))
*** - PARSE-NAMESTRING: syntax error in filename "ռ" at position 0

when CUSTOM:*PATHNAME-ENCODING* is CHARSET:UTF-16 because then #\ARMENIAN_SMALL_LETTER_RA corresponds to the 4 bytes #(255 254 124 5) and the byte 124 is not a valid byte for a Win32 file name because it means | in ASCII.

The set of valid pathname bytes is determined by the GNU autoconf test src/m4/filecharset.m4 at configure time. While rather stable for the first 127 bytes, on Win32 it varies wildly for the bytes 128-256, depending on the OS version and the file system.

The line terminator mode of CUSTOM:*PATHNAME-ENCODING* is ignored.

Platform Dependent: Mac OS X platform only: Mac OS X pathnames are actually UNICODE STRINGs, so CUSTOM:*PATHNAME-ENCODING* is a constant with value CHARSET:UTF-8.

CUSTOM:*TERMINAL-ENCODING*

is the encoding used for communication with the terminal, in particular by *TERMINAL-IO*.

CUSTOM:*MISC-ENCODING*

is the encoding used for access to environment variables, command line options, and the like. Its line terminator mode is ignored.

CUSTOM:*FOREIGN-ENCODING*

is the encoding for strings passed through the “FFI” (some platforms only). If it is a “1:1” encoding, i.e. an encoding in which every character is represented by one byte, it is also used for passing characters through the “FFI”.

The default encoding objects are initialized according to -Edomain encoding.

Reminder

You have to use EXT:LETF/EXT:LETF* for SYMBOL-MACROs; LET/LET* will not work!

31.5.6.1. Default line terminator

The line terminator facet of the above EXT:ENCODINGs is determined by the following logic: since CLISP understands all possible line terminators on input (see Section 13.11, “Treatment of Newline during Input and Output sec_13-1-8”), all that matters is what line terminator do most other programs expect?

Platform Dependent: UNIX platform only.: If a non-0 O_BINARY cpp constant is defined, we assume that the OS distinguishes between text and binary files, and, since the encodings are relevant only for text files, we thus use :DOS; otherwise the default is :UNIX.
Platform Dependent: Win32 platform only.: Since most Win32 programs expect CRLF, the default line terminator is :DOS.

This boils down to the following code in src/encoding.d:

 #if defined(WIN32) || (defined(UNIX) && (O_BINARY != 0))

Default line terminator on Cygwin

Both of the above tests pass on Cygwin, so the default line terminator is :DOS. If you so desire, you can change it in your RC file.

31.5.7. Converting between strings and byte vectors

Encodings can also be used to convert directly between strings and their corresponding byte vector representation according to that encoding.

(EXT:CONVERT-STRING-FROM-BYTES vector encoding &KEY :START :END): converts the subsequence of vector (a (VECTOR (UNSIGNED-BYTE 8))) from start to end to a STRING, according to the given encoding, and returns the resulting string.
(EXT:CONVERT-STRING-TO-BYTES string encoding &KEY :START :END): converts the subsequence of string from start to end to a (VECTOR (UNSIGNED-BYTE 8)), according to the given encoding, and returns the resulting byte vector.

Prev	Up	Next
31.4. Internationalization of CLISP	Home	31.6. Generic streams