Support for Unicode with Various Databases

Most database management systems support a "national" character set data type (NCHAR with SQL Server / Oracle / MySQL and GRAPHIC in the case of DB2). The Adelia table generator uses this data type for ALPHA types with the "Unicode" option. However, depending how this data type is implemented, the semantics can vary and are not necessarily related to Unicode characters.

Except with MySQL, when a database stores character data in a multi-byte code page, the length of the "CHAR" type is always expressed in bytes, whereas the length of the "NCHAR" type may be expressed in bytes or characters, depending on the context.

The table below lists the specific features of the character types in each DBMS:

DBMS (Adelia type)	CHAR (ALPHA non Unicode)	NCHAR (ALPHA Unicode)
SQL Server	By default, characters are stored in the system's ANSI code page.	Characters are stored using Unicode coding (UTF-16).
Oracle	Characters are stored in the database's code page. A Unicode Oracle base uses UTF-8 as its main code page.	Characters are stored in the database's national code page. A Unicode Oracle base uses UTF-16 as its national code page.
DB2 UDB (non Unicode base)	Characters are stored in the database's code page (single- or multi-byte coding).	The NCHAR type is called "GRAPHIC". Characters are stored in the database's code page (with single- or multi-byte coding, as for CHAR), but the length is expressed in characters rather than bytes (internally, each character is stored on two bytes). The "GRAPHIC" data type is only available if the database uses an MBCS code page.
DB2 UDB (Unicode base)	Characters are stored in the UTF-8 code page.	Characters are stored in the UTF-16 code page.
DB2/400	Characters are stored in the table's code page (which, by default, is the system's code page).	The NCHAR type is called "GRAPHIC", and Adelia generates it with UCS-2 coding.
MySQL	Characters are stored in the database's code page.	With MySQL 5, data is stored in UTF-8; the "NCHAR" type was previously considered to be synonymous with the "CHAR" type. Adelia generates a "char character set utf-8" column to maintain compatibility with both versions.
PostgreSQL	Characters are stored in the database's code page.	Characters are stored in the database's code page.
Access	Character data is stored in Unicode.	Character data is stored in Unicode.

The problem with UTF-8 (and with MBCS code pages in general):

Code pages in which characters are coded with a variable length pose a number of problems, primarily with DB2 and Oracle:

In DB2 UDB the NCHAR (GRAPHIC) type is not always available. For that, the database's code page must be MBCS. In addition, in order to be Unicode-compatible, the DB2 base must be explicitly declared for Unicode (with UTF-8 code pages for CHAR, and UTF-16 for GRAPHIC), which can be problematic if you store non-ASCII characters in a CHAR field (see below).
With DB2 UDB and Oracle, the length of "CHAR" data fields is expressed in bytes, not as a "number of characters". This means that a CHAR(10) may not be able to store 10 characters, if the characters are coded on more than one byte. This limitation also applies to Oracle NCHAR types.
The general semantics used for string handling instructions can be affected by the coding system. For example, LENGTH returns lengths in bytes by default (DB2, Oracle).
The UTF-16 standard is not a fixed-width standard. However, in most cases, coding can be assumed to be in UCS-2 (a subset of UTF-16 with a fixed, double-byte width), with any characters requiring coding on more than two bytes (e.g. Ancient Persian) being special cases. Most commonly-used languages, including Asian languages, can be represented in UCS-2.

Recommendations for use:

- If you want to use an all-Unicode (UTF-8 code page) DB2 UDB or Oracle database, we recommend generating any fields that are liable to contain non-ASCII characters in Unicode.
- For DB2, we recommend using version 9.5, which supports "character" semantics for SQL scalar functions such as length, extract, etc. In earlier versions, SQL scalar functions had byte-oriented semantics that can result in incorrect results with UTF-8 fields.
- For Oracle, although the recommended coding system for Unicode bases is UTF-8 for the base and UTF-16 for the national character set, we advise opting instead for a single-byte code page (ISO8859-1 or WE8MSWIN1252) for the character set and AL16UTF16 for the national character set.
- For PostgreSQL, we recommend creating the database with the UTF-8 code page to ensure compatibility with UNICODE applications.

↑ Top of page