Skip Headers
Oracle® Database Globalization Support Guide
11g Release 2 (11.2)

Part Number E10729-07
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

6 Supporting Multilingual Databases with Unicode

This chapter illustrates how to use Unicode in an Oracle Database environment. This chapter includes the following topics:

Overview of Unicode

Unicode is a character encoding system that defines every character in most of the spoken languages in the world.

To overcome the limitations of existing character encodings, several organizations began working on the creation of a global character set in the late 1980s. The need for this became even greater with the development of the World Wide Web in the mid-1990s. The Internet has changed how companies do business, with an emphasis on the global market that has made a universal character set a major requirement.

A global character set needs to fulfill the following conditions:

A global character set should also have the following capabilities:

Unicode, which is now in wide use, meets all of the requirments and capabilities of a global character set.

What is Unicode?

Unicode is a universally encoded character set that enables information from any language to be stored using a single character set. Unicode provides a unique code value for every character, regardless of the platform, program, or language.

The Unicode standard has been adopted by many software and hardware vendors. Many operating systems and browsers now support Unicode. Unicode is required by standards such as XML, Java, JavaScript, LDAP, and WML. It is also synchronized with the ISO/IEC 10646 standard.

Oracle Database introduced Unicode as a database character set in Oracle Database 7. In Oracle Database 11g, Unicode support has been expanded, and supports Unicode 5.0.

See Also:

http://www.unicode.org for more information about the Unicode standard

This section contains the following topics:

Supplementary Characters

The first version of Unicode was a 16-bit, fixed-width encoding that used two bytes to encode each character. This enabled 65,536 characters to be represented. However, more characters need to be supported, especially additional CJK ideographs that are important for the Chinese, Japanese, and Korean markets.

Unicode defines supplementary characters to meet this need. It uses two 16-bit code points (also known as supplementary characters) to represent a single character. The implementation of supplementary characters enables more than a million additional characters to be defined.

Adding supplementary characters has increased the complexity of Unicode; however, this is less complex than managing several different encodings in the same configuration.

Unicode Encodings

The Unicode standard encodes characters in different ways: UTF-8, UCS-2, and UTF-16. Conversion between different Unicode encodings is a simple bit-wise operation that is defined in the Unicode standard.

This section contains the following topics:

UTF-8 Encoding

UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding and a strict superset of ASCII. This means that each and every character in the ASCII character set is available in UTF-8 with the same code point values. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF-8 encoding. Characters from the European scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary characters are represented in 4 bytes.

UTF-8 is the Unicode encoding used for HTML and most Internet browsers.

The benefits of UTF-8 are as follows:

UCS-2 Encoding

UCS-2 is a fixed-width, 16-bit encoding. Each character is 2 bytes. UCS-2 is the Unicode encoding used for internal processing by Java before version J2SE 5.0 and by Microsoft Windows NT. UCS-2 supports characters defined for Unicode 3.0, so there is no support for supplementary characters.

The benefits of UCS-2 over UTF-8 are as follows:

  • More compact storage for Asian scripts, because all characters are two bytes

  • Faster string processing, because characters are fixed-width

  • Better compatibility with Java and Microsoft clients

UTF-16 Encoding

UTF-16 encoding is the 16-bit encoding of Unicode. UTF-16 is an extension of UCS-2 because it supports the supplementary characters by using two UCS-2 code points for each supplementary character. UTF-16 is a strict superset of UCS-2.

One character can be either 2 bytes or 4 bytes in UTF-16. Characters from European and most Asian scripts are represented in 2 bytes. Supplementary characters are represented in 4 bytes. UTF-16 is the main Unicode encoding used for internal processing by Java since version J2SE 5.0 and by Microsoft Windows since version 2000.

The benefits of UTF-16 over UTF-8 are as follows:

Examples: UTF-16, UTF-8, and UCS-2 Encoding

Figure 6-1 shows some characters and their character codes in UTF-16, UTF-8, and UCS-2 encoding. The last character is a treble clef (a music symbol), a supplementary character.

Figure 6-1 UTF-16, UTF-8, and UCS-2 Encoding Examples

Description of Figure 6-1 follows
Description of "Figure 6-1 UTF-16, UTF-8, and UCS-2 Encoding Examples"

Support for Unicode in Oracle Database

Oracle Database began supporting Unicode as a database character set in release 7. Table 6-1 summarizes the Unicode character sets supported by Oracle Database.

Table 6-1 Unicode Character Sets Supported by Oracle Database

Character Set Supported in RDBMS Release Unicode Encoding Unicode Version Database Character Set National Character Set

AL24UTFFSS

7.2 - 8i

UTF-8

1.1

Yes

No

UTF8

8.0 - 11g

UTF-8

For Oracle Database release 8.0 through Oracle8i Release 8.1.6: 2.1

For Oracle8i Database release 8.1.7 and later: 3.0

Yes

Yes (Oracle9i Database and newer only)

UTFE

8.0 - 11g

UTF-EBCDIC

For Oracle8i Database releases 8.0 through 8.1.6: 2.1

For Oracle8i Database release 8.1.7 and later: 3.0

Yes

No

AL32UTF8

9i - 11g

UTF-8

Oracle9i Database release 1: 3.0

Oracle9i Database release 2: 3.1

Oracle Database 10g, release 1: 3.2

Oracle Database 10g, release2: 4.0

Oracle Database 11g: 5.0

Yes

No

AL16UTF16

9i - 11g

UTF-16

Oracle9i Database release 1: 3.0

Oracle9i Database release 2: 3.1

Oracle Database 10g, release 1: 3.2

Oracle Database 10g, release 2: 4.0

Oracle Database 11g: 5.0

No

Yes


Implementing a Unicode Solution in the Database

Unicode characters can be stored in an Oracle database in two ways:

Note:

You can combine a Unicode database solution with a Unicode data type solution.

The following sections explain how to use the two Unicode solutions and how to choose between them:

Enabling Multilingual Support with Unicode Databases

The database character set specifies the encoding to be used in the SQL CHAR data types as well as the metadata such as table names, column names, and SQL statements. A Unicode database is a database with a UTF-8 character set as the database character set. There are three Oracle character sets that implement the UTF-8 encoding. The first two are designed for ASCII-based platforms while the third one should be used on EBCDIC platforms.

  • AL32UTF8

    The AL32UTF8 character set supports the latest version of the Unicode standard. It encodes characters in one, two, or three bytes. Supplementary characters require four bytes. It is for ASCII-based platforms.

  • UTF8

    The UTF8 character set encodes characters in one, two, or three bytes. It is for ASCII-based platforms.

    Supplementary characters inserted into a UTF8 database do not corrupt the data in the database. A supplementary character is treated as two separate, user-defined characters that occupy 6 bytes. Oracle recommends that you switch to AL32UTF8 for full support of supplementary characters in the database character set.

  • UTFE

    The UTFE character set is for EBCDIC platforms. It is similar to UTF8 on ASCII platforms, but it encodes characters in one, two, three, and four bytes. Supplementary characters are converted as two 4-byte characters.

Example 6-1 Creating a Database with a Unicode Character Set

To create a database with the AL32UTF8 character set, use the CREATE DATABASE statement and include the CHARACTER SET AL32UTF8 clause. For example:

CREATE DATABASE sample 
CONTROLFILE REUSE 
LOGFILE 
GROUP 1 ('diskx:log1.log', 'disky:log1.log') SIZE 50K, 
GROUP 2 ('diskx:log2.log', 'disky:log2.log') SIZE 50K 
MAXLOGFILES 5 
MAXLOGHISTORY 100 
MAXDATAFILES 10 
MAXINSTANCES 2 
ARCHIVELOG
CHARACTER SET AL32UTF8
NATIONAL CHARACTER SET AL16UTF16 
DATAFILE 
'disk1:df1.dbf' AUTOEXTEND ON, 
'disk2:df2.dbf' AUTOEXTEND ON NEXT 10M MAXSIZE UNLIMITED 
DEFAULT TEMPORARY TABLESPACE temp_ts 
UNDO TABLESPACE undo_ts 
SET TIME_ZONE = '+02:00';

Note:

Specify the database character set when you create the database.

Enabling Multilingual Support with Unicode Data Types

An alternative to storing Unicode data in the database is to use the SQL NCHAR data types (NCHAR, NVARCHAR, NCLOB). You can store Unicode characters in columns of these data types regardless of how the database character set has been defined. The NCHAR data type is exclusively a Unicode data type, which means that it stores data encoded as Unicode.

You can create a table using the NVARCHAR2 and NCHAR data types. The column length specified for the NCHAR and NVARCHAR2 columns always equals the number of characters instead of the number of bytes:

CREATE TABLE product_information 
    ( product_id          NUMBER(6) 
    , product_name        NVARCHAR2(100) 
    , product_description VARCHAR2(1000));

The encoding used in the SQL NCHAR data types is the national character set specified for the database. You can specify one of the following Oracle character sets as the national character set:

  • AL16UTF16

    This is the default character set for SQL NCHAR data types. This character set encodes Unicode data in the UTF-16 encoding. It supports supplementary characters, which are stored as four bytes.

  • UTF8

    When UTF8 is specified for SQL NCHAR data types, the data stored in the SQL data types is in UTF-8 encoding.

You can specify the national character set for the SQL NCHAR data types when you create a database using the CREATE DATABASE statement with the NATIONAL CHARACTER SET clause. The following statement creates a database with WE8ISO8859P1 as the database character set and AL16UTF16 as the national character set.

Example 6-2 Creating a Database with a National Character Set

CREATE DATABASE sample 
CONTROLFILE REUSE 
LOGFILE 
GROUP 1 ('diskx:log1.log', 'disky:log1.log') SIZE 50K, 
GROUP 2 ('diskx:log2.log', 'disky:log2.log') SIZE 50K 
MAXLOGFILES 5 
MAXLOGHISTORY 100 
MAXDATAFILES 10 
MAXINSTANCES 2 
ARCHIVELOG
CHARACTER SET WE8ISO8859P1
NATIONAL CHARACTER SET AL16UTF16 
DATAFILE 
'disk1:df1.dbf' AUTOEXTEND ON, 
'disk2:df2.dbf' AUTOEXTEND ON NEXT 10M MAXSIZE UNLIMITED 
DEFAULT TEMPORARY TABLESPACE temp_ts 
UNDO TABLESPACE undo_ts 
SET TIME_ZONE = '+02:00';

How to Choose Between a Unicode Database and a Unicode Data Type Solution

To choose the correct Unicode solution for your database, consider the following questions:

  • Programming environment: What are the main programming languages used in your applications? How do they support Unicode?

  • Ease of migration: How easily can your data and applications be migrated to take advantage of the Unicode solution?

  • Types of data: Is your data mostly Asian or European? Do you need to store multilingual documents into LOB columns?

  • Types of applications: What type of applications are you implementing: a packaged application or a customized end-user application?

This section describes some general guidelines for choosing a Unicode database or a Unicode data type solution. The final decision largely depends on your exact environment and requirements. This section contains the following topics:

When Should You Use a Unicode Database?

Use a Unicode database in the situations described in Table 6-2.

Table 6-2 Using a Unicode Database

Situation Explanation

You need easy code migration for Java or PL/SQL.

If your existing application is mainly written in Java and PL/SQL and your main concern is to minimize the code changes required to support multiple languages, then you may want to use a Unicode database solution. If the data types used to stored data remain as SQL CHAR data types, then the Java and PL/SQL code that accesses these columns does not need to change.

You have evenly distributed multilingual data.

If the multilingual data is evenly distributed in existing schema tables and you are not sure which tables contain multilingual data, then you should use a Unicode database because it does not require you to identify the kind of data that is stored in each column.

Your SQL statements and PL/SQL code contain Unicode data.

You must use a Unicode database. SQL statements and PL/SQL code are converted into the database character set before being processed. If the SQL statements and PL/SQL code contain characters that cannot be converted to the database character set, then those characters are lost. A common place to use Unicode data in a SQL statement is in a string literal.

You want to store multilingual documents in BLOB format and use Oracle Text for content searching.

You must use a Unicode database. The BLOB data is converted to the database character set before being indexed by Oracle Text. If your database character set is not UTF8, then data is lost when the documents contain characters that cannot be converted to the database character set.


When Should You Use Unicode Data Types?

Use Unicode data types in the situations described in Table 6-3.

Table 6-3 Using Unicode Data Types

Situation Explanation

You want to add multilingual support incrementally.

If you want to add Unicode support to the existing database without migrating the character set, then consider using Unicode data types to store Unicode data. You can add columns of the SQL NCHAR data types to existing tables or new tables to support multiple languages incrementally.

You want to build a packaged application.

If you are building a packaged application to sell to customers, then you may want to build the application using SQL NCHAR data types. The SQL NCHAR data type is a reliable Unicode data type in which the data is always stored in Unicode, and the length of the data is always specified in UTF-16 code units. As a result, you need to test the application only once. The application will run on customer databases with any database character set.

You want better performance with single-byte database character sets.

If performance is your main concern, then consider using a single-byte database character set and storing Unicode data in the SQL NCHAR data types.

You require UTF-16 support in Windows clients.

If your applications are written in Visual C/C++ or Visual Basic running on Windows, then you may want to use the SQL NCHAR data types. You can store UTF-16 data in SQL NCHAR data types in the same way that you store it in the wchar_t buffer in Visual C/C++ and string buffer in Visual Basic. You can avoid buffer overflow in client applications because the length of the wchar_t and string data types match the length of the SQL NCHAR data types in the database.


Note:

You can use a Unicode database with Unicode data types.

Comparing Unicode Character Sets for Database and Data Type Solutions

Oracle provides two solutions to store Unicode characters in the database: a Unicode database solution and a Unicode data type solution. After you select the Unicode database solution, the Unicode data type solution, or a combination of both, you then determine the character set to be used in the Unicode database or the Unicode data type.

Table 6-4 contains advantages and disadvantages of character sets for a Unicode database solution. The Oracle character sets that can be Unicode database character sets are AL32UTF8, UTF8, and UTFE.

Table 6-4 Character Set Advantages and Disadvantages for a Unicode Database Solution

Database Character Set Advantages Disadvantages

AL32UTF8

  • Supplementary characters are stored in 4 bytes, so there is no data conversion when supplementary characters are retrieved and inserted if the client setting is UTF-8.

  • The storage for supplementary characters requires less disk space in AL32UTF8 than in UTF8.

  • You cannot specify the length of SQL CHAR types in number of UCS-2 code points for supplementary characters. Supplementary characters are treated as one code point rather than the standard two code points.

  • The binary order for SQL CHAR columns is different from the binary order of SQL NCHAR columns when the data consists of supplementary characters. As a result, CHAR columns and NCHAR columns do not always have the same sort for identical strings.

UTF8

  • You can specify the length of SQL CHAR types in number of UCS-2 code points.

  • The binary order of the SQL CHAR columns is always the same as the binary order of the SQL NCHAR columns when the data consists of the same supplementary characters. As a result, CHAR columns and NCHAR columns have the same sort for identical strings.

  • Supplementary characters are stored as 6 bytes instead of the 4 bytes defined by Unicode 4.0. As a result, Oracle has to convert data for supplementary characters if the client setting is UTF-8.

UTFE

  • This is the only Unicode character set for the EBCDIC platform.

  • You can specify the length of SQL CHAR types in number of UCS-2 code points.

  • The binary order of the SQL CHAR columns is always the same as the binary order of the SQL NCHAR columns when the data consists of the same supplementary characters. As a result, CHAR columns and NCHAR columns have the same sort for identical strings.

  • Supplementary character are stored as 8 bytes (two 4-byte sequences) instead of the 5 bytes defined by the Unicode standard. As a result, Oracle has to convert data for those supplementary characters.

  • UTFE is not a standard encoding in the Unicode standard. As a result, clients requiring standard UTF-8 encoding must convert data from UTFE to the standard encoding when data is retrieved and inserted.


Table 6-5 contains advantages and disadvantages of different character sets for a Unicode data type solution. The Oracle character sets that can be national character sets are AL16UTF16 and UTF8. The default is AL16UTF16.

Table 6-5 Character Set Advantages and Disadvantages for a Unicode Data Type Solution

National Character Set Advantages Disadvantages

AL16UTF16

  • Asian data in AL16UTF16 is usually more compact than in UTF8. As a result, you save disk space and have less disk I/O when most of the multilingual data stored in the database is Asian data.

  • It is usually faster to process strings encoded in the AL16UTF16 character set than strings encoded in UTF8 because Oracle processes most characters in an AL16UTF16 encoded string as fixed-width characters.

  • The maximum length limits for the NCHAR and NVARCHAR2 columns are 1000 and 2000 characters, respectively. Because the data is fixed-width, the lengths are guaranteed.

  • European ASCII data requires more disk space to store in AL16UTF16 than in UTF8. If most of your data is European data, then it uses more disk space than if it were UTF8 data.

  • The maximum lengths for NCHAR and NVARCHAR2 are 1000 and 2000 characters, which is less than the lengths for NCHAR (2000) and NVARCHAR2 (4000) in UTF8.

UTF8

  • European data in UTF8 is usually more compact than in AL16UTF16. As a result, you save disk space and have better response time when most of the multilingual data stored in the database is European data.

  • The maximum lengths for the NCHAR and NVARCHAR2 columns are 2000 and 4000 characters respectively, which is more than those for NCHAR (1000) and NVARCHAR2 (2000) in AL16UTF16. Although the maximum lengths of the NCHAR and NVARCHAR2 columns are larger in UTF8, the actual storage size is still bound by the byte limits of 2000 and 4000 bytes, respectively. For example, you can store 4000 UTF8 characters in an NVARCHAR2 column if all the characters are single byte, but only 4000/3 characters if all the characters are three bytes.

  • Asian data requires more disk space to store in UTF8 than in AL16UTF16. If most of your data is Asian data, then disk space usage is not less efficient than when the character set is AL16UTF16.

  • Although you can specify larger length limits for NCHAR and NVARCHAR, you are not guaranteed to be able to insert the number of characters specified by these limits. This is because UTF8 allows variable-width characters.

  • It is usually slower to process strings encoded in UTF8 than strings encoded in AL16UTF16 because UTF8 encoded strings consist of variable-width characters.


Unicode Case Studies

This section describes typical scenarios for storing Unicode characters in an Oracle database:

Example 6-3 Unicode Solution with a Unicode Database

An American company running a Java application would like to add German and French support in the next release of the application. They would like to add Japanese support at a later time. The company currently has the following system configuration:

In this case, a typical solution is to choose UTF8 for the database character set because of the following reasons:

Example 6-4 Unicode Solution with Unicode Data Types

A European company that runs its applications mainly on Windows platforms wants to add new Windows applications written in Visual C/C++. The new applications will use the existing database to support Japanese and Chinese customer names. The company currently has the following system configuration:

A typical solution is take the following actions:

The reasons for this solution are:

Example 6-5 Unicode Solution with a Unicode Database and Unicode Data Types

A Japanese company wants to develop a new Java application. The company expects that the application will support as many languages as possible in the long run.

In this case, the typical solution is to create a Unicode database using AL32UTF8 as the database character set and use the SQL NCHAR data types to store multilingual data. The national character set should be set to AL16UTF16. The reasons for this solution are as follows:

Designing Database Schemas to Support Multiple Languages

In addition to choosing a Unicode solution, the following issues should be taken into consideration when the database schema is designed to support multiple languages:

Specifying Column Lengths for Multilingual Data

When you use NCHAR and NVARCHAR2 data types for storing multilingual data, the column size specified for a column is defined in number of characters. (The number of characters means the number of Unicode code units.) Table 6-6 shows the maximum size of the NCHAR and NVARCHAR2 data types for the AL16UTF16 and UTF8 national character sets.

Table 6-6 Maximum Data Type Size

National Character Set Maximum Column Size of NCHAR Data Type Maximum Column Size of NVARCHAR2 Data Type

AL16UTF16

1000 characters

2000 characters

UTF8

2000 bytes

4000 bytes


When you use CHAR and VARCHAR2 data types for storing multilingual data, the maximum length specified for each column is, by default, in number of bytes. If the database needs to support Thai, Arabic, or multibyte languages such as Chinese and Japanese, then the maximum lengths of the CHAR, VARCHAR, and VARCHAR2 columns may need to be extended. This is because the number of bytes required to encode these languages in UTF8 or AL32UTF8 may be significantly larger than the number of bytes for encoding English and Western European languages. For example, one Thai character in the Thai character set requires 3 bytes in UTF8 or AL32UTF8. In addition, the maximum column lengths for CHAR, VARCHAR, and VARCHAR2 data types are 2000 bytes, 4000 bytes, and 4000 bytes respectively. If applications need to store more than 4000 bytes, then they should use the CLOB data type.

Storing Data in Multiple Languages

The Unicode character set includes characters of most written languages around the world, but it does not contain information about the language to which a given character belongs. In other words, a character such as ä does not contain information about whether it is a French or German character. In order to provide information in the language a user desires, data stored in a Unicode database should accompany the language information to which the data belongs.

There are many ways for a database schema to relate data to a language. The following sections discuss different approaches:

Store Language Information with the Data

For data such as product descriptions or product names, you can add a language column (language_id) of CHAR or VARCHAR2 data type to the product table to identify the language of the corresponding product information. This enables applications to retrieve the information in the desired language. The possible values for this language column are the 3-letter abbreviations of the valid NLS_LANGUAGE values of the database.

See Also:

Appendix A, "Locale Data" for a list of NLS_LANGUAGE values and their abbreviations

You can also create a view to select the data of the current language. For example:

ALTER TABLE scott.product_information ADD (language_id VARCHAR2(50)):

CREATE OR REPLACE VIEW product AS 
   SELECT product_id, product_name 
   FROM   product_information 
   WHERE  language_id = SYS_CONTEXT('USERENV','LANG');

Select Translated Data Using Fine-Grained Access Control

Fine-grained access control enables you to limit the degree to which a user can view information in a table or view. Typically, this is done by appending a WHERE clause. When you add a WHERE clause as a fine-grained access policy to a table or view, Oracle automatically appends the WHERE clause to any SQL statements on the table at run time so that only those rows satisfying the WHERE clause can be accessed.

You can use this feature to avoid specifying the desired language of a user in the WHERE clause in every SELECT statement in your applications. The following WHERE clause limits the view of a table to the rows corresponding to the desired language of a user:

WHERE language_id = SYS_CONTEXT('userenv', 'LANG')

Specify this WHERE clause as a fine-grained access policy for product_information as follows:

CREATE FUNCTION func1 (sch VARCHAR2 , obj VARCHAR2 )
RETURN VARCHAR2(100);
BEGIN
RETURN 'language_id = SYS_CONTEXT(''userenv'', ''LANG'')';
END
/

DBMS_RLS.ADD_POLICY ('scott', 'product_information', 'lang_policy', 'scott', 'func1', 'select');

Then any SELECT statement on the product_information table automatically appends the WHERE clause.

See Also:

Oracle Database Advanced Application Developer's Guide for more information about fine-grained access control

Storing Documents in Multiple Languages in LOB Data Types

You can store documents in multiple languages in CLOB, NCLOB, or BLOB data types and set up Oracle Text to enable content search for the documents.

Data in CLOB columns is stored in a format that is compatible with UCS-2 when the database character set is multibyte, such as UTF8 or AL32UTF8. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document in a CLOB column requires less storage space than the same document in a LONG column using UTF8, typically around 30% less, depending on the contents of the document.

Documents in NCLOB format are also stored in a proprietary format that is compatible with UCS-2 regardless of the database character set or national character set. The storage space requirement is the same as for CLOB data. Document contents are converted to UTF-16 when they are inserted into a NCLOB column. If you want to store multilingual documents in a non-Unicode database, then choose NCLOB. However, content search on NCLOB is not yet supported.

Documents in BLOB format are stored as they are. No data conversion occurs during insertion and retrieval. However, SQL string manipulation functions (such as LENGTH or SUBSTR) and collation functions (such as NLS_SORT and ORDER BY) cannot be applied to the BLOB data type.

Table 6-7 lists the advantages and disadvantages of the CLOB, NCLOB, and BLOB data types when storing documents:

Table 6-7 Comparison of LOB Data Types for Document Storage

Data Types Advantages Disadvantages

CLOB

  • Content search support

  • String manipulation support

  • Depends on database character set

  • Data conversion is necessary for insertion

  • Cannot store binary documents

NCLOB

  • Independent of database character set

  • String manipulation support

  • No content search support

  • Data conversion is necessary for insertion

  • Cannot store binary documents

BLOB

  • Independent of database character set

  • Content search support

  • No data conversion, data stored as is

  • Can store binary documents such as Microsoft Word or Microsoft Excel

  • No string manipulation support


Creating Indexes for Searching Multilingual Document Contents

Oracle Text enables you to build indexes for content search on multilingual documents stored in CLOB format and BLOB format. It uses a language-specific lexer to parse the CLOB or BLOB data and produces a list of searchable keywords.

Create a multilexer to search multilingual documents. The multilexer chooses a language-specific lexer for each row, based on a language column. This section describes the high level steps to create indexes for documents in multiple languages. It contains the following topics:

Creating Multilexers

The first step in creating the multilexer is the creation of language-specific lexer preferences for each language supported. The following example creates English, German, and Japanese lexers with PL/SQL procedures:

ctx_ddl.create_preference('english_lexer', 'basic_lexer'); 
ctx_ddl.set_attribute('english_lexer','index_themes','yes'); 
ctx_ddl.create_preference('german_lexer', 'basic_lexer'); 
ctx_ddl.set_attribute('german_lexer','composite','german'); 
ctx_ddl.set_attribute('german_lexer','alternate_spelling','german'); 
ctx_ddl.set_attribute('german_lexer','mixed_case','yes'); 
ctx_ddl.create_preference('japanese_lexer', 'JAPANESE_VGRAM_LEXER'); 

After the language-specific lexer preferences are created, they need to be gathered together under a single multilexer preference. First, create the multilexer preference, using the MULTI_LEXER object:

ctx_ddl.create_preference('global_lexer','multi_lexer'); 

Now add the language-specific lexers to the multilexer preference using the add_sub_lexer call:

ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer'); 
ctx_ddl.add_sub_lexer('global_lexer', 'japanese', 'japanese_lexer'); 
ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer'); 

This nominates the german_lexer preference to handle German documents, the japanese_lexer preference to handle Japanese documents, and the english_lexer preference to handle everything else, using DEFAULT as the language.

Creating Indexes for Documents Stored in the CLOB Data Type

The multilexer decides which lexer to use for each row based on a language column in the table. This is a character column that stores the language of the document in a text column. Use the Oracle language name to identify the language of a document in this column. For example, if you use the CLOB data type to store your documents, then add the language column to the table where the documents are stored:

CREATE TABLE globaldoc
  (doc_id    NUMBER       PRIMARY KEY, 
  language  VARCHAR2(30), 
  text      CLOB); 

To create an index for this table, use the multilexer preference and specify the name of the language column:

CREATE INDEX globalx ON globaldoc(text) 
  indextype IS ctxsys.context 
  parameters ('lexer 
               global_lexer 
               language 
               column 
               language'); 

Creating Indexes for Documents Stored in the BLOB Data Type

In addition to the language column, the character set and format columns must be added in the table where the documents are stored. The character set column stores the character set of the documents using the Oracle character set names. The format column specifies whether a document is a text or binary document. For example, the CREATE TABLE statement can specify columns called characterset and format:

CREATE TABLE globaldoc ( 
   doc_id       NUMBER       PRIMARY KEY, 
   language     VARCHAR2(30), 
   characterset VARCHAR2(30), 
   format       VARCHAR2(10), 
   text         BLOB 
  ); 

You can put word-processing or spreadsheet documents into the table and specify binary in the format column. For documents in HTML, XML and text format, you can put them into the table and specify text in the format column.

Because there is a column in which to specify the character set, you can store text documents in different character sets.

When you create the index, specify the names of the format and character set columns:

CREATE INDEX globalx ON globaldoc(text) 
  indextype is ctxsys.context 
  parameters ('filter inso_filter 
                lexer global_lexer 
                language column language 
                format  column format 
                charset column characterset'); 

You can use the charset_filter if all documents are in text format. The charset_filter converts data from the character set specified in the charset column to the database character set.