Actian Zen Simplifies Code Pages and Data Encoding
When information is saved to a file, it should be encoded to support globalization of the data. Properly encoded data is easily read by different users in different locales. This is the case for databases or even simple text documents.
This page covers the topics listed at right.
Encoding Systems
ASCII is one of the most common encoding systems. Under the modern ASCII standard, the letter A is encoded as the numerical value 65, B as 66, and so on. Early forms of ASCII supported only 127 characters (upper- and lower-case English characters, digits, punctuation, and other special characters). Later forms of ASCII supported up to 255 characters, but even this wasn’t enough to represent all the symbols used by different writing systems.
To support the characters needed for various languages, the same values are repurposed to represent other characters under different writing systems. The different systems of how letters are encoded are called code pages, most of which are based on ASCII.
Companies and organizations have defined their own code pages and given them names. Sometimes the same system of encoding may exist under more than one name because of different organizations using their own naming conventions. Examples of some code pages are Microsoft Windows-1252 (for Western Latin-based languages), code page 932 (for Japanese characters), and ISO-8859-1, which is almost identical to Windows-1252.
In a flat file, it isn’t necessarily clear what encoding is being used. Many encodings use the same representations for Latin characters that we find in ASCII, so if you’re viewing a document that contains a lot of text, at first glance it might seem the correct encoding is being used. But sometimes there’s a problem that won’t be noticed until certain other characters come into play.
Thus, it’s important to ensure that the same encoding is used for reading from and writing to a file. Unfortunately, with plain text files there is no standard way to identify the default code file used to create the file and needed to read it accurately.
In contrast, a database can, and should, allow configuration to tailor the code page to the data. If there’s a difference between the code page of the saved data and the code page that an application expects, the database should automatically convert between the two.
Code Pages and Text Files
Let’s look at a common scenario: Encoding and decoding data saved in a flat text file with the correct code page. We have a file with a list of items and prices formatted as a CSV. When viewed properly, the lines of the file look similar to the following:
A file like this may be read or written with the following code snippets, which assume there’s a structure named CurrencyRecord with fields for the above data.
"Australia","AUS","Dollar","$" "Canada","CAD","Dollar","$" "France","EUR","Euro","€" "United Kingdom","GBP","Pound","£" "United States","Dollar","$"
A file like this might be read or written with the following code snippets, which assume there’s a structure named CurrencyRecord with fields for the above data.
void writeCurrencyRecords() { std::ofstream outputFile("CurrencyData.csv"); for (auto it = currencyList.begin(); it < currencyList.end(); ++it) { outputFile << it->country << ", " << it->code << ", " << it->name << ", " << it->symbol << std::endl; } } void readCurrencyRecords() { std::ifstream inputFile("CurrencyData.csv"); std::vector<std::string> fieldList; std::string line; if (!inputFile.good()) return; while (std::getline(inputFile,line)) { CurrencyRecord r = { 0 }; std::string field; std::stringstream ss(line); fieldList.clear(); while (std::getline(ss, field, ',')) fieldList.push_back(field); strncpy(r.country, fieldList[0].c_str(), 32); strncpy(r.code, fieldList[1].c_str(), 4); strncpy(r.name,fieldList[2].c_str(),32); r.symbol = fieldList[3][0]; currencyList.push_back(r); } inputFile.close(); }
Whether this file is stored and read using the code page for ISO-8859-1 or stored and read using Windows-1252, the results are identical. If you display the data in a program without performing any additional processing on the data, the expected letters appear.
Code Page Confusion
An application typically uses the code page set by the operating system, although sometimes the application has its own setting. As long as the file is written and read by the same program with the same setting, it will be handled appropriately. It’s when the file is copied to a machine with a different code page setting or used by an application with a different setting that problems may arise.
Here’s the original version for reference:
"Australia","AUS","Dollar","$" "Canada","CAD","Dollar","$" "France","EUR","Euro","€" "United Kingdom","GBP","Pound","£" "United States","Dollar","$"
If my file is encoded in ISO-8859-1 and the application reads it using Windows 1252, the results look slightly different:
"Australia","AUS","Dollar","$" "Canada","CAD","Dollar","$" "France","EUR","Euro","¤" "United Kingdom","GBP","Pound","£" "United States","Dollar","$"
The symbol for the Euro (€) has been replaced with another character, although the other currency symbols are correct. These two code pages use different numerical values to encode the Euro symbol.
This is a mild example of what can go wrong. Encoding differences can lead to problems far more severe. In the following example, a plain text file was read with the wrong encoding. The result is completely unreadable and might be treated as corrupted by a program trying to read it.
EÿþD Q ¹Ø-ØãØóØÐØÈØéØüغlQ_ ,gåeˆØŠØ↑Ø)YzznØþvØLØ
When data is passed from one system to another, a conversion is needed if there are encoding differences. The functions available for converting between code pages can be compiler- and platform-specific.
To convert only the differences in the code pages for Windows 1252 and ISO-8859-1, one method is to make a map of the characters that are different in the encodings and swap those characters, allowing the others to go through conversion untouched:
void populateMap() { cp1252_To_iso85591.emplace(std::make_pair('\x80', '\0xa4')); cp1252_To_iso85591.emplace(std::make_pair('\x8A', '\0xa6')); cp1252_To_iso85591.emplace(std::make_pair('\x8c', '\0xbc')); cp1252_To_iso85591.emplace(std::make_pair('\x8e', '\0xb4')); cp1252_To_iso85591.emplace(std::make_pair('\x9a', '\0xa8')); cp1252_To_iso85591.emplace(std::make_pair('\x9c', '\0xbd')); cp1252_To_iso85591.emplace(std::make_pair('\x98', '\0xb8')); cp1252_To_iso85591.emplace(std::make_pair('\x9f', '\0xbe')); //make a reverse map for conversions in the other direction for (auto it = cp1252_To_iso85591.begin(); it != cp1252_To_iso85591.end(); ++it) iso85591_to_cp1252.emplace(std::make_pair(it->second, it->first)); } std::string convert_85591_to_1252(std::string source) { std::string returnValue; for (std::string::iterator it = source.begin(); it != source.end(); ++it) { char c = *it; //For characters encoded with values 0x00 to 0x7f the encodings are identical if (c <= 0x7f) { returnValue.push_back(c); } else { auto result = cp1252_To_iso85591.find(c); if (result != cp1252_To_iso85591.end()) returnValue.push_back(result->second); } } return returnValue; } std::string convert_1252_to_8559(std::string source) { std::string returnValue; for (std::string::iterator it = source.begin(); it != source.end(); ++it) { char c = *it; //For characters encoded with values 0x00 to 0x7f the encodings are identical if (c <= 0x7f) { returnValue.push_back(c); } else { auto result = iso85591_to_cp1252.find(c); if (result != iso85591_to_cp1252.end()) returnValue.push_back(result->second); } } return returnValue; }
This takes a lot of code. These functions must be called on data being read or written to convert it to the desired code page. And while the code can perform the conversions between these two specific code pages, it can’t handle other ones.
Furthermore, it doesn’t address the problem of not knowing what encoding was used to save the data in the first place. Because different code pages may encode some characters identically, you might glance at a file and make an assumption that will work for some but not all of the data.
Code Page Support in Actian Zen
Let’s look now at a similar scenario that uses the Actian Zen database with a SQL access method connection and its native Btrieve 2 API.
Open Database Connectivity (ODBC) is an API for accessing a database. In applications that use ODBC, a driver sits between the application and the database. The driver serves as an adapter to the database and translates a request from the application into something appropriate for the database system.
In addition to supporting communication with the database, the Actian Zen ODBC driver can also convert between code pages. Both the database and the application each have their own code page setting, and the ODBC driver uses these settings to determine the conversions to make.
When you create a new database in Zen Control Center, the settings include an option to change the code page. Selecting Change code page gives a list of available choices. The initial value is the code page in use by the operating system. The operating system default depends on the language setting. For example, on Windows platforms in English you commonly see Windows-1252. The option to change the code page is also offered in the dialogs for File Open, Save File, Import Data, Export Data, and Export Schema.
Computers within the same region can have different language settings, especially in regions where more than one language is spoken. If your data is going to be used by more than one system, you should select a code page that matches the data you plan to write using the Btrieve 2 API. Do this when you’re creating a database. Changing the setting after the database is created and populated won’t automatically update the data that’s already been written to it. Once created, the text encoding is an attribute of the database and is used by the driver to know what conversion to apply.
The following code reads information from a table in an existing database and prints it to the console. Since code page conversions are handled by the ODBC driver, there is nothing in the following code that converts code pages. It happens automatically.
#include <windows.h> #include <iostream> #include <sql.h> #include <sqlext.h> #include <stdio.h> using namespace std; const WCHAR* CONNECTION_STRING = L"Driver={Pervasive ODBC Unicode Interface};DSN=MyZenSrc"; int main() { SQLWCHAR* SELECT_STATEMENT = (SQLWCHAR*)L"SELECT ID,Country, CurrencyCode,CurrencyName, Symbol FROM CurrencyInfo "; SQLWCHAR* INSERT_STATEMENT = (SQLWCHAR*)L"INSERT INTO CurrencyInfo (Country,CurrencyCode, CurrencyName, Symbol) values (?, ?, ?, ?)"; SQLRETURN sret; SQLHENV hEnvironment = nullptr; SQLHDBC hDB; //database handle SQLHSTMT hSqlStatement; //The return value for each call should be checked for errors. //Checks are omitted here for clarity. sret = SQLAllocHandle(SQL_HANDLE_ENV, SQL_NULL_HANDLE, &hEnvironment); sret = SQLSetEnvAttr(hEnvironment,SQL_ATTR_ODBC_VERSION,reinterpret_cast<SQLPOINTER>(SQL_OV_ODBC3),0); sret = SQLAllocHandle(SQL_HANDLE_DBC, hEnvironment, &hDB); sret = SQLDriverConnect(hDB, NULL, (SQLWCHAR*)CONNECTION_STRING, SQL_NTS, NULL, 0, NULL, SQL_DRIVER_COMPLETE); sret = SQLAllocHandle(SQL_HANDLE_STMT, hDB, &hSqlStatement); //Buffers for parameters WCHAR CountryName[51] = { 0 }; WCHAR CurrencyCode[4] = { 0 }; WCHAR CurrencyName[17] = { 0 }; WCHAR CurrencySymbol[3] = { 0 }; SQLLEN CountryLen, CurrencyNameLen, CurrencyCodeLen, CurrencySymbolLen, IDLen; //Copy values into buffers wcscpy_s(CountryName, L"Japan"); wcscpy_s(CurrencyCode, L"JPY"); wcscpy_s(CurrencyName, L"Yen"); wcscpy_s(CurrencySymbol, L"¥"); //Set those buffers as parameters for next statement CountryLen = CurrencyCodeLen = CurrencyNameLen = CurrencySymbolLen = SQL_NTSL; sret = SQLBindParameter(hSqlStatement, 1, SQL_PARAM_INPUT, SQL_C_WCHAR, SQL_VARCHAR, 50, 0, CountryName, 50, &CountryLen); sret = SQLBindParameter(hSqlStatement, 2, SQL_PARAM_INPUT, SQL_C_WCHAR, SQL_VARCHAR, 4, 0, CurrencyCode, 4, &CurrencyCodeLen); sret = SQLBindParameter(hSqlStatement, 3, SQL_PARAM_INPUT, SQL_C_WCHAR, SQL_VARCHAR, 17, 0, CurrencyName, 16, &CurrencyNameLen); sret = SQLBindParameter(hSqlStatement, 4, SQL_PARAM_INPUT, SQL_C_WCHAR, SQL_VARCHAR, 3, 0, CurrencySymbol, 50, &CurrencySymbolLen); //Insert record sret = SQLExecDirect(hSqlStatement, (SQLWCHAR*)INSERT_STATEMENT, SQL_NTS); //Free previous statement handle and create new one to use for retrieval statement SQLFreeHandle(SQL_HANDLE_STMT, hSqlStatement); sret = SQLAllocHandle(SQL_HANDLE_STMT, hDB, &hSqlStatement); if (sret != SQL_SUCCESS) { SQLWCHAR SqlState[6], SQLStmt[100], Msg[SQL_MAX_MESSAGE_LENGTH]; SQLINTEGER NativeError; SQLSMALLINT i, MsgLen; SQLRETURN rc1, rc2; SQLLEN numRecs = 0; SQLGetDiagField(SQL_HANDLE_STMT, hSqlStatement, 0, SQL_DIAG_NUMBER, &numRecs, 0, 0); if (sret == SQL_SUCCESS) { bool moreData = true; int ID; //Bind columns to our variables. Every time SQLFetch is called, // these variables will be populated with values of next row. sret = SQLSetStmtAttr(hSqlStatement, SQL_ATTR_PARAM_BIND_TYPE, SQL_PARAM_BIND_BY_COLUMN, 0); sret = SQLBindCol(hSqlStatement, 1, SQL_C_SLONG, &ID, sizeof(ID), &IDLen); sret = SQLBindCol(hSqlStatement, 2, SQL_C_TCHAR, CountryName, sizeof(CountryName), &CountryLen); sret = SQLBindCol(hSqlStatement, 3, SQL_C_TCHAR, CurrencyCode, sizeof(CurrencyCode), &CurrencyCodeLen); sret = SQLBindCol(hSqlStatement, 4, SQL_C_TCHAR, CurrencyName, sizeof(CurrencyName), &CurrencyNameLen);| sret = SQLBindCol(hSqlStatement, 5, SQL_C_TCHAR, CurrencySymbol, sizeof(CurrencySymbol), &CurrencySymbolLen); sret = SQLExecDirect(hSqlStatement, SELECT_STATEMENT, SQL_NTS); do { sret = SQLFetch(hSqlStatement); if (sret == SQL_NO_DATA) moreData = false; else wcout << ID << L", " << CountryName << L", " << CurrencyCode << ", " << CurrencyName << ", " << CurrencySymbol << endl; } while (moreData); } if (hSqlStatement) SQLFreeHandle(SQL_HANDLE_STMT, hSqlStatement); if (hDB) { SQLDisconnect(hDB); SQLFreeHandle(SQL_HANDLE_DBC, hDB); } if (hEnvironment) SQLFreeHandle(SQL_HANDLE_ENV, hEnvironment); return 0; }
When using the Btrieve 2 API to communicate with a Zen data file (which may or may not be part of a database), the data passed via Btrieve 2 is written in the same form in which it was passed. This highlights a difference between using ODBC and Btrieve 2. Btrieve 2 provides fast access to the file because data access is more direct and doesn’t go through translations and conversions before being written. Using ODBC for data access isn’t as fast as the direct methods, but has the conveniences that come with the additional functionality provided by the driver.
To take advantage of the automatic code page conversion if the Btrieve file is part of a database, other applications that access the data can connect through ODBC. Otherwise the application would need to perform its own code page conversion.
In the following code, the Btrieve 2 API is used to read a collection of records from a Btrieve file. The text is written to the file using the same code page as the application.
status = btrieveClient.FileOpen(&btrieveFile, FILE_NAME, NULL, Btrieve::OPEN_MODE_NORMAL); if (status != Btrieve::STATUS_CODE_NO_ERROR) return -1; int bytesRead = btrieveFile.RecordRetrieveFirst(sortIndex, (char*)&record, sizeof(record)); if (bytesRead > 0) { while (status == Btrieve::STATUS_CODE_NO_ERROR ) { currencyList.push_back(record); btrieveFile.RecordRetrieveNext((char*)&record, sizeof(record)); status = btrieveFile.GetLastStatusCode(); } } btrieveClient.FileClose(&btrieveFile); btrieveClient.Reset();
Now we’ll add a list of records to a database:
Btrieve::StatusCode status = btrieveClient.FileOpen(&btrieveFile, FILE_NAME, NULL, Btrieve::OPEN_MODE_NORMAL); for (int i=0;i<currencyList.size();++i) { btrieveFile.RecordCreate((char*)¤cyList[i], sizeof(CurrencyRecord)); } btrieveClient.FileClose(&btrieveFile); btrieveClient.Reset();
The Btrieve 2-based code is much smaller and doesn’t require the use of SQL. Btrieve 2 code is also faster than code that uses ODBC because fewer calls are needed to read from and write to the data files.
Wrapping Up
Applications sharing data must use either the same code page or a reliable method of translating data from one code page to another. For applications that use SQL methods to access the data (in these examples, via the ODBC driver), the code page conversion is already available through the database drivers.
With applications that directly manage their own files, code page conversion isn’t automatic. Their data is written using the encoding used by the particular application. For Btrieve files that are part of a Zen database, other applications that access information can import data using SQL methods even if the application generating the data does not use SQL.
Making use of this existing code page translation solution can help prevent data from being corrupted when it must be passed between systems. You can find more information on Btrieve in the The Btrieve C++ and C API documentation, where you’ll also find other C++ samples on the Examples tab.