Handling different character encodings when importing tabular data from CSV

I’ve been fixing some minor issues with the utility in HBSMR that imports data from the Portable Antiquities Scheme. One issue was that some characters were mojibaked during the import, for example this text:

A Medieval copper alloy fragment of an ornate horse harness pendant (12th – 14th century). The pendant has an integral loop for suspension set at 90º to the pendant.

was being imported as

A Medieval copper alloy fragment of an ornate horse harness pendant (12th – 14th century).  The pendant has an integral loop for suspension set at 90º to the pendant. 

The user first selects the records they are interested in from the PAS database, and save this data to a local CSV file. The utility in HBSMR prompts for the location of the file and then imports the data into a temporary holding table using the Microsoft.Jet.OLEDB.4.0 provider. By default this assumes the incoming data is ANSI, whereas in fact the downloaded CSV files are encoded as “UTF-8 without BOM” (Unicode).

Our code is along these lines:

  schemaINI = CreateSchemaIni(PASFileName)
  Set db = CurrentDb()
  strConn = "[Text;FMT=Delimited;HDR=Yes;DATABASE=" & Left$(PASFileName, InStrRev(PASFileName, "\") - 1) & ";].[" & Right$(PASFileName, Len(PASFileName) - InStrRev(PASFileName, "\")) & "]"
  db.Execute "SELECT * INTO " & cPASDataTabName & " FROM " & strConn, dbFailOnError

The solution lay in the CreateSchemaIni function. This creates a file called schema.ini in the same folder as the incoming file, which is used by the text file driver to help read the file correctly (as described here). Note however that while that documentation page says the character set can be set to either ANSI or OEM, in fact it can be set to other code page identifiers as well (I don’t know how many are supported however).

I needed to add a line to this function to specify the character set, as shown in bold here, where 65001 is the appropriate code for “UTF-8 without BOM”:

  Open strPath & "Schema.ini" For Output As #1
  Print #1, "[" & strFile & "]"
  Print #1, "ColNameHeader=True"
  Print #1, "Format=CSVDelimited"
  Print #1, "CharacterSet=65001"
etc etc

Now the data imports correctly. Note: although we’re working in MS Access here, the same will apply if using the Microsoft.Jet.OLEDB.4.0 provider to read text files into .Net.

Comments

Find out more