By HandyD

2019-02-07 03:12:28 8 Comments

I am currently in the process of migrating data from Oracle to SQL Server and I'm encountering an issue trying to validate the data post-migration.

Environment Details:

  • Oracle 12 - AL32UTF8 character set
  • Client - NLS_LANG - WE8MSWIN1252
  • VARCHAR2 field

SQL Server 2016

  • Latin1_General_CI_AS collation
  • NVARCHAR field

I'm using DBMS_CRYPTO.HASH on Oracle to generate a checksum of the whole row, then copying to SQL and using HASHBYTES to generate a checksum of the whole row, which I'm then comparing to validate the data matches.

The checksums match for all rows, except those with multibyte characters.

For example, rows with this character: ◦ do not match in the checksums, even though the data is transferred correctly. When I use DUMP in Oracle or convert to VARBINARY in SQL Server the data matches exactly except for the bytes for this character.

In SQL Server, the bytes are 0xE625 and in Oracle they are 0x25E6.

Why are they ordered differently, and is there a reliable way to convert one to the other to ensure the checksum at the other end matches for strings with multi-byte characters?


@Solomon Rutzky 2019-02-07 19:02:05

The collation of an NVARCHAR / NCHAR / NTEXT column has no bearing on the encoding used to store the data in that column. NVARCHAR data is always UTF-16 Little Endian (LE). The collation of NVARCHAR data only impacts sorting and comparison. Collation does impact the encoding of VARCHAR data since the collation determines the code page used to store the data in that column / variable / literal, but we aren't dealing with that here.

As sepupic mentioned, what you are seeing when you view the data in binary form is a difference in endianness (Oracle is using Big Endian while SQL Server is using Little Endian). HOWEVER, what you are seeing when you view the binary form of the string in Oracle is not how the data is actually being stored. You are using AL32UTF8 which is UTF-8, which encodes that character in 3 bytes, not 2, as: E2, 97, A6.

Also, it is not possible for the hashes to be the same for rows of just "a" but not when they include "◦", not unless the hashing in Oracle was done without conversion, hence using the UTF-8 encoding, and the hashing in SQL Server accidentally converting to VARCHAR first. Otherwise there is no hash algorithm that will behave as you are describing, as you can verify by running the following in SQL Server:

DECLARE @Algorithm NVARCHAR(50) = N'MD4';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'MD5';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA1';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA2_256';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA2_512';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);

In Oracle, you should use the CONVERT function to get the string into the AL16UTF16LE encoding, and then hash that value. That should match up to what SQL Server has. For example, you can see the different encoding forms of White Bullet (U+25E6) and how using CONVERT along with AL16UTF16LE will correct this on dbfiddle and below:

SELECT DUMP(CHR(14849958), 1016) AS "UTF8",
       DUMP(CHR(9702 USING NCHAR_CS), 1016) AS "UTF16BE",
       DUMP(CONVERT(CHR(9702 USING NCHAR_CS), 'AL16UTF16LE' ), 1016) AS "UTF16LE"

SELECT DUMP('a' || CHR(14849958), 1016) AS "UTF8",
       DUMP('a' || CHR(9702 USING NCHAR_CS), 1016) AS "UTF16BE",
       DUMP(CONVERT('a' || CHR(9702 USING NCHAR_CS), 'AL16UTF16LE' ), 1016) AS "UTF16LE"

That returns:

UTF8:     Typ=1 Len=3 CharacterSet=AL32UTF8: e2,97,a6
UTF16BE:  Typ=1 Len=2 CharacterSet=AL16UTF16: 25,e6
UTF16LE:  Typ=1 Len=2 CharacterSet=AL16UTF16: e6,25

UTF8:     Typ=1 Len=4 CharacterSet=AL32UTF8: 61,e2,97,a6
UTF16BE:  Typ=1 Len=4 CharacterSet=AL16UTF16: 0,61,25,e6
UTF16LE:  Typ=1 Len=4 CharacterSet=AL16UTF16: 61,0,e6,25

As you can see in the 3rd column, the character set is misreported as being Big Endian when it is clearly Little Endian based on the order of the two bytes. You can also see that both characters are two bytes in UTF-16, and the order of both of them is different between Big and Little Endian, not just the characters that are > 1 byte in UTF-8.

Given all of this, since the data is being stored as UTF-8 yet you are seeing it as UTF-16 Big Endian via the DUMP function, it seems like you are already converting it to UTF-16, but probably not realizing that the default UTF-16 in Oracle is Big Endian.

Looking at the "UTF-16" definition on the Glossary page of the Oracle documentation, it states (I broke the following sentences into two parts so it would be easier to distinguish between BE and LE):

AL16UTF16 implements the big-endian encoding scheme of the UTF-16 encoding form (more significant byte of each code unit comes first in memory). AL16UTF16 is a valid national character set.


AL16UTF16LE implements the little-endian UTF-16 encoding scheme. It is a conversion-only character set, valid only in character set conversion functions such as SQL CONVERT or PL/SQL UTL_I18N.STRING_TO_RAW.

P.S. Since you are using AL32UTF8 in Oracle, you should be using the Latin1_General_100_CI_AS_SC collation in SQL Server, not Latin1_General_CI_AS. The one you are using is older and doesn't fully support Supplementary Characters (no data loss if they exist, but built-in functions handle them as 2 characters instead of a single entity).

@HandyD 2019-02-08 01:00:52

My question was a little light on details, apologies, but your answer has given me the solution. I was converting to AL16UTF16 but not specifying BE or LE. I had looked into the server settings and it was 'supposed' to be little endian according to what I'd found but specifying AL16UTF16LE in my conversion has now flipped the bytes and my checksum is now matching correctly, thanks for the info on the collation as well.

@Solomon Rutzky 2019-02-08 06:41:54

@HandyD Glad to hear that it's working now :-) Yeah, AL16UTF16 is Big Endian, which is the default. The only way to get Little Endian in Oracle is to specify that character set explicitly, and it can only be done in the CONVERT function, or a few other similar functions. I updated my answer to include something I had found earlier but forgot to include. It's the quotes from the Oracle doc towards the bottom of my answer.

@sepupic 2019-02-07 15:22:55

What you are seeing is Little-Endian encoding that SQL Server uses to store Unicode characters (more precisely, it uses UCS-2 LE).

More on Little-Endian here: Difference between Big Endian and little Endian Byte order

I don't know how it was possible that

When I use DUMP in Oracle or convert to VARBINARY in SQL Server the data matches exactly except for the bytes for this character

All the Unicode characters stored in SQL Server, converted to binary, are "inverted", I mean, to see the real codes you should divide them in groups of 2 bytes and invert the order within every pair.


declare @str varchar(3) = 'abc';
declare @str_n nvarchar(3) = N'abc';

select cast(@str as varbinary(3));
select cast(@str_n as varbinary(6));

The result is



As you see in case of Unicode characters bytes are inverted: "a" is represented as 0x6100 and not as 0x0061.

The same story is about 0x25E6 that is real Unicode code while in binary representation in SQL Server you see it as 0xE625, i.e. inverted.

Related Questions

Sponsored Content

2 Answered Questions

Spool Unicode data in sqlplus to ISO-client without character set conversion

1 Answered Questions

[SOLVED] Measuring Character Set Conversion

2 Answered Questions

1 Answered Questions

1 Answered Questions

1 Answered Questions

1 Answered Questions

[SOLVED] Silent truncation of last character when using SSIS with Oracle

  • 2012-04-02 19:57:03
  • josjo
  • 884 View
  • 5 Score
  • 1 Answer
  • Tags:   oracle ssis

1 Answered Questions

Arabic characters not displaying in oracle database

  • 2012-09-04 16:37:25
  • Macky
  • 12213 View
  • 1 Score
  • 1 Answer
  • Tags:   oracle

0 Answered Questions

Testing equivalence in strings with different character sets

  • 2014-06-12 13:25:53
  • kevinsky
  • 156 View
  • 1 Score
  • 0 Answer
  • Tags:   oracle oracle-10g

5 Answered Questions

[SOLVED] Oracle sort varchar2 column with special characters last

Sponsored Content