By Evan Carroll

2017-11-28 16:18:12 8 Comments

ICU Specifies different LDML Collation Settings. Some of them seem pretty interesting, especially the ones on case and accent,

  • “Ignore accents”: strength=primary
  • “Ignore accents” but take case into account: strength=primary caseLevel=on
  • “Ignore case”: strength=secondary
  • “Ignore punctuation” (completely): strength=tertiary alternate=shifted
  • “Ignore punctuation” but distinguish among punctuation marks: strength=quaternary alternate=shifted potentially a better method of doing what

You can also see these documented here. Are these ICU options and settings possible with PostgreSQL 10 ICU collation support?

CREATE COLLATION special (provider = icu, locale = '[email protected]=primary');
SELECT 'Å' LIKE 'A' COLLATE "special"; # returns false

I've also tried the CLDR BCP47

Starting with ICU 54, collation attributes can be specified via locale keywords as well, in the old locale extension syntax ("[email protected]=upper") or in language tag syntax ("el-u-kf-upper"). Keywords and values are case-insensitive. See the LDML Collation spec, Collation Settings, and the data file listing the valid collation keywords and their values. (The deprecated attributes kh/colHiraganaQuaternary and vt/variableTop are not supported.)

For that, this looked right

CREATE COLLATION special (provider = icu, locale = 'en-ks-level1');
SELECT 'Å' LIKE 'A' COLLATE "special"; # returns false

Also tried en-u-ks-level1 That method seems to be what the docs go for,

CREATE COLLATION german_phonebook (provider = icu, locale = 'de-u-co-phonebk');


@Daniel Vérité 2017-12-01 14:53:23

Case insensitive or accent-insensitive collations cannot be used consistently because internally PostgreSQL considers that strings with a different binary representation are not equal. When the collation-aware comparator says they are equal, it uses the non-collation-aware strcmp() function as a tie-breaker.

It must do that, at least because of hashing. Hashing is used in joins and having hash(s1)!=hash(s2) whereas s1=s2 is a killer problem for a normal hash table implementation. This happens with any collation that ignores case or accents.

This is discussed in detail in this thread in the hackers mailing list: strcmp() tie-breaker for identical ICU-collated strings. The developers expect to solve that problem at some point, but not as of PostgreSQL 10.

Related Questions

Sponsored Content

0 Answered Questions

Does Postgresql logical replication support hot standby?

2 Answered Questions

[SOLVED] Does PostgreSQL support online schema modification (DDL)?

1 Answered Questions

0 Answered Questions

Does PostgreSQL support subtransactions?

1 Answered Questions

1 Answered Questions

0 Answered Questions

What do LC_TIME and LC_NUMERIC settings actually change in PostgreSQL?

1 Answered Questions

[SOLVED] Does PostgreSQL support multi-threaded replication?

1 Answered Questions

[SOLVED] Unicode support of PostgreSQL and its performance

Sponsored Content