Uploaded image for project: 'eZ Publish / Platform'
  1. eZ Publish / Platform
  2. EZP-14859

Documentation of multilingual collation issues

    Details

      Description

      We should add documentation about the problem of database collation.

      In short:

      • On single language sites the admin should make sure the correct collation is used. (If the language is english, the default is fine.)
      • On multi language sites, the admin must choose the collation that best fits the site. We can not support multiple collations (see below) due to limitations in the databases we use.
      • The problem affects matching of text (like when searching or when using attribute based filters), where you may get too many or too few matches. It also affects sorting by text, where you may get results not sorted in the correct order (as defined by your language).

      Documentation pages that should mention this include:

      • Installation guide, under database setup
      • Multilingual information pages
      • Attribute filter (under e.g. the content / fetch page)
      • eZ Search Engine
      • eZ Find

      The following is what I have found while researching this topic.

      There may not be many sites that are affected by the problem*. Many multi language sites can get away with choosing one collation that fits all languages more or less well.

      • On a norwegian/english site the collations are quite compatible. If we add german, however, there is a collation conflict with regards to y and ü, for example.

      On sites where we do have conflicts, the problem is serious however. Imagine a norwegian/german site. For a norwegian user it is a bug if a filter for "a*" also returns "å*", while for a german user it is a bug if "y*" also returns "ü*". Yet given the current ezp, we have to choose one of these two evils, or the third evil, binary collation, which makes the filter case sensitive.

      As far as ezsearchengine is concerned, the situation is worse IMHO, in that it gives to few rather than too many results. Given standard collation, a search for "var" returns only those objects containing both "var" and "vår", for example. In other words, it expands the search to include all word variants that the current collation considers to be equal, and does an AND search with them. An OR search would be preferable. It may be tricky to implement, but can be implemented regardless of any collation solution.

      MySQL

      Connection collation is overridden by table and column collation, so it is useless for us. The only way to collate correctly on a multi language site (where collations conflict) is to specify collation for each select query where text is compared. This is a huge amount of work, and must be made database independent as well. Collations in SQL are only standardized on the column level, while collation per query is MySQL specific.

      Oracle

      Collation for sorting and for matching are two different issues. For sorting it is not so bad, for matching we may have to do extra work like function based indexes or column replication. Research needs to be done, it is not well supported and recent versions (11) may be required.

      Postgres

      Postgres currently only supports one collation per installation, which makes this problem unsolvable in a nice way. From v. 8.4 it will reportedly support collations per database, which doesn't help. So in the foreseeable future it can only be supported by column replication.

      Column replication solution

      It is a waste of work and performance to implement the collation per query fix, since it only works for MySQL. We need to support the others too, and so far column replication seems to bee the only way to do it. What this means is that for each column where we need to compare and sort texts, we need to copy the column to normalised versions, one extra column per collation we need to support. Example for the norwegian/english/german site discussed above:

      -----------------------------------

      text utf8_danish_ci utf8_unicode_ci

      -----------------------------------

      var var var
      vår vår var
      über yber über
      yber yber yber
      pie pie pie

      -----------------------------------

      Then, based on the language of the current site access, we choose the column to use for collation.

      This is messy, it complicates a lot of code, it hurts performance, I don't like it at all, and I'm sure it will never happen

      Binary collation solution

      This will work fairly well, and it will be fast, if we can only get around the fact that it is case sensitive. I believe ezsearchengine already stores the indexed words in lower case. If we can just make sure that lower casing is done correctly for utf-8 (using mb_strtolower() or similar) it may be a workable solution.

      This way, no variants will ever be considered equal. It is not correct, but may be a useful compromise. Sorting will not be correct in some languages. Matching will also not always be correct, but it will fail in a more sensible way (IMHO) than the current situation.

        Issue Links

          Activity

          Hide
          (inactive) Gunnstein Lye added a comment -

          Be aware that MySQL has default settings for character sets and collations at four levels: server, database, table, and column. The more specific settings override the more general settings.

          Example for setting database collation:
          ALTER DATABASE ezpublish CHARSET utf8 COLLATE utf8_danish_ci;

          Example for setting table collation:
          ALTER TABLE collation_test CHARSET utf8 COLLATE utf8_danish_ci;

          Example for setting column collation:
          ALTER TABLE collation_test MODIFY word varchar(32) CHARSET utf8 COLLATE utf8_danish_ci;

          MySQL manual 9.1.3. Specifying Character Sets and Collations:
          http://dev.mysql.com/doc/refman/5.0/en/charset-syntax.html

          Show
          (inactive) Gunnstein Lye added a comment - Be aware that MySQL has default settings for character sets and collations at four levels: server, database, table, and column. The more specific settings override the more general settings. Example for setting database collation: ALTER DATABASE ezpublish CHARSET utf8 COLLATE utf8_danish_ci; Example for setting table collation: ALTER TABLE collation_test CHARSET utf8 COLLATE utf8_danish_ci; Example for setting column collation: ALTER TABLE collation_test MODIFY word varchar(32) CHARSET utf8 COLLATE utf8_danish_ci; MySQL manual 9.1.3. Specifying Character Sets and Collations: http://dev.mysql.com/doc/refman/5.0/en/charset-syntax.html
          Hide
          (inactive) Gunnstein Lye added a comment - - edited

          Another important point:
          ezsearch may return the wrong results even when the collation is correct. In the Norwegian/Danish collation, å (a ring) equals aa (double a). This means that ezsearch will expand a search for "så" to "så AND saa", which will usually give zero results even if there are many cases of "så".

          Show
          (inactive) Gunnstein Lye added a comment - - edited Another important point: ezsearch may return the wrong results even when the collation is correct. In the Norwegian/Danish collation, å (a ring) equals aa (double a). This means that ezsearch will expand a search for "så" to "så AND saa", which will usually give zero results even if there are many cases of "så".
          Hide
          Dominika Kurek added a comment -

          Concerns outdated doc.

          Show
          Dominika Kurek added a comment - Concerns outdated doc.

            People

            • Assignee:
              Unassigned
              Reporter:
              (inactive) Gunnstein Lye
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: