We should add documentation about the problem of database collation.
- On single language sites the admin should make sure the correct collation is used. (If the language is english, the default is fine.)
- On multi language sites, the admin must choose the collation that best fits the site. We can not support multiple collations (see below) due to limitations in the databases we use.
- The problem affects matching of text (like when searching or when using attribute based filters), where you may get too many or too few matches. It also affects sorting by text, where you may get results not sorted in the correct order (as defined by your language).
Documentation pages that should mention this include:
- Installation guide, under database setup
- Multilingual information pages
- Attribute filter (under e.g. the content / fetch page)
- eZ Search Engine
- eZ Find
The following is what I have found while researching this topic.
There may not be many sites that are affected by the problem*. Many multi language sites can get away with choosing one collation that fits all languages more or less well.
- On a norwegian/english site the collations are quite compatible. If we add german, however, there is a collation conflict with regards to y and ü, for example.
On sites where we do have conflicts, the problem is serious however. Imagine a norwegian/german site. For a norwegian user it is a bug if a filter for "a*" also returns "å*", while for a german user it is a bug if "y*" also returns "ü*". Yet given the current ezp, we have to choose one of these two evils, or the third evil, binary collation, which makes the filter case sensitive.
As far as ezsearchengine is concerned, the situation is worse IMHO, in that it gives to few rather than too many results. Given standard collation, a search for "var" returns only those objects containing both "var" and "vår", for example. In other words, it expands the search to include all word variants that the current collation considers to be equal, and does an AND search with them. An OR search would be preferable. It may be tricky to implement, but can be implemented regardless of any collation solution.
Connection collation is overridden by table and column collation, so it is useless for us. The only way to collate correctly on a multi language site (where collations conflict) is to specify collation for each select query where text is compared. This is a huge amount of work, and must be made database independent as well. Collations in SQL are only standardized on the column level, while collation per query is MySQL specific.
Collation for sorting and for matching are two different issues. For sorting it is not so bad, for matching we may have to do extra work like function based indexes or column replication. Research needs to be done, it is not well supported and recent versions (11) may be required.
Postgres currently only supports one collation per installation, which makes this problem unsolvable in a nice way. From v. 8.4 it will reportedly support collations per database, which doesn't help. So in the foreseeable future it can only be supported by column replication.
Column replication solution
It is a waste of work and performance to implement the collation per query fix, since it only works for MySQL. We need to support the others too, and so far column replication seems to bee the only way to do it. What this means is that for each column where we need to compare and sort texts, we need to copy the column to normalised versions, one extra column per collation we need to support. Example for the norwegian/english/german site discussed above:
Then, based on the language of the current site access, we choose the column to use for collation.
This is messy, it complicates a lot of code, it hurts performance, I don't like it at all, and I'm sure it will never happen
Binary collation solution
This will work fairly well, and it will be fast, if we can only get around the fact that it is case sensitive. I believe ezsearchengine already stores the indexed words in lower case. If we can just make sure that lower casing is done correctly for utf-8 (using mb_strtolower() or similar) it may be a workable solution.
This way, no variants will ever be considered equal. It is not correct, but may be a useful compromise. Sorting will not be correct in some languages. Matching will also not always be correct, but it will fail in a more sensible way (IMHO) than the current situation.