Uploaded image for project: 'eZ Publish / Platform'
  1. eZ Publish / Platform
  2. EZP-21239

eZ Find's auto-complete functionality does not work with Kanji and Hiragana Japanese characters

    Details

      Description

      eZ Find's autocomplete functionality does not work, on both backend and frontend siteaccesses, with Kanji and Hiragana Japanese characters. However, is does work with katakana characters.

      Steps to reproduce:

      1. Configure CJKTokenizer in solr. Following SOLR's example (http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/solr/example/solr/conf/schema.xml), I added the following block to ./ezpublish_legacy/extension/ezfind/java/solr/conf/schema.xml:

      <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
      	<analyzer>
      		<tokenizer class="solr.StandardTokenizerFactory"/>
      		<!-- normalize width before bigram, as e.g. half-width dakuten combine  -->
      		<filter class="solr.CJKWidthFilterFactory"/>
      		<!-- for any non-CJK -->
      		<filter class="solr.LowerCaseFilterFactory"/>
      		<filter class="solr.CJKBigramFilterFactory"/>
      	</analyzer>
      </fieldType>
      

      ...just after:

      <fieldtype name="geohash" class="solr.GeoHashField"/>
      

      Please note that you must re-start SOLR for the changes to take effect. Re-indexing in not necessary, though.

      2. Create Japanese content. For the sake of completeness, I created content in Kanji, Hiragana and Katakana:

      Kanji: 漢字(かんじ) no auto-complete

      Hiragana: ひらがな no auto-complete

      Katakana: カタカナ auto-complete works

        Activity

        Hide
        Yannick Roger (Inactive) added a comment - - edited

        There are 2 bugs.
        1 - The javascript autocomplete feature provided by YUI2 doesn't work with Japanese. It should be either debugged or upgraded to YUI3. I would recommend upgrading to avoid supporting a patch version of YUI2 and we have YUI3 already loaded.

        2 - Solr doesn't work autocompleting these japanese alphabet.

        We used the default schema. Started solr with solr.sh. Created an article with the name: ひらがな

        Example request for the "sh" string that we want to autocomplete (this one works) :

         http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&facet=true&facet.mincount=1&facet.prefix=sh&facet.limit=10&json.nl=arrarr&facet.field=ezf_sp_words&facet.method=fc&wt=xml&fq=meta_language_code_ms:%28eng-GB%29&rows=0

        Now in japanese with the string "ひら" (doesn't work):

         http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&facet=true&facet.mincount=1&facet.prefix=%E3%81%B2%E3%82%89&facet.limit=10&json.nl=arrarr&facet.field=ezf_sp_words&facet.method=fc&wt=xml&fq=meta_language_code_ms:%28eng-GB%29&rows=0

        Show
        Yannick Roger (Inactive) added a comment - - edited There are 2 bugs. 1 - The javascript autocomplete feature provided by YUI2 doesn't work with Japanese. It should be either debugged or upgraded to YUI3. I would recommend upgrading to avoid supporting a patch version of YUI2 and we have YUI3 already loaded. 2 - Solr doesn't work autocompleting these japanese alphabet. We used the default schema. Started solr with solr.sh. Created an article with the name: ひらがな Example request for the "sh" string that we want to autocomplete (this one works) : http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&facet=true&facet.mincount=1&facet.prefix=sh&facet.limit=10&json.nl=arrarr&facet.field=ezf_sp_words&facet.method=fc&wt=xml&fq=meta_language_code_ms:%28eng-GB%29&rows=0 Now in japanese with the string "ひら" (doesn't work): http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&facet=true&facet.mincount=1&facet.prefix=%E3%81%B2%E3%82%89&facet.limit=10&json.nl=arrarr&facet.field=ezf_sp_words&facet.method=fc&wt=xml&fq=meta_language_code_ms:%28eng-GB%29&rows=0
        Hide
        Paul Borgermans (Inactive) added a comment - - edited

        I looked at various options, and for what I understand from the use cases and testing locally with japanese test strings I found, the below configuration should provide tailored japanese processing and auto-complete on relevant sections in case of mixed japanese/english content.

        Note this is valid for installations using Solr 3.6.1 (eZP5.x, patched versions of 4.5,4.6,4.7)

        The field type "spell" in solr conf/schema.xml should be changed to:

            
            <fieldtype name="spell" class="solr.TextField" positionIncrementGap="100">
              <analyzer>
                <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
                <filter class="solr.JapaneseBaseFormFilterFactory"/>
                <!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) -->
                <filter class="solr.CJKWidthFilterFactory"/>
                <!-- Removes common tokens typically not useful for search, but have a negative effect on ranking -->
                 <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
                <!-- Lower-cases romaji characters -->
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.StandardFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnNumerics="0" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
               </analyzer>
            </fieldtype>
        
        

        In this way, japanese text is split and normalized morphologically. The test string above (ひらがな) will not autocomplete, but a string like シニアソフトウェアエンジニア will be split into シニア  ソフトウェア  エンジニア

        If you want the string ひらがな to autocomplete, you may consider the more "hard" option to do only whitespace tokenisation by changing the "spell" field type to

         <fieldtype name="spell" class="solr.TextField" positionIncrementGap="100">
              <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.StandardFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnNumerics="0" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
               </analyzer>
            </fieldtype>
         
        
        

        This will not do any language specific analysis though.

        Show
        Paul Borgermans (Inactive) added a comment - - edited I looked at various options, and for what I understand from the use cases and testing locally with japanese test strings I found, the below configuration should provide tailored japanese processing and auto-complete on relevant sections in case of mixed japanese/english content. Note this is valid for installations using Solr 3.6.1 (eZP5.x, patched versions of 4.5,4.6,4.7) The field type "spell" in solr conf/schema.xml should be changed to: < fieldtype name = "spell" class = "solr.TextField" positionIncrementGap = "100" > < analyzer > < tokenizer class = "solr.JapaneseTokenizerFactory" mode = "search" /> < filter class = "solr.JapaneseBaseFormFilterFactory" /> <!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) --> < filter class = "solr.CJKWidthFilterFactory" /> <!-- Removes common tokens typically not useful for search, but have a negative effect on ranking --> < filter class = "solr.JapaneseKatakanaStemFilterFactory" minimumLength = "4" /> <!-- Lower-cases romaji characters --> < filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "stopwords.txt" /> < filter class = "solr.StandardFilterFactory" /> < filter class = "solr.LowerCaseFilterFactory" /> < filter class = "solr.WordDelimiterFilterFactory" generateWordParts = "1" splitOnNumerics = "0" generateNumberParts = "0" catenateWords = "1" catenateNumbers = "0" catenateAll = "0" splitOnCaseChange = "1" /> < filter class = "solr.RemoveDuplicatesTokenFilterFactory" /> </ analyzer > </ fieldtype > In this way, japanese text is split and normalized morphologically. The test string above (ひらがな) will not autocomplete, but a string like シニアソフトウェアエンジニア will be split into シニア  ソフトウェア  エンジニア If you want the string ひらがな to autocomplete, you may consider the more "hard" option to do only whitespace tokenisation by changing the "spell" field type to < fieldtype name = "spell" class = "solr.TextField" positionIncrementGap = "100" > < analyzer > < tokenizer class = "solr.WhitespaceTokenizerFactory" /> < filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "stopwords.txt" /> < filter class = "solr.StandardFilterFactory" /> < filter class = "solr.LowerCaseFilterFactory" /> < filter class = "solr.WordDelimiterFilterFactory" generateWordParts = "1" splitOnNumerics = "0" generateNumberParts = "0" catenateWords = "1" catenateNumbers = "0" catenateAll = "0" splitOnCaseChange = "1" /> < filter class = "solr.RemoveDuplicatesTokenFilterFactory" /> </ analyzer > </ fieldtype >   This will not do any language specific analysis though.
        Hide
        Yannick Roger (Inactive) added a comment -

        PR : https://github.com/ezsystems/ezfind/pull/122

        It needs Paul's 2nd configuration in order to work with Japanese (we will need to update the doc).

        Show
        Yannick Roger (Inactive) added a comment - PR : https://github.com/ezsystems/ezfind/pull/122 It needs Paul's 2nd configuration in order to work with Japanese (we will need to update the doc).
        Show
        Yannick Roger (Inactive) added a comment - Fixed in master : https://github.com/ezsystems/ezfind/commit/edc6d1d660dbbbb66bfaf9e9ceeabec65b4200f7 https://github.com/ezsystems/ezfind/commit/f3c4c0138f5cc9d775de6e3a695d8406a4f21fcb
        Show
        Ricardo Correia (Inactive) added a comment - The eZ Find documentation for versions 2.4, 2.5, 2.6, 2.7, 5.0.0 and 5.1.0 have been updated accordingly, in the following locations: http://doc.ez.no/Extensions/eZ-Publish-extensions/eZ-Find/eZ-Find-LS-5.1.0/Customization/Auto-complete-search http://doc.ez.no/Extensions/eZ-Publish-extensions/eZ-Find/eZ-Find-LS-5.0.0/Customization/Auto-complete-search http://doc.ez.no/Extensions/eZ-Publish-extensions/eZ-Find/eZ-Find-2.7/Customization/Auto-complete-search http://doc.ez.no/Extensions/eZ-Publish-extensions/eZ-Find/eZ-Find-2.6/Customization/Auto-complete-search http://doc.ez.no/Extensions/eZ-Publish-extensions/eZ-Find/eZ-Find-2.5/Customization/Auto-complete-search http://doc.ez.no/Extensions/eZ-Publish-extensions/eZ-Find/eZ-Find-2.4/Customization/Auto-complete-search
        Hide
        Joao Pingo (Inactive) added a comment -

        Test in Master, 5.2 and 5.1 with tc-1833
        Test passed ... QA Approved

        Show
        Joao Pingo (Inactive) added a comment - Test in Master, 5.2 and 5.1 with tc-1833 Test passed ... QA Approved

          People

          • Assignee:
            Unassigned
            Reporter:
            Nuno Oliveira (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0 minutes
              0m
              Logged:
              Time Spent - 1 week, 4 days, 30 minutes
              1w 4d 30m