Uploaded image for project: 'eZ Publish / Platform'
  1. eZ Publish / Platform
  2. EZP-18485

Indexing fails on File objects with PDF with default settings

    XMLWordPrintable

Details

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Medium Medium
    • None
    • None
    • None
    • Ubuntu 10.04

    Description

      It seems like that Solr 3.1 does not accept anymore some non-UTF8 characters send while indexing a document. This is the case with the attached PDF and the default settings (pstotext to transform PDF in text)

      Solr log :

      22 juil. 2011 09:31:33 org.apache.solr.common.SolrException log
      GRAVE: org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0x20 (at char #2314, byte #-1)
      	at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
      	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
      	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
      	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
      	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
      	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
      	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
      	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
      	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
      	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
      	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
      	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
      	at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
      	at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
      	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
      	at org.mortbay.jetty.Server.handle(Server.java:326)
      	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
      	at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
      	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
      	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
      	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
      	at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
      	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
      Caused by: com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 middle byte 0x20 (at char #2314, byte #-1)
      	at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
      	at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
      	at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:98)
      	at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
      	... 22 more
      Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x20 (at char #2314, byte #-1)
      	at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313)
      	at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204)
      	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
      	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
      	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
      	at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:763)
      	at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1995)
      	at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069)
      	... 24 more
      
      Steps to reproduce

      create a File object with the attached PDF and check that the file content is not indexed.

      Attachments

        Activity

          People

            unknown unknown
            dp@ez.no dp@ez.no
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: