Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
None
-
Ubuntu 10.04
Description
It seems like that Solr 3.1 does not accept anymore some non-UTF8 characters send while indexing a document. This is the case with the attached PDF and the default settings (pstotext to transform PDF in text)
Solr log :
22 juil. 2011 09:31:33 org.apache.solr.common.SolrException log GRAVE: org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0x20 (at char #2314, byte #-1) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 middle byte 0x20 (at char #2314, byte #-1) at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:98) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) ... 22 more Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x20 (at char #2314, byte #-1) at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:763) at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1995) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069) ... 24 more
Steps to reproduce
create a File object with the attached PDF and check that the file content is not indexed.