Tags

,

Recently, one colleague in another team told me that when import one csv file to Solr server, it failed with following exception:
SEVERE: org.apache.solr.common.SolrException: CSVLoader: input=file:/sample.txt, line=1,can’t read line: 12450
        values={NO LINES AVAILABLE}
        at rg.apache.solr.handler.loader.CSVLoaderBase.input_err(CSVLoaderBase.java:320)
        at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:359)
        at org.apache.solr.handler.loader.CSVLoader.load(CSVLoader.java:31)
  …
Caused by: java.io.IOException: (line 1) invalid char between encapsulated token end delimiter.
        at org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481)
        at org.apache.solr.internal.csv.CSVParser.nextToken(CSVParser.java:359)
        at org.apache.solr.internal.csv.CSVParser.getLine(CSVParser.java:231)
        at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:356)

At that time, I enabled remote debug, and used Eclipse Display view to find the invalid character, and 2 more characters, then searched in the CSV file to find the reason: it is because there is ” in the value of the from field: |  | “an,xxxx”
For more information, please read: 
Use Eclipse Display View While Debugging to Fix Real Problem
Import CSV that Contains Double-Quotes into Solr

This makes me change Solr’s code so if similar problem happens next time, we can find the problem directly from the log, not have to do remote debug again.

The code looks like below:

private Token encapsulatedTokenLexer(Token tkn, int c) throws IOException {
 for (;;) {
   c = in.read();
   else if (c == strategy.getEncapsulator()) {
   if (in.lookAhead() == strategy.getEncapsulator()) {
   } else {
  for (;;) {
   c = in.read();
   else if (!isWhitespace(c)) {
    // error invalid char between token and next delimiter
    throw new IOException(
     "(line "
      + getLineNumber()
      + ") invalid char between encapsulated token end delimiter, invalid char: "
      + String.valueOf((char)c) + ", context " + getContextChars(c));
    }
  }
  }
   } 
 }
}
// new method: read more 3 characters
private String getContextChars(int c) {
    int count =0;
    String moreChars=String.valueOf((char)c);
    while (count < 3) {
      try {
        int tmpc = in.read();
        moreChars +=String.valueOf((char)tmpc);
        ++count;
      } catch (Exception e) {
        break;
      }
    }
    return moreChars;
}

via Blogger http://lifelongprogrammer.blogspot.com/2013/10/improve-solr-csvparser-to-log-invalid-characters.html

Advertisements