Tags

,

The Problem
In my project, we run some queries in Solr server, and return combined response back to client(we can’t use SolrJ as the request goes through some proxy application which add extra functions). But some text fields are too large, we would like to reduce their size. 
So at remote Solr server we use GZipOutputStream and Bse64OutputStream to zip the string, this can reduce the size by more than 85%: Original 134 mb String is compressed to 16mb.
Read more: Java: Use ZIP Stream and Base64 Stream to Compress Large String Data
Java: Use Zip Stream and Base64 Encoder to Compress Large String Data 

At client side, when it receives the zipped base64 string, it first Base64 decodes it, uncompress it, then add it as a field into Solr. 

But If we do all this in memory, it will load the huge original unzipped string 134mb into memory. It will cause the application OutOfMemoryError. Obviously this is not desired.

We want to first use stream(Base64InputStream and GZipInputStream) to unzip it, and write original string into a temp file. When Solr add this field into Solr, it can use FileReader to read from the temp file, after it’s done, it can delete the temp file.

In Lucene, we can provide a Reader as a parameter to Field constructor, Lucene will consume the reader and close it after it’s done. 
The field can only indexed, not stored. But this is fine for us, as this field is only used for search.

Solr doesn’t expose this function, but we can easily extend Solr to define a custom field which accepts a Reader.

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
} else {
if (window.CHITIKA === undefined) {
window.CHITIKA = { ‘units’ : [] };
};
var unit = {
‘publisher’ : “jefferyyuan”,
‘width’ : 728,
‘height’ : 90,
‘type’ : “mpu”,
‘sid’ : “Chitika Default”,
‘color_site_link’ : “FFFFFF”,
‘color_title’ : “FFFFFF”,
‘color_border’ : “FFFFFF”,
‘color_text’ : “4E2800”,
‘color_bg’ : “F7873D”
};
var placement_id = window.CHITIKA.units.length;
window.CHITIKA.units.push(unit);
document.write(“

“);
var s = document.createElement(‘script’);
s.type = ‘text/javascript’;
s.src = ‘http://scripts.chitika.net/getads.js&#8217;;
try {
document.getElementsByTagName(‘head’)[0].appendChild(s);
} catch(e) {
document.write(s.outerHTML);
}
}
//]]>

The Solution
Custom Solr Field Type: FileTextField
FileTextField extends solr.schema.TextField. When add value to FileTextField, the value can be a string, or a reader. If it’s a reader, createField will create a Lucene Field with Reader as parameter: Field f = new Field(name, fr, type); Lucene will consume the reader, and close it after it’s done.
FileTextField has one configuration parameter: deleteFile. If true, it will delete the file after Lucene has read the file and written it to index. If false, it will keep this file. We have to set encoding in its constructors: this can avoid the problem: different encodings are used when write and read the file.

public class FileTextField extends TextField {
  private boolean deleteFile;  
  protected void init(IndexSchema schema, Map<String,String> args) {
    // can't be stored, as Lucene Field disallows it: fields with a Reader value
    // cannot be stored
    restrictProps(STORED);
    String str = args.remove("deleteFile");
    if (str != null) {
      deleteFile = Boolean.parseBoolean(str);
    }
    super.init(schema, args);
  }
  public IndexableField createField(SchemaField field, Object value, float boost) {
    if (!field.indexed() && !field.stored()) {
      if (log.isTraceEnabled()) log.trace("Ignoring unindexed/unstored field: "
          + field);
      return null;
    }
    if (value instanceof ReaderWrapper) {
      ReaderWrapper reader = (ReaderWrapper) value;
      reader.setDeleteFile(deleteFile);
      org.apache.lucene.document.FieldType newType = new org.apache.lucene.document.FieldType();
      return createFileTextField(field.getName(), reader, newType, boost);
    } else {
      return super.createField(field, value, boost);
    }
  }
  public Field createFileTextField(String name, Reader fr,
      org.apache.lucene.document.FieldType type, float boost) {
    Field f = new Field(name, fr, type);
    f.setBoost(boost);
    return f;
  }
}

ReaderWrapper
In order to get the file path, we create a wrapper MyFileReader, it extends InputStreamReader, if deleteFile is true, its close method will delete the file after close the stream.

public static class ReaderWrapper extends InputStreamReader {
    private String filePath;
    private boolean deleteFile;
    
    public ReaderWrapper(File file, String encoding)
        throws FileNotFoundException, UnsupportedEncodingException {
      super(new FileInputStream(file), encoding);
      this.filePath = file.getAbsolutePath();
    }
    public ReaderWrapper(String filename, String encoding)
        throws FileNotFoundException, UnsupportedEncodingException {
      super(new FileInputStream(filename), encoding);
      this.filePath = filename;
    }
    public void close() throws IOException {
      super.close();
      if (deleteFile) {
        boolean deleted = new File(filePath).delete();
        if (!deleted) {
          log.error("Unable to delete " + filePath);
        }
      }
    }
    public void setDeleteFile(boolean deleteFile) {
      this.deleteFile = deleteFile;
    } 
  }

Define FileTextField field in schema
FileTextField is similar as Solr Text Field, we can define tokenizer and filters for index and query. It accepts an additional parameter: deleteFile. The value of stored can’t be true.

Read Document from Stream and add FileTextField into Solr
The following code uses GSon’s streaming JsonReader to read one document. We can determine size of zippedcontent by size field. If it’s too large, we will write the uncompressed string into a temporary file and add a ReaderWrapper to content field. 

private static Future<Void> readOneDoc(JsonReader reader, SolrQueryRequest request)
  throws IOException {
String str;
reader.beginObject();
long size = 0;
Object unzippedcontent = null;
boolean useFileText = false;
SolrInputDocument solrDoc = new SolrInputDocument();
while (reader.hasNext()) {
  str = reader.nextName();
  if ("size".equals(str)) {
 size = Long.parseLong(reader.nextString());
 if (size > size_LIMIT_FILETEXT) {
   useFileText = true;
 }
  } else if ("zippedcontent".equals(str) && reader.peek() != JsonToken.NULL) {
 if (useFileText) {
   // unzippedcontent is a ReaderWrapper
   unzippedcontent = unzipValueToTmpFile(reader);
 } else {
   // decoded and uncompressed string
   unzippedcontent = unzipValueDirectly(reader);
 }
  } else {
 // in case, we change server side code.
 reader.skipValue();
  }
}
reader.endObject();
// add it to solr
UpdateRequestProcessorChain updateChian = request.getCore()
   .getUpdateProcessingChain("/update");
AddUpdateCommand command = new AddUpdateCommand(request);
command.solrDoc = solrDoc;
UpdateRequestProcessor processor = updateChian.createProcessor(request, new SolrQueryResponse());
processor.processAdd(command); 
}

private static String unzipValueDirectly(JsonReader reader)
  throws IOException {
String value = reader.nextString();
ZipInputStream zi = null;
try {
  Base64InputStream base64is = new Base64InputStream(new ByteArrayInputStream(
   value.getBytes("UTF-8")));
  zi = new ZipInputStream(base64is);
  zi.getNextEntry();
  return IOUtils.toString(zi);
} finally {
  IOUtils.closeQuietly(zi);
}
}

private static ReaderWrapper unzipValueToTmpFile(JsonReader reader) throws IOException {
File tmpFile = File.createTempFile(TMP_FILE_PREFIX_ZIPPEDCONTENT, TMP_FILE_PREFIX_SUFFIX);
String value = reader.nextString();
ZipInputStream zi = null;
OutputStreamWriter osw = null;

try {
 Base64InputStream base64is = new Base64InputStream(new ByteArrayInputStream(
   value.getBytes("UTF-8")));
  zi = new ZipInputStream(base64is);
  zi.getNextEntry();
  osw = new OutputStreamWriter(new FileOutputStream(tmpFile), "UTF-8");
  IOUtils.copy(zi, osw);
  zi.closeEntry();
} finally {
  IOUtils.closeQuietly(osw);
  IOUtils.closeQuietly(base64is);
}
return new ReaderWrapper(tmpFile.getAbsolutePath(), "UTF-8");
}

via Blogger http://lifelongprogrammer.blogspot.com/2013/11/creating-custom-solr-type-to-stream-Large-text-field.html

Advertisements