Tags

,

Recently, we hit a problem related with highlighter: I set hl.fragsize = 300 like below: 

<str name="hl">on</str>
<str name="hl.fl">title,body_stored</str>
<str name="hl.fragsize">300</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.body_stored.hl.fragsize">300</str>

But the highlight section for one document still outputs more than 2000 characters.

Look into the code, in org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream, String, boolean, int),  after the for loop, it appends whole remaining text into last fragment.

if (
  // if there is text beyond the last token considered..
  (lastEndOffset < text.length())
  &&
  // and that text is not too large...
  (text.length()<= maxDocCharsToAnalyze)
 )
{
 //append it to the last fragment
 newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
currentFrag.textEndPos = newText.length();

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
} else {
if (window.CHITIKA === undefined) {
window.CHITIKA = { ‘units’ : [] };
};
var unit = {
‘publisher’ : “jefferyyuan”,
‘width’ : 728,
‘height’ : 90,
‘type’ : “mpu”,
‘sid’ : “Chitika Default”,
‘color_site_link’ : “FFFFFF”,
‘color_title’ : “FFFFFF”,
‘color_border’ : “FFFFFF”,
‘color_text’ : “4E2800”,
‘color_bg’ : “F7873D”
};
var placement_id = window.CHITIKA.units.length;
window.CHITIKA.units.push(unit);
document.write(“

“);
var s = document.createElement(‘script’);
s.type = ‘text/javascript’;
s.src = ‘http://scripts.chitika.net/getads.js&#8217;;
try {
document.getElementsByTagName(‘head’)[0].appendChild(s);
} catch(e) {
document.write(s.outerHTML);
}
}
//]]>

This code is problematical, as in some cases, the last fragment is the most relevant section and will be selected to return to client.

I made some change to the code like below. Now it works 🙂

//Test what remains of the original text beyond the point where we stopped analyzing
if(lastEndOffset < text.length())
{
 if(textFragmenter instanceof SimpleFragmenter)
 {
  SimpleFragmenter simpleFragmenter = 
(SimpleFragmenter) textFragmenter;
  int remain =simpleFragmenter.getFragmentSize() -(newText.length() - currentFrag.textStartPos);
  if(remain > 0 )
  {
   int endIndex = lastEndOffset + remain;
   if (endIndex > text.length()) {
    endIndex = text.length();
   }
   newText.append(encoder.encodeText(text.substring(lastEndOffset,
     endIndex)));
  }
 }
 else
 {
  newText.append(encoder.encodeText(text.substring(lastEndOffset)));
 }
}
currentFrag.textEndPos = newText.length();

Resources
https://issues.apache.org/jira/browse/LUCENE-5381

via Blogger http://lifelongprogrammer.blogspot.com/2014/01/lucene-fix-highlighter-issue-to-always-honor-hl-fragsize.html

Advertisements