Tags

,

The Problem
We use Nutch2 to crawl our internal documentation site, and save index to Solr. We noticed that if the title is too long (longer than 100 characters), the title would be truncated to the frist 100 chracters.

For example: 
The original title is:
Getting Started -….(omit 62 characters) Firewall for Windows File System
In solr search result, the title would be:
Getting Started -….(omit 62 characters) Firewall for Wi
This is bad for the user experience. We want to save entire title to Solr.
How Nutch Works?
In parsing phrase, Nutch gets the entire title:
org.apache.nutch.parse.html.DOMContentUtils.getTitle(StringBuilder, Node)

org.apache.nutch.parse.html.HtmlParser.getParse(String, WebPage)
utils.getTitle(sb, root); // extract title
title = sb.toString().trim();
Parse parse = new Parse(text, title, outlinks, status);

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
} else {
if (window.CHITIKA === undefined) {
window.CHITIKA = { ‘units’ : [] };
};
var unit = {
‘publisher’ : “jefferyyuan”,
‘width’ : 728,
‘height’ : 90,
‘type’ : “mpu”,
‘sid’ : “Chitika Default”,
‘color_site_link’ : “FFFFFF”,
‘color_title’ : “FFFFFF”,
‘color_border’ : “FFFFFF”,
‘color_text’ : “4E2800”,
‘color_bg’ : “F7873D”
};
var placement_id = window.CHITIKA.units.length;
window.CHITIKA.units.push(unit);
document.write(“

“);
var s = document.createElement(‘script’);
s.type = ‘text/javascript’;
s.src = ‘http://scripts.chitika.net/getads.js&#8217;;
try {
document.getElementsByTagName(‘head’)[0].appendChild(s);
} catch(e) {
document.write(s.outerHTML);
}
}
//]]>

But in indexing phrase, in BasicIndexingFilter, it will only crawl the first X characters of title. X is defined by property indexer.max.title.length.

org.apache.nutch.indexer.basic.BasicIndexingFilter
It will read property indexer.max.title.length from nutch-default.xml or nutch-site.xml. The value in nutch-site.xml will overwrite the one in nutch-default.xml.

public void setConf(Configuration conf) {
 this.conf = conf;
 this.MAX_TITLE_LENGTH = conf.getInt("indexer.max.title.length", 100);
 LOG.info("Maximum title length for indexing set to: " + this.MAX_TITLE_LENGTH);
}
public NutchDocument filter(NutchDocument doc, String url, WebPage page) throws IndexingException {
 String title = TableUtil.toString(page.getTitle());
 if (title.length() > MAX_TITLE_LENGTH) { // truncate title if needed
  title = title.substring(0, MAX_TITLE_LENGTH);
 }
 if (title.length() > 0) {
  doc.add("title", title);
 }
}

The default value of indexer.max.title.length is 100, as defined in nutch-default.xml.

<property>
 <name>indexer.max.title.length</name>
 <value>100</value>
 <description>The maximum number of characters of a title that are
   indexed.
   Used by index-basic.
 </description>
</property>

The Solution
Now the fix is straightforward, we can define indexer.max.title.length to a larger value in nutch-site.xml such as indexer.max.title.length=500.

Misc
Nutch includes many indexer plugin such as index-(basic|static|metadata), which add some field name and value pairs. We can check all added fields by opening call hierarchy on method: org.apache.nutch.indexer.NutchDocument.add(String, String).

Resources
nutch-default.xml

via Blogger http://lifelongprogrammer.blogspot.com/2013/12/nutch2-save-entire-title-to-solr.html

Advertisements