This series talks about how to use Nutch and Solr to implement Google Search’s “Jump to” and Anchor links features. This article introduces how to use Nutch, HTML Parser Jsoup and Regular Expression to Extract Anchor Tag and Content
The Problem 
In the search result, to help users easily jump to the section uses may be interested, we want to add anchor link below page description. Just like Google Search’s “Jump to” and Anchor links features.
Main Steps
1. Extract anchor tag, text and content in Nutch.
Also refer to
Using Nutch to Extract Anchor Tag and Conten
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression
2. Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
This is described in this article
3. Using DocTransformer to Add Anchor tag and content into response. 

document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)

Task: Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
In previous article, we have used Nutch to extract anchor tag, text and content from web page, and add into Solr documents: anchorTags, anchorTexts, anchorContents. These three fields are a list of string.

In Solr side, it will use a UpdateRequestProcessor to remove these three fields, and add a new Document for each anchor, set docType as 1: 0 means, this doc is a web page. 1 means an anchor.
The web page doc and anchor docs is a parent-child relationship.

public class AnchorContentProcessorFactory extends
    UpdateRequestProcessorFactory {
  private String fromFlAnchorTags, fromFlAnchorTexts, fromFlAnchorContents;
  private String toFlAnchorTag, toFlAnchorText, toFlAnchorContent,
      toFlAnchorOrder, flForeignKey;
  public void init(NamedList args) {
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      fromFlAnchorTags = checkNotNull(params.get("fromFlAnchorTags"),
          "fromFlAnchorTags can't be null");
      fromFlAnchorTexts = checkNotNull(params.get("fromFlAnchorTexts"),
          "fromFlAnchorTexts can't be null");
      fromFlAnchorContents = checkNotNull(params.get("fromFlAnchorContents"),
          "fromFlAnchorContents can't be null");
      toFlAnchorTag = checkNotNull(params.get("toFlAnchorTag"),
          "toFlAnchorTag can't be null");
      toFlAnchorText = checkNotNull(params.get("toFlAnchorText"),
          "toFlAnchorText can't be null");
      toFlAnchorContent = checkNotNull(params.get("toFlAnchorContent"),
          "toFlAnchorContent can't be null");
      toFlAnchorOrder = checkNotNull(params.get("toFlAnchorOrder"),
          "toFlAnchorOrder can't be null");
      flForeignKey = checkNotNull(params.get("flForeignKey"),
          "flForeignKey can't be null");
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new AnchorContentProcessor(next);
  class AnchorContentProcessor extends UpdateRequestProcessor {
    public AnchorContentProcessor(UpdateRequestProcessor next) {
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument oldDoc = cmd.solrDoc;
      // docType 0 means this item is full web page.
      // docType 1 means this item is anchor.
      oldDoc.setField("docType", 0);
      Collection<Object> fromAnchorTags = oldDoc
      Collection<Object> fromAnchorTexts = oldDoc
      Collection<Object> fromAnchorContents = oldDoc
      if (fromAnchorTags != null && fromAnchorTexts != null
          && fromAnchorContents != null) {
        if (fromAnchorTags.size() != fromAnchorTexts.size()
            || fromAnchorTags.size() != fromAnchorContents.size()) throw new RuntimeException(
            "size doesn't match: size of fromAnchorTags: "
                + fromAnchorTags.size() + ", size of fromAnchorTexts: "
                + fromAnchorTexts.size() + ", size of fromAnchorContents: "
                + fromAnchorContents.size());
        // add a new document
        AddUpdateCommand newCmd = new AddUpdateCommand(cmd.getReq());
        SolrInputDocument newDoc = new SolrInputDocument();
        Iterator<Object> it1 = fromAnchorTags.iterator(), it2 = fromAnchorTexts
            .iterator(), it3 = fromAnchorContents.iterator();
        int order = 0;
        while (it1.hasNext()) {
          // avoid construct new SolrInputDocument
          newDoc.addField(toFlAnchorTag, it1.next().toString());
          newDoc.addField(toFlAnchorText, it2.next().toString());
          newDoc.addField(toFlAnchorContent, it3.next().toString());
          newDoc.addField(toFlAnchorOrder, order++);
          String uniqueFl = newCmd.getReq().getSchema().getUniqueKeyField()
          newDoc.addField(flForeignKey, oldDoc.getFieldValue(uniqueFl)
          // set docType 1 for the anchor item
          newDoc.addField("docType", 1);
          newCmd.solrDoc = newDoc;


      <str name="fromFlAnchorTags">anchorTags</str>
      <str name="fromFlAnchorTexts">anchorTexts</str>
      <str name="fromFlAnchorContents">anchorContents</str>

      <str name="toFlAnchorTag">anchorTag</str>
      <str name="toFlAnchorText">anchorText</str>
      <str name="toFlAnchorContent">anchorContent</str>
      <str name="toFlAnchorOrder">anchorOrder</str>
      <str name="flForeignKey">url</str>


<field name="docType" type="tint" indexed="true" stored="true" multiValued="false" /> 
    <field name="anchorTag" type="string" indexed="false" stored="true"  multiValued="false" /> 
    <field name="anchorText" type="string" indexed="false" stored="true" multiValued="false" /> 
    <field name="anchorContent" type="text_rev" indexed="true" stored="false" multiValued="false" /> 
    <field name="anchorOrder" type="tint" indexed="true" stored="true" multiValued="false" /> 

Using Nutch to Extract Anchor Tag and Content
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression

via Blogger http://ift.tt/1jhx5D8