Http Proxy Setting In HttpURLConnection and Apache HTTP Client- From http://ift.tt/1ajReyV

Tags

,

During development, we usually need use fiddler to monitor/debug request and response. This article introduce how to set proxy in code or in command line to use fiddler as a proxy.


Set Proxy When Use HttpURLConnection
If we are using Java HttpURLConnection, we can set the following system environment in test code:
System.setProperty(“http.proxyHost”, “localhost”);
System.setProperty(“http.proxyPort”, “8888”);
or set them as JVM parameters in command line:
-Dhttp.proxyHost=localhost -Dhttp.proxyPort=8888

Set Proxy in the Code When Use Apache HTTP Client 4.x
HttpHost proxy = new HttpHost(“127.0.0.1”, 8888, “http”);
httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY, proxy);
Set Proxy When Use Apache HTTP Client 3.x
HttpClient client = new HttpClient();
client.getHostConfiguration().setProxy(“127.0.0.1”, 8888);

Set Proxy in Command Line When Use Apache HTTP Client 4.2 or Newer
If we are using 4.2 or newer Apache HTTP Client, we can use SystemDefaultHttpClient, which honors JSSE and networking system properties, such as http.proxyHost, http.proxyPort

How this is implemented in SystemDefaultHttpClient
SystemDefaultHttpClient uses ProxySelector.getDefault(), which uses DefaultProxySelector. DefaultProxySelector uses NetProperties to read system properties.

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
}
//]]>

Set Proxy in Command Line When Use Apache HTTP Client 3.x
If we are using Apache HTTP Client 3.x, we can read system property: proxyHost and proxyPort. If they are not empty, set proxy.

String proxyHost = System.getProperty("http.proxyHost");
String proxyPort = System.getProperty("http.proxyPort");

if (StringUtils.isNotBlank(proxyHost)
  && StringUtils.isNotBlank(proxyPort)) {
 client.getHostConfiguration().setProxy(proxyHost,
   Integer.parseInt(proxyPort));
}

We use similar logic to set proxy when use older Apache HTTP Client 4.x.

Resources
Java Networking and Proxies

via Blogger http://ift.tt/MYvSTo

Advertisements

C# Parse Negative Number When Using Double.Parse(String, NumberStyles)- From http://ift.tt/1ajReyV

Tags

,

The Problem
Our C# application sends a query to Solr server, and parses the response and generate graph report. Today the application throws error: Input string was not in a correct format. in one test environment.

At first thought, we thought it is due to the language and region settings, as shown in the post: C# Parsing is Regional(Culture) Sensitive

But we found out that in customer environment, it doesn’t always fail, just failed in some rare cases. 
The Analysis
We re-executed the Solr stats query, and found some unexpected numbers in the response: the min value of the stats query is negative. This should not happen in normal cases. 

The real problem here is why these negative value comes from, and we should reject invalid value when push data to Solr.  We will fix the previous problem. 

But why it fails when C# parses the response? As the same code is used when parse all double value from solr response. It may contain negative value.

Checked the code and run it with the negative number.

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
}
//]]>

string value = "-10.01";
double dvalue = double.Parse(value, System.Globalization.NumberStyles.AllowExponent | System.Globalization.NumberStyles.AllowDecimalPoint);
Console.WriteLine(dvalue);

It failed. Now it’s clear that this is because Double.Parse(String, NumberStyles).
From Double.Parse Method (String, NumberStyles)
Converts the string representation of a number in a specified style to its double-precision floating-point number equivalent.

As we only specify AllowExponent and AllowDecimalPoint. It will disallow sign symbol. It will only take non negative value. Give it negative value, it will fail with exception. This code should be updated to use Double.parse(String s).

The Double.Parse Method allows string in format: [ws][sign][integral-digits[,]]integral-digits[.[fractional-digits]][E[sign]exponential-digits][ws]

Resources
C# Parsing is Regional(Culture) Sensitive
Solr: Extend StatsComponent to Support stats.query, stats.facet and facet.topn

via Blogger http://ift.tt/1eewMRL

PowerShell: Working with CSV Files- From http://ift.tt/1ajReyV

Tags

,

Background
When import csv file to solr, it may fail because the csv is in correct formatted: mostly related with double quotes in column value, or maybe there is no enough columns.

When this happens, we may have to dig into csv files. Powershell is a great tool in this case.
Task: Get Line Number of the CSV Record
When solr fails to import csv: it may report the following error:
SEVERE: Import csv1.csv failed: org.apache.solr.common.SolrException: CSVLoader: input=file:/C:/csv1.csv, line=134370,expected 19 values but got 17
                values={field_values_in_this_row}
Solr shows the error happens at 134370 line, but if we use Get-Content csv1.csv | Select-Object -index 134370, we may find content of 134370 line is totally different. This is because if there are multiline records in the csv file, the line number would be not correct.

  /**
   * ATTENTION: in case your csv has multiline-values the returned
   *            number does not correspond to the record-number
   * 
   * @return  current line number
   */
  public int org.apache.solr.internal.csv.CSVParser.getLineNumber() {
    return in.getLineNumber();  
  }

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
}
//]]>

To Get correct line of the csv record, use the following PowerShell command:
select-string -pattern ‘field_values_in_this_row’ csv1.csv | select Line,LineNumber
Line                                                                                              LineNumber
—-                                                                                               ———-
field_values_in_this_row                                                                134378
Task: Get Record Number of CSV File
Users want to know whether all records are imported to csv. To do this, we need get number of all not-empty records in the csv file. Line number of the csv file is not useful, as ther may be empty lines , or multiple-lines records in the csv file.

We can use the following Powershell command: the Where-Object excludes empty records.
(Import-Csv csv1.csv | Where-Object { ($_.PSObject.Properties | ForEach-Object {$_.Value}) -ne $null} | Measure-Object).count

The previous command is slow, if we are sure there is no empty records(lines) in the csv file: we can use following command:
(Import-Csv .\csv1.csv | Measure-Object).count

Other CSV related PfowerShell Commands
Select fields from CSV file:
Import-Csv csv1.csv | select f1,f2 | Export-Csv -Path csv2.csv –NoTypeInformation
Add new fields into CSV file:
Import-CSV csv1.csv | Select @{Name=”Surname”;Expression={$_.”Last Name”}}, @{Name=”GivenName”;Expression={$_.”First Name”}} | Export-Csv -Path csv2.csv –NoTypeInformation

Rescources
Import CSV that Contains Double-Quotes into Solr
Improve Solr CSVParser to Log Invalid Characters

via Blogger http://ift.tt/1e98vfN

Part2: Run Time-Consuming Solr Query Faster: Use Guava CacheBuilder to Cache Response- From http://ift.tt/1ajReyV

Tags

,

The Problem
In our web application, the very first request to solr server is a stats query. When there are more than 50 millions data, the first stats query may take 1, 2 or more minutes. As it need load millions of documents, terms into Solr. 

For subsequent stats queries, it will run faster as Solr load them into its caches, but it still takes 5 to 15 or more seconds as the stats query is a compute-intensive task, and there is too many data.

We need make it run faster to make the web GUI more responsive.
Main Steps
1. Auto run queries X minutes after no update after startup or commit to make the first stats query run faster
2.  Use Guava CacheBuilder to Cache Solr Response
This is described in this article.

Task: Use Guava CacheBuilder to Cache Solr Response
We would like to store response of time-consuming request into cache, sol later request will be much faster.

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
}
//]]>

The Implementation
CacheManager
CacheManager is the key class in the implementation. The key of the outer ConcurrentHashMap is SolrCore, its value is a ConcurrentHashMap. The key of inner ConcurrentHashMap is cacheType: such as solr request. Its value is a Guava Cache.

By default the cache = CacheBuilder.newBuilder().concurrencyLevel(16).expireAfterAccess(10, TimeUnit.MINUTES).softValues().recordStats().build(); We can specify parameter -DcacheSpec=concurrencyLevel=10,expireAfterAccess=5m,softValues to use a different kind of cache.

It adds response to cache asynchronously.

public class CacheManager implements CacheStatsOpMXBean {
  protected static final Logger logger = LoggerFactory
      .getLogger(CacheManager.class);
  public static final String CACHE_TAG_SOLR_REQUEST = "CACHE_TAG_SOLR_REQUEST";
  @SuppressWarnings("rawtypes")
  private ConcurrentHashMap<SolrCore,ConcurrentHashMap<String,Cache>> cacheMap = new ConcurrentHashMap<SolrCore,ConcurrentHashMap<String,Cache>>();
  
  private static CacheManager instance = null;
  private ExecutorService executors;
  
  private static String cacheSpec;
  
  private CacheManager() {
    cacheSpec = System.getProperty("cacheSpec");
    executors = Executors.newCachedThreadPool();
  }
  
  public static CacheManager getInstance() {
    if (instance == null) {
      synchronized (CacheManager.class) {
        if (instance == null) {
          instance = new CacheManager();
        }
      }
    }
    return instance;
  }
  
  private <K,V> Cache<K,V> newCache() {
    Cache<K,V> result = null;
    if (StringUtils.isNotBlank(cacheSpec)) {
      try {
        result = CacheBuilder.from(cacheSpec).build();
      } catch (Exception e) {
        logger.error("Invalid cacheSpec: " + cacheSpec, e);
      }
    }
    if (result == null) {
      // default cache
      result = CacheBuilder.newBuilder().concurrencyLevel(16)
          .expireAfterAccess(10, TimeUnit.MINUTES).softValues()
          .recordStats().build();
    }
    return result;
  }
  
  public <K,V> Cache<K,V> getCache(SolrCore core, String cacheTag) {
    cacheMap.putIfAbsent(core, new ConcurrentHashMap<String,Cache>());
    ConcurrentHashMap<String,Cache> coreCache = cacheMap.get(core);
    coreCache.putIfAbsent(cacheTag, newCache());
    return coreCache.get(cacheTag);
  }
  
  public void invalidateAll(SolrCore core) {
    ConcurrentHashMap<String,Cache> coreCache = cacheMap.get(core);
    if (coreCache != null) {
      for (Cache cahe : coreCache.values()) {
        cahe.invalidateAll();
      }
    }
  }

  public void addToCache(final SolrCore core, final String cacheTag,
      final CacheKeySolrQueryRequest cacheKey, final Object rspObj) {
    executors.submit(new Runnable() {
      @Override
      public void run() {
        Cache<CacheKeySolrQueryRequest,Object> cache = CacheManager
            .getInstance().getCache(core, cacheTag);
        cache.put(cacheKey, rspObj);
      }
    });
  }
}

CacheKeySolrQueryRequest
We can’t use SolrQueryRequest as the the key of Guava cache. Because it doesn’t implement hashCode and equals methods.The hashCode would be different for different requests with same solr query, equals would be false.
So We extract params map: Map from SolrQueryRequest, and implements the hashCode and equals methods. The order in the map and String[] array doesn’t matter.

We can also use the deepHahsCode and deepEquals from java-util.

public class CacheKeySolrQueryRequest implements Serializable {
  
  private static final long serialVersionUID = 1L;
  Map<String,String[]> paramsMap;
  String url;
  
  private CacheKeySolrQueryRequest(SolrQueryRequest request) {
    this.paramsMap = SolrParams.toMultiMap(request.getParams().toNamedList());
    // remove unimportant params
    paramsMap.remove(CommonParams.TIME_ALLOWED);
    if (request.getContext().get("url") != null) {
      this.url = request.getContext().get("url").toString();
    }
  }
  
  public static CacheKeySolrQueryRequest create(SolrQueryRequest request) {
    CacheKeySolrQueryRequest result = null;
    if ((request.getContentStreams() == null || !request.getContentStreams()
        .iterator().hasNext())) {
      result = new CacheKeySolrQueryRequest(request);
    }
    return result;    
  }

  public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ((url == null) ? 0 : url.hashCode());
    // the order in the map doesn't matter
    if (paramsMap != null) {
      int mapHashCode = 1;
      for (Entry<String,String[]> entry : paramsMap.entrySet()) {
        mapHashCode = (entry.getKey() == null ? 0 : entry.getKey().hashCode());
        for (String value : entry.getValue()) {
          mapHashCode = prime * mapHashCode
              + (value == null ? 0 : value.hashCode());
        }
      }
      
      result = prime * result + mapHashCode;
    }
    return result;
  }

  public boolean equals(Object obj) {
    if (this == obj) return true;
    if (obj == null) return false;
    if (getClass() != obj.getClass()) return false;
    CacheKeySolrQueryRequest other = (CacheKeySolrQueryRequest) obj;
    if (url == null) {
      if (other.url != null) return false;
    } else if (!url.equals(other.url)) return false;
    
    if (paramsMap == null) {
      if (other.paramsMap != null) return false;
    } else {
      if (paramsMap.size() != other.paramsMap.size()) return false;
      
      Iterator<Entry<String,String[]>> it = paramsMap.entrySet().iterator();
      while (it.hasNext()) {
        Entry<String,String[]> entry = it.next();
        String[] thisValues = entry.getValue();
        String[] otherValues = other.paramsMap.get(entry.getKey());
        if (!haveSameElements(thisValues, otherValues)) return false;
      }
      if (it.hasNext()) {
        return false;
      }
    }
    return true;
  }
  
  // helper class, so we don't have to do a whole lot of autoboxing
  private static class Count {
    public int count = 0;
  }
  // from: http://ift.tt/1ly6zU9
  public boolean haveSameElements(String[] list1, String[] list2) {
    if (list1 == list2) return true;
    if (list1 == null || list2 == null || list1.length != list2.length) return false;
    HashMap<String,Count> counts = new HashMap<String,Count>();

    for (String item : list1) {
      if (!counts.containsKey(item)) counts.put(item, new Count());
      counts.get(item).count += 1;
    }
    for (String item : list2) {
      // If the map doesn't contain the item here, then this item wasn't in
      // list1
      if (!counts.containsKey(item)) return false;
      counts.get(item).count -= 1;
    }
    for (Map.Entry<String,Count> entry : counts.entrySet()) {
      if (entry.getValue().count != 0) return false;
    }
    return true;
  }  
}

ResponseCachedSearchHandler
If useCache is true, ResponseCachedSearchHandler will first try to load the response from the cache, if the response is already cached, it will return response directly. If this is the first time this request is executed, it will run the request, if the execution time is longer than minExecuteTime, put response into cache. by default minExecuteTime is -1, mean we will always put response into cache).
We can change value of minExecuteTime, so Solr will only cache response if the requests takes more than specified minimum time.

Before return cached response, we have to call oldRsp.setReturnFields(new SolrReturnFields(oldReq)); this will set what fields to return based on fl parameter in request. Otherwise, solr will return all fields: as no fl parameter is set.

Sub class can extend ResponseCachedSearchHandler: implement isUseCache() method to determine whether solr should cache the response; implement beforeReturnFromCache to do something before return cached response back to solr.

public class ResponseCachedSearchHandler extends SearchHandler {  
  protected static final String PARAM_USE_CACHE = "useCache",
      PARAM_MIN_EXECUTE_TIME = "minExecuteTime";
  
  protected boolean defUseCache = false;
  protected int defMinExecuteTime = -1;
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      defUseCache = defaults.getBool(PARAM_USE_CACHE, false);
      defMinExecuteTime = defaults.getInt(PARAM_MIN_EXECUTE_TIME, -1);
    }
  }
  
  public void handleRequestBody(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp) throws Exception {
    
    boolean useCache = isUseCache(oldReq);
    CacheKeySolrQueryRequest cacheKey = null;
    if (useCache) {
      Cache<CacheKeySolrQueryRequest,Object> cache = CacheManager
          .getInstance().getCache(oldReq.getCore(),
              CacheManager.CACHE_TAG_SOLR_REQUEST);
      
      cacheKey = CacheKeySolrQueryRequest.create(oldReq);
      if (cacheKey != null) {
        Object cachedRsp = cache.getIfPresent(cacheKey);
        if (cachedRsp != null) {
          NamedList<Object> valuesNL = oldRsp.getValues();
          valuesNL.add("response", cachedRsp);
          // SolrReturnFields defines which fields to return.
          oldRsp.setReturnFields(new SolrReturnFields(oldReq));
          beforeReturnFromCache(oldReq, oldRsp);
          return;
        }
      }
    }
    Stopwatch stopwatch = new Stopwatch().start();
    executeRequest(oldReq, oldRsp);
    long executeTime = stopwatch.elapsedTime(TimeUnit.MILLISECONDS);
    stopwatch.stop();
    beforeReturnNoCache(oldReq, oldRsp);
    addRspToCache(oldReq, oldRsp, useCache, cacheKey, executeTime);
  }
  
  protected void addRspToCache(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp, boolean useCache,
      CacheKeySolrQueryRequest cacheKey, long executeTime) {
    long minExecuteTime = oldReq.getParams().getInt(PARAM_MIN_EXECUTE_TIME,
        defMinExecuteTime);
    if (useCache && cacheKey != null && executeTime > minExecuteTime) {
      NamedList<Object> valuesNL = oldRsp.getValues();
      Object rspObj = (Object) valuesNL.get("response");
      CacheManager.getInstance().addToCache(oldReq.getCore(),
          CacheManager.CACHE_TAG_SOLR_REQUEST, cacheKey, rspObj);      
    }
  }
  
  /**
   * SubClass can extend this to check whether the request is stats query etc.
   */
  protected boolean isUseCache(SolrQueryRequest oldReq) {
    return oldReq.getParams().getBool(PARAM_USE_CACHE, defUseCache);
  }
  
  protected void beforeReturnNoCache(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp) {}

  protected void beforeReturnFromCache(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp) {}
      
  /**
   * by default, call searchHander.executeRequest
   */
  protected void executeRequest(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp) throws Exception {
    super.handleRequestBody(oldReq, oldRsp);
  }
}

CacheStatsFacetRequestHandler
CacheStatsFacetRequestHandler extends ResponseCachedSearchHandler, so solr will only store response of stats and facet requests. We will change the default requestHandler to use CacheStatsFacetRequestHandler.

<requestHandler name="/select" class="CacheStatsFacetRequestHandler" default="true">
    <!-- omitted -->
  </requestHandler>
public class CacheStatsFacetRequestHandler extends ResponseCachedSearchHandler {
  protected boolean isUseCache(SolrQueryRequest oldReq) {
    boolean useCache = super.isUseCache(oldReq);
    if (useCache) {
      SolrParams params = oldReq.getParams();
      useCache = params.getBool(StatsParams.STATS, false)
          || params.getBool(FacetParams.FACET, false);
    }
    return useCache;
  }
}
InvalidateCacheProcessorFactory
We need invalidate caches after solr commit. We need add the InvalidateCacheProcessorFactory to the default processor chain, and every updateRequestProcessorChain.
<updateRequestProcessorChain name="defaultChain" default="true">
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
    <processor class="InvalidateCacheProcessorFactory" />
    <processor
        class="AutoRunQueriesProcessorFactory"/>      
  </updateRequestProcessorChain>
public class InvalidateCacheProcessorFactory extends
    UpdateRequestProcessorFactory {
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new InvalidateCacheProcessor(next);
  }  
  private static class InvalidateCacheProcessor extends
      UpdateRequestProcessor {    
    public InvalidateCacheProcessor(UpdateRequestProcessor next) {
      super(next);
    }
    public void processCommit(CommitUpdateCommand cmd) throws IOException {
      super.processCommit(cmd);
      CacheManager.getInstance().invalidateAll(cmd.getReq().getCore());
    }
  }
}

via Blogger http://ift.tt/1oheDLj

Part1: Run Time-Consuming Solr Query Faster: Auto Run Queries X Minutes after Startup and Commit – From http://ift.tt/1ajReyV

Tags

,

The Problem
In our web application, the very first request to solr server is a stats query. When there are more than 50 millions data, the first stats query may take 1, 2 or more minutes. As it need load millions of documents, terms into Solr.
For subsequent stats queries, it will run faster as Solr load them into its caches, but it still takes 5 to 10 or more seconds as the stats query is a compute-intensive task, and there is too many data.


We want these stats queries run faster to make the web GUI more responsive.
Main Steps
1. Make the first stats query run faster
This is described in this article: auto run quries X minutes after no update after startup or commit.
2. Make subsequent stats qury run faster.
Task: Make the first stats query run faster
The first stats query is like this: q=*&stats=true&stats.field=szkb&stats.pagination=true&f.szkb.stats.query=*&f.szkb.stats.facet=file_type.
Solr firstSearcher and newSearcher

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
}
//]]>

From Solr wiki:
A firstSearcher event is fired whenever a new searcher is being prepared but there is no current registered searcher to handle requests or to gain autowarming data from (ie: on Solr startup). A newSearcher event is fired whenever a new searcher is being prepared and there is a current searcher handling requests (aka registered).

In our application, we can’t use firstSearcher. As there are too many data, and multiple cores in one solr server, the startup would be very slow, it may take 3 to 5 minutes, 
It also may take 1 to 2 minutes to run commit. Also during push date phrase, client will push many data and commit multiple times, we don’t want to slow down the commit, or run the queries every time after commit.
Expected Solution
We want run defined queries after no update in last 5 minutes after server startup; run defined queries after no update in last 10 minutes after a commit.
In this way, we will not run these queries too often: we only run them when the data is kind of stable. No update in 10 minutes.
The Implementation
QueryAutoRunner
This singleton classes maintains the mapping between the SolrCore and the queries, and will auto run them X minutes after no update after startup or commit.

public class QueryAutoRunner {
  protected static final Logger logger = LoggerFactory
      .getLogger(QueryAutoRunner.class);
  
  public static final long DEFAULT_RUN_AUTO_QUERIES_AFTER_COMMIT = 1000 * 60 * 10;
  public static final long DEFAULT_RUN_AUTO_QUERIES_AFTER_STARTUP = 1000 * 60 * 2;
  
  public static long RUN_AUTO_QUERIES_AFTER_COMMIT = DEFAULT_RUN_AUTO_QUERIES_AFTER_COMMIT;
  public static long RUN_AUTO_QUERIES_AFTER_STARTUP = DEFAULT_RUN_AUTO_QUERIES_AFTER_STARTUP;
  private ConcurrentHashMap<SolrCore,CoreAutoRunnerState> autoRunQueries = new ConcurrentHashMap<SolrCore,CoreAutoRunnerState>();
  
  private static QueryAutoRunner instance = null;  
  public static QueryAutoRunner getInstance() {
    if (instance == null) {
      synchronized (QueryAutoRunner.class) {
        if (instance == null) {
          instance = new QueryAutoRunner();
        }
      }
    }
    return instance;
  }

  public void scheduleAutoRunnerAfterCommit(SolrCore core) {
    CoreAutoRunnerState autoQueriesState = autoRunQueries.get(core);
    autoQueriesState.setLastUpdateTime(new Date().getTime());
    autoQueriesState.schedule(RUN_AUTO_QUERIES_AFTER_COMMIT,
        RUN_AUTO_QUERIES_AFTER_COMMIT);
  }  
  public void updateLastUpdateTime(SolrCore core) {
    autoRunQueries.get(core).setLastUpdateTime(new Date().getTime());
  }
  
  public synchronized void initQueries(SolrCore core, Set<NamedList> queries) {
    CoreAutoRunnerState autoQueriesState = new CoreAutoRunnerState(core,
        queries);
    autoRunQueries.put(core, autoQueriesState);
    // always run auto queries for first start
    autoQueriesState.schedule(RUN_AUTO_QUERIES_AFTER_STARTUP, -1);
  }
  private QueryAutoRunner() {
    String str = System.getProperty("RUN_AUTO_QUERIES_AFTER_COMMIT");
    if (StringUtils.isNotBlank(str)) {
      try {
        RUN_AUTO_QUERIES_AFTER_COMMIT = Long.parseLong(str);
      } catch (Exception e) {
        logger
            .error("RUN_AUTO_QUERIES_AFTER_COMMIT should be a positive number");
      }
    }
    str = System.getProperty("RUN_AUTO_QUERIES_AFTER_STARTUP");
    if (StringUtils.isNotBlank(str)) {
      try {
        RUN_AUTO_QUERIES_AFTER_STARTUP = Long.parseLong(str);
      } catch (Exception e) {
        logger
            .error("RUN_AUTO_QUERIES_AFTER_STARTUP should be a positive number");
      }
    }
  }
  
  private static class CoreAutoRunnerState {
    protected static final Logger logger = LoggerFactory
        .getLogger(CoreAutoRunnerState.class);
    
    private SolrCore core;
    private AtomicLong lastUpdateTime = new AtomicLong();
    private Set<NamedList> paramsSet = new LinkedHashSet<NamedList>();

    private ScheduledFuture pending;
    private final ScheduledExecutorService scheduler = Executors
        .newScheduledThreadPool(1);

        public CoreAutoRunnerState(SolrCore core, Set<NamedList> queries) {
      this.core = core;
      this.paramsSet = queries;
    }
    
    public void schedule(long withIn, long minTimeNoUpdate) {
      // if there is already one scheduled runner whose remaining time less
      // than withIn (almost always), cancel the old one.
      if (pending != null && pending.getDelay(TimeUnit.MILLISECONDS) < withIn) {
        pending.cancel(false);
        pending = null;
      }
      if (pending == null) {
        pending = scheduler.schedule(new AutoQueriesRunner(minTimeNoUpdate),
            withIn, TimeUnit.MILLISECONDS);
        logger.info("Scheduled to run queries in " + withIn);
      }
    }
    
    private class AutoQueriesRunner implements Runnable {
      private long minTimeNoUpdate;
      
      public AutoQueriesRunner(long minTimeNoUpdate) {
        this.minTimeNoUpdate = minTimeNoUpdate;
      }      
      @Override
      public void run() {
        if (minTimeNoUpdate > 0
            && (new Date().getTime() - lastUpdateTime.get()) < minTimeNoUpdate) {
          long remaingTime = minTimeNoUpdate
              - (new Date().getTime() - lastUpdateTime.get());
          if (remaingTime > 1000) {
            // reschedule auto runner
            pending = scheduler.schedule(
                new AutoQueriesRunner(minTimeNoUpdate), remaingTime,
                TimeUnit.MILLISECONDS);
            return;
          }
        }
        logger.info("Started to execute auto runner for " + core.getName());
        // if there is no update in less than X minutes,
        for (NamedList params : paramsSet) {
          SolrQueryRequest request = null;
          try {
            request = new LocalSolrQueryRequest(core, params);
            
            String qt = request.getParams().get(CommonParams.QT);
            if (StringUtils.isBlank(qt)) {
              qt = "/select";
            }
            request.getContext().put("url", qt);
            core.execute(core.getRequestHandler(request.getParams().get(
                CommonParams.QT)), request, new SolrQueryResponse());
          } catch (Exception e) {
            logger.error("Error happened when run for " + core.getName()
                + " auro query: " + params, e);
          } finally {
            if (request != null) {
              request.close();
            }
          }
        }
        logger.info("Excuted auto runner for " + core.getName());
      }
    }
    public CoreAutoRunnerState setLastUpdateTime(long lastUpdateTime) {
      this.lastUpdateTime.set(lastUpdateTime);
      return this;
    }
  }
}

AutoRunQueriesRequestHandler
This request handler is a abstract handler, not meant to be called via http. It’s used to define the query list which will be run automatically at some point, also it will shcedule a AutoRunner in 2 minutes.
Its definition in solrConfig.xml looks like this:

<requestHandler name="/abstracthandler_autorunqueries" class="AutoRunQueriesRequestHandler" >
  <lst name="defaults">
    <arr name="autoRunQueries">
      <lst> 
        <str name="q">*</str>
        <str name="rows">0</str>                 
        <str name="stats">true</str>
        <str name="stats.pagination">true</str>
        <str name="f.szkbround1.stats.query">*</str>
        <str name="stats.field">szkbround1</str>
        <str name="f.szkbround1.stats.facet">ext_name</str>
      </lst>
    </arr>
  </lst>
</requestHandler>
public class AutoRunQueriesRequestHandler extends RequestHandlerBase
    implements SolrCoreAware {  
  private Set<NamedList> paramsSet = new LinkedHashSet<NamedList>();
  private static final String PARAM_AUTO_RUN_QUERIES = "autoRunQueries";
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      NamedList nl = (NamedList) args.get("defaults");
      List<NamedList> allLists = (List<NamedList>) nl
          .get(PARAM_AUTO_RUN_QUERIES);
      if (allLists == null) return;
      for (NamedList nlst : allLists) {
        if (nlst.get("distrib") == null) {
          nlst.add("distrib", false);
        }
        paramsSet.add(nlst);
      }
    }
  }
  public void inform(SolrCore core) {
    if (!paramsSet.isEmpty()) {
      QueryAutoRunner.getInstance().initQueries(core, paramsSet);
    }
  }
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
      throws Exception {
    throw new SolrServerException("Abstract Hanlder, not meant to be called.");
  }
}

AutoRunQueriesProcessorFactory
This processor factory needed to be added in the default processor chain, and all updateRequestProcessorChain. The InvalidateCacheProcessorFactory is used to invalidate the Solr response cache. It’s described at a later post.

<updateRequestProcessorChain name="defaultChain" default="true">
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
  <processor class="InvalidateCacheProcessorFactory" />
  <processor
   class="AutoRunQueriesProcessorFactory"/>      
</updateRequestProcessorChain>

It’s processAdd, processDelete will update lastUpdateTime of CoreAutoRunnerState, its processCommit method will schedule a AutoRunner in 10 minutes. 

public class AutoRunQueriesProcessorFactory extends
    UpdateRequestProcessorFactory {
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new AutoRunQueriesProcessor(next);
  }
  
  private static class AutoRunQueriesProcessor extends UpdateRequestProcessor {
    public AutoRunQueriesProcessor(UpdateRequestProcessor next) {
      super(next);
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      updateLastUpdateTime(cmd);
      super.processAdd(cmd);
    }
    public void processDelete(DeleteUpdateCommand cmd) throws IOException {
      updateLastUpdateTime(cmd);
      super.processDelete(cmd);
    }
    public void processCommit(CommitUpdateCommand cmd) throws IOException {
      super.processCommit(cmd);
      QueryAutoRunner.getInstance().scheduleAutoRunnerAfterCommit(
          cmd.getReq().getCore());
    }
    public void updateLastUpdateTime(UpdateCommand cmd) {
      QueryAutoRunner.getInstance().updateLastUpdateTime(
          cmd.getReq().getCore());
    }
  }
}

via Blogger http://ift.tt/1g1iWH4

PowerShell Tips: Get a Random Sample from CSV File- From http://ift.tt/1ajReyV

Tags

,

The Problem

I am trying to write and test R script against some data from customer. But the data is too big, it would take a lot of time to load the data and run the script. So it would be to extract a small fraction from the original data.


The Solution
First extract the first line from the original csv file, write to destination file.
Get-Content big.csv -TotalCount 1 | Out-File -Encoding utf8 sample.txt


Notice that by default Out-File cmdlet or redirection command >> uses system default encoding when write to a file. Most application by default uses utf-8 or utf-16 to read data. Hence we use -Encoding utf8 here.


Then we randomly select 1000 lines from all lines except the first line.
Then we read all lines except the first line: Get-Content big.csv | where {$_.readcount -gt 1 }


Then randomly select 1000 lines and append them to the destination file.
Get-Content big.csv | where {$_.readcount -gt 1 } | Get-Random -Count 100 | Out-File -Encoding utf8 -Append sample.txt


The Complete Script
Get-Content big.csv -TotalCount 1 | Out-File -Encoding utf8 sample.txt; Get-Content big.csv | where {$_.readcount -gt 1 } | Get-Random -Count 100 | Out-File -Encoding utf8 -Append sample.txt


Related Script: Get default system encoding
[System.Text.Encoding]::Default
[System.Text.Encoding]::Default.EncodingName


Resource
PSTip: Get-Random
Get-Random Cmdlet

via Blogger

Using Solr DocTransformer to Add Anchor Tag and Text into Response- From http://ift.tt/1ajReyV

Tags

,

This series talks about how to use Nutch and Solr to implement Google Search’s “Jump to” and Anchor links features. This article introduces how to use Nutch, HTML Parser Jsoup and Regular Expression to Extract Anchor Tag and Content
The Problem
In the search result, to help users easily jump to the section uses may be interested, we want to add anchor link below page description. Just like Google Search’s “Jump to” and Anchor links features.
Main Steps
1. Extract anchor tag, text and content in Nutch
Please refer to
Using Nutch to Extract Anchor Tag and Content
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression
2. Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
3. Using Solr DocTransformer to Add Anchor Tag and Content into Response
This is described in current article.

Task: Using Solr DocTransformer to Add Anchor Tag and Content into Response
In previous article, we have used Nutch to extract anchor tag, text and content from web page, and saved content into Solr as separate docs with docType 1.

To return tag information for the web page that matches the query, we can use Solr DocTransformer to add fields into response.

AnchorTransformerFactory
DocTransformer is very powerful and useful, allows us to add/remove or update fields before returning. But it has one limit: it can only add one field, and the field name must be [transformer_name].


AnchorTransformer adds tow fields anchorTag, anchorText into SolrDocument. If we just use fl=[anchors], the response would not contains these fields. We have to use fl=[anchors],anchorTag,anchorText. The anchorTag,anchorText would tell Solr to add them into SolrReturnFields. Please refer the code at SolrReturnFields.add(String, NamedList<String>, DocTransformers, SolrQueryRequest).

public class AnchorTransformerFactory extends TransformerFactory {
  
  private String defaultSort;
  private int defaultAnchorRows = 5;
  private static final String SORT_BY_ORDER = "order";
  protected static Logger logger = LoggerFactory
      .getLogger(AnchorTransformerFactory.class);
  public void init(NamedList args) {
    super.init(args);
    Object obj = args.get("sort");
    if (obj != null) {
      defaultSort = (String) obj;
    }
    obj = args.get("anchorRows");
    if (obj != null) {
      defaultAnchorRows = Integer.parseInt(obj.toString());
    }
  }
  @Override
  public DocTransformer create(String field, SolrParams params,
      SolrQueryRequest req) {
    String sort = defaultSort;
    if (!StringUtils.isBlank(params.get("sort"))) {
      sort = params.get("sort");
    }
    int anchorRows = defaultAnchorRows;
    if (StringUtils.isNotBlank(params.get("anchorRows"))) {
      anchorRows = Integer.parseInt(params.get("anchorRows"));
    }
    return new AnchorTransformer(field, req, sort, anchorRows);
  }
  
  private static class AnchorTransformer extends DocTransformer {
    private SolrQueryRequest req;
    private String sort;
    private int anchorRows;
    
    public AnchorTransformer(String field, SolrQueryRequest req, String sort,
        int anchorRows) {
      this.req = req;
      this.sort = sort;
      this.anchorRows = anchorRows;
    }
    
    @Override
    public void transform(SolrDocument doc, int docid) throws IOException {
      String oldQuery = req.getParams().get(CommonParams.Q);
      Object idObj = doc.getFieldValue("contentid");
      
      // java.lang.RuntimeException: When this is called? obj.type:class
      // org.apache.lucene.document.LazyDocument$LazyField at
      String id;
      if (idObj instanceof org.apache.lucene.document.Field) {
        org.apache.lucene.document.Field field = (Field) idObj;
        id = field.stringValue();
      } else if (idObj instanceof IndexableField) {
        IndexableField field = (IndexableField) idObj;
        id = field.stringValue();
      } else {
        throw new RuntimeException("When this is called? obj.type:"
            + idObj.getClass());
      }
      SolrQuery query = new SolrQuery();
      query
          .setQuery(
              "anchorContent:" + ClientUtils.escapeQueryChars(oldQuery)
                  + " AND url: " + ClientUtils.escapeQueryChars(id))
          .addFilterQuery("docType:1").setRows(anchorRows)
          .setFields("anchorTag", "anchorText");
      if (SORT_BY_ORDER.equals(sort)) {
        query.setSort("anchorOrder", ORDER.asc);
      }
      // else default, sort by score
      List<Map<String,String>> anchorMap = extractSingleFieldValues(
          req.getCore(), "/select", query, "anchorTag", "anchorText");
      for (Map<String,String> map : anchorMap) {
        doc.addField("anchorTag", map.get("anchorTag"));
        doc.addField("anchorText", map.get("anchorText"));
      }
    }
    
  public static List<Map<String,String>> extractSingleFieldValues(
      SolrCore core, String handlerName, SolrQuery query, String... fls)
      throws IOException {
    SolrRequestHandler requestHandler = core.getRequestHandler(handlerName);
    query.setFields(fls);
    SolrQueryRequest newReq = new LocalSolrQueryRequest(core, query);
    try {
      SolrQueryResponse queryRsp = new SolrQueryResponse();
      requestHandler.handleRequest(newReq, queryRsp);
      return extractSingleFieldValues(newReq, queryRsp, fls);
    } finally {
      newReq.close();
    }
  }
  
  @SuppressWarnings("rawtypes")
  public static List<Map<String,String>> extractSingleFieldValues(
      SolrQueryRequest newReq, SolrQueryResponse newRsp, String[] fls)
      throws IOException {
    List<Map<String,String>> rst = new ArrayList<Map<String,String>>();
    NamedList contentIdNL = newRsp.getValues();
    
    Object rspObj = contentIdNL.get("response");
    SolrIndexSearcher searcher = newReq.getSearcher();    
    if (rspObj instanceof ResultContext) {
      ResultContext resultContext = (ResultContext) rspObj;
      DocList doclist = resultContext.docs;
      DocIterator dit = doclist.iterator();
      while (dit.hasNext()) {
        int docid = dit.nextDoc();
        Document doc = searcher.doc(docid, new HashSet<String>());
        Map<String,String> row = new HashMap<String,String>();
        for (String fl : fls) {
          row.put(fl, doc.get(fl));
        }
        rst.add(row);
      }
    } else if (rspObj instanceof SolrDocumentList) {
      SolrDocumentList docList = (SolrDocumentList) rspObj;
      Iterator<SolrDocument> docIt = docList.iterator();
      while (docIt.hasNext()) {
        SolrDocument doc = docIt.next();
        docIt.remove();
        Map<String,String> row = new HashMap<String,String>();
        for (String fl : fls) {
          Object tmp = doc.getFieldValue(fl);
          if (tmp != null) {
            row.put(fl, tmp.toString());
          }
        }
        rst.add(row);
      }
    }
    return rst;
  }    
  } 
}

SolrConfig.xml

  <transformer name="anchors" class="AnchorTransformerFactory" >
    <int name="anchorRows">5</int>
  </transformer>
  <requestHandler name="/select" class="solr.SearchHandler"
		default="true">  
      <lst name="defaults">
          <str name="fl">otherfields,[anchors],anchorTag,anchorText</str>
       </lst>
   </requestHandler>

Resource
Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
Using Nutch to Extract Anchor Tag and Content
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression

via Blogger

Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr- From http://ift.tt/1ajReyV

Tags

,

This series talks about how to use Nutch and Solr to implement Google Search’s “Jump to” and Anchor links features. This article introduces how to use Nutch, HTML Parser Jsoup and Regular Expression to Extract Anchor Tag and Content
The Problem 
In the search result, to help users easily jump to the section uses may be interested, we want to add anchor link below page description. Just like Google Search’s “Jump to” and Anchor links features.
Main Steps
1. Extract anchor tag, text and content in Nutch.
Also refer to
Using Nutch to Extract Anchor Tag and Conten
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression
2. Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
This is described in this article
3. Using DocTransformer to Add Anchor tag and content into response. 

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
}
//]]>

Task: Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
In previous article, we have used Nutch to extract anchor tag, text and content from web page, and add into Solr documents: anchorTags, anchorTexts, anchorContents. These three fields are a list of string.

In Solr side, it will use a UpdateRequestProcessor to remove these three fields, and add a new Document for each anchor, set docType as 1: 0 means, this doc is a web page. 1 means an anchor.
The web page doc and anchor docs is a parent-child relationship.
Code

public class AnchorContentProcessorFactory extends
    UpdateRequestProcessorFactory {
  
  private String fromFlAnchorTags, fromFlAnchorTexts, fromFlAnchorContents;
  private String toFlAnchorTag, toFlAnchorText, toFlAnchorContent,
      toFlAnchorOrder, flForeignKey;
  
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      fromFlAnchorTags = checkNotNull(params.get("fromFlAnchorTags"),
          "fromFlAnchorTags can't be null");
      fromFlAnchorTexts = checkNotNull(params.get("fromFlAnchorTexts"),
          "fromFlAnchorTexts can't be null");
      fromFlAnchorContents = checkNotNull(params.get("fromFlAnchorContents"),
          "fromFlAnchorContents can't be null");
      
      toFlAnchorTag = checkNotNull(params.get("toFlAnchorTag"),
          "toFlAnchorTag can't be null");
      toFlAnchorText = checkNotNull(params.get("toFlAnchorText"),
          "toFlAnchorText can't be null");
      toFlAnchorContent = checkNotNull(params.get("toFlAnchorContent"),
          "toFlAnchorContent can't be null");
      toFlAnchorOrder = checkNotNull(params.get("toFlAnchorOrder"),
          "toFlAnchorOrder can't be null");
      flForeignKey = checkNotNull(params.get("flForeignKey"),
          "flForeignKey can't be null");
    }
  }
  
  @Override
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new AnchorContentProcessor(next);
  }
  
  class AnchorContentProcessor extends UpdateRequestProcessor {
    
    public AnchorContentProcessor(UpdateRequestProcessor next) {
      super(next);
    }
    
    @Override
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      
      SolrInputDocument oldDoc = cmd.solrDoc;
      // docType 0 means this item is full web page.
      // docType 1 means this item is anchor.
      oldDoc.setField("docType", 0);
      Collection<Object> fromAnchorTags = oldDoc
          .getFieldValues(fromFlAnchorTags);
      Collection<Object> fromAnchorTexts = oldDoc
          .getFieldValues(fromFlAnchorTexts);
      Collection<Object> fromAnchorContents = oldDoc
          .getFieldValues(fromFlAnchorContents);
      
      if (fromAnchorTags != null && fromAnchorTexts != null
          && fromAnchorContents != null) {
        if (fromAnchorTags.size() != fromAnchorTexts.size()
            || fromAnchorTags.size() != fromAnchorContents.size()) throw new RuntimeException(
            "size doesn't match: size of fromAnchorTags: "
                + fromAnchorTags.size() + ", size of fromAnchorTexts: "
                + fromAnchorTexts.size() + ", size of fromAnchorContents: "
                + fromAnchorContents.size());
        
        // add a new document
        AddUpdateCommand newCmd = new AddUpdateCommand(cmd.getReq());
        SolrInputDocument newDoc = new SolrInputDocument();
        
        Iterator<Object> it1 = fromAnchorTags.iterator(), it2 = fromAnchorTexts
            .iterator(), it3 = fromAnchorContents.iterator();
        int order = 0;
        while (it1.hasNext()) {
          // avoid construct new SolrInputDocument
          newDoc.clear();
          newDoc.addField(toFlAnchorTag, it1.next().toString());
          newDoc.addField(toFlAnchorText, it2.next().toString());
          newDoc.addField(toFlAnchorContent, it3.next().toString());
          newDoc.addField(toFlAnchorOrder, order++);
          
          String uniqueFl = newCmd.getReq().getSchema().getUniqueKeyField()
              .getName();
          newDoc.addField(uniqueFl,
              UUID.randomUUID().toString().toLowerCase(Locale.ROOT).toString());
          newDoc.addField(flForeignKey, oldDoc.getFieldValue(uniqueFl)
              .toString());
          // set docType 1 for the anchor item
          newDoc.addField("docType", 1);
          newCmd.solrDoc = newDoc;
          super.processAdd(newCmd);
        }
      }
      
      oldDoc.removeField(fromFlAnchorTags);
      oldDoc.removeField(fromFlAnchorTexts);
      oldDoc.removeField(fromFlAnchorContents);
      super.processAdd(cmd);
    }
  } 
}

SolrConfig.xml

<processor
   class="com.commvault.solr.update.processor.CVAnchorContentProcessorFactory">
      <str name="fromFlAnchorTags">anchorTags</str>
      <str name="fromFlAnchorTexts">anchorTexts</str>
      <str name="fromFlAnchorContents">anchorContents</str>

      <str name="toFlAnchorTag">anchorTag</str>
      <str name="toFlAnchorText">anchorText</str>
      <str name="toFlAnchorContent">anchorContent</str>
      <str name="toFlAnchorOrder">anchorOrder</str>
      <str name="flForeignKey">url</str>
    </processor>  

Schema.xml

<field name="docType" type="tint" indexed="true" stored="true" multiValued="false" /> 
    <field name="anchorTag" type="string" indexed="false" stored="true"  multiValued="false" /> 
    <field name="anchorText" type="string" indexed="false" stored="true" multiValued="false" /> 
    <field name="anchorContent" type="text_rev" indexed="true" stored="false" multiValued="false" /> 
    <field name="anchorOrder" type="tint" indexed="true" stored="true" multiValued="false" /> 

Resource
Using Nutch to Extract Anchor Tag and Content
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression

via Blogger http://ift.tt/1jhx5D8

Using Nutch to Extract Anchor Tag and Content- From http://ift.tt/1ajReyV

Tags

,

This series talks about how to use Nutch and Solr to implement Google Search’s “Jump to” and Anchor links features.
The Problem 
In the search result, to help users easily jump to the section uses may be interested, we want to add anchor link below page description. Just like Google Search’s “Jump to” and Anchor links features.
Main Steps
1. Extract anchor tag, text and content in Nutch.
This is described in this article and Using HTML Parser Jsoup and Regular Expression to Get Text between Tow Tags and Debugging and Optimizing Regular Expression
2. Save anchor information to Solr.
3. Return Anchor tag and text that matches the query. 

Task: Extract anchor tag, text and content in Nutch
We will write a Nutch plugin named index-anchor-content: it implements IndexingFilter extension point. 

Its getFields returns a collection that contains WebPage.Field.CONTENT field. This will tell Nutch to read Content field from the underlying data store. Without this step, the WebPage instance in filter(NutchDocument, String, WebPage) method would not have value for content field.

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
}
//]]>

In filter method, we use jsoup to extract all anchor links in div[id=toc] ul>li section. 

Then use regular expression <span[^>]*\bid\s*=\s*(?:”|’)?{0}(?:’|”)?[^>]*>([^<]*)</span>(.*?)<span[^>]*\bid\s*=\s*(?:”|’)?{1}(?:’|”)?[^>]*>[^<]*</span> to extract tag, text and content for each anchor. {0} and {1} the anchor tag of anchor1 and anchor2. 

We then add them into NutchDocument fields: anchorTags, anchorTexts, anchorContents.

Please read more from Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression

The detailed step to build nutch plugin are omitted. Please refer to Writing Nutch Plugin Example.
Code

public class AnchorContentIndexingFilter implements IndexingFilter {

  public static final Logger LOG = LoggerFactory
      .getLogger(AnchorContentIndexingFilter.class);
  private Configuration conf;
  private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
  static {
    FIELDS.add(WebPage.Field.CONTENT);
  }
  private static final String DEFAULT_REGEX_TOC_ANCHOR = "div[id=toc] ul>li a[href^=#]:not([href=#])";
  private static final String DEFAULT_REGEX_PLAIN_ANCHOR_TAG = "a[href^=#]:not([href=#])";

  private static final int DEFAULT_MAX_ANCHOR_LINKS = 20;
  private static final String DEFAULT_FL_ANCHOR_TAGS = "anchorTags",
      DEFAULT_FL_ANCHOR_TEXTS = "anchorTexts",
      DEFAULT_FL_ANCHOR_CONTENTS = "anchorContents",
      DEFAULT_REGEX_BODY_ROOT = "article[id=sectionContent]",
      DEFAULT_REGEX_EXTRACT_CONTENT = "<span[^>]*?\bid\\s*=\\s*(?:\"|')?{0}(?:'|\")?[^>]*>([^<]*)</span>(.*?)<span[^>]*?\bid\\s*=\\s*(?:\"|')?{1}(?:'|\")?[^<]*>([^<]*)</span>";

  private String flAnchorTags, flAnchorTexts, flAnchorContents, regexTocAnchor,
      // if can't find tocAnchor in web page, revert to plainAnchorTag
      regexPlainAnchorTag,
      // if exists, only search content in this section
      regexBodyRoot;

  private boolean extractOtherAnchors = false;
  /**
   * the regex to extract content between two tags: <br>
   * 1. The string must have 2 place holders {0}, {1}, it will be replaced by the
   * anchor name at runtime.<br>
   * 2. There must be 3 regex group, the first group is to extract the text
   * of the first anchor, the second group is to extract content between the two
   * anchors, the third is to extract the text of the second anchor.<br>
   * 3. If ther is single quote ' in the regex string, have to replaced by
   * doubled single quotes '' due to the usage of MessageFormat.check:
   * http://ift.tt/1c5Njax <br>
   * Check DEFAULT_REGEX_EXTRACT_CONTENT
   */
  private String regexExtractContent = DEFAULT_REGEX_EXTRACT_CONTENT;

  private int maxAnachorLinks = DEFAULT_MAX_ANCHOR_LINKS;
  private MessageFormat MSG_FORMAT;

  @Override
  public NutchDocument filter(NutchDocument doc, String url, WebPage page)
      throws IndexingException {

    ByteBuffer dataBuffer = page.getContent();
    String content = new String(dataBuffer.array());

    Document rootDoc = Jsoup.parse(content);
    try {
      List<Anchor> anchors = parseAnchors(rootDoc);
      for (Anchor anchor : anchors) {
        if (StringUtils.isNotBlank(anchor.getTag())
            && StringUtils.isNotBlank(anchor.getText())
            && StringUtils.isNotBlank(anchor.getContent())) {
          doc.add(flAnchorTags, anchor.getTag());
          doc.add(flAnchorTexts, anchor.getText());
          doc.add(flAnchorContents, anchor.getContent());
        }
      }
    } catch (IOException e) {
      throw new IndexingException(e);
    }
    return doc;
  }

  public List<Anchor> parseAnchors(Document rootDoc) throws IOException {
    List<Anchor> anchorContents = new LinkedList<Anchor>();
    Element rootElement = rootDoc;
    if (regexBodyRoot != null) {
      rootElement = rootDoc.select(regexBodyRoot).first();
    }
    if (rootElement == null)
      return anchorContents;
    Set<String> anchors = getAnchors(rootElement);
    if (anchors.isEmpty())
      return anchorContents;
    StringBuilder remainingTxt = new StringBuilder(rootElement.toString());

    Iterator<String> it = anchors.iterator();
    String curAnchorTag = it.next();
    String lastAnchorTag = null;
    while (it.hasNext() && remainingTxt.length() > 0) {
      String nextAnchorTag = it.next();
      Anchor anchor = getContentBetweenAnchor(remainingTxt, curAnchorTag, nextAnchorTag);
      anchorContents.add(anchor);
      if (!it.hasNext()) {
        // only for last anchor
        lastAnchorTag = anchor.getNextTagText();
      }
      curAnchorTag = nextAnchorTag;
    }
    // Don't forget last tag
    String lastTxt = Jsoup.parse(remainingTxt.toString()).text();
    if (StringUtils.isNotBlank(lastTxt)) {
      anchorContents.add(new Anchor(curAnchorTag, lastAnchorTag, lastTxt));
    }
    return anchorContents;
  }

  public Set<String> getAnchors(Element rootElement) {
    Set<String> anchors = new LinkedHashSet<String>() {
      private static final long serialVersionUID = 1L;

      @Override
      public boolean add(String e) {
        if (size() >= maxAnachorLinks)
          return false;
        return super.add(e);
      }
    };
    getAnchorsImpl(rootElement, regexTocAnchor, anchors);
    if (anchors.isEmpty() && extractOtherAnchors) {
      getAnchorsImpl(rootElement, regexPlainAnchorTag, anchors);
    }
    return anchors;
  }

  public void getAnchorsImpl(Element rootElement, String anchorPattern,
      Set<String> anchors) {
    Elements elements = rootElement.select(anchorPattern);
    if (!elements.isEmpty()) {
      for (Element element : elements) {
        String href = element.attr("href");
        anchors.add(href.substring(1));
      }
    }
  }
  public Anchor getContentBetweenAnchor(StringBuilder remainingTxt,
      String curAnchorTag, String nextAnchorTag) throws IOException {
    Anchor anchor = null;
    String regex = MSG_FORMAT.format(new String[] { curAnchorTag, nextAnchorTag });
    Matcher matcher = Pattern
        .compile(regex, Pattern.DOTALL | Pattern.MULTILINE).matcher(remainingTxt);
    if (matcher.find()) {
      String anchorText = Jsoup.parse(matcher.group(1)).text();
      String anchorContent = anchorText + " "
          + Jsoup.parse(matcher.group(2)).text();
      String nextTagText = matcher.group(3);
      anchor = new Anchor(curAnchorTag, anchorText, anchorContent, nextTagText);

      int g2End = matcher.end(2);
      remainingTxt.delete(0, g2End);
    }
    return anchor;
  }

  @Override
  public Collection<WebPage.Field> getFields() {
    return FIELDS;
  }
  
  private static class Anchor {
    private String tag, text, content,
    // used to get last tag text
    nextTagText;
  }
  public void setConf(Configuration conf) {
    this.conf = conf;
  
    flAnchorTags = getValue(conf, "indexer.anchorContent.field.anchorTags",
        DEFAULT_FL_ANCHOR_TAGS, false);
    flAnchorTexts = getValue(conf, "indexer.anchorContent.field.anchorTags",
        DEFAULT_FL_ANCHOR_TEXTS, false);
    flAnchorContents = getValue(conf,
        "indexer.anchorContent.field.anchorContents",
        DEFAULT_FL_ANCHOR_CONTENTS, false);
    regexTocAnchor = getValue(conf, "indexer.anchorContent.regex.tocAnchor",
        DEFAULT_REGEX_TOC_ANCHOR, false);
    String str = getValue(conf, "indexer.anchorContent.extractOtherAnchors",
        "false", true);
    if (StringUtils.isNotBlank(str)) {
      extractOtherAnchors = Boolean.parseBoolean(str);
    }
    if (extractOtherAnchors) {
      regexPlainAnchorTag = getValue(conf,
          "indexer.anchorContent.regex.plainAnchorTag",
          DEFAULT_REGEX_PLAIN_ANCHOR_TAG, false);
    }
    regexBodyRoot = getValue(conf, "indexer.anchorContent.regex.bodyRoot",
        DEFAULT_REGEX_BODY_ROOT, true);
  
    regexExtractContent = getValue(conf,
        "indexer.anchorContent.regex.extractContent",
        DEFAULT_REGEX_EXTRACT_CONTENT, false);
    MSG_FORMAT = new MessageFormat(regexExtractContent);
  
    str = conf.get("indexer.anchorContent.maxAnchorLinks");
    if (str != null) {
      maxAnachorLinks = Integer.parseInt(str);
    }
  }

  public String getValue(Configuration conf, String param, String oldValue,
      boolean blankable) {
    String newValue = oldValue;
    if (conf.get(param) != null) {
      newValue = conf.get(param);
    }
    if (!blankable && StringUtils.isBlank(newValue)) {
      throw new IllegalArgumentException(newValue + " is set to empty or null.");
    }
    return newValue;
  }
}

Configuration
We update plugin.includes in nutch-site.xml to include this plugin. In solrindex-mapping.xml, we map field in NutchDocument to field in Solr Document.

<field dest="anchorTags" source="anchorTags" />
<field dest="anchorTexts" source="anchorTexts" />
<field dest="anchorContents" source="anchorContents" />

Resource
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression
http://ift.tt/1kaBQyz
Writing Nutch Plugin Example

via Blogger http://ift.tt/1kfh7cL

PowerShell in Action: Analyze Log and Interact with Solr- From http://ift.tt/1ajReyV

Tags

,

The Problem
Need write a program to analyze solr logs to check why some items local solr server fetches from remote solr server is missing. 
We suspect it’s because of the deduplication configuration. Items that have same values for signature fields are marked as duplication and removed by Solr. But we need analyze the log and find all these items.
Why Use PowerShell?
1. Powershell is preinstalled with Win7, Windows Server 2008 R2 and later Windows release.
2. It’s powerful, we can even call .Net in powershell script.
3. It’s an interpreted language. Means we can easily change the script and run it. No need to compile and package as Java or .Net.
4. I have worked as a Java programmer for more than 6 years, it’s kind of boring to write this program in Java, So why not try some new tool and learn something new:)
Analyze Log
In linux, we can use awk, grep to search and extract content and field from log.
In powershell, we use Get-Content and Foreach-Object. In Foreach-Object, we test whether current item(log) contains “Got id”, if so, split it by white space, and get the third field, then write result to a temporary file.

//<![CDATA[
if(showAdsense){
document.write("(adsbygoogle = window.adsbygoogle || []).push({});”)
}
//]]>

Get-Content $logs | Foreach-Object{ if($_.Contains("Got id")) {$a=$_.Split()[3]; $a.Substring(0,$a.Length-1); } } | out-file ".\ids.txt"

Interact with Solr
We then read 100 ids from the temp file, construct a url, then use Net.HttpWebRequest to send a http request, and use Net.HttpWebResponse and IO.StreamReader to read the http response.

In PowerShell 3.0 and newer, we can use Invoke-WebRequest to execute http request and parse response.

We then check ids in the response, if it doesn’t exist in response. It means it is missing in Solr. We then save it to the result file.

$count=100
$ids=@()
gc .\ids.txt  | foreach  {$i=0;} {
  $ids+=$_
  $i++
  if($i -eq $count) { checkSolr $ids; $ids=@(); $i=0;}
}
Function checkSolr ($ids)
{
  $url=$solrServer+"/select?fl=contentid&omitHeader=true&q="
  foreach ($id in $ids) {$url+="contentid:$id OR "}
  $url=$url.SubString(0, $url.length-3)
  [Net.HttpWebRequest] $req = [Net.WebRequest]::create($url)
  $req.Method = "GET"
  $req.Timeout = 600000 # = 10 minutes
  [Net.HttpWebResponse] $result = $req.GetResponse()
  [IO.Stream] $stream = $result.GetResponseStream()
  [IO.StreamReader] $reader = New-Object IO.StreamReader($stream)
  [string] $output = $reader.readToEnd()
  $stream.flush()
  $stream.close()
  # A foreach loop doesn't ouput to the pipeline.
  foreach ($id in $ids) {
    $idx = $output.IndexOf($id)
    if($idx -eq -1)  {
       $notExistStream.WriteLine("$id not in solr");
    }
    else {
    if("$existFile" -ne "" ){ $existStream.WriteLine("$id exist in solr") }
    }
  }
}

Complete Code

[CmdletBinding()]
Param(
   [Parameter(Mandatory=$True,Position=1)]
   [String]$solrServer,
   
   [Parameter(Mandatory=$True,Position=2)]
   [String[]]$logs,
 
   [Parameter(Mandatory=$True)]
   [string]$notExistFile,
   
   [Parameter(Mandatory=$False)]
   [string]$existFile
)
Function checkSolr ($ids)
{
  $url=$solrServer+"/select?fl=contentid&omitHeader=true&q="
  foreach ($id in $ids) {$url+="contentid:$id OR "}
  $url=$url.SubString(0, $url.length-3)
  [Net.HttpWebRequest] $req = [Net.WebRequest]::create($url)
  $req.Method = "GET"
  $req.Timeout = 600000 # = 10 minutes
  [Net.HttpWebResponse] $result = $req.GetResponse()
  [IO.Stream] $stream = $result.GetResponseStream()
  [IO.StreamReader] $reader = New-Object IO.StreamReader($stream)
  [string] $output = $reader.readToEnd()
  $stream.flush()
  $stream.close()
  # A foreach loop doesn't ouput to the pipeline.
  foreach ($id in $ids) {
    $idx = $output.IndexOf($id)
    if($idx -eq -1)  {
       $notExistStream.WriteLine("$id not in solr");
    }
    else {
    if("$existFile" -ne "" ){ $existStream.WriteLine("$id exist in solr") }
    }
  }
}
function createNewFile($file)
{
  if(Test-Path -Path $file) { Remove-Item $file }
  New-Item $file -ItemType file
  $file=$(Resolve-Path $file).ToString()
}

Write-Host (Get-Date).tostring(), script started -BackgroundColor "Red" -ForegroundColor "Black"

$elapsed = [System.Diagnostics.Stopwatch]::StartNew()

Get-Content $logs | %{ if($_.Contains("Got id")) {$a=$_.Split()[3]; $a.Substring(0,$a.Length-1); } } | out-file ".\ids.txt"
Write-Host (Get-Date).tostring(), created ids.txt -BackgroundColor "Red" -ForegroundColor "Black"

$count=100
$ids=@()
gc .\ids.txt  | foreach  {$i=0;} {
  $ids+=$_
  $i++
  if($i -eq $count) { checkSolr $ids; $ids=@(); $i=0;}
 }
 
$notExistFile=createNewFile $notExistFile
$notExistStream = [System.IO.StreamWriter] "$notExistFile"
if("$existFile" -ne "") { createNewFile $existFile; $existStream = [System.IO.StreamWriter] "$existFile"; }
# check for remaining ids
checkSolr $ids;


$notExistStream.close()
if($existStream) {$existStream.close()}

Write-Host (Get-Date).tostring(), script finished -BackgroundColor "Red" -ForegroundColor "Black"
write-host "Total Elapsed Time: $($elapsed.Elapsed.TotalSeconds )" -BackgroundColor "Red" -ForegroundColor "Black"

PowerShell GUI
PowerGUI

via Blogger http://ift.tt/1ine1zB