Salmon Run: databases

Showing posts with label databases. Show all posts

Thursday, February 06, 2014

Substring matching using Aho-Corasick and Redis

The Aho-Corasick algorithm is a fast string matching algorithm that can match a large set of keywords simultaneously against incoming text. It does this by using a trie like data structure, and attempting to navigate the tree along nodes corresponding to characters in the keywords.

The first phase builds up the trie by reading through each keyword in the dictionary, and building nodes for each character in each keyword if it doesn't already exist. To each node, it attaches a set of "transition" nodes - ie, characters it can traverse to from the current character given the set of keywords. Additionally, the keyword itself is associated with the character that ends it. The tree building complexity is linear, O(m) where m is the number of characters across all keywords.

Once the keyword trie is built, searching it is accomplished in a single pass through the text to be matched. The search process recovers substrings which occur in the keywords from the dictionary. The complexity of search is also linear, O(n) where n is the size of the input text, since all keywords are matched simultaneously.

Even though the build time is linear, it can become significant for large dictionaries. If the dictionary is relatively static, the trie building step can be avoided by storing it in a non-volatile key-value store such as Redis. Since Redis operates on in-memory data, you get the best of both worlds - no startup penalty and search speeds almost as good as using in-memory data structures.

In this post, I describe a small (and incomplete, but sufficient for my purposes) implementation of the Aho-Corasick algorithm in Scala that relies on Redis to store the keyword trie. It is similar to the (also slightly modified) Python implementation described in this Sidelines blog post.

Here is the code for the algorithm. The prepare() method takes in a list of keywords and builds a Redis-backed data structure that of two Maps of Sets keyed by the current character, the first one representing the transitions and the second the results (keywords). The search() method takes a phrase to be searched and returns the List of substrings which matches words in the keyword list.

// Source: src/main/scala/com/mycompany/solr4extras/dictann/AhoCorasickRedis.scala
package com.mycompany.solr4extras.dictann

import com.redis.RedisClient
import scala.collection.mutable.ArrayBuffer

class AhoCorasickRedis {

  val redis = new RedisClient("localhost", 6379)
  
  def prepare(keywords: List[String]): Unit = {
    keywords.foreach(keyword => {
      var prevCh = '\0'
      keyword.foreach(ch => {
        redis.sadd(tkey(prevCh), ch)
        prevCh = ch
      })
      redis.sadd(rkey(prevCh), keyword)
    })
  }
  
  def search(phrase: String): List[String] = {
    val matches = ArrayBuffer[String]()
    var prevCh = '\0'
    phrase.foreach(ch => {
      prevCh = if (redis.sismember(tkey(prevCh), ch)) ch 
               else '\0'
      val cmatches = redis.smembers(rkey(prevCh)) match {
        case Some(results) => results.flatten
        case None => List() 
      }
      matches ++= cmatches
    })
    matches.toSet.toList
  }
  
  def tkey(ch: Char) = "trn:" + ch
  def rkey(ch: Char) = "res:" + ch
}

To test this, we run the following simple JUnit test which loads the trie with three strings and then searches it using a sentence containing these 3 strings. As you can see, the algorithm finds the 3 strings we expect it to find.

// Source: src/test/scala/com/mycompany/solr4extras/dictann/AhoCorasickRedisTest.scala
package com.mycompany.solr4extras.dictann

import org.junit.Test
import org.junit.Assert

class AhoCorasickRedisTest {

  @Test
  def testBuild(): Unit = {
    val aho = new AhoCorasickRedis()
    aho.prepare(List("Seahawks", "Broncos", "Super Bowl"))
    val matches = aho.search(
      "The Seahawks defeated the Broncos at the Super Bowl.")
    Console.println("matches=" + matches)
    Assert.assertEquals(3, matches.size)
    Assert.assertTrue(matches.contains("Seahawks"))
    Assert.assertTrue(matches.contains("Broncos"))
    Assert.assertTrue(matches.contains("Super Bowl"))
  }
}

For more information about this algorithm, Pekka Kilpelainen's lecture slides provides lots of detail, and Ivan Kuckir's Blog post has a nice animation that illustrates how it works. Both links are also listed under the External Links section in the algorithm's Wikipedia page (referenced earlier).

Friday, March 01, 2013

Drools Rules in a Database, Take 2

Quite some time back, I wrote an article describing how I used Drools with database backed rules. Since Drools (version 2) did not support this at the time, I ended up writing some glue code that tied Commons Collections Predicates and Closures into Drools Rule Conditions and Consequences, then built my rules on top of that. It was for a proof of concept project which ultimately never materialized. Over the years, I've had requests for the code backing the article.

Fast forward 6 years, and I find myself trying to do something similar for an upcoming project. Time sure flies when you are having fun. Of course, by this time, Drools is up to version 5, and database backed rules is now natively supported, so its much easier this time round.

What I describe here is a small example I tried out to prove to myself that this will work before putting it into the main application. The example I use is part of a (machine generated) decision tree from Erik Siegel's book "Predictive Analytics", that predicts the risk of mortgage default, given factors such as loan amount, applicant income, etc. Machine generated or not, its a decision tree, and hence can be modeled as a set of rules, which is what I do here.

Rules in Drools (now called JBoss Rules) are written using the "when ${condition} then ${consequence}" pattern. In keeping with the pattern, our rules are also written into a single MySQL table, and the interesting data is contained in two columns: rule_cond (the condition) and rule_cons (the consequence). Here is the dump (edited for readability) of these columns for our toy application.

mysql> select rule_cond, rule_cons from mort_rules order by rule_name;
+---------------------------------------+------------------------------------+
| rule_cond                             | rule_cons                          |
+---------------------------------------+------------------------------------+
| interestRate < 7.94 &&                | setRisk(kcontext, $mortgage, 2.6)  |
|   applicantIncome < 78223             |                                    |
+---------------------------------------+------------------------------------+
| interestRate < 7.19 &&                | setRisk(kcontext, $mortgage, 3.4   |
|   applicantIncome >= 78223            |                                    |
+---------------------------------------+------------------------------------+
| interestRate >= 7.19 &&               | setRisk(kcontext, $mortgage, 9.1)  |
|   interestRate < 7.94 &&              |                                    |
|   applicantIncome >= 78223            |                                    |
+---------------------------------------+------------------------------------+
| interestRate >= 7.94 &&               | setRisk(kcontext, $mortgage, 8.1)  |
|   mortgageAmount < 67751 &&           |                                    |
|   loanToValue < 87.4                  |                                    |
+---------------------------------------+------------------------------------+
| interestRate >= 7.94 &&               | setRisk(kcontext, $mortgage, 8.5)  |
|   mortgageAmount < 182926 &&          |                                    |
|   mortgageAmount >= 67751 &&          |                                    |
|   loanToValue < 87.4 &&               |                                    |
|   interestRate < 8.69 &&              |                                    |
|   isCondo == 1                        |                                    |
+---------------------------------------+------------------------------------+
| interestRate >= 7.94 &&               | setRisk(kcontext, $mortgage, 16.3) |
|   mortgageAmount < 182926 &&          |                                    |
|   mortgageAmount >= 67751 &&          |                                    |
|   loanToValue < 87.4 &&               |                                    |
|   interestRate < 8.69 &&              |                                    |
|   isCondo == 0                        |                                    |
+---------------------------------------+------------------------------------+
| interestRate >= 7.94 &&               | setRisk(kcontext, $mortgage, 25.6) | 
|   mortgageAmount < 182926 &&          |                                    |
|   mortgageAmount >= 67751 &&          |                                    |
|   loanToValue < 87.4 &&               |                                    |
|   interestRate >= 8.69                |                                    |
+---------------------------------------+------------------------------------+
| interestRate >= 7.94 &&               | setRisk(kcontext, $mortgage, 6.4)  | 
|   mortgageAmount < 182926 &&          |                                    |
|   loanToValue >= 87.4                 |                                    |
+---------------------------------------+------------------------------------+
| interestRate >= 7.94 &&               | setRisk(kcontext, $mortgage, 15.2) | 
|   mortgageAmount >= 182926 &&         |                                    |
|   isCondo == 1                        |                                    |
+---------------------------------------+------------------------------------+
| interestRate >= 7.94 &&               | setRisk(kcontext, $mortgage, 40.0) | 
|   mortgageAmount >= 182926 &&         |                                    |
|   isCondo == 0                        |                                    |
+---------------------------------------+------------------------------------+
10 rows in set (0.01 sec)

These rules work are substituted into a DRT (Drools Template) file, which is written in MVEL and in my case looks something like this. Drools provides a ResultSetGenerator object which will use the ResultSet to build a large DRL string.

// Source: src/main/resources/mort_rules.drt
template header
// list of columns that the SELECT will return
rule_cond
rule_cons

// header (will be written out once)
package com.mycompany.mortgage;

dialect "mvel"
import function com.mycompany.mortgage.MortgageFunctions.setRisk;

template "mortgage"
// this section will be repeated for each database row, the value
// of x will be written out for @{x}.

rule "mortgage_@{row.rowNumber}"
no-loop
when
  $mortgage: Mortgage(@{rule_cond})
then
  @{rule_cons}
end

end template

Here is the code to run a bunch of mortgages (represented by the Mortgage object) through the rules thus generated.

// Source: src/main/scala/com/mycompany/mortgage/MortgageRiskCalculator.scala
package com.mycompany.mortgage

import java.io.{FileInputStream, StringReader}
import java.sql.DriverManager

import org.drools.RuleBaseFactory
import org.drools.compiler.PackageBuilder
import org.drools.runtime.rule.RuleContext
import org.drools.template.jdbc.ResultSetGenerator

object MortgageRiskCalculator extends App {
  val mortgages = Seq(
    new Mortgage(120000.00, 2.3, 50000.00, 75.5, 1, -1.0),
    new Mortgage(360000.00, 4.7, 100000.00, 12.0, 0, -1.0),
    new Mortgage(400000.00, 7.5, 250000.00, 20.0, 0, -1.0))
  val rules = new RiskRules("src/main/resources/mort_rules.drt")
  rules.runRules(mortgages)
  mortgages.foreach(x => println("risk=" + x.risk))
}

class Mortgage(
    val mortgageAmount: Double, 
    val interestRate: Double, 
    val applicantIncome: Double, 
    val loanToValue: Double,
    val isCondo: Int,
    var risk: Double)
    
class RiskRules(template: String) {
  // set up database connection
  val jdbcUrl = "jdbc:mysql://localhost:3306/mydb"
  Class.forName("com.mysql.jdbc.Driver")
  val conn = DriverManager.getConnection(jdbcUrl, "myuser", "mypass")
  // build drl with data from database
  val preparedStmt = conn.prepareStatement("""
    select rule_cond, rule_cons 
    from mort_rules""")
  val resultSet = preparedStmt.executeQuery()
  val resultSetGenerator = new ResultSetGenerator()
  val drl = resultSetGenerator.compile(resultSet, 
    new FileInputStream(template))
  resultSet.close()
  preparedStmt.close()
  conn.close()
  // build rule base
  val packageBuilder = new PackageBuilder()
  packageBuilder.addPackageFromDrl(new StringReader(drl))
  val ruleBase = RuleBaseFactory.newRuleBase()
  ruleBase.addPackage(packageBuilder.getPackage())
  val workingMemory = ruleBase.newStatefulSession()
  
  def runRules(facts: Seq[Mortgage]): Unit = {
    facts.foreach(workingMemory.insert(_))
    workingMemory.fireAllRules()
    workingMemory.dispose()
  }
}

object MortgageFunctions {
  def setRisk(ctx: RuleContext, m: Mortgage, r: Double): Unit = {
    m.risk = r
  }
}

The MortgageRiskCalculator is the calling class that instantiates the RiskRules object and runs the Mortgages through the Rule. The RiskRules class is the central class, it reads the rules from the MySQL table, instantiates the RuleSetGenerator, creates the DRL, creates the RuleBase and sets the DRL into it, and finally instantiates a WorkingMemory. Its runRules method inserts all the Mortgage (facts) nto the WorkingMemory and fires all the rules.

In our case, there is only a single Consequence function setRisk that sets the risk value back into the Mortgage object. In practice there can be other such Consequences that are modelled as static methods of MortgageFunctions and are imported into the DRT file so they feel like a DSL to the rule maintainer.

The output of the run is the default risk associated with each mortgage (as a percentage).

1
2
3

risk=2.6
risk=3.4
risk=9.1

In the spirit of full disclosure, in case any of you want to replicate this, I built a standard SBT project with "g8 typesafehub/scala-sbt" and my build.sbt looks like this:

name := "mortgage"

version := "1.0"

scalaVersion := "2.9.2"

resolvers += "Typesafe Repository" at "http://repo.typesafe.com/typesafe/repo/"

// drools

libraryDependencies += "org.drools" % "drools-core" % "5.5.0.Final"

libraryDependencies += "org.drools" % "drools-compiler" % "5.5.0.Final"

libraryDependencies += "org.drools" % "drools-jsr94" % "5.5.0.Final"

libraryDependencies += "org.drools" % "drools-decisiontables" % "5.5.0.Final"

libraryDependencies += "org.drools" % "knowledge-api" % "5.5.0.Final"

// mysql

libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.12"

So now that this functionality (support for database backed rules) is available as part of Drools, I finally have justification for deleting the old code (backing the article) off my hard disk :-).

Saturday, May 14, 2011

Custom Sorting in Solr using External Database Data

Recently, someone (from another project) asked me if I knew how to sort data in Solr by user-entered popularity rating data. The data would be entered as thumbs up/down clicks on the search results page and be captured into a RDBMS. Users would be given the option to sort the data by popularity (as reflected by the percentage of thumbs-up ratings vs thumbs-down ratings).

I knew how to do this with Lucene (using a ScoreDocComparator against an in-memory cache of the database table set to a relatively short expiration time), but I figured that since Solr is used so heavily in situations where user feedback is king, there must be something built into Solr, so I promised to get back to him after I did some research (aka googling, asking on the Solr list, etc).

A quick scan on Google pointed me to this Nabble thread, circa 2007, which seemed reasonable, but still required some customizing Solr (code). So I asked on the Solr list, and as expected, Solr did offer a way to do this (without customizing Solr code), using ExternalFileField field types. The post on tailsweep offers more details about this approach.

We saw some problems with this approach, however... First off, the ExternalFileField is not used for sorting, rather it is used to influence the scores using a function query, so the results did not seem quite deterministic. Second, the process of updating ratings involves writing out the database table into a flat file and copying it into Solr's index directory followed by a commit. Can be scripted of course, but it involves making yet another copy of the database. Third, since the approach does not provide a way to pull the popularity rank information in the results, the client application would have to make a separate lookup against the database (bringing up the issue of stale data, since the database could be ahead of the file version at a point in time).

For these reasons, I decided to explore the programmatic approach outlined in the Nabble page above. The approach involves the following steps.

Create a custom FieldType. The getSortField() method in this class should return a custom SortField using a custom FieldComparatorSource object (more on that couple lines down).
Configure this field type and a field of that type in your Solr schema.xml.
Build a custom FieldComparatorSource that returns a custom FieldComparator in its newComparator() method. The FieldComparator should use data from the database to do its ordering.
Once done, results can be sorted by &sort=rank+desc (or asc) in the request URL.

I had initially thought of using a short-lived cache to proxy the database call through, thus ensuring that any entry is at most, say 5 minutes old. But I was worried about the performance impact of this, so I decided to go with the approach taken by ExternalFileField and use an in-memory map which is refreshed periodically via a commit. Further, since I was writing code in Solr anyway, I decided to provide the thumbs-up/down percentages in the Solr response. Here is what I ended up doing.

Database

Assuming that the data is contained in a table with the following structure. The uid refers to a unique content id assigned by the application that created the content (in our case it is a MD5 hash of the URL).

+-------+----------+------+-----+---------+-------+------------------------+
| Field | Type     | Null | Key | Default | Extra | Comment                |
+-------+----------+------+-----+---------+-------+------------------------+
| uid   | char(32) | NO   | PRI | NULL    |       | unique content id      |
| n_up  | int(11)  | NO   |     | 0       |       | number of thumbs up    |
| n_dn  | int(11)  | NO   |     | 0       |       | number of thumbs down  |
| n_tot | int(11)  | NO   |     | 0       |       | total votes            |
+-------+----------+------+-----+---------+-------+------------------------+

The rank is calculated as shown below. The raw n_up and n_dn data are converted to rounded integer percentages. Then the rank is computed as the difference between the percentages, returning a number between -100 and +100. In order to keep the rank positive for scoring, we rebase the rank by 100.

1
2
3

  thumbs_up = round(n_up * 100 / n_tot)
  thumbs_dn = round(n_dn * 100 / n_tot)
  rank = thumbs_up - thumbs_dn + 100

Custom Field Type

The first step is to create the custom Rank FieldType, which is fairly trivial. I was writing this stuff against a slightly outdated SVN version of the Solr and Lucene code (I already had an index lying around here, so I decided to reuse that instead of building one for Solr 1.4.1/Lucene 2.9.2 version that I use at work. There is a slight difference in API for the Solr 1.4.1 version, you need to implement another write method, but you can just adapt it from an existing FieldType like I did with this one).

Here is the code for the custom FieldType. Of particular interest is the getSortField() method, which returns a RankFieldComparatorSource instance.

// Source: src/java/org/apache/solr/schema/ext/RankFieldType.java
package org.apache.solr.schema.ext;

import java.io.IOException;

import org.apache.lucene.document.Fieldable;
import org.apache.lucene.search.SortField;
import org.apache.solr.response.TextResponseWriter;
import org.apache.solr.schema.FieldType;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.ext.RankFieldComparatorSource;

public class RankFieldType extends FieldType {

  @Override
  public SortField getSortField(SchemaField field, boolean top) {
    return new SortField(field.getName(), 
      new RankFieldComparatorSource(), top);
  }

  @Override
  // copied verbatim from GeoHashField method
  public void write(TextResponseWriter writer, String name, Fieldable f)
      throws IOException {
    writer.writeStr(name, f.stringValue(), false);
  }
}

The next step is to configure this field type and a field of this type into the schema.xml file. The relevant snippets of my updated schema.xml file are shown below:

<!-- Source: solr/example/.../conf/schema.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="adam" version="1.3">
  <types>
    ...
    <fieldType name="rank_t" class="org.apache.solr.schema.ext.RankFieldType"/>
  </types>
 <fields>
   ...
   <field name="rank" type="rank_t" indexed="true" stored="true"/>
 </fields>
 ...
</schema>

Custom Field Comparator

The third step is to build our custom FieldComparatorSource and FieldComparator classes. Since my custom FieldComparator is unlikely to be called from any place other than my custom FieldComparatorSource, I decided to package them into a single file, with the FieldComparator defined as an inner class.

The custom FieldComparator is copied verbatim from the DocFieldComparator. The only difference is that instead of returning raw docIDs, these methods return the rank at these docIDs. Here is the code:

// $Source$ src/java/org/apache/solr/search/ext/RankFieldComparatorSource.java
package org.apache.solr.search.ext;

import java.io.IOException;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.FieldComparator;
import org.apache.lucene.search.FieldComparatorSource;
import org.apache.solr.util.ext.RankUpdateListener;

/**
 * Called from RankFieldType.getSortField() method.
 */
public class RankFieldComparatorSource extends FieldComparatorSource {

  @Override
  public FieldComparator newComparator(String fieldname, int numHits,
      int sortPos, boolean reversed) throws IOException {
    return new RankFieldComparator(numHits);
  }

  // adapted from DocFieldComparator
  public static final class RankFieldComparator extends FieldComparator {
    private final int[] docIDs;
    private int docBase;
    private int bottom;

    RankFieldComparator(int numHits) {
      docIDs = new int[numHits];
    }

    @Override
    public int compare(int slot1, int slot2) {
      return getRank(docIDs[slot1]) - getRank(docIDs[slot2]);
    }

    @Override
    public int compareBottom(int doc) {
      return getRank(bottom) - getRank(docBase + doc);
    }

    @Override
    public void copy(int slot, int doc) {
      docIDs[slot] = docBase + doc;
    }

    @Override
    public FieldComparator setNextReader(IndexReader reader, int docBase) {
      this.docBase = docBase;
      return this;
    }
    
    @Override
    public void setBottom(final int bottom) {
      this.bottom = docIDs[bottom];
    }

    @Override
    public Comparable<?> value(int slot) {
      return getRank(docIDs[slot]);
    }
    
    private Integer getRank(int docId) {
      return RankUpdateListener.getRank(docId);
    }
  }
}

Database Integration

The ranks themselves come from the RankUpdateListener, which is a SolrEventListener and listens for the firstSearcher and newSearcher events. When these events occur, it reads all the records from the database table and recreates two in-memory maps keyed by docID, one returning the computed rank value and the other returning a 2-element list containing the computed thumbs up and down percentages.

Since I needed to connect to the database, I needed to pass in the database connection properties to this listener. I added this new file to my conf directory.

# Source: solr/example/.../conf/database.properties
database.driverClassName=com.mysql.jdbc.Driver
database.url=jdbc:mysql://localhost:3306/mytestdb
database.username=root
database.password=secret

To get at this file from within my listener, I needed the SolrResourceLoader which looks at the conf directory, and which I could get only from the SolrCore object. I tried implementing SolrCore and getting the SolrCore from the inform(SolrCore) method, but Solr does not allow a listener to implement SolrCore, so I followed the pattern in a few other listeners and just passed SolrCore through the constructor. This works, ie, no extra configuration is needed for the listener in solrconfig.xml to indicate that it takes in SolrCore in its constructor.

I also added the MySQL JAR file (I am using MySQL for my little experiment, in case you haven't guessed from the formatting already) to the solr/lib directory. Doing this ensures that this JAR file will be packaged into the solr.war when I do ant dist-war.

The idea of using a Listener instead of a plain old service was to ensure that the in-memory hashmaps are repopulated every time a commit is sent. This allows for the situations where there have been deletes and the docIDs are no longer pointing to the same documents that they used to before the commit. It also offers a (somewhat crude, IMO) mechanism to trigger ratings refreshes. I think I would prefer to have it repopulate on a timer, say every 5 minutes, and after every commit (which may be less frequent or never). But the code change to do this is quite simple - since the populateRanks() is factored out into a separate method, it can now just be called from two places instead of one. Here is the code for the listener.

// Source: src/java/org/apache/solr/util/ext/RankUpdateListener.java
package org.apache.solr.util.ext;

import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.core.SolrCore;
import org.apache.solr.core.SolrEventListener;
import org.apache.solr.core.SolrResourceLoader;
import org.apache.solr.search.SolrIndexSearcher;

public class RankUpdateListener implements SolrEventListener {

  private static final Integer BASE = 100;
  
  private static final Map<Integer,Integer> ranks = 
    new HashMap<Integer,Integer>();
  private static final Map<Integer,List<Integer>> thumbs = 
    new HashMap<Integer,List<Integer>>();
  
  private SolrCore core;
  private Map<String,String> dbprops;

  public RankUpdateListener(SolrCore core) {
    this.core = core;
    this.dbprops = new HashMap<String,String>();
  }
  
  ////////////// SolrEventListener methods /////////////////
  
  @Override
  public void init(NamedList args) {
    try {
      SolrResourceLoader loader = core.getResourceLoader();
      List<String> lines = loader.getLines("database.properties");
      for (String line : lines) {
        if (StringUtils.isEmpty(line) || line.startsWith("#")) {
          continue;
        }
        String[] kv = StringUtils.split(line, "=");
        dbprops.put(kv[0], kv[1]);
      }
      Class.forName(dbprops.get("database.driverClassName"));
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }

  @Override
  public void newSearcher(SolrIndexSearcher newSearcher,
      SolrIndexSearcher currentSearcher) {
    try {
      populateRanks(newSearcher);
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }
  
  @Override
  public void postCommit() { /* NOOP */ }
  
  ////////////// Service methods /////////////////////
  
  public static List<Integer> getThumbs(int docId) {
    if (thumbs.containsKey(docId)) {
      return thumbs.get(docId);
    } else {
      return Arrays.asList(0, 0);
    }
  }
  
  public static Integer getRank(int docId) {
    if (ranks.containsKey(docId)) {
      return ranks.get(docId);
    } else {
      return BASE;
    }
  }
  
  // filling it in from a hardcoded set of terms, but will
  // ultimately be set from a database table
  private synchronized void populateRanks(SolrIndexSearcher searcher) 
      throws IOException {
    Map<Integer,Integer> ranksCopy = new HashMap<Integer,Integer>();
    Map<Integer,List<Integer>> thumbsCopy = 
      new HashMap<Integer,List<Integer>>();
    Connection conn = null;
    PreparedStatement ps = null;
    ResultSet rs = null;
    try {
      conn = DriverManager.getConnection(
        dbprops.get("database.url"), 
        dbprops.get("database.username"), 
        dbprops.get("database.password"));
      ps = conn.prepareStatement(
        "select uid, n_up, n_dn, n_tot from thumbranks");
      rs = ps.executeQuery();
      while (rs.next()) {
        String uid = rs.getString(1);
        int nup = rs.getInt(2);
        int ndn = rs.getInt(3);
        int ntot = rs.getInt(4);
        // calculate thumbs up/down percentages
        int tupPct = Math.round((float) (nup * 100) / (float) ntot);
        int tdnPct = Math.round((float) (ndn * 100) / (float) ntot);
        // calculate score, rebasing by 100 to make positive
        int rank = tupPct - tdnPct + BASE;
        // find Lucene docId for uid
        ScoreDoc[] hits = searcher.search(
          new TermQuery(new Term("uid", uid)), 1).scoreDocs;
        for (int i = 0; i < hits.length; i++) {
          int docId = hits[i].doc;
          ranksCopy.put(docId, rank);
          thumbsCopy.put(docId, Arrays.asList(tupPct, tdnPct));
        }
      }
    } catch (SQLException e) {
      throw new IOException(e);
    } finally {
      if (conn != null) { 
        try { conn.close(); } catch (SQLException e) { /* NOOP */ }
      }
    }
    ranks.clear();
    ranks.putAll(ranksCopy);
    thumbs.clear();
    thumbs.putAll(thumbsCopy);
  }
}

The listener needs to be configured in the solrconfig.xml file, here is the relevant snippet from mine.

<!-- Source: solr/examples/.../conf/solrconfig.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<config>
  ...
    <listener event="firstSearcher" 
        class="org.apache.solr.util.ext.RankUpdateListener"/>
    <listener event="newSearcher" 
        class="org.apache.solr.util.ext.RankUpdateListener"/>
  ...
</config>

Returning external data in response

At this point, if I rebuild my solr.war with these changes, I can sort the results by adding a &sort=rank+desc (or rank+asc) to my Solr request URL.

However, the caller would have to consult the database to get the thumbs up and down percentages, since the response does not contain this information. Neither can we specify that we want this information by specifying &fl=*,rank.

This is (fairly) easily remedied, however. Simply add a SearchComponent that checks to see if the fl parameter contains "rank", and if so, convert the DocSlice into a SolrDocumentList and plug in the cached values by looking up the service methods in the RankUpdateListener component above.

So, configuration wise, we need to add a last-components entry into the handler that we are using. We are using the "standard" SearchHandler, so we just add our custom component in our solrconfig.xml file as shown in the snippet below:

<!-- Source: solr/example/.../conf/solrconfig.xml -->
<config>
  ...
  <searchComponent name="rank-extract" 
      class="org.apache.solr.handler.component.ext.RankExtractComponent"/>
  <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
    </lst>
    <arr name="last-components">
      <str>rank-extract</str>
    </arr>
  </requestHandler>
  ...
</config>

Here is the code for the RankExtractComponent. It only does anything if "rank" is found in the input request's fl (field list) parameter. In case it is found, it will add the rank, thumbs_up and thumbs_down percentages into the document response.

// Source: src/java/org/apache/solr/handler/component/ext/RankExtractComponent.java
package org.apache.solr.handler.component.ext;

import java.io.IOException;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Fieldable;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.params.CommonParams;
import org.apache.solr.handler.component.ResponseBuilder;
import org.apache.solr.handler.component.SearchComponent;
import org.apache.solr.schema.IndexSchema;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.DocIterator;
import org.apache.solr.search.DocSlice;
import org.apache.solr.search.SolrIndexReader;
import org.apache.solr.util.ext.RankUpdateListener;

/**
 * Custom component that extracts rank information out of
 * the database and appends it to the response. This is 
 * configured as a last-components in our search handler
 * chain.
 */
public class RankExtractComponent extends SearchComponent {

  @Override
  public void prepare(ResponseBuilder rb) throws IOException {
    /* NOOP */
  }

  @Override
  public void process(ResponseBuilder rb) throws IOException {
    Set<String> returnFields = getReturnFields(rb);
    if (returnFields.contains("rank")) {
      // only trigger this code if user explicitly lists rank
      // in the field list. This changes the DocSlice in the 
      // result returned by the standard component and replaces
      // it with a SolrDocumentList (whose attributes are more 
      // amenable to modification).
      DocSlice slice = (DocSlice) rb.rsp.getValues().get("response");
      SolrIndexReader reader = rb.req.getSearcher().getReader();
      SolrDocumentList rl = new SolrDocumentList();
      for (DocIterator it = slice.iterator(); it.hasNext(); ) {
        int docId = it.nextDoc();
        Document doc = reader.document(docId);
        SolrDocument sdoc = new SolrDocument();
        List<Fieldable> fields = doc.getFields();
        for (Fieldable field : fields) {
          String fn = field.name();
          if (returnFields.contains(fn)) {
            sdoc.addField(fn, doc.get(fn));
          }
        }
        if (returnFields.contains("score")) {
          sdoc.addField("score", it.score());
        }
        if (returnFields.contains("rank")) {
          List<Integer> thumbs = RankUpdateListener.getThumbs(docId);
          sdoc.addField("thumbs_up", thumbs.get(0));
          sdoc.addField("thumbs_down", thumbs.get(1));
          sdoc.addField("rank", RankUpdateListener.getRank(docId));
        }
        rl.add(sdoc);
      }
      rl.setMaxScore(slice.maxScore());
      rl.setNumFound(slice.matches());
      rb.rsp.getValues().remove("response");
      rb.rsp.add("response", rl);
    }
  }

  private Set<String> getReturnFields(ResponseBuilder rb) {
    Set<String> fields = new HashSet<String>();
    String flp = rb.req.getParams().get(CommonParams.FL);
    if (StringUtils.isEmpty(flp)) {
      // called on startup with a null ResponseBuilder, so
      // we want to prevent a spurious NPE in the logs...
      return fields; 
    }
    String[] fls = StringUtils.split(flp, ",");
    IndexSchema schema = rb.req.getSchema();
    for (String fl : fls) {
      if ("*".equals(fl)) {
        Map<String,SchemaField> fm = schema.getFields();
        for (String fieldname : fm.keySet()) {
          SchemaField sf = fm.get(fieldname);
          if (sf.stored() && (! "content".equals(fieldname))) {
            fields.add(fieldname);
          }
        }
      } else if ("id".equals(fl)) {
        SchemaField usf = schema.getUniqueKeyField();
        fields.add(usf.getName());
      } else {
        fields.add(fl);
      }
    }
    return fields;
  }

  ///////////////////// SolrInfoMBean methods ///////////////////
  
  @Override
  public String getDescription() {
    return "Rank Extraction Component";
  }

  @Override
  public String getSource() {
    return "$Source$";
  }

  @Override
  public String getSourceId() {
    return "$Id$";
  }

  @Override
  public String getVersion() {
    return "$Revision$";
  }
}

Conclusion

And thats pretty much it! Just 5 new classes and a bit of configuration, and we have a deterministic external field based sort on Solr, with a commit-aware (and optionally timer based) refresh strategy, which returns the additional fields in the Solr search results.

And now for the obligatory screenshots. The one on the left shows the results with rank information (ie, with &fl=*,rank only) sorted by relevance. The one on the right shows the same results, but sorted by rank descending (ie, with &fl=*,rank&sort=rank+desc).

The alternative (ie, going the ExternalFileField route) would require some coding (perhaps outside Solr) to write out the database data into a flat file and push it into Solr's index directory, followed by a commit. And the client would have to make a separate call to get the ranking components for each search result at render time. Also, I don't understand function queries that well (okay, not at all, need to read up on them), but from our limited testing, it did not appear to be too effective at sorting, but its entirely possible that I am missing something that would make it work nicely.

Of course, I have a lot to learn about Solr and Lucene, so if you have solved this differently, or if you feel this could have been done differently/better, would love to hear from you.

Salmon Run

Thursday, February 06, 2014

Substring matching using Aho-Corasick and Redis

Friday, March 01, 2013

Drools Rules in a Database, Take 2

Saturday, May 14, 2011

Custom Sorting in Solr using External Database Data

Database

Custom Field Type

Custom Field Comparator

Database Integration

Returning external data in response

Conclusion

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me