Showing posts with label question-answering. Show all posts
Showing posts with label question-answering. Show all posts

Saturday, October 05, 2024

Using Knowledge Graphs to enhance Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) has become a popular approach to harness LLMs for question answering using your own corpus of data. Typically, the context to augment the query that is passed into the Large Language Model (LLM) to generate an answer comes from a database or search index containing your domain data. When it is a search index, the trend is to use Vector search (HNSW ANN based) over Lexical (BM25/TF-IDF based) search, often combining both Lexical and Vector searches into Hybrid search pipelines.

In the past, I have worked on Knowledge Graph (KG) backed entity search platforms, and observed that for certain types of queries, they produce results that are superior / more relevant compared to that produced from a standard lexical search platform. The GraphRAG framework from Microsoft Research describes a comprehensive technique to leverage KG for RAG. GraphRAG helps produce better quality answers in the following two situations.

  • the answer requires synthesizing insights from disparate pieces of information through their shared attributes
  • the answer requires understanding summarized semantic concepts over part of or the entire corpus

The full GraphRAG approach consists of building a KG out of the corpus, and then querying the resulting KG to augment the context in Retrieval Augmented Generation. In my case, I already had access to a medical KG, so I focused on building out the inference side. This post describes what I had to do to get that to work. It is based in large part on the ideas described in this Knowledge Graph RAG Query Engine page from the LlamaIndex documentation.

At a high level, the idea is to extract entities from the question, and then query a KG with these entities to find and extract relationship paths, single or multi-hop, between them. These relationship paths are used, in conjunction with context extracted from the search index, to augment the query for RAG. The relationship paths are the shortest paths between pairs of entities in the KG, and we only consider paths upto 2 hops in length (since longer paths are likely to be less interesting).

Our medical KG is stored in an Ontotext RDF store. I am sure we can compute shortest paths in SPARQL (the standard query language for RDF) but Cypher seems simpler for this use case, so I decided to dump out the nodes and relationships from the RDF store into flat files that look like the following, and then upload them to a Neo4j graph database using neo4j-admin database import full.

1
2
3
4
5
6
7
8
9
# nodes.csv
cid:ID,cfname,stygrp,:LABEL
C8918738,Acholeplasma parvum,organism,Ent
...

# relationships.csv
:START_ID,:END_ID,:TYPE,relname,rank
C2792057,C8429338,Rel,HAS_DRUG,7
...

The first line in both CSV files are the headers that inform Neo4j about the schema. Here our nodes are of type Ent and relationships are of type Rel, cid is an ID attribute that is used to connect nodes, and the other elements are (scalar) attributes of each node. Entities were extracted using our Dictionary-based Named Entity Recognizer (NER) based on the Aho-Corasick algorithm, and shortest paths are computed between each pair of entities (indicated by placeholders _LHS_ and _RHS_) extracted using the following Cypher query.

1
2
MATCH p = allShortestPaths((a:Ent {cid:'_LHS_'})-[*..]-(b:Ent {cid:'_RHS_'}))
RETURN p, length(p)

Shortest paths returned by the Cypher query that are more than 2 hops long are discarded, since these don't indicate strong / useful relationships between the entity pairs. The resulting list of relationship paths are passed into the LLM along with the search result context to produce the answer.

We evaluated this implementation against the baseline RAG pipeline (our pipeline minus the relation paths) using the RAGAS metrics Answer Correctness and Answer Similarity. Answer Correctness measures the factual similarity between the ground truth answer and the generated answer, and Answer Similarity measures the semantic similarity between these two elements. Our evaluation set was a set of 50 queries where the ground truth was assigned by human domain experts. The LLM used to generate the answer was Claude-v2 from Anthropic while the one used for evaluation was Claude-v3 (Sonnet). The table below shows the averaged Answer Correctness and Similarity over all 50 queries, for the Baseline and my GraphRAG pipeline respectively.

Pipeline Answer Correctness Answer Similarity
Baseline 0.417 0.403
GraphRAG (inference) 0.737 0.758

As you can see, the performance gain from using the KG to augment the query for RAG seems to be quite impressive. Since we already have the KG and the NER available from previous projects, it is a very low effort addition to make to our pipeline. Of course, we would need to verify these results using Further human evaluations.

I recently came across the paper Knowledge Graph based Thought: A Knowledge Graph enhanced LLM Framework for pan-cancer Question Answering (Feng et al, 2024). In it, the authors identify four broad classes of triplet patterns that their questions (i.e, in their domain) can be decomposed to, and addressed using reasoning approaches backed by Knowledge Graphs -- One hop, Multi-hop, Intersection and Attribute problems. The idea is to use an LLM prompt to identify the entities and relationships in the question, then use an LLM to determine which of these templates should be used to address the question and produce an answer. Depending on the path chosen, an LLM is used to generate a Cypher query (an industry standard query language for graph databases originally introduced by Neo4j) to extract the missing entities and relationships in the template and answer the question. An interesting future direction for my GraphRAG implementation would be to incorporate some of the ideas from this paper.

Saturday, February 24, 2024

Thoughts on using LangChain LCEL with Claude

I got into Natural Language Processing (NLP) and Machine Learning (ML) through Search. And this led me into Generative AI (GenAI), which led me back to Search via Retrieval Augmented Generation (RAG). RAG started out relatively simple -- take a query, generate search results, use search results as context for a Large Language Model (LLM) to generate an abstractive summary of the results. Back when I started on my first "official" GenAI project middle of last year, there were not too many frameworks to support building GenAI components (at least not the prompt based ones), except maybe LangChain, which was just starting out. But prompting as a concept is not too difficult to understand and implement, so thats what we did at the time.

I did have plans to use LangChain in my project once it became more stable, so I started out building my components to be "langchain compliant". But that turned out to be a bad idea as LangChain continued its exponential (and from the outside at least, somewhat haphazard) growth and showed no signs of stabilizing. At one point, LangChain users were advised to make pip install -U langchain part of their daily morning routine! So anyway, we ended up building up our GenAI application by hooking up third party components with our own (non-framework) code, using Anthropic's Claude-v2 as our LLM, ElasticSearch as our lexical / vector document store and PostgreSQL as our conversational buffer.

While I continue to believe that the decision to go with our own code made more sense than trying to jump on the LangChain (or Semantic Kernel, or Haystack, or some other) train, I do regret it in some ways. A collateral benefit for people who adopted and stuck with LangChain were the ready-to-use implementations of cutting-edge RAG and GenAI techniques that the community implemented at almost the same pace as they were being proposed in academic papers. For the subset of these people that were even slightly curious about how these implementations worked, this offered a ringside view into the latest advances in the field and a chance to stay current with it, with minimal effort.

So anyway, in an attempt to replicate this benefit for myself (going forward at least), I decided to learn LangChain by doing a small side project. Earlier I needed to learn to use Snowflake for something else and had their free O'Reilly book on disk, so I converted it to text, chunked it, and put it into a Chroma vector store. I then tried to implement examples from the DeepLearning.AI courses LangChain: Chat with your Data and LangChain for LLM Application Development. The big difference is that the course examples use OpenAI's GPT-3 as their LLM whereas I use Claude-2 on AWS Bedrock in mine. In this post, I share the issues I faced and my solutions, hopefully this can help guide others in similar situations.

Couple of observations here. First, the granularity of GenAI components is necessarily larger than traditional software components, and this means application details that the developer of the component was working on can leak into the component itself (mostly through the prompt). To a user of the component, this can manifest as subtle bugs. Fortunately, LangChain developers seem to have also noticed this and have come up with the LangChain Expression Language (LCEL), a small set of reusable components that can be composed to create chains from the ground up. They have also marked a large number of Chains as Legacy Chains (to be converted to LCEL chains in the future).

Second, most of the components (or chains, since that is LangChain's central abstraction) are developed against OpenAI GPT-3 (or its chat version GPT-3.5 Turbo) whose strengths and weaknesses may be different from those of your LLM. For example, OpenAI is very good at generating JSON output, whereas Claude is better at generating XML. I have also seen that Claude can terminate XML / JSON output mid-output unless forced to complete using stop_sequences. Yhis doesn't seem to be a problem GPT-3 users have observed -- when I mentioned this problem and the fix, I drew a blank on both counts.

To address the first issue, my general approach in trying to re-implement these examples has been to use LCEL to build my chains from scratch. I attempt to leverage the expertise available in LangChain by looking in the code or running the existing LangChain chain with langchain.debug set to True. Doing this helps me see the prompt being used and the flow, which I can use to adapt the prompt and flow for my LCEL chain. To address the second issue, I play to Claude's strengths by specifying XML output format in my prompts and parsing them as Pydantic objects for data transfer across chains.

The example application I will use to illustrate these techniques here is derived from the Evaluation lesson from the LangChain for LLM Application Development course, and is illustrated in the diagram below. The application takes a chunk of text as input, and uses the Question Generation chain to generate multiple question-answer pairs from it. The questions and the original content are fed into the Question Answering chain, which uses the question to generate additional context from a vector retriever, and uses all three to generate an answer. The answer generated from the Question Generation chain and the answer generated from the Question Answering chain are fed into a Question Generation Evaluation chain, where the LLM grades one against the other, and generates an aggregate score for the questions generated from the chunk.

Each chain in this pipeline is actually quite simple, they take one or more inputs and generates a block of XML. All the chains are structured as follows:

1
2
3
from langchain_core.output_parsers import StrOutputParser

chain = prompt | model | StrOutputParser()

And all our prompts follow the same general format. Here is the prompt for the Evaluation chain (the third one) which I adapted from the QAEvalChain used in the lesson notebook. Developing from scratch using LCEL gives me the chance to use Claude's Human / Assistant format (see LangChain Guidelines for Anthropic) rather than depend on the generic prompt that happens to work well for GPT-3.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Human: You are a teacher grading a quiz.

You are given a question, the context the question is about, and the student's 
answer.

QUESTION: {question}
CONTEXT: {context}
STUDENT ANSWER: {predicted_answer}
TRUE ANSWER: {generated_answer}

You are to score the student's answer as either CORRECT or INCORRECT, based on the 
context.

Write out in a step by step manner your reasoning to be sure that your conclusion 
is correct. Avoid simply stating the correct answer at the outset.

Please provide your response in the following format:

<result>
    <qa_eval>
        <question>the question here</question>
        <student_answer>the student's answer here</student_answer>
        <true_answer>the true answer here</true_answer>
        <explanation>step by step reasoning here</explanation>
        <grade>CORRECT or INCORRECT here</grade>
    </qa_eval>
</result>

Grade the student answers based ONLY on their factual accuracy. Ignore differences in 
punctuation and phrasing between the student answer and true answer. It is OK if the 
student answer contains more information than the true answer, as long as it does not 
contain any conflicting statements.

Assistant:

In addition, I specify the formatting instructions explicitly in the prompt instead of using the canned ones from XMLOutputParser or PydanticOutputParser via get_formatting_instructions(), which are comparatively quite generic and sub-optimal. By convention, the outermost tag in my format is always <result>...</result>. The qa_eval tag inside result has a corresponding Pydantic class analog declared in the code as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from pydantic import BaseModel, Field

class QAEval(BaseModel):
    question: str = Field(alias="question", description="question text")
    student_answer: str = Field(alias="student_answer",
                                description="answer predicted by QA chain")
    true_answer: str = Field(alias="true_answer",
                             description="answer generated by QG chain")
    explanation: str = Field(alias="explanation",
                             description="chain of thought for grading")
    grade: str = Field(alias="grade",
                       description="LLM grade CORRECT or INCORRECT")

After the StrOutputParser extracts the LLM output into a string, it is first passed through a regular expression to remove any content outside the <result>...</result>, then convert it into the QAEval Pydantic object using the following code. This allows us to keep object manipulation between chains independent of the output format, as well as negate any need for format specific parsing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import re
import xmltodict

from pydantic import Field
from pydantic.generics import GenericModel
from typing import Generic, List, Tuple, TypeVar

T = TypeVar("T")

class Result(GenericModel, Generic[T]):
    value: T = Field(alias="result")

def parse_response(response):
    response = response.strip()
    start_tag, end_tag = "<result>", "</result>"
    is_valid = response.startswith(start_tag) and response.endswith(end_tag)
    if not is_valid:
        pattern = f"(?:{start_tag})(.*)(?:{end_tag})"
        p = re.compile(pattern, re.DOTALL)
        m = p.search(response)
        if m is not None:
            response = start_tag + m.group(1) + end_tag
    resp_dict = xmltodict.parse(response)
    result = Result(**resp_dict)
    return result

# example call
response = chain.invoke(
    "question": "the question",
    "context": "the context",
    "predicted_answer": "the predicted answer",
    "generated_answer": "the generated answer"
})
result = parse_response(response)
qa_eval = result.value["qa_eval"]

One downside to this approach is that it uses the current version of the Pydantic toolkit (v2) whereas LangChain still uses Pydantic V1 internally, as descibed in LangChain's Pydantic compatibility page. This is why this conversion needs to be outside LangChain and in the application code. Ideally, I would like this to be part of a subclass of PydanticOutputParser where the formatting_instructions could be generated from the class definition as a nice side effect, but that would mean more work than I am prepared to do at this point :-). Meanwhile, this seems like a decent compromise.

Thats all I had for today. Thank you for staying with me so far, and hope you found this useful!

Saturday, April 29, 2023

Haystack US 2023: Trip Report

I attended the Haystack US 2023 Search Relevance conference last week. It was a great opportunity to share ideas and techniques around search and search relevance, as well as to catch up with old friends and acquaintances and a chance to make new ones. I was there only for the two days of the actual conference, but there were events before and after the conference as well. The full talk schedule can be found here. The conference was in two tracks and took place at the Violet Crown movie theater in Charlottseville VA. The mall it is in also has a bunch of nice eateries, so if you are a foodie like me, then this may be a chance to expand your gastronomic domain as well. This is the US version; since the last couple of years, they have two Haystack search relevance conferences per year, one in the US and another one in Europe. In this post, I will describe very briefly the talks I attended, with links to the actual abstracts on the Haystack site. The Haystack team is working on releasing the slides and videos, you can find more information on the Relevancy Slack Channel.

Day 1

Opening Keynote

Keynote is titled Relevance in an age of Generative Search and delivered by Trey Grainger. Trey is the main author of AI Powered Search, along with co-authors Doug Turnbull and Max Irwin, a book that has become popular in the search community as the discipline moves to embrace vector search to provide more relevant results for search and recommendation. He talked about the changes in search industry in the context of his book, then mentioned ChatGPT and some popular applications of generative AI, such as search summaries and document exploration.

Metarank

Learning to hybrid search: combining BM25, neural embeddings and customer behavior into an ultimate ranking ensemble was a presentation by the author of Metarank Roman Grebenikkov. He makes the point that lexical (BM25) search is good at a few things and neural search is good at a few other things. Therefore combining the two (or more) searches as an ensemble can address the weaknesses of both systems and improve results. Metarank was used to evaluate this idea using various ensembles of techniques.

Querysets and Offline Evaluation

The Creating Representative Query Sets for Offline Evaluation talk by Karel Bergman deals with the question of how many queries to sample to evaluate an application via offline evaluation so as to achieve the required confidence level. This step is important because it allows us to predict the minimum dataset size using which we can be confident about our results.

Relevant Search at Scale

This talk about Breaking Search Performance Limits with Domain-Specific Computing was delivered by Ohad Levy of Hyperspace, which manufactures a FPGA device that provides functionality similar to a (vector enabled) ElasticSearch instance. He makes the point that in a tradeoff between performance, cost and relevance, one can usually have only 1 or 2 out of 3, and that lower latency implies better customer engagement and hence increased revenue. Their search solution offers an ElasticSearch like JSON API as well as a more Pythonic object-oriented API through which users interact with the device.

EBSCO Case Study

The EBSCO case study Vector Search for Clinical Decisions presentation by Erica Lesyshyn and Max Irwin has a lot of parallels with the search engine platform I work with (ClinicalKey). Like us, they are backed by an ontology is was developed initially using the Unified Medical Language System (UMLS) and additional structures built around that using additional ontologies or internal domain knowledge. They also have a similar concept search platform on top of which they are running various products. They partition their query into 3 intents – simple, specific and complex. Simple is similar to 1 or 2 concept searches and corresponds to their head, the specific ones are simple but qualified so can be handled with BM25 based tricks and their complex is longer queries. Their presentation described how they fixed their bad search performance on their tail queries using vector search, encoding their query and documents using an off-the-shelf Large Language Model (LLM) and doing Approximate Nearest Neighbor (ANN) search using QDrant, a Rust based vector search engine. To serve the model, Max built Mighty a Rust based inference server that packages their embedding model into ONNX and serves it over HTTP. Because Mighty compiles the service down to executable code, there are no (Python / Rust) dependencies and thus very fast and easy to deploy.

Lightning Talks

There were a series of shorter talks in the Lightning Talks section. I did take notes throughout the conference, as well as these talks, but since they were short, it was hard to take adequate notes, so some of what follows is from memory. If you wish to correct them (or indeed, any part of my trip report) please drop me a comment.

Filtered Vector Search – vector search can be difficult to threshold, so suggestion here is to use common-sense facets to build the appropriate thresholds. Another suggestion is to cache vector output for common / repeated queries so model gets invoked only for new queries.

Using search relevance with Observability – advocates for dashboards that extract aggregation metrics from queries that can help with decision making around search relevance

Doug Turnbull came up with the idea for a website nextsearchjob.com to help connect search / search-ML engineers with employers based on the jobs channel on Haystack Slack. I can see it becoming a good niche job recommendation system similar to how Andrej Karpathy's tool arxiv-sanity is for searching the Arxiv website.

Peter Dixon-Moses started the Flying Blind initiative around a shared Google spreadsheet that collects information from the community about good impact metrics, systemic embarrassing moments that could be addressed systemically, etc.

The next lightning talk was a plug for the JesterJ, a document ingestion software, by author Gus Heck. Gus points out that the advertised interfaces for document ingestion are usually for toy setups, and JesterJ provides a robust alternative to production style indexes.

Aruna Lakshmanan gave an awesome Lightning talk with tons of in-depth advice around search signals. I thought it would have been even better as a full size talk or workshop. Here are a list of user signals she spoke about.

  • classify  query term (brand/category/keyword, search vs landing, top product/category, keywords)
  • facets (click order, facets missed)
  • search vs features (don't load features up front) -- what are the top features that are being clicked?
  • click metrics -- not clicked results?
  • zero results and recommendations (should be based on user signals)
  • time per session (longer)
  • drop rate
  • personalization, preference and trending

Explainable recommendation systems with vector search, by Uri Goren, suggests creating mini-embeddings of fixed length for each feature and then concatenating for input matrix, and then densifying them by some means (auto-encoder, matrix factorization), then breaking them apart again into individual features. These features are now explainable since we know what they represent. These ideas have been implemented in Uri's recsplain system.

Lucene 9 vector implementation, by the folks at KMW Technology – Lucene and Solr 9.x support ANN search for vectors, but the index needs to be in a single segment and is loaded into memory in its entirety, making it not very useful for large vector indexes. Large indexes can be supported but at higher cost.

Eric Pugh floated a rating party to build an e-commerce dataset of query document pairs using the Quepid tool for search relevancy tuning.

Day 2

AI Powered Search Panel

Panel discussion / AMA composed of the authors of AI Powered Search – Trey Grainger, Doug Turnbull and Max Irwin – answer questions from the audience about the future of search, hybrid search, generative models, hype cycles, etc.

Citation Network

The Exploiting Citation Networks in Large Corpora to improve relevance on Broad Queries by Marc-Andre Morissette describes a technique to create synonyms using citation networks. Specifically, keywords in citing documents are treated as synonyms or child / meronym of the title of the cited document. Useful in legal situations where keywords in case law refers can be used colloquially to refer to specific legislation. Talk also outlines various statistical measures that tune the importance of such keywords.

Question Answering using Question Generation

I didn't technically attend this talk since this was my presentation, but I was there in the room when it happened, so I figured that counts. In any case, this was my talk, its about the work I did last year with fellow data scientist Sharvari Jadhav to build a FAQ style query pipeline proof of concept using a T5 sequence to sequence model to generate questions from passages, storing both passage and generated questions into the index, and matching incoming questions to stored questions during search, basically an implementation of the doc2query (and subsequently doctT5query) papers. Here are my slides for those interested.

Ref2Vec

Presented as part of Women of Search by Erika Cardenas, the presentation Women of Search present building Recommendation Systems with Vector Search discusses a concept called Ref2Vec to do product recommendations. This is currently a work in progress at Weaviate, and tries to represent a series of user interactions by the centroid of their embeddings in order to recommend them other products they might like.

Knowledge Graphs

The Populating and leveraging semantic knowledge graphs to supercharge search talk by Chris Morley covers a lot of ground around Knowledge Graphs and Semantic Search. I will revisit the presentation once his slides and video are out, but I think the point of the presentation was that he treats his tail queries as a sequence of Knowledge Graph entities and increase relevance.

ChatGPT dangers

The Stop Hallucinations and Half-Truths in Generative Search presentation by Colin Harman has some solid advice based on experience building GPT-3 based products over the last year. The talk basically provides a framework for building Generative AI based systems that are useful, helpful and relatively harmless. However, he stresses that it is not possible to guarantee 100% that such systems won't go off the rails, and to try to work around these limitations to the extent possible.

And thats my trip report. I did have situations where I really wanted to attend both simultaneous presentations, which I will try to address once the slides and videos are out. Hope you found it useful. If you work in search and search relevance and haven't signed up on the Relevancy Slack channel, I urge you to consider doing so -- there are a bunch of very knowledgeable and helpful people in there. And maybe we will see each other at the next Haystack!

Sunday, September 29, 2019

Searching for Answer Candidate Passages with Solr and Anserini


I just got back from our company's internal Search Summit at our offices at Raleigh, NC -- the conference is in its third year, and it has grown quite a bit from its humble beginnings. We even have our own conference sticker! The conference was 1 day of multi-track workshops, and two days of single track presentations. Our Labs team conducted a workshop on Bidirectional Encoder Representations from Transformers (or BERT), and I presented my results on BERT based Open Domain Question Answering.

Our BERT based Question Answering pipeline is inspired by the End-to-end Open-Domain Question Answering with BERTSerini paper from Prof Jimmy Lin's team from the University of Waterloo. We are using our own content from ScienceDirect, and we have been trying BERT, pre-trained variants such as BioBERT and SciBERT, and other models such as XLNet and AllenNLP BiDAF, fine tuned with SQuAD 1.1 and 2.0 datasets, and we are using the pipeline variants to answer our own set of questions in the scientific domain. Overall, we have gotten best results from SciBERT+SQuAD 1.1, but we are looking at fine-tuning with SQuAD 2.0 to see if we can get additional signal from when the model abstains from answering.

The figure below shows the BERTSerini pipeline (as described in the BERTSerini paper). In this post, I want to describe the Anserini Retriever component and our implementation of it as a Solr plugin. The Anserini Retriever is an open source IR toolkit described in Anserini: Enabling the Use of Lucene for Information Retrieval Research. It was originally built as a way to experiment with running things like TREC benchmarks, as described in Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. It implements a pluggable strategy for handling question style queries against a Lucene index. The code for the Anserini project is available at castorini/anserini on Github.


Functionally, the Anserini retriever component takes as input a string representing a question, and returns a set of results that can be used as candidate passages by the Question Answering module. The pipeline consists of two steps -- query expansion and results reranking. Anserini offers multiple ways to do each step, allowing the caller to mix and match these strategies to create customized search pipelines. It also offers multiple pluggable similarity strategies, most commonly used of which seem to be BM25 (default similarity for Lucene and its derivative platforms nowadays) and QL (Query Likelihood). The question is parsed by the query expansion steps, and sent to the index, which I will call query A. Results from query A are then reranked -- the reranking is really another (filtered) query to the index, which I will call query B.

Query Expansion strategies include the Bag of Words (BoW), and the Sequential Dependency Model (SDM). Bag of words is fairly self explanatory, its just an OR query of all the tokens in the query, after stopwording, and optionally synonym expansion. SDM is only slightly more complex, it is a weighted query with three main clauses. The first clause is a Bag of Words. The second clause is an OR query of neighboring bigram tokens where proximity and order are both important, and the third clause is an OR query of neighboring bigram tokens where proximity is relaxed and order is not important. The three clauses are weighted in a compound OR query, default weights are (0.85, 0.1, 0.05).

The query (Query A) is sent to the index, which will return results. We take the top K results (configurable, we use K=50 as our default) and send it to the result reranking step. Anserini provides three pluggable reranking algorithms. They are RM3 (Relevance Model 3), Axiomatic, and Identity. RM3 computes feature vectors for query terms and the results from query A. Feature vectors for the results come from the top fbTerms (default 10) from each of the top fbDocs (default 10) documents in the result set. Query vectors and result vectors are interpolated using a multiplier alpha (default 0.5), and resulting top scoring terms are used to construct Query B as a weighted OR query, where the weights for each term is the score computed for it. The Axiomatic strategy is similar, except it uses a mix of the top rerankCutoff results from query A, and a random set of non-results to improve recall. It uses Mutual Information (MI) between query terms in the query and results to compute the top results. As with RM3, Query B for Axiomatic is a weighted OR query consisting of terms with highest MI and the corresponding weights are the MI values for the term. The Identity strategy, as the name suggests, is a no-op passthrough, which passes the output of Query A unchanged. It can be useful for debugging (in a sense "turning-off" reranking), or when the results of Query A produce sufficiently good candidates for question answering. Finally, since Query B is its own separate query, in order to ensure that it behaves as a reranker, we want to restrict the documents returned to those returned in the top rerankedCutoff documents from Query A. In the Solr plugin, we have implemented that as a docID filter on top results of Query A that is added to Query B.

Pluggable similarities is probably a bit of a misnomer. Way back, Lucene offered a single similarity implementation -- a variant of TF-IDF. Later they started offering BM25 as an alternative, and since Lucene 7.x (I believe), BM25 has become the default Similarity implementation. However, probably as a result of the changes needed to accommodate BM25, it became easier to add newer similarity implementations, and recent versions of Lucene offer a large variety of them, as you can see from the Javadocs for Lucene 8 Similarity. However, similarities are associated with index fields, so a pluggable similarity will only work if you indexed your field with the appropriate similarity in the first place. Anserini offers quite a few similarity implementations, corresponding to the different similarities available in Lucene. However, we noticed that in our case, we just needed BM25 and QL (Query Likelihood, corresponding to Lucene's LMDirichletSimilarity), so our Solr plugin just offers these two.

When I set out to implement the BERTSerini pipeline, my original thought was to leverage the Lucene code directly. However, I decided against it for a number of reasons. First, the scripts I saw in their repository suggested that the primary use case is running large benchmarks with different parameters in batch mode, whereas my use case (at least initially) was more interactive. Second, our index is fairly large, consisting of 4000 books from Science Direct, which translates to approximately 42 million records (paragraphs), and takes up 150 GB (approx) disk space, so we are constrained to build it on a cloud provider's machine (AWS in our case). With Lucene, the only way to "look inside" is Luke, which is harder to forward to your local machine over SSH, compared to forwarding HTTP. For these reasons I decided on using Solr as my indexing platform, and implementing the necessary search functionality as a Solr plugin.

Once I understood the functionality Anserini offered, it took just 2-3 days to implement the plugin and access it from inside a rudimentary web application. The figure below shows the candidate passages for a question that should be familiar to many readers of this blog -- How is market basket analysis related to collaborative filtering? If you look at the top 3 (visible) paragraphs returned, they seem like pretty good candidate passages. Overall, the (BM25 + BoW + RM3) strategy seems to return good passages for question answering.


While the plugin is currently usable as-is, i.e., it is responsive and produces good results, the code relies exclusively in copying functionality (and sometimes chunks of code) from the Anserini codebase rather than using Anserini as a library. In fact, the initial implementation (in the "master" branch) does not have any dependencies on the Anserini JAR. For long term viability, it makes sense to have the plugin be dependent on Anserini. I am currently working with Prof Lin to make that happen, and some partially working code is available to do this in the branch "b_anserini_deps".

The code for the Solr plugin (and documentation on how to install and use it) can be found in the elsevierlabs-os/anserini-solr-plugin repository on Github. My employer (Elsevier, Inc.) open sourced the software so we could (a) make it more robust as described above, in consultation with Prof Lin's team, and (b) provide a tool for Solr users interested in exposing the awesome candidate passage generation for question answering functionality provided by Anserini.

If you are working in this space and are looking for a good tool to extract candidate passages from questions, I think you will find the Solr plugin very useful. If you end up using it, please let us know what you think, including how it could be improved.