Applied Intro to Search Algorithms

Information Retrieval and Reranking is a wide field with significant nuance and incredible importance to multiple businesses. This post is an applied introduction to building a simple search application, displaying three common approaches to search, one to reranking, and some basic evaluation metrics. This is not intended as a deep dive into each algorithm, but as a high-level display of the basic components required to build a search solution. For more information please review the references at the end of the post.

TL;DR: BM25 is the default for a reason: simple, fast, and accurate. You should start there and use it as the benchmark for other approaches. If you have time and require increased accuracy, Colbert is worth exploring, but focus first on gathering data and building evaluation metrics.

You can run the following code on Google Colab clicking here.

Install required libraries

Tested with python 3.10.1

!pip install numpy panda torch langchain-text-splitters sentence-transformers rank_bm25 faiss-cpu ranx ragatouille==0.0.8 llama-index==0.9.48

Import libraries

import pandas as pd
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import faiss
from ragatouille import RAGPretrainedModel
from rank_bm25 import BM25Okapi
from ranx import Qrels, Run, fuse, evaluate

/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/threadpoolctl.py:1214: RuntimeWarning: 
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

  warnings.warn(msg, RuntimeWarning)

Define utility functions

We define a simple text processing function. Possible improvements include tokenization, stemming, stopword removal, etc.

def text_preprocess(text):
    text = text.lower()
    return text

def build_run(results_df, doc_id_col="chunk_id"):
    run_df = results_df.copy()
    run = Run.from_df(
        df=run_df,
        q_id_col="query_id",
        doc_id_col=doc_id_col,
        score_col="score",
    )
    return run

Prepare data

Load data

texts_df = pd.read_csv("https://raw.githubusercontent.com/santiagomvc/search_methods_intro/main/data/texts.csv")
queries_df = pd.read_csv("https://raw.githubusercontent.com/santiagomvc/search_methods_intro/main/data/queries.csv")

Chunking configuration

Splits text into smaller chunks for more detailed representations of the text.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=256,
    chunk_overlap=64,
    length_function=len,
    is_separator_regex=False,
)

Split data and process chunks

doc_ids = []
chunk_ids = []
chunk_texts = []
for _, row in texts_df.iterrows():
    doc_id = str(row["doc_id"])
    doc_chunk_texts = text_splitter.split_text(row["doc_text"])
    n_chunk_texts = len(doc_chunk_texts)
    doc_chunk_ids = [f"{doc_id}-{str(i)}" for i in range(n_chunk_texts)]
    # basic text processing
    doc_chunk_texts = [text_preprocess(chunk) for chunk in doc_chunk_texts]
    # save results
    doc_ids.extend([doc_id] * n_chunk_texts)
    chunk_ids.extend(doc_chunk_ids)
    chunk_texts.extend(doc_chunk_texts)

Save results as df

chunks_df = pd.DataFrame({
    "doc_id": doc_ids,
    "chunk_id": chunk_ids,
    "chunk_text": chunk_texts,
})

Indexing data

Sparse Representations: BM25

BM25 is a ranking algorithm based on multiple statistics calculated using the terms in the Query and Documents, including term frequency in the document, document length, term frequency in all documents, etc.

bm_25_tokenized_corpus = [chunk.split(" ") for chunk in chunk_texts]
bm25_index = BM25Okapi(bm_25_tokenized_corpus)

Basic Semantic Similarity: Sentence Transformers + Faiss Index

Semantic Similarity is the task of determining how similar is the meaning of two or more texts. It uses ML models to convert text into a single dense vector that captures semantic information. Semantic similarity can be used as a ranking function by calculating the similarity between the Query and each Document, retrieving those with the higher similarity.

sentsim_model = SentenceTransformer("all-mpnet-base-v2")
sentsim_embeddings = sentsim_model.encode(chunk_texts)
sentsim_embedding_size = sentsim_embeddings.shape[1]
sentsim_index = faiss.IndexFlatL2(sentsim_embedding_size)
sentsim_index.add(sentsim_embeddings)

Advanced Semantic Similarity: Colbert + RAGatuille

Colbert is a retrieval model built on top of BERT-like Language Models. Colbert uses multiple token-level dense embeddings to calculate the relevance between the Query and Documents, while traditional sentence embeddings collapse query and document information into single embeddings. Take a look at this Vespa Demo for a very interesting visual aid.

if __name__ == "__main__":   # Required so ragatouille runs safely
    try:
        colbert_index = RAGPretrainedModel.from_index(".ragatouille/colbert/indexes/index")
    except:
        colbert_index = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
        colbert_index.index(
            index_name="index", 
            collection=chunk_texts, 
            document_ids=chunk_ids, 
            use_faiss=False,
            max_document_length=1024,
            split_documents=False,
        )

[Jul 02, 18:29:22] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(

Search Functions

BM25

def bm25_search(query_text, bm25_index=bm25_index, chunks_df=chunks_df):
    # Preprocess query same as docs
    query_text = text_preprocess(query_text)
    # Transform query
    tokenized_query = query_text.split(" ")
    # Search with bm25 index
    doc_scores = bm25_index.get_scores(tokenized_query)
    # Format as dataframe
    bm25_df = chunks_df.copy()
    bm25_df["score"] = doc_scores
    bm25_df = bm25_df.loc[bm25_df["score"] > 0]
    # Drop to get docs, no chunks
    bm25_df = bm25_df.sort_values("score", ascending=False)
    bm25_df = bm25_df.drop_duplicates(subset=["doc_id"], keep="first")
    # Return results
    return bm25_df

Sentece Similarity

def sentsim_search(query_text, sentsim_model=sentsim_model, sentsim_index=sentsim_index, chunks_df=chunks_df, k=5):
    # Preprocess query same as docs
    query_text = text_preprocess(query_text)
    # Encode query
    sentsim_query_emb = sentsim_model.encode(query_text).reshape(1,-1)
    # Search with embedding
    D, I = sentsim_index.search(sentsim_query_emb, k)
    # Format as dataframe
    sentsim_df = chunks_df.copy()
    sentsim_df = sentsim_df.loc[I[0]]
    sentsim_df["score"] = D[0].astype(float)
    # Drop to get docs, no chunks
    sentsim_df = sentsim_df.sort_values("score", ascending=False)
    sentsim_df = sentsim_df.drop_duplicates(subset=["doc_id"], keep="first")
    return sentsim_df

Colbert

def colbert_search(query_text, colbert_index=colbert_index, chunks_df=chunks_df, k=5):
    # Preprocess query same as docs
    query_text = text_preprocess(query_text)
    # Run query
    colbert_results = colbert_index.search(query_text, k=k)
    # Save results as a df
    colbert_df = pd.DataFrame(colbert_results)
    colbert_df = colbert_df.rename({"document_id": "chunk_id"}, axis=1)
    colbert_df = colbert_df.merge(chunks_df, how="left", on="chunk_id")
    colbert_df = colbert_df[["doc_id", "chunk_id", "chunk_text", "score"]]
    # Drop to get docs, no chunks
    colbert_df = colbert_df.sort_values("score", ascending=False)
    colbert_df = colbert_df.drop_duplicates(subset=["doc_id"], keep="first")
    return colbert_df

Rank Fusion: Min-Max Norm, CombMAX fusion

Since retrieval algorithms have different strengths, sometimes it’s useful to combine predictions to maximize users’ expected results. Reranking algorithms receive candidates and scores from different retrieval algorithms, and return a single combined pair of documents and scores. Two important parameters for Reranking are Score Normalization and Fusion Methods. There are multiple methods for Normalization and Fusion, in this case, we use:

Min-Max Norm: Scales the scores of a retriever between 0 and 1, scaling to 0 the minimum score and 1 the maximum score.
CombMax fusion: Combines scores from different sources by taking the maximum score for each item.

def combined_search(query_text, fusion_norm="min-max", fusion_method="max", chunks_df=chunks_df):
    runs = []
    for search_fun in [bm25_search, sentsim_search, colbert_search]:
        # Save results in Run format
        run_df = search_fun(query_text)
        run_df["query_id"] = "0"   # query id is required for the run
        # run_df["chunk_id"] = run_df["chunk_id"].astype(str)
        run = build_run(run_df)
        runs.append(run)
    ## Combining runs
    combined_run = fuse(
        runs=runs,
        norm=fusion_norm,
        method=fusion_method,
    )
    ## Saving as dataframe
    combined_df = combined_run.to_dataframe()
    combined_df = combined_df.drop("q_id", axis=1)
    combined_df = combined_df.rename({"doc_id": "chunk_id"}, axis=1)
    combined_df = combined_df.merge(chunks_df, how="left", on="chunk_id")
    combined_df = combined_df[["doc_id", "chunk_id", "chunk_text", "score"]]
    # Drop to get docs, no chunks
    combined_df = combined_df.sort_values("score", ascending=False)
    combined_df = combined_df.drop_duplicates(subset=["doc_id"], keep="first")
    ## Return similar format to other responses
    return combined_df

Global Search Function

def search(query_text, mode="bm25"):
    if mode=="bm25":
        return bm25_search(query_text)
    elif mode=="sentsim":
        return sentsim_search(query_text)
    elif mode=="colbert":
        return colbert_search(query_text)
    elif mode=="combined":
        return combined_search(query_text)

Evaluates search with labeled queries

Evaluation function

Metrics allow us to evaluate search algorithms performance in a fast and automated way. Though they don’t exactly map to users’ preferences, and usually require manual labor first, they allow for quick iteration and supervision during the experimental and deployment phases. There are multiple metrics to evaluate search algorithms. In this case we use:

F1 Score: Harmonic mean of precision and recall
MRR: Average multiplicative inverse of the rank of the first retrieved relevant document

def evaluate_search(mode="bm25", queries_df=queries_df):
    # Preprocess df
    queries_df["query_id"] = queries_df["query_id"].astype(str)
    queries_df["doc_id"] = queries_df["doc_id"].astype(str)
    queries_df.loc[queries_df["score"] > 0, "score"] = 1   # Replace all positive scores with 1
    # Create Qrel for evaluation
    qrels = Qrels.from_df(
        df=queries_df,
        q_id_col="query_id",
        doc_id_col="doc_id",
        score_col="score",
    )

    # Get search responses
    unique_queries_df = queries_df[["query_id", "query_text"]].drop_duplicates()
    unique_queries_list = unique_queries_df.values.tolist()
    responses_list = []
    for query_id, query_text in unique_queries_list:
        response_df = search(query_text, mode=mode)
        response_df["query_id"] = query_id
        responses_list.append(response_df)

    # Build run dataframe
    run_df = pd.concat(responses_list)
    run_df["doc_id"] = run_df["doc_id"].astype(str)
    run = build_run(run_df, doc_id_col="doc_id")

    ## Evaluate run
    metrics = evaluate(qrels, run, ["f1", "mrr"])
    print(mode, metrics)

Evaluate BM25

evaluate_search(mode="bm25")

bm25 {'f1': 0.6981481481481481, 'mrr': 0.9166666666666666}

Evaluate Single Vector Sentence Similarity

evaluate_search(mode="sentsim")

sentsim {'f1': 0.6555555555555556, 'mrr': 0.8888888888888888}

Evaluate Colbert

evaluate_search(mode="colbert")

Loading searcher for index index for the first time... This may take a few seconds
[Jul 02, 18:29:35] #> Loading codec...
[Jul 02, 18:29:35] #> Loading IVF...
[Jul 02, 18:29:35] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(

[Jul 02, 18:29:36] #> Loading doclens...

100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 603.06it/s]

[Jul 02, 18:29:36] #> Loading codes and residuals...


100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 65.29it/s]

[Jul 02, 18:29:36] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

[Jul 02, 18:29:36] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . juneteenth,          True,       None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2238, 17389,  3372,  2232,   102,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])

/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(

colbert {'f1': 0.7944444444444444, 'mrr': 1.0}

/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(

Evaluate Rank Fusion (Min-Max Norm, CombMAX fusion)

evaluate_search(mode="combined")

/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(

combined {'f1': 0.6814814814814815, 'mrr': 0.8888888888888888}

Try it yourself! (Only on interactive mode)

query_text = input("Enter your query:")
search_mode = input("Enter the search mode (bm25, sentsim, colbert, combined):")
search(query_text, search_mode).to_dict(orient="records")

Enter your query: health
Enter the search mode (bm25, sentsim, colbert, combined): combined

/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(

[{'doc_id': '16',
  'chunk_id': '16-4',
  'chunk_text': "13 vacation rentals or short-term rentals as follows:\n14 1. to protect the public's health and safety, including rules and\n15 regulations related to fire and building codes, health and sanitation,",
  'score': 1.0},
 {'doc_id': '10',
  'chunk_id': '10-48',
  'chunk_text': '110 of social work, psychologist licensed by the board of psychology, or other licensed counseling\n111 professional with appropriate experience and training, provided that any such individual makes progress',
  'score': 1.0},
 {'doc_id': '17',
  'chunk_id': '17-56',
  'chunk_text': '41 (10) "health care provider" or "provider" means any person or entity li-\n42 censed, certified, or otherwise authorized by law to administer health care\n43 in the ordinary course of business or practice of a profession, including',
  'score': 1.0},
 {'doc_id': '14',
  'chunk_id': '14-76',
  'chunk_text': '176 analysts, and other licensed health and behavioral positions, which may either be employed by the\n177 school board or provided through contracted services.',
  'score': 0.439275072516164},
 {'doc_id': '12',
  'chunk_id': '12-42',
  'chunk_text': '100 deduction for such taxable year for long-term health care insurance premiums paid by him.\n101 11. contract payments to a producer of quota tobacco or a tobacco quota holder, or their spouses, as',
  'score': 0.1682393388722695},
 {'doc_id': '13',
  'chunk_id': '13-6',
  'chunk_text': '19 conditions of employment of the workforce \n20 \n21 . however, no locality shall adopt any workplace rule, other than for the purposes of a\n22 community services board or behavioral health authority as defined in § 37.2-100, that prevents an',
  'score': 0.01917413339973471},
 {'doc_id': '5',
  'chunk_id': '5-7',
  'chunk_text': '18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 5. the mental and physical health of all individuals involved.\n28 6. which parent is more likely to allow the child frequent,\n29 meaningful and continuing contact with the other parent. this paragraph',
  'score': 0.0}]

Thank you!