!pip install numpy panda torch langchain-text-splitters sentence-transformers rank_bm25 faiss-cpu ranx ragatouille==0.0.8 llama-index==0.9.48
Information Retrieval and Reranking is a wide field with significant nuance and incredible importance to multiple businesses. This post is an applied introduction to building a simple search application, displaying three common approaches to search, one to reranking, and some basic evaluation metrics. This is not intended as a deep dive into each algorithm, but as a high-level display of the basic components required to build a search solution. For more information please review the references at the end of the post.
TL;DR: BM25 is the default for a reason: simple, fast, and accurate. You should start there and use it as the benchmark for other approaches. If you have time and require increased accuracy, Colbert is worth exploring, but focus first on gathering data and building evaluation metrics.
You can run the following code on Google Colab clicking here.
Install required libraries
Tested with python 3.10.1
Import libraries
import pandas as pd
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import faiss
from ragatouille import RAGPretrainedModel
from rank_bm25 import BM25Okapi
from ranx import Qrels, Run, fuse, evaluate
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/threadpoolctl.py:1214: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md
warnings.warn(msg, RuntimeWarning)
Define utility functions
We define a simple text processing function. Possible improvements include tokenization, stemming, stopword removal, etc.
def text_preprocess(text):
= text.lower()
text return text
def build_run(results_df, doc_id_col="chunk_id"):
= results_df.copy()
run_df = Run.from_df(
run =run_df,
df="query_id",
q_id_col=doc_id_col,
doc_id_col="score",
score_col
)return run
Prepare data
Load data
= pd.read_csv("https://raw.githubusercontent.com/santiagomvc/search_methods_intro/main/data/texts.csv")
texts_df = pd.read_csv("https://raw.githubusercontent.com/santiagomvc/search_methods_intro/main/data/queries.csv") queries_df
Chunking configuration
Splits text into smaller chunks for more detailed representations of the text.
= RecursiveCharacterTextSplitter(
text_splitter =256,
chunk_size=64,
chunk_overlap=len,
length_function=False,
is_separator_regex )
Split data and process chunks
= []
doc_ids = []
chunk_ids = []
chunk_texts for _, row in texts_df.iterrows():
= str(row["doc_id"])
doc_id = text_splitter.split_text(row["doc_text"])
doc_chunk_texts = len(doc_chunk_texts)
n_chunk_texts = [f"{doc_id}-{str(i)}" for i in range(n_chunk_texts)]
doc_chunk_ids # basic text processing
= [text_preprocess(chunk) for chunk in doc_chunk_texts]
doc_chunk_texts # save results
* n_chunk_texts)
doc_ids.extend([doc_id]
chunk_ids.extend(doc_chunk_ids) chunk_texts.extend(doc_chunk_texts)
Save results as df
= pd.DataFrame({
chunks_df "doc_id": doc_ids,
"chunk_id": chunk_ids,
"chunk_text": chunk_texts,
})
Indexing data
Sparse Representations: BM25
BM25 is a ranking algorithm based on multiple statistics calculated using the terms in the Query and Documents, including term frequency in the document, document length, term frequency in all documents, etc.
= [chunk.split(" ") for chunk in chunk_texts]
bm_25_tokenized_corpus = BM25Okapi(bm_25_tokenized_corpus) bm25_index
Basic Semantic Similarity: Sentence Transformers + Faiss Index
Semantic Similarity is the task of determining how similar is the meaning of two or more texts. It uses ML models to convert text into a single dense vector that captures semantic information. Semantic similarity can be used as a ranking function by calculating the similarity between the Query and each Document, retrieving those with the higher similarity.
= SentenceTransformer("all-mpnet-base-v2")
sentsim_model = sentsim_model.encode(chunk_texts)
sentsim_embeddings = sentsim_embeddings.shape[1]
sentsim_embedding_size = faiss.IndexFlatL2(sentsim_embedding_size)
sentsim_index sentsim_index.add(sentsim_embeddings)
Advanced Semantic Similarity: Colbert + RAGatuille
Colbert is a retrieval model built on top of BERT-like Language Models. Colbert uses multiple token-level dense embeddings to calculate the relevance between the Query and Documents, while traditional sentence embeddings collapse query and document information into single embeddings. Take a look at this Vespa Demo for a very interesting visual aid.
if __name__ == "__main__": # Required so ragatouille runs safely
try:
= RAGPretrainedModel.from_index(".ragatouille/colbert/indexes/index")
colbert_index except:
= RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
colbert_index
colbert_index.index(="index",
index_name=chunk_texts,
collection=chunk_ids,
document_ids=False,
use_faiss=1024,
max_document_length=False,
split_documents )
[Jul 02, 18:29:22] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn(
Search Functions
BM25
def bm25_search(query_text, bm25_index=bm25_index, chunks_df=chunks_df):
# Preprocess query same as docs
= text_preprocess(query_text)
query_text # Transform query
= query_text.split(" ")
tokenized_query # Search with bm25 index
= bm25_index.get_scores(tokenized_query)
doc_scores # Format as dataframe
= chunks_df.copy()
bm25_df "score"] = doc_scores
bm25_df[= bm25_df.loc[bm25_df["score"] > 0]
bm25_df # Drop to get docs, no chunks
= bm25_df.sort_values("score", ascending=False)
bm25_df = bm25_df.drop_duplicates(subset=["doc_id"], keep="first")
bm25_df # Return results
return bm25_df
Sentece Similarity
def sentsim_search(query_text, sentsim_model=sentsim_model, sentsim_index=sentsim_index, chunks_df=chunks_df, k=5):
# Preprocess query same as docs
= text_preprocess(query_text)
query_text # Encode query
= sentsim_model.encode(query_text).reshape(1,-1)
sentsim_query_emb # Search with embedding
= sentsim_index.search(sentsim_query_emb, k)
D, I # Format as dataframe
= chunks_df.copy()
sentsim_df = sentsim_df.loc[I[0]]
sentsim_df "score"] = D[0].astype(float)
sentsim_df[# Drop to get docs, no chunks
= sentsim_df.sort_values("score", ascending=False)
sentsim_df = sentsim_df.drop_duplicates(subset=["doc_id"], keep="first")
sentsim_df return sentsim_df
Colbert
def colbert_search(query_text, colbert_index=colbert_index, chunks_df=chunks_df, k=5):
# Preprocess query same as docs
= text_preprocess(query_text)
query_text # Run query
= colbert_index.search(query_text, k=k)
colbert_results # Save results as a df
= pd.DataFrame(colbert_results)
colbert_df = colbert_df.rename({"document_id": "chunk_id"}, axis=1)
colbert_df = colbert_df.merge(chunks_df, how="left", on="chunk_id")
colbert_df = colbert_df[["doc_id", "chunk_id", "chunk_text", "score"]]
colbert_df # Drop to get docs, no chunks
= colbert_df.sort_values("score", ascending=False)
colbert_df = colbert_df.drop_duplicates(subset=["doc_id"], keep="first")
colbert_df return colbert_df
Rank Fusion: Min-Max Norm, CombMAX fusion
Since retrieval algorithms have different strengths, sometimes it’s useful to combine predictions to maximize users’ expected results. Reranking algorithms receive candidates and scores from different retrieval algorithms, and return a single combined pair of documents and scores. Two important parameters for Reranking are Score Normalization and Fusion Methods. There are multiple methods for Normalization and Fusion, in this case, we use:
- Min-Max Norm: Scales the scores of a retriever between 0 and 1, scaling to 0 the minimum score and 1 the maximum score.
- CombMax fusion: Combines scores from different sources by taking the maximum score for each item.
def combined_search(query_text, fusion_norm="min-max", fusion_method="max", chunks_df=chunks_df):
= []
runs for search_fun in [bm25_search, sentsim_search, colbert_search]:
# Save results in Run format
= search_fun(query_text)
run_df "query_id"] = "0" # query id is required for the run
run_df[# run_df["chunk_id"] = run_df["chunk_id"].astype(str)
= build_run(run_df)
run
runs.append(run)## Combining runs
= fuse(
combined_run =runs,
runs=fusion_norm,
norm=fusion_method,
method
)## Saving as dataframe
= combined_run.to_dataframe()
combined_df = combined_df.drop("q_id", axis=1)
combined_df = combined_df.rename({"doc_id": "chunk_id"}, axis=1)
combined_df = combined_df.merge(chunks_df, how="left", on="chunk_id")
combined_df = combined_df[["doc_id", "chunk_id", "chunk_text", "score"]]
combined_df # Drop to get docs, no chunks
= combined_df.sort_values("score", ascending=False)
combined_df = combined_df.drop_duplicates(subset=["doc_id"], keep="first")
combined_df ## Return similar format to other responses
return combined_df
Global Search Function
def search(query_text, mode="bm25"):
if mode=="bm25":
return bm25_search(query_text)
elif mode=="sentsim":
return sentsim_search(query_text)
elif mode=="colbert":
return colbert_search(query_text)
elif mode=="combined":
return combined_search(query_text)
Evaluates search with labeled queries
Evaluation function
Metrics allow us to evaluate search algorithms performance in a fast and automated way. Though they don’t exactly map to users’ preferences, and usually require manual labor first, they allow for quick iteration and supervision during the experimental and deployment phases. There are multiple metrics to evaluate search algorithms. In this case we use:
- F1 Score: Harmonic mean of precision and recall
- MRR: Average multiplicative inverse of the rank of the first retrieved relevant document
def evaluate_search(mode="bm25", queries_df=queries_df):
# Preprocess df
"query_id"] = queries_df["query_id"].astype(str)
queries_df["doc_id"] = queries_df["doc_id"].astype(str)
queries_df["score"] > 0, "score"] = 1 # Replace all positive scores with 1
queries_df.loc[queries_df[# Create Qrel for evaluation
= Qrels.from_df(
qrels =queries_df,
df="query_id",
q_id_col="doc_id",
doc_id_col="score",
score_col
)
# Get search responses
= queries_df[["query_id", "query_text"]].drop_duplicates()
unique_queries_df = unique_queries_df.values.tolist()
unique_queries_list = []
responses_list for query_id, query_text in unique_queries_list:
= search(query_text, mode=mode)
response_df "query_id"] = query_id
response_df[
responses_list.append(response_df)
# Build run dataframe
= pd.concat(responses_list)
run_df "doc_id"] = run_df["doc_id"].astype(str)
run_df[= build_run(run_df, doc_id_col="doc_id")
run
## Evaluate run
= evaluate(qrels, run, ["f1", "mrr"])
metrics print(mode, metrics)
Evaluate BM25
="bm25") evaluate_search(mode
bm25 {'f1': 0.6981481481481481, 'mrr': 0.9166666666666666}
Evaluate Single Vector Sentence Similarity
="sentsim") evaluate_search(mode
sentsim {'f1': 0.6555555555555556, 'mrr': 0.8888888888888888}
Evaluate Colbert
="colbert") evaluate_search(mode
Loading searcher for index index for the first time... This may take a few seconds
[Jul 02, 18:29:35] #> Loading codec...
[Jul 02, 18:29:35] #> Loading IVF...
[Jul 02, 18:29:35] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn(
[Jul 02, 18:29:36] #> Loading doclens...
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 603.06it/s]
[Jul 02, 18:29:36] #> Loading codes and residuals...
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 65.29it/s]
[Jul 02, 18:29:36] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jul 02, 18:29:36] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!
#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . juneteenth, True, None
#> Output IDs: torch.Size([32]), tensor([ 101, 1, 2238, 17389, 3372, 2232, 102, 103, 103, 103,
103, 103, 103, 103, 103, 103, 103, 103, 103, 103,
103, 103, 103, 103, 103, 103, 103, 103, 103, 103,
103, 103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
colbert {'f1': 0.7944444444444444, 'mrr': 1.0}
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
Evaluate Rank Fusion (Min-Max Norm, CombMAX fusion)
="combined") evaluate_search(mode
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
combined {'f1': 0.6814814814814815, 'mrr': 0.8888888888888888}
Try it yourself! (Only on interactive mode)
= input("Enter your query:")
query_text = input("Enter the search mode (bm25, sentsim, colbert, combined):")
search_mode ="records") search(query_text, search_mode).to_dict(orient
Enter your query: health
Enter the search mode (bm25, sentsim, colbert, combined): combined
/Users/santiagovelez/anaconda3/envs/temp2/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
[{'doc_id': '16',
'chunk_id': '16-4',
'chunk_text': "13 vacation rentals or short-term rentals as follows:\n14 1. to protect the public's health and safety, including rules and\n15 regulations related to fire and building codes, health and sanitation,",
'score': 1.0},
{'doc_id': '10',
'chunk_id': '10-48',
'chunk_text': '110 of social work, psychologist licensed by the board of psychology, or other licensed counseling\n111 professional with appropriate experience and training, provided that any such individual makes progress',
'score': 1.0},
{'doc_id': '17',
'chunk_id': '17-56',
'chunk_text': '41 (10) "health care provider" or "provider" means any person or entity li-\n42 censed, certified, or otherwise authorized by law to administer health care\n43 in the ordinary course of business or practice of a profession, including',
'score': 1.0},
{'doc_id': '14',
'chunk_id': '14-76',
'chunk_text': '176 analysts, and other licensed health and behavioral positions, which may either be employed by the\n177 school board or provided through contracted services.',
'score': 0.439275072516164},
{'doc_id': '12',
'chunk_id': '12-42',
'chunk_text': '100 deduction for such taxable year for long-term health care insurance premiums paid by him.\n101 11. contract payments to a producer of quota tobacco or a tobacco quota holder, or their spouses, as',
'score': 0.1682393388722695},
{'doc_id': '13',
'chunk_id': '13-6',
'chunk_text': '19 conditions of employment of the workforce \n20 \n21 . however, no locality shall adopt any workplace rule, other than for the purposes of a\n22 community services board or behavioral health authority as defined in § 37.2-100, that prevents an',
'score': 0.01917413339973471},
{'doc_id': '5',
'chunk_id': '5-7',
'chunk_text': '18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 5. the mental and physical health of all individuals involved.\n28 6. which parent is more likely to allow the child frequent,\n29 meaningful and continuing contact with the other parent. this paragraph',
'score': 0.0}]
References
- https://web.stanford.edu/class/cs276/handouts/lecture12-bm25etc.pdf
- https://zilliz.com/learn/mastering-bm25-a-deep-dive-into-the-algorithm-and-application-in-milvus
- https://huggingface.co/tasks/sentence-similarity
- https://github.com/stanford-futuredata/ColBERT
- https://arxiv.org/abs/2004.12832
- https://til.simonwillison.net/llms/colbert-ragatouille
- https://amenra.github.io/ranx/
- https://trec.nist.gov/pubs/trec2/papers/txt/23.txt