Intro to Entity Matching

Entity matching is the process of identifying entity descriptions from different sources that refer to the same real-world entity. This post is an applied introduction to some common approaches used for entity matching. Since even for small datasets the number of possible matches can become excessive, the process is usually split in two:

Blocking: Generates the possible combinations and removes most of them based on simple metrics (ex: number of shared letters)
Disambiguation: Evaluates the likelihood that two entity descriptions refer to the same real-world entity. For Entity Disambiguation, the approaches can be split into the following categories:
- Prompting trained Large Language Models
- Bert-like Language Models Fine Tuned on Entity Tasks
- Similarity Metrics and traditional ML Models
- Rule systems developed by human experts

There are multiple tasks related to Entity Matching. Though they are different, some approaches may solve common issues.

Entity Disambiguation: Evaluates the likelihood that two entity descriptions refer to the same real-world entity. Does not include the candidate generation and blocking process.
Entity Linking: Identifies characters in a text that match an entity in an existing Knowledge Base.
Named Entity Recognition: Identifies characters in a text that could be identified as Named Entities. Some of them also use Knowledge Bases or Disambiguation.

TL;DR: This paragraph in “Entity Matching using Large Language Models” is a good summary of the current status of Entity Matching in 2024:

“We can summarize the high-level implications of our findings concerning the selection of matching techniques in the following rules of thumb: For use cases that do not involve many unseen entities and for which a decent amount of training data is available, PLM-based matchers are a suitable option which does not require much compute due to the smaller size of the models. For use cases that involve a relevant amount of unseen entities and for which it is costly to gather and maintain a decent size training set, LLM-based matchers should be preferred due to their high zero-shot performance and ability to generalize to unseen entities. If using the best performing hosted LLMs is not an option due to their high usage costs, fine-tuning a cheaper hosted model is an alternative that can deliver a similar F1 performance. If using using hosted models is no option due to privacy concerns, using an open-source LLM on local hardware can be an alternative providing a slightly lower F1 performance given that some task-specific training data or domain-specific matching rules are available.”

Ralph Peeters & Christian Bizer, “Entity Matching using Large Language Models”, 2024

A more detailed explanation of Entity Matching can be found here

When to use AI for Entity Matching

If a human can find a match with the presented data, is likely that a Modern AI approach can do it too
If a human is having issues finding a match, AI won’t be likely to help
If there’s not enough information presented to the model, AI won’t be able to find the match, even if a human has enough internal information to do it

You can run the following code on Google Colab clicking here.

Setup

# Downloading required libraries and packages
!pip install transformers spacy numpy scipy py_entitymatching python-dotenv openai

# Downloading spacy english package
!python -m spacy download en_core_web_sm

# Creating data for training/few-shot and evaluation
synt_data = [
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "John A. Smith", "entity_2": "John A. Smith", "match": "yes"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Emily J. Clarke", "entity_2": "Emily T. Clarke", "match": "no"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Sarah M. Johnson", "entity_2": "Sarah Marie Johnson", "match": "yes"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "James P. Miller", "entity_2": "James P. Miles", "match": "no"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Michael O'Leary", "entity_2": "Michael OLeary", "match": "yes"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Nancy L. Wright", "entity_2": "Nancy W. Wright", "match": "no"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Catherine G. Davis", "entity_2": "Catherine Grace Davis", "match": "yes"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Richard A. Lee", "entity_2": "Richard A. Lin", "match": "no"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Robert K. Brown", "entity_2": "Robert K. Brown Jr.", "match": "yes"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Karen M. Harris", "entity_2": "Karen M. Harrison", "match": "no"},
]

real_data = [
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Spann-Wilder, Tiffany", "entity_2": "Tiffany Spann-Wilder", "match": "yes"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Landon C. Dais", "entity_2": "Landon Dais", "match": "yes"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Giglio JA", "entity_2": "Jodi Giglio", "match": "yes"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "Brown, M", "entity_2": "Marla Gallo Brown", "match": "yes"},
    {"chamber_1": "house", "chamber_2": "house", "entity_1": "J.T. 'Jabo' Waggoner  ", "entity_2": "Jabo Waggoner", "match": "yes"},
]

Language Models

Currently the best performing approach to Entity Disambiguation based on recent research. It usually consists of building a set of examples and assembling a prompt, in order to pass an unseen pair to the Language Model and get the relevant response. There’s some exploration into using LM to enhance existing Entity Linking models.

Advantages

Easy to use: create examples, compose a prompt, call an API, and parse the results.
Performance above older models trained specifically for entity matching
May not require structuring the data as much as other options, just collapsing all the information in a single string.
LM approach can be used to generate explanations and categories. Categories can be used to find common error cases (year, mixed words, etc.)

Disadvantages

Prompting is very sensible to changes
Most powerful models are paid or hard to use
No candidate generation and blocking (can be done with other tools)

Best LM Configuration - based on Entity Matching using LMs:

Model: GPT4
Prompt: Domain specific, complex prompt, free-form response (use regex to find ‘yes’)
Few Shot: Yes, related examples
Fine-tuning helps, especially smaller models, without losing generalization

# Loading libraries
from openai import OpenAI
from dotenv import load_dotenv

# Loading env and openai client
load_dotenv(".env")   # Create an ENV file with OPENAI_API_KEY
client = OpenAI()

# Building prompts for few shot learning
task_description = "Do the two legislator names refer to the same real-world legislator?"
demostration = "legislator_1: '{entity_1}'\nlegislator_2: '{entity_2}'"

# Creating examples for few shot learning
few_shot_messages = []
# System Prompt
few_shot_messages.append({"role": "system", "content": "You are a helpful assistant."})
# Few shot Examples
for example in synt_data:
    # Adds description
    few_shot_messages.append({"role": "user", "content": task_description})
    # Adds example
    few_shot_messages.append({"role": "user", "content": demostration.format(entity_1=example["entity_1"], entity_2=example["entity_2"])})
    # Adds correct answer
    few_shot_messages.append({"role": "assistant", "content": example["match"]})

# Running the model for each real example
for sample in real_data:
  messages = few_shot_messages.copy()
  messages.append({"role": "user", "content": task_description})
  messages.append({"role": "user", "content": demostration.format(entity_1=sample["entity_1"], entity_2=sample["entity_2"])})
  response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
  )
  print("Entity 1: ", sample["entity_1"])
  print("Entity 2: ", sample["entity_2"])
  print("Same Entity: ", response.choices[0].message.content)

Entity 1:  Spann-Wilder, Tiffany
Entity 2:  Tiffany Spann-Wilder
Same Entity:  yes
Entity 1:  Landon C. Dais
Entity 2:  Landon Dais
Same Entity:  yes
Entity 1:  Giglio JA
Entity 2:  Jodi Giglio
Same Entity:  No, the two names 'Giglio JA' and 'Jodi Giglio' do not seem to refer to the same real-world legislator. 'Giglio JA' is likely an abbreviation or a format that includes initials, and 'Jodi Giglio' is a full name. Without additional context, they are likely different individuals.
Entity 1:  Brown, M
Entity 2:  Marla Gallo Brown
Same Entity:  no
Entity 1:  J.T. 'Jabo' Waggoner 
Entity 2:  Jabo Waggoner
Same Entity:  yes

Bert-type Models Fine Tuned for Entity Tasks

Older and smaller generation of Language Models, trained with general data but fine-tuned for entity tasks. Though some are trained for other tasks, most are trained for entity linking, which requires defining a Trie or Knowledge Base (usually WikiData). Changing the KB may require retraining the model

Advantages

The Blocking process is performed automatically for the existing Trie/Knowledge Base
Good performance for data within training distribution
May not require additional training (depending on the model)
You can run it locally at no cost

Disadvantages

Hard to train from scratch, which is recommended if data is out of distribution
Limited Out of Distribution performance out of the box
Need to build a custom TRIE/KB or use generic KBs like WikiData

Recommended Models

Facebook GENRE

Facebook Genre (Using Genre default Trie)

# Loading libraries
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Loading model and tokenizers
tokenizer = AutoTokenizer.from_pretrained("facebook/genre-linking-aidayago2")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/genre-linking-aidayago2").eval()

# Processing text
sentences = []
for example in real_data:
    sentences.append(f"[START_ENT] {example['entity_1']} [END_ENT]")
    sentences.append(f"[START_ENT] {example['entity_2']} [END_ENT]")

# Running inference
num_beams = 3
outputs = model.generate(
    **tokenizer(
        sentences, 
        return_tensors="pt", 
        padding=True, 
        truncation=True,
    ),
    num_beams=num_beams,
    num_return_sequences=num_beams,
    # OPTIONAL: use constrained beam search
    # prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)
preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Printing results
i = 0
for example in real_data:
    print("Entity 1: ", example['entity_1'])
    print("Entity 1 KB Candidates: ", preds[i:i+num_beams])
    print("Entity 2: ", example['entity_2'])
    print("Entity 2 KB Candidates: ", preds[i+num_beams:i+(num_beams*2)])
    print("--")
    i += +(num_beams*2)

/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/transformers/generation/utils.py:1168: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(

Entity 1:  Spann-Wilder, Tiffany
Entity 1 KB Candidates:  ['Tiffany Spann-Wilder', 'Spann-Wilder, Tiffany', 'Tiffany Spann-Wilder, Tiffany']
Entity 2:  Tiffany Spann-Wilder
Entity 2 KB Candidates:  ['Tiffany Spann-Wilder', 'Tiffany Spann- Wilder', 'Tiffani Spann-Wilder']
--
Entity 1:  Landon C. Dais
Entity 1 KB Candidates:  ['Landon C. Dais', 'Landon Dais', 'Landon C.Dais']
Entity 2:  Landon Dais
Entity 2 KB Candidates:  ['Landon Dais', 'LandonDais', 'Landon dais']
--
Entity 1:  Giglio JA
Entity 1 KB Candidates:  ['Juan Antonio Giglio', 'Antonio Giglio', 'Jorge Antonio Giglio']
Entity 2:  Jodi Giglio
Entity 2 KB Candidates:  ['Jodi Giglio', 'Jodi Giglio (actress)', 'Jodi Giglio (singer)']
--
Entity 1:  Brown, M
Entity 1 KB Candidates:  ['Mark Brown (American football)', 'Michael J. Brown', 'Michael Brown (American football)']
Entity 2:  Marla Gallo Brown
Entity 2 KB Candidates:  ['Marla Gallo Brown', 'Marla Gallo-Brown', 'Marla Gallo']
--
Entity 1:  J.T. 'Jabo' Waggoner 
Entity 1 KB Candidates:  ['J. T. Waggoner', "J. T. 'Jabo' Waggoner", 'Jabo Waggoner']
Entity 2:  Jabo Waggoner
Entity 2 KB Candidates:  ['Jabo Waggoner', 'Jabo Waggoner', 'Jambo Waggoner']
--

Matching systems with Small to No Training

This group encompasses traditional Entity Matching techniques that use similarity metrics or classic ML models to evaluate how likely are the entities to match. Some of these approaches will require labeled examples to train models, though training is mostly managed by the library. This approach is popular for tabular datasets with multiple attributes

Advantages

Simpler matching algorithms (though libraries can be outdated)
Usually includes candidate generation and blocking
Fast

Disadvantages

Worse performance in Entity Disambiguation compared to LM-based approaches
Requires training data for some of the algorithms
Some libraries are old and have not been updated for some time

Matching Algorithms Examples

Jaccard similarity
Levenstein distance
Cosine similarity of vectors
Random Forest Similarity Classifier

py_entitymatching + Random Forest

# Loading libraries
import pandas as pd
import py_entitymatching as em

# Formatting data in required library configuration 
# Train set
df = pd.DataFrame(synt_data)
df = df.reset_index(names="id").reset_index(names="ltable_id").reset_index(names="rtable_id")
df["match"] = df["match"].apply(lambda x: 1 if x=="yes" else 0)
A = df[['ltable_id', 'chamber_1', 'entity_1']].rename({'ltable_id': 'id', 'chamber_1': 'chamber', 'entity_1': 'entity'}, axis=1)
B = df[['rtable_id', 'chamber_2', 'entity_2']].rename({'rtable_id': 'id', 'chamber_2': 'chamber', 'entity_2': 'entity'}, axis=1)
C = df[["id", "ltable_id", "rtable_id", "match"]]

# Test set
test_df = pd.DataFrame(real_data)
test_df = test_df.reset_index(names="id").reset_index(names="ltable_id").reset_index(names="rtable_id")
test_df["match"] = test_df["match"].apply(lambda x: 1 if x=="yes" else 0)
X = df[['ltable_id', 'chamber_1', 'entity_1']].rename({'ltable_id': 'id', 'chamber_1': 'chamber', 'entity_1': 'entity'}, axis=1)
Y = df[['rtable_id', 'chamber_2', 'entity_2']].rename({'rtable_id': 'id', 'chamber_2': 'chamber', 'entity_2': 'entity'}, axis=1)
Z = df[["id", "ltable_id", "rtable_id", "match"]]

# Registering tables metadata in the library
em.set_key(A, 'id')
em.set_key(B, 'id')
em.set_key(C, 'id')
em.set_ltable(C, B)
em.set_ltable(C, A)

em.set_key(X, 'id')
em.set_key(Y, 'id')
em.set_key(Z, 'id')
em.set_ltable(Z, X)
em.set_ltable(Z, Y)

True

# Creating and blocking candidate matches (just for display)
ob = em.OverlapBlocker()
C = ob.block_tables(A, B, 'entity', 'entity', 
                    l_output_attrs=['entity', 'chamber'], 
                    r_output_attrs=['entity', 'chamber'],
                    overlap_size=1, show_progress=False)

/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/py_entitymatching/blocker/overlap_blocker.py:258: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  l_df[l_dummy_overlap_attr] = l_df[l_overlap_attr]
/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/py_entitymatching/blocker/overlap_blocker.py:259: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  r_df[r_dummy_overlap_attr] = r_df[r_overlap_attr]
/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/py_entitymatching/blocker/overlap_blocker.py:615: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  table[overlap_attr] = values

# Workaround - Saving labeled data so we are able to load it in the library with the right metadata
df = df.rename({'id':'_id', 'chamber_1': 'ltable_chamber', 'chamber_2': 'rtable_chamber', 'entity_1': 'ltable_entity',
       'entity_2': 'rtable_entity', 'match': 'gold'}, axis=1)
df.to_csv("temp_df.csv")
test_df = test_df.rename({'id':'_id', 'chamber_1': 'ltable_chamber', 'chamber_2': 'rtable_chamber', 'entity_1': 'ltable_entity',
       'entity_2': 'rtable_entity', 'match': 'gold'}, axis=1)
test_df.to_csv("temp_test_df.csv")

# Load training data with metadata
G = em.read_csv_metadata("temp_df.csv", 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
# Load test data with metadata
Z = em.read_csv_metadata("temp_test_df.csv", 
                         key='_id',
                         ltable=X, rtable=Y, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.

# Selecting features for Entity Matching Model
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

# Building feature vectors for entity matching model
# Select the attrs. to be included in the feature vector table
attrs_from_table = ['ltable_entity', 'ltable_chamber',
                    'rtable_entity', 'rtable_chamber',]
# Convert the labeled data to feature vectors using the feature table
H = em.extract_feature_vecs(G, 
                            feature_table=feature_table, 
                            attrs_before = attrs_from_table,
                            attrs_after='gold',
                            show_progress=False)

/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)

# Training entity matching model with training data
# Instantiate the RF Matcher
rf = em.RFMatcher()
# Get the attributes to be projected while training
attrs_to_be_excluded = []
attrs_to_be_excluded.extend(['_id', 'ltable_id', 'rtable_id', 'gold'])
attrs_to_be_excluded.extend(attrs_from_table)
# Train using feature vectors from the labeled data.
rf.fit(table=H, exclude_attrs=attrs_to_be_excluded, target_attr='gold')

# Prepare test data for inference
# Select the attrs. to be included in the feature vector table
attrs_from_table = ['ltable_entity', 'ltable_chamber',
                    'rtable_entity', 'rtable_chamber',]
# Convert the cancidate set to feature vectors using the feature table
L = em.extract_feature_vecs(Z, feature_table=feature_table,
                             attrs_before= attrs_from_table,
                             show_progress=False, n_jobs=-1)
# Get the attributes to be excluded while predicting 
attrs_to_be_excluded = []
attrs_to_be_excluded.extend(['_id', 'ltable_id', 'rtable_id'])
attrs_to_be_excluded.extend(attrs_from_table)

/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

# Predict the matches on inference data
predictions = rf.predict(table=L, exclude_attrs=attrs_to_be_excluded,                          
              append=True, target_attr='predicted', inplace=False)
predictions.head()

	_id	ltable_id	rtable_id	ltable_entity	ltable_chamber	rtable_entity	rtable_chamber	id_id_exm	id_id_anm	...	chamber_chamber_jac_qgm_3_qgm_3	entity_entity_jac_qgm_3_qgm_3	entity_entity_cos_dlm_dc0_dlm_dc0	entity_entity_jac_dlm_dc0_dlm_dc0	entity_entity_mel	entity_entity_lev_dist	entity_entity_lev_sim	entity_entity_nmw	entity_entity_sw	predicted
0	0	0	0	Spann-Wilder, Tiffany	house	Tiffany Spann-Wilder	house	1	0.0	...	1.0	1.000000	1.000000	1.000000	1.000000	0.0	1.000000	13.0	13.0	1
1	1	1	1	Landon C. Dais	house	Landon Dais	house	1	1.0	...	1.0	0.700000	0.666667	0.500000	0.973333	1.0	0.933333	14.0	14.0	0
2	2	2	2	Giglio JA	house	Jodi Giglio	house	1	1.0	...	1.0	0.625000	0.666667	0.500000	0.945395	4.0	0.789474	12.0	12.0	1
3	3	3	3	Brown, M	house	Marla Gallo Brown	house	1	1.0	...	1.0	0.571429	0.666667	0.500000	0.959048	2.0	0.866667	12.0	12.0	0
4	4	4	4	J.T. 'Jabo' Waggoner\t	house	Jabo Waggoner	house	1	1.0	...	1.0	0.736842	0.500000	0.333333	0.986667	1.0	0.933333	13.0	13.0	1

5 rows × 26 columns

Intro to Entity Matching

When to use AI for Entity Matching

Setup

Language Models

Advantages

Disadvantages

Best LM Configuration - based on Entity Matching using LMs:

Bert-type Models Fine Tuned for Entity Tasks

Advantages

Disadvantages

Recommended Models

Facebook Genre (Using Genre default Trie)

Matching systems with Small to No Training

Advantages

Disadvantages

Matching Algorithms Examples

py_entitymatching + Random Forest

Sources

Python Libraries

Papers

Thank you!

When to use AI for Entity Matching

Setup

Language Models

Advantages

Disadvantages

Best LM Configuration - based on Entity Matching using LMs:

Bert-type Models Fine Tuned for Entity Tasks

Advantages

Disadvantages

Recommended Models

Facebook Genre (Using Genre default Trie)

Matching systems with Small to No Training

Advantages

Disadvantages

Matching Algorithms Examples

Related libraries:

py_entitymatching + Random Forest

Sources

Python Libraries

Papers