# Downloading required libraries and packages
!pip install transformers spacy numpy scipy py_entitymatching python-dotenv openai
Entity matching is the process of identifying entity descriptions from different sources that refer to the same real-world entity. This post is an applied introduction to some common approaches used for entity matching. Since even for small datasets the number of possible matches can become excessive, the process is usually split in two:
Blocking: Generates the possible combinations and removes most of them based on simple metrics (ex: number of shared letters)
Disambiguation: Evaluates the likelihood that two entity descriptions refer to the same real-world entity. For Entity Disambiguation, the approaches can be split into the following categories:
- Prompting trained Large Language Models
- Bert-like Language Models Fine Tuned on Entity Tasks
- Similarity Metrics and traditional ML Models
- Rule systems developed by human experts
There are multiple tasks related to Entity Matching. Though they are different, some approaches may solve common issues.
- Entity Disambiguation: Evaluates the likelihood that two entity descriptions refer to the same real-world entity. Does not include the candidate generation and blocking process.
- Entity Linking: Identifies characters in a text that match an entity in an existing Knowledge Base.
- Named Entity Recognition: Identifies characters in a text that could be identified as Named Entities. Some of them also use Knowledge Bases or Disambiguation.
TL;DR: This paragraph in “Entity Matching using Large Language Models” is a good summary of the current status of Entity Matching in 2024:
“We can summarize the high-level implications of our findings concerning the selection of matching techniques in the following rules of thumb: For use cases that do not involve many unseen entities and for which a decent amount of training data is available, PLM-based matchers are a suitable option which does not require much compute due to the smaller size of the models. For use cases that involve a relevant amount of unseen entities and for which it is costly to gather and maintain a decent size training set, LLM-based matchers should be preferred due to their high zero-shot performance and ability to generalize to unseen entities. If using the best performing hosted LLMs is not an option due to their high usage costs, fine-tuning a cheaper hosted model is an alternative that can deliver a similar F1 performance. If using using hosted models is no option due to privacy concerns, using an open-source LLM on local hardware can be an alternative providing a slightly lower F1 performance given that some task-specific training data or domain-specific matching rules are available.”
Ralph Peeters & Christian Bizer, “Entity Matching using Large Language Models”, 2024
A more detailed explanation of Entity Matching can be found here
When to use AI for Entity Matching
- If a human can find a match with the presented data, is likely that a Modern AI approach can do it too
- If a human is having issues finding a match, AI won’t be likely to help
- If there’s not enough information presented to the model, AI won’t be able to find the match, even if a human has enough internal information to do it
You can run the following code on Google Colab clicking here.
Setup
# Downloading spacy english package
!python -m spacy download en_core_web_sm
# Creating data for training/few-shot and evaluation
= [
synt_data "chamber_1": "house", "chamber_2": "house", "entity_1": "John A. Smith", "entity_2": "John A. Smith", "match": "yes"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Emily J. Clarke", "entity_2": "Emily T. Clarke", "match": "no"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Sarah M. Johnson", "entity_2": "Sarah Marie Johnson", "match": "yes"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "James P. Miller", "entity_2": "James P. Miles", "match": "no"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Michael O'Leary", "entity_2": "Michael OLeary", "match": "yes"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Nancy L. Wright", "entity_2": "Nancy W. Wright", "match": "no"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Catherine G. Davis", "entity_2": "Catherine Grace Davis", "match": "yes"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Richard A. Lee", "entity_2": "Richard A. Lin", "match": "no"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Robert K. Brown", "entity_2": "Robert K. Brown Jr.", "match": "yes"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Karen M. Harris", "entity_2": "Karen M. Harrison", "match": "no"},
{
]
= [
real_data "chamber_1": "house", "chamber_2": "house", "entity_1": "Spann-Wilder, Tiffany", "entity_2": "Tiffany Spann-Wilder", "match": "yes"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Landon C. Dais", "entity_2": "Landon Dais", "match": "yes"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Giglio JA", "entity_2": "Jodi Giglio", "match": "yes"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "Brown, M", "entity_2": "Marla Gallo Brown", "match": "yes"},
{"chamber_1": "house", "chamber_2": "house", "entity_1": "J.T. 'Jabo' Waggoner ", "entity_2": "Jabo Waggoner", "match": "yes"},
{ ]
Language Models
Currently the best performing approach to Entity Disambiguation based on recent research. It usually consists of building a set of examples and assembling a prompt, in order to pass an unseen pair to the Language Model and get the relevant response. There’s some exploration into using LM to enhance existing Entity Linking models.
Advantages
- Easy to use: create examples, compose a prompt, call an API, and parse the results.
- Performance above older models trained specifically for entity matching
- May not require structuring the data as much as other options, just collapsing all the information in a single string.
- LM approach can be used to generate explanations and categories. Categories can be used to find common error cases (year, mixed words, etc.)
Disadvantages
- Prompting is very sensible to changes
- Most powerful models are paid or hard to use
- No candidate generation and blocking (can be done with other tools)
Best LM Configuration - based on Entity Matching using LMs:
- Model: GPT4
- Prompt: Domain specific, complex prompt, free-form response (use regex to find ‘yes’)
- Few Shot: Yes, related examples
- Fine-tuning helps, especially smaller models, without losing generalization
# Loading libraries
from openai import OpenAI
from dotenv import load_dotenv
# Loading env and openai client
".env") # Create an ENV file with OPENAI_API_KEY
load_dotenv(= OpenAI()
client
# Building prompts for few shot learning
= "Do the two legislator names refer to the same real-world legislator?"
task_description = "legislator_1: '{entity_1}'\nlegislator_2: '{entity_2}'" demostration
# Creating examples for few shot learning
= []
few_shot_messages # System Prompt
"role": "system", "content": "You are a helpful assistant."})
few_shot_messages.append({# Few shot Examples
for example in synt_data:
# Adds description
"role": "user", "content": task_description})
few_shot_messages.append({# Adds example
"role": "user", "content": demostration.format(entity_1=example["entity_1"], entity_2=example["entity_2"])})
few_shot_messages.append({# Adds correct answer
"role": "assistant", "content": example["match"]}) few_shot_messages.append({
# Running the model for each real example
for sample in real_data:
= few_shot_messages.copy()
messages "role": "user", "content": task_description})
messages.append({"role": "user", "content": demostration.format(entity_1=sample["entity_1"], entity_2=sample["entity_2"])})
messages.append({= client.chat.completions.create(
response ="gpt-4o",
model=messages,
messages
)print("Entity 1: ", sample["entity_1"])
print("Entity 2: ", sample["entity_2"])
print("Same Entity: ", response.choices[0].message.content)
Entity 1: Spann-Wilder, Tiffany
Entity 2: Tiffany Spann-Wilder
Same Entity: yes
Entity 1: Landon C. Dais
Entity 2: Landon Dais
Same Entity: yes
Entity 1: Giglio JA
Entity 2: Jodi Giglio
Same Entity: No, the two names 'Giglio JA' and 'Jodi Giglio' do not seem to refer to the same real-world legislator. 'Giglio JA' is likely an abbreviation or a format that includes initials, and 'Jodi Giglio' is a full name. Without additional context, they are likely different individuals.
Entity 1: Brown, M
Entity 2: Marla Gallo Brown
Same Entity: no
Entity 1: J.T. 'Jabo' Waggoner
Entity 2: Jabo Waggoner
Same Entity: yes
Bert-type Models Fine Tuned for Entity Tasks
Older and smaller generation of Language Models, trained with general data but fine-tuned for entity tasks. Though some are trained for other tasks, most are trained for entity linking, which requires defining a Trie or Knowledge Base (usually WikiData). Changing the KB may require retraining the model
Advantages
- The Blocking process is performed automatically for the existing Trie/Knowledge Base
- Good performance for data within training distribution
- May not require additional training (depending on the model)
- You can run it locally at no cost
Disadvantages
- Hard to train from scratch, which is recommended if data is out of distribution
- Limited Out of Distribution performance out of the box
- Need to build a custom TRIE/KB or use generic KBs like WikiData
Recommended Models
Facebook Genre (Using Genre default Trie)
# Loading libraries
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Loading model and tokenizers
= AutoTokenizer.from_pretrained("facebook/genre-linking-aidayago2")
tokenizer = AutoModelForSeq2SeqLM.from_pretrained("facebook/genre-linking-aidayago2").eval()
model
# Processing text
= []
sentences for example in real_data:
f"[START_ENT] {example['entity_1']} [END_ENT]")
sentences.append(f"[START_ENT] {example['entity_2']} [END_ENT]")
sentences.append(
# Running inference
= 3
num_beams = model.generate(
outputs **tokenizer(
sentences, ="pt",
return_tensors=True,
padding=True,
truncation
),=num_beams,
num_beams=num_beams,
num_return_sequences# OPTIONAL: use constrained beam search
# prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)= tokenizer.batch_decode(outputs, skip_special_tokens=True)
preds
# Printing results
= 0
i for example in real_data:
print("Entity 1: ", example['entity_1'])
print("Entity 1 KB Candidates: ", preds[i:i+num_beams])
print("Entity 2: ", example['entity_2'])
print("Entity 2 KB Candidates: ", preds[i+num_beams:i+(num_beams*2)])
print("--")
+= +(num_beams*2) i
/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/transformers/generation/utils.py:1168: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
Entity 1: Spann-Wilder, Tiffany
Entity 1 KB Candidates: ['Tiffany Spann-Wilder', 'Spann-Wilder, Tiffany', 'Tiffany Spann-Wilder, Tiffany']
Entity 2: Tiffany Spann-Wilder
Entity 2 KB Candidates: ['Tiffany Spann-Wilder', 'Tiffany Spann- Wilder', 'Tiffani Spann-Wilder']
--
Entity 1: Landon C. Dais
Entity 1 KB Candidates: ['Landon C. Dais', 'Landon Dais', 'Landon C.Dais']
Entity 2: Landon Dais
Entity 2 KB Candidates: ['Landon Dais', 'LandonDais', 'Landon dais']
--
Entity 1: Giglio JA
Entity 1 KB Candidates: ['Juan Antonio Giglio', 'Antonio Giglio', 'Jorge Antonio Giglio']
Entity 2: Jodi Giglio
Entity 2 KB Candidates: ['Jodi Giglio', 'Jodi Giglio (actress)', 'Jodi Giglio (singer)']
--
Entity 1: Brown, M
Entity 1 KB Candidates: ['Mark Brown (American football)', 'Michael J. Brown', 'Michael Brown (American football)']
Entity 2: Marla Gallo Brown
Entity 2 KB Candidates: ['Marla Gallo Brown', 'Marla Gallo-Brown', 'Marla Gallo']
--
Entity 1: J.T. 'Jabo' Waggoner
Entity 1 KB Candidates: ['J. T. Waggoner', "J. T. 'Jabo' Waggoner", 'Jabo Waggoner']
Entity 2: Jabo Waggoner
Entity 2 KB Candidates: ['Jabo Waggoner', 'Jabo Waggoner', 'Jambo Waggoner']
--
Matching systems with Small to No Training
This group encompasses traditional Entity Matching techniques that use similarity metrics or classic ML models to evaluate how likely are the entities to match. Some of these approaches will require labeled examples to train models, though training is mostly managed by the library. This approach is popular for tabular datasets with multiple attributes
Advantages
- Simpler matching algorithms (though libraries can be outdated)
- Usually includes candidate generation and blocking
- Fast
Disadvantages
- Worse performance in Entity Disambiguation compared to LM-based approaches
- Requires training data for some of the algorithms
- Some libraries are old and have not been updated for some time
Matching Algorithms Examples
- Jaccard similarity
- Levenstein distance
- Cosine similarity of vectors
- Random Forest Similarity Classifier
py_entitymatching + Random Forest
# Loading libraries
import pandas as pd
import py_entitymatching as em
# Formatting data in required library configuration
# Train set
= pd.DataFrame(synt_data)
df = df.reset_index(names="id").reset_index(names="ltable_id").reset_index(names="rtable_id")
df "match"] = df["match"].apply(lambda x: 1 if x=="yes" else 0)
df[= df[['ltable_id', 'chamber_1', 'entity_1']].rename({'ltable_id': 'id', 'chamber_1': 'chamber', 'entity_1': 'entity'}, axis=1)
A = df[['rtable_id', 'chamber_2', 'entity_2']].rename({'rtable_id': 'id', 'chamber_2': 'chamber', 'entity_2': 'entity'}, axis=1)
B = df[["id", "ltable_id", "rtable_id", "match"]]
C
# Test set
= pd.DataFrame(real_data)
test_df = test_df.reset_index(names="id").reset_index(names="ltable_id").reset_index(names="rtable_id")
test_df "match"] = test_df["match"].apply(lambda x: 1 if x=="yes" else 0)
test_df[= df[['ltable_id', 'chamber_1', 'entity_1']].rename({'ltable_id': 'id', 'chamber_1': 'chamber', 'entity_1': 'entity'}, axis=1)
X = df[['rtable_id', 'chamber_2', 'entity_2']].rename({'rtable_id': 'id', 'chamber_2': 'chamber', 'entity_2': 'entity'}, axis=1)
Y = df[["id", "ltable_id", "rtable_id", "match"]] Z
# Registering tables metadata in the library
'id')
em.set_key(A, 'id')
em.set_key(B, 'id')
em.set_key(C,
em.set_ltable(C, B)
em.set_ltable(C, A)
'id')
em.set_key(X, 'id')
em.set_key(Y, 'id')
em.set_key(Z,
em.set_ltable(Z, X) em.set_ltable(Z, Y)
True
# Creating and blocking candidate matches (just for display)
= em.OverlapBlocker()
ob = ob.block_tables(A, B, 'entity', 'entity',
C =['entity', 'chamber'],
l_output_attrs=['entity', 'chamber'],
r_output_attrs=1, show_progress=False) overlap_size
/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/py_entitymatching/blocker/overlap_blocker.py:258: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
l_df[l_dummy_overlap_attr] = l_df[l_overlap_attr]
/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/py_entitymatching/blocker/overlap_blocker.py:259: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
r_df[r_dummy_overlap_attr] = r_df[r_overlap_attr]
/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/py_entitymatching/blocker/overlap_blocker.py:615: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
table[overlap_attr] = values
# Workaround - Saving labeled data so we are able to load it in the library with the right metadata
= df.rename({'id':'_id', 'chamber_1': 'ltable_chamber', 'chamber_2': 'rtable_chamber', 'entity_1': 'ltable_entity',
df 'entity_2': 'rtable_entity', 'match': 'gold'}, axis=1)
"temp_df.csv")
df.to_csv(= test_df.rename({'id':'_id', 'chamber_1': 'ltable_chamber', 'chamber_2': 'rtable_chamber', 'entity_1': 'ltable_entity',
test_df 'entity_2': 'rtable_entity', 'match': 'gold'}, axis=1)
"temp_test_df.csv") test_df.to_csv(
# Load training data with metadata
= em.read_csv_metadata("temp_df.csv",
G ='_id',
key=A, rtable=B,
ltable='ltable_id', fk_rtable='rtable_id')
fk_ltable# Load test data with metadata
= em.read_csv_metadata("temp_test_df.csv",
Z ='_id',
key=X, rtable=Y,
ltable='ltable_id', fk_rtable='rtable_id') fk_ltable
Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.
# Selecting features for Entity Matching Model
= em.get_features_for_matching(A, B, validate_inferred_attr_types=False) feature_table
# Building feature vectors for entity matching model
# Select the attrs. to be included in the feature vector table
= ['ltable_entity', 'ltable_chamber',
attrs_from_table 'rtable_entity', 'rtable_chamber',]
# Convert the labeled data to feature vectors using the feature table
= em.extract_feature_vecs(G,
H =feature_table,
feature_table= attrs_from_table,
attrs_before ='gold',
attrs_after=False) show_progress
/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
return bound(*args, **kwds)
# Training entity matching model with training data
# Instantiate the RF Matcher
= em.RFMatcher()
rf # Get the attributes to be projected while training
= []
attrs_to_be_excluded '_id', 'ltable_id', 'rtable_id', 'gold'])
attrs_to_be_excluded.extend([
attrs_to_be_excluded.extend(attrs_from_table)# Train using feature vectors from the labeled data.
=H, exclude_attrs=attrs_to_be_excluded, target_attr='gold') rf.fit(table
# Prepare test data for inference
# Select the attrs. to be included in the feature vector table
= ['ltable_entity', 'ltable_chamber',
attrs_from_table 'rtable_entity', 'rtable_chamber',]
# Convert the cancidate set to feature vectors using the feature table
= em.extract_feature_vecs(Z, feature_table=feature_table,
L = attrs_from_table,
attrs_before=False, n_jobs=-1)
show_progress# Get the attributes to be excluded while predicting
= []
attrs_to_be_excluded '_id', 'ltable_id', 'rtable_id'])
attrs_to_be_excluded.extend([ attrs_to_be_excluded.extend(attrs_from_table)
/Users/santiagovelez/anaconda3/envs/exp/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
return bound(*args, **kwds)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
# Predict the matches on inference data
= rf.predict(table=L, exclude_attrs=attrs_to_be_excluded,
predictions =True, target_attr='predicted', inplace=False)
append predictions.head()
_id | ltable_id | rtable_id | ltable_entity | ltable_chamber | rtable_entity | rtable_chamber | id_id_exm | id_id_anm | id_id_lev_dist | ... | chamber_chamber_jac_qgm_3_qgm_3 | entity_entity_jac_qgm_3_qgm_3 | entity_entity_cos_dlm_dc0_dlm_dc0 | entity_entity_jac_dlm_dc0_dlm_dc0 | entity_entity_mel | entity_entity_lev_dist | entity_entity_lev_sim | entity_entity_nmw | entity_entity_sw | predicted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | Spann-Wilder, Tiffany | house | Tiffany Spann-Wilder | house | 1 | 0.0 | 0.0 | ... | 1.0 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 1.000000 | 13.0 | 13.0 | 1 |
1 | 1 | 1 | 1 | Landon C. Dais | house | Landon Dais | house | 1 | 1.0 | 0.0 | ... | 1.0 | 0.700000 | 0.666667 | 0.500000 | 0.973333 | 1.0 | 0.933333 | 14.0 | 14.0 | 0 |
2 | 2 | 2 | 2 | Giglio JA | house | Jodi Giglio | house | 1 | 1.0 | 0.0 | ... | 1.0 | 0.625000 | 0.666667 | 0.500000 | 0.945395 | 4.0 | 0.789474 | 12.0 | 12.0 | 1 |
3 | 3 | 3 | 3 | Brown, M | house | Marla Gallo Brown | house | 1 | 1.0 | 0.0 | ... | 1.0 | 0.571429 | 0.666667 | 0.500000 | 0.959048 | 2.0 | 0.866667 | 12.0 | 12.0 | 0 |
4 | 4 | 4 | 4 | J.T. 'Jabo' Waggoner\t | house | Jabo Waggoner | house | 1 | 1.0 | 0.0 | ... | 1.0 | 0.736842 | 0.500000 | 0.333333 | 0.986667 | 1.0 | 0.933333 | 13.0 | 13.0 | 1 |
5 rows × 26 columns
Sources
Python Libraries
- https://github.com/wbsg-uni-mannheim/MatchGPT/blob/main/LLMForEM
- https://huggingface.co/facebook/mgenre-wiki
- https://github.com/explosion/projects/tree/v3/tutorials/nel_emerson
- https://pypi.org/project/spacy-entity-linker/
- https://huggingface.co/shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher?text=fuzzformer
- https://github.com/megagonlabs/ditto
- https://dedupe.io
- https://github.com/anhaidgroup/deepmatcher
- https://nbviewer.org/github/anhaidgroup/py_entitymatching
- https://github.com/facebookresearch/GENRE
- https://github.com/Babelscape/multinerd?tab=readme-ov-file
- https://github.com/SapienzaNLP/extend?tab=readme-ov-file
- https://github.com/Lucaterre/spacyfishing
Papers
- LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking
- Entity Matching using Large Language Models
- EntGPT: Linking Generative Large Language Models with Knowledge Bases
- On Leveraging Large Language Models for Enhancing Entity Resolution
- Using ChatGPT for Entity Matching
- “Is This You?” Entity Matching in the Modern Data Stack with Large Language models
- DeepType: Multilingual Entity Linking by Neural Type System Evolution
- MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)
- AUTOREGRESSIVE ENTITY RETRIEVAL
- Deep Entity Matching with Pre-Trained Language Models
- https://github.com/sebastianruder/NLP-progress/blob/master/english/entity_linking.md
- https://openai.com/index/discovering-types-for-entity-disambiguation/