Skip to contents

Probabilistic Record Linkage Using Pretrained Text Embeddings

Usage

fuzzylink(
  dfA,
  dfB,
  by,
  blocking.variables = NULL,
  verbose = TRUE,
  record_type = "entity",
  instructions = NULL,
  model = "gpt-3.5-turbo-instruct",
  openai_api_key = Sys.getenv("OPENAI_API_KEY"),
  embedding_dimensions = 256,
  embedding_model = "text-embedding-3-large",
  fmla = match ~ sim + jw,
  max_validations = 1e+05,
  p = c(0.1, 0.95),
  k = 20,
  parallel = TRUE,
  return_all_pairs = FALSE
)

Arguments

dfA, dfB

A pair of data frames or data frame extensions (e.g. tibbles)

by

A character denoting the name of the variable to use for fuzzy matching

blocking.variables

A character vector of variables that must match exactly in order to match two records

verbose

TRUE to print progress updates, FALSE for no output

record_type

A character describing what type of entity the by variable represents. Should be a singular noun (e.g. "person", "organization", "interest group", "city").

instructions

A string containing additional instructions to include in the LLM prompt during validation.

model

Which LLM to prompt when validating matches; defaults to 'gpt-3.5-turbo-instruct'

openai_api_key

Your OpenAI API key. By default, looks for a system environment variable called "OPENAI_API_KEY" (recommended option). Otherwise, it will prompt you to enter the API key as an argument.

embedding_dimensions

The dimension of the embedding vectors to retrieve. Defaults to 256

embedding_model

Which pretrained embedding model to use; defaults to 'text-embedding-3-large' (OpenAI), but will also accept 'mistral-embed' (Mistral).

fmla

By default, logistic regression model predicts whether two records match as a linear combination of embedding similarity and Jaro-Winkler similarity (match ~ sim + jw). Change this input for alternate specifications.

max_validations

The maximum number of LLM prompts to submit during the validation stage. Defaults to 100,000

p

The range of estimated match probabilities within which fuzzylink() will validate record pairs using an LLM prompt. Defaults to c(0.1, 0.95)

k

Number of nearest neighbors to validate for records in dfA with no identified matches. Higher values may improve recall at expense of precision. Defaults to 20

parallel

TRUE to submit API requests in parallel. Setting to FALSE can reduce rate limit errors at the expense of longer runtime.

return_all_pairs

If TRUE, returns every within-block record pair from dfA and dfB, not just validated pairs. Defaults to FALSE.

Value

A dataframe with all rows of dfA joined with any matches from dfB

Examples

dfA <- data.frame(state.x77)
dfA$name <- rownames(dfA)
dfB <- data.frame(name = state.abb, state.division)
df <- fuzzylink(dfA, dfB,
                by = 'name',
                record_type = 'US state government',
                instructions = 'The first dataset contains full US state names. The second dataset contains US postal codes.')
#> Retrieving 100 embeddings (11:40:02 AM)
#> 
#> Computing similarity matrix (11:40:03 AM)
#> 
#> Labeling training set (11:40:03 AM)
#> 
#> Fitting model (11:40:05 AM)
#> 
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Linking datasets (11:40:05 AM)
#> 
#> Validating 17 matches (11:40:05 AM)
#> 
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Done! (11:40:05 AM)