Probabilistic Record Linkage Using Pretrained Text Embeddings

Usage

fuzzylink(
  dfA,
  dfB,
  by,
  blocking.variables = NULL,
  verbose = TRUE,
  record_type = "entity",
  instructions = NULL,
  model = "gpt-3.5-turbo-instruct",
  openai_api_key = Sys.getenv("OPENAI_API_KEY"),
  embedding_dimensions = 256,
  embedding_model = "text-embedding-3-large",
  fmla = match ~ sim + jw,
  max_validations = 1e+05,
  p = c(0.1, 0.95),
  k = 0,
  parallel = TRUE,
  return_all_pairs = FALSE
)

Arguments

dfA, dfB: A pair of data frames or data frame extensions (e.g. tibbles)
by: A character denoting the name of the variable to use for fuzzy matching
blocking.variables: A character vector of variables that must match exactly in order to match two records
verbose: TRUE to print progress updates, FALSE for no output
record_type: A character describing what type of entity the by variable represents. Should be a singular noun (e.g. "person", "organization", "interest group", "city").
instructions: A string containing additional instructions to include in the LLM prompt during validation.
model: Which LLM to prompt when validating matches; defaults to 'gpt-3.5-turbo-instruct'
openai_api_key: Your OpenAI API key. By default, looks for a system environment variable called "OPENAI_API_KEY" (recommended option). Otherwise, it will prompt you to enter the API key as an argument.
embedding_dimensions: The dimension of the embedding vectors to retrieve. Defaults to 256
embedding_model: Which pretrained embedding model to use; defaults to 'text-embedding-3-large' (OpenAI), but will also accept 'mistral-embed' (Mistral).
fmla: By default, logistic regression model predicts whether two records match as a linear combination of embedding similarity and Jaro-Winkler similarity (match ~ sim + jw). Change this input for alternate specifications.
max_validations: The maximum number of LLM prompts to submit during the validation stage. Defaults to 100,000
p: The range of estimated match probabilities within which fuzzylink() will validate record pairs using an LLM prompt. Defaults to c(0.1, 0.95)
k: Number of nearest neighbors to validate for records in dfA with no identified matches. Higher values may improve recall at expense of precision. Defaults to 0
parallel: TRUE to submit API requests in parallel. Setting to FALSE can reduce rate limit errors at the expense of longer runtime.
return_all_pairs: If TRUE, returns every within-block record pair from dfA and dfB, not just validated pairs. Defaults to FALSE.

Value

A dataframe with all rows of dfA joined with any matches from dfB

Examples

dfA <- data.frame(state.x77)
dfA$name <- rownames(dfA)
dfB <- data.frame(name = state.abb, state.division)
df <- fuzzylink(dfA, dfB,
                by = 'name',
                record_type = 'US state government',
                instructions = 'The first dataset contains full US state names. The second dataset contains US postal codes.')
#> Retrieving 100 embeddings (4:24:20 PM)
#> 
#> Computing similarity matrix (4:24:21 PM)
#> 
#> Labeling training set (4:24:21 PM)
#> 
#> Fitting model (4:24:23 PM)
#> 
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Linking datasets (4:24:23 PM)
#> 
#> Validating 18 matches (4:24:23 PM)
#> 
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Done! (4:24:24 PM)