Create a training set
get_training_set.Rd
Creates a training set from a list of similarity matrices and labels it using a zero-shot GPT prompt.
Usage
get_training_set(
sim,
num_bins = 50,
samples_per_bin = 10,
n = 500,
record_type = "entity",
instructions = NULL,
model = "gpt-3.5-turbo-instruct",
openai_api_key = Sys.getenv("OPENAI_API_KEY"),
parallel = TRUE
)
Arguments
- sim
A matrix of similarity scores
- num_bins
Number of bins to split similarity scores for stratified random sampling (defaults to 50)
- samples_per_bin
Number of string pairs to sample from each bin (defaults to 5)
- n
Sample size for the training dataset
- record_type
A character describing what type of entity the rows and columns of
sim
represent. Should be a singular noun (e.g. "person", "organization", "interest group", "city").- instructions
A string containing additional instructions to include in the LLM prompt during validation.
- model
Which OpenAI model to prompt; defaults to 'gpt-3.5-turbo-instruct'
- openai_api_key
Your OpenAI API key. By default, looks for a system environment variable called "OPENAI_API_KEY" (recommended option). Otherwise, it will prompt you to enter the API key as an argument.
- parallel
TRUE to submit API requests in parallel. Setting to FALSE can reduce rate limit errors at the expense of longer runtime.