
Measure a latent quantity from text
textscale.RdRuns the full textscale pipeline in a single call: generates pairwise comparisons, retrieves embeddings, annotates pairs with an LLM, fits and validates a model on the train/test split, refits on all comparisons, and returns scores for every document.
Usage
textscale(
documents,
prompt,
prop = 0.8,
holdout = NULL,
blocks = NULL,
n_train = 10000,
n_test = 5000,
seed = NULL,
llm_model = "gpt-4.1-mini",
embeddings_cache = "textscale_embeddings.rds",
annotations_cache = "textscale_annotations.rds",
annotations_path = "textscale_annotations.json",
parallel = FALSE,
method = "ridge",
ci = TRUE,
ci_method = "laplace",
validate = TRUE,
force = FALSE,
...
)Arguments
- documents
A character vector of documents to scale.
- prompt
Instruction text for the LLM annotator. Should be a plain question or directive describing which document should be judged "greater" on the latent dimension — no placeholder syntax needed. For example:
"Which political ad is more negative toward its opponent?". The document text is appended automatically as labelled options A and B.- prop
Proportion of documents assigned to the training split. Defaults to
0.8. Cannot be used together withholdout.- holdout
Optional logical vector the same length as
documents.TRUEmarks a document for the test set,FALSEfor the training set. When supplied, overridespropfor train/test assignment. Cannot be used together withprop.- blocks
Optional vector (character, factor, or integer) the same length as
documents. When supplied, only within-block pairs are generated. Seegenerate_comparisons()for details.- n_train
Maximum number of training comparison pairs to generate. Defaults to
10000.- n_test
Maximum number of test comparison pairs to generate. Defaults to
5000.- seed
Integer random seed for reproducible pair sampling. Defaults to
NULL.- llm_model
OpenAI model name used for annotation. Defaults to
"gpt-4.1-mini".- embeddings_cache
File path for caching embeddings as an RDS file. Passed to
get_embeddings(). Defaults to"textscale_embeddings.rds"in the current working directory. Set toNULLto disable caching.- annotations_cache
File path for caching annotations as an RDS file. Passed to
annotate_comparisons(). Defaults to"textscale_annotations.rds"in the current working directory. Set toNULLto disable caching.- annotations_path
File path for checkpointing batch API calls. Passed to
annotate_comparisons()aspath. Defaults to"textscale_annotations.json"in the current working directory. Set toNULLto disable checkpointing. Ignored whenparallel = TRUE.- parallel
Logical. If
FALSE(the default), annotations are submitted via the OpenAI Batch API at 50% of standard prices. Set toTRUEto useellmer::parallel_chat_text()for immediate results at standard prices.- method
Fitting method passed to
fit_model(). One of"ridge"(default),"lasso","enet", or"svm".- ci
Logical. If
TRUE, thescoreselement of the returned object is a tibble withscore,lower, anduppercolumns. Defaults toTRUE.- ci_method
Method for computing confidence intervals. One of
"laplace"(default) or"bootstrap". Passed toscore_documents(). Ignored whenci = FALSE.- validate
Logical. If
TRUE(the default),validate_model()is called on the held-out test split and its output is printed.- force
Logical. If
FALSE(the default), the pipeline stops when validation metrics are poor (accuracy < 0.55 or ICI > 0.20) and warns when they are marginal (accuracy < 0.65 or ICI > 0.10). Setforce = TRUEto downgrade stops to warnings and continue scoring regardless of validation results. Ignored whenvalidate = FALSE.- ...
Additional arguments passed to
fit_model()(e.g.alpha,nlambda,lambda_min_ratio).
Value
A textscale_result object (a list) containing:
scoresDocument scores on the latent dimension. A named numeric vector by default, or a tibble with
score,lower, andupperifci = TRUE.model_finalThe
textscale_modelfit on all comparisons, used to producescores. Pass toscore_documents()to scale new documents.model_evalThe
textscale_modelfit on training comparisons only, used for validation.validationA
textscale_validationobject fromvalidate_model(), orNULLifvalidate = FALSE. Callprint()on it for metrics andplot()for the calibration plot.comparisonsThe annotated comparisons tibble.
embeddingsThe document embedding matrix.