
Generate pairwise comparisons from a set of documents
generate_comparisons.RdSamples unique pairs of documents to be used as input to
annotate_comparisons(). Each row of the returned tibble
represents one comparison between two documents.
Usage
generate_comparisons(
documents,
n_train = 10000,
n_test = 5000,
prop = NULL,
holdout = NULL,
blocks = NULL,
seed = NULL
)Arguments
- documents
A character vector of documents to compare.
- n_train
Maximum number of unique pairs to sample from the training set. Defaults to
10000. Set toNULLto return all possible training pairs. When no split is active, this limits the overall sample.- n_test
Maximum number of unique pairs to sample from the test set. Defaults to
5000. Set toNULLto return all possible test pairs. Only used when a split is active (proporholdout).- prop
Optional proportion of documents assigned to the training set (e.g.
0.8). When supplied, asplitcolumn is added to the result. Cannot be used together withholdout.- holdout
Optional logical vector the same length as
documents.TRUEmarks a document for the test set,FALSEfor the training set. When supplied, asplitcolumn is added to the result. Cannot be used together withprop.- blocks
Optional vector (character, factor, or integer) the same length as
documents. When supplied, only within-block pairs are generated and ablockcolumn is included in the output. Blocks with fewer than 2 documents are skipped.- seed
Optional integer random seed for reproducibility.
Value
A tibble with columns doc_id_a, doc_id_b, text_a, and
text_b. doc_id_a and doc_id_b are integer row indices into
documents. When a split is active, an additional split column
contains "train" or "test". When blocks is supplied, a
block column identifies which block each pair belongs to.
Details
When prop or holdout is supplied, the result includes a split
column ("train" / "test") that
fit_model() and validate_model() respect
automatically.
When blocks is supplied, only within-block pairs are generated. Blocks
with fewer than 2 documents are skipped with a message.
When fewer unique pairs exist than the requested n_train or
n_test, all available pairs are returned with a message.
Examples
docs <- c("The quick brown fox", "A lazy dog", "Hello world", "Foo bar")
# All unique pairs
generate_comparisons(docs, n_train = NULL)
#> # A tibble: 6 × 4
#> doc_id_a doc_id_b text_a text_b
#> <int> <int> <chr> <chr>
#> 1 1 2 The quick brown fox A lazy dog
#> 2 1 3 The quick brown fox Hello world
#> 3 2 3 A lazy dog Hello world
#> 4 1 4 The quick brown fox Foo bar
#> 5 2 4 A lazy dog Foo bar
#> 6 3 4 Hello world Foo bar
# With an 80/20 train/test split, default sample sizes
generate_comparisons(docs, prop = 0.8, seed = 1)
#> `n_train` (10000) exceeds the number of unique train pairs (3). Using all 3.
#> `n_test` (5000) exceeds the number of unique test pairs (0). Using all 0.
#> # A tibble: 3 × 5
#> doc_id_a doc_id_b text_a text_b split
#> <int> <int> <chr> <chr> <chr>
#> 1 1 3 The quick brown fox Hello world train
#> 2 1 4 The quick brown fox Foo bar train
#> 3 3 4 Hello world Foo bar train
# With blocks
generate_comparisons(docs, blocks = c("a", "a", "b", "b"), n_train = NULL)
#> # A tibble: 2 × 5
#> doc_id_a doc_id_b text_a text_b block
#> <int> <int> <chr> <chr> <chr>
#> 1 1 2 The quick brown fox A lazy dog a
#> 2 3 4 Hello world Foo bar b
# With user-supplied holdout
generate_comparisons(docs, holdout = c(FALSE, FALSE, TRUE, TRUE))
#> `n_train` (10000) exceeds the number of unique train pairs (1). Using all 1.
#> `n_test` (5000) exceeds the number of unique test pairs (1). Using all 1.
#> # A tibble: 2 × 5
#> doc_id_a doc_id_b text_a text_b split
#> <int> <int> <chr> <chr> <chr>
#> 1 1 2 The quick brown fox A lazy dog train
#> 2 3 4 Hello world Foo bar test