Skip to contents

Samples unique pairs of documents to be used as input to annotate_comparisons(). Each row of the returned tibble represents one comparison between two documents.

Usage

generate_comparisons(
  documents,
  n_train = 10000,
  n_test = 5000,
  prop = NULL,
  holdout = NULL,
  blocks = NULL,
  seed = NULL
)

Arguments

documents

A character vector of documents to compare.

n_train

Maximum number of unique pairs to sample from the training set. Defaults to 10000. Set to NULL to return all possible training pairs. When no split is active, this limits the overall sample.

n_test

Maximum number of unique pairs to sample from the test set. Defaults to 5000. Set to NULL to return all possible test pairs. Only used when a split is active (prop or holdout).

prop

Optional proportion of documents assigned to the training set (e.g. 0.8). When supplied, a split column is added to the result. Cannot be used together with holdout.

holdout

Optional logical vector the same length as documents. TRUE marks a document for the test set, FALSE for the training set. When supplied, a split column is added to the result. Cannot be used together with prop.

blocks

Optional vector (character, factor, or integer) the same length as documents. When supplied, only within-block pairs are generated and a block column is included in the output. Blocks with fewer than 2 documents are skipped.

seed

Optional integer random seed for reproducibility.

Value

A tibble with columns doc_id_a, doc_id_b, text_a, and text_b. doc_id_a and doc_id_b are integer row indices into documents. When a split is active, an additional split column contains "train" or "test". When blocks is supplied, a block column identifies which block each pair belongs to.

Details

When prop or holdout is supplied, the result includes a split column ("train" / "test") that fit_model() and validate_model() respect automatically.

When blocks is supplied, only within-block pairs are generated. Blocks with fewer than 2 documents are skipped with a message.

When fewer unique pairs exist than the requested n_train or n_test, all available pairs are returned with a message.

Examples

docs <- c("The quick brown fox", "A lazy dog", "Hello world", "Foo bar")
# All unique pairs
generate_comparisons(docs, n_train = NULL)
#> # A tibble: 6 × 4
#>   doc_id_a doc_id_b text_a              text_b     
#>      <int>    <int> <chr>               <chr>      
#> 1        1        2 The quick brown fox A lazy dog 
#> 2        1        3 The quick brown fox Hello world
#> 3        2        3 A lazy dog          Hello world
#> 4        1        4 The quick brown fox Foo bar    
#> 5        2        4 A lazy dog          Foo bar    
#> 6        3        4 Hello world         Foo bar    
# With an 80/20 train/test split, default sample sizes
generate_comparisons(docs, prop = 0.8, seed = 1)
#> `n_train` (10000) exceeds the number of unique train pairs (3). Using all 3.
#> `n_test` (5000) exceeds the number of unique test pairs (0). Using all 0.
#> # A tibble: 3 × 5
#>   doc_id_a doc_id_b text_a              text_b      split
#>      <int>    <int> <chr>               <chr>       <chr>
#> 1        1        3 The quick brown fox Hello world train
#> 2        1        4 The quick brown fox Foo bar     train
#> 3        3        4 Hello world         Foo bar     train
# With blocks
generate_comparisons(docs, blocks = c("a", "a", "b", "b"), n_train = NULL)
#> # A tibble: 2 × 5
#>   doc_id_a doc_id_b text_a              text_b     block
#>      <int>    <int> <chr>               <chr>      <chr>
#> 1        1        2 The quick brown fox A lazy dog a    
#> 2        3        4 Hello world         Foo bar    b    
# With user-supplied holdout
generate_comparisons(docs, holdout = c(FALSE, FALSE, TRUE, TRUE))
#> `n_train` (10000) exceeds the number of unique train pairs (1). Using all 1.
#> `n_test` (5000) exceeds the number of unique test pairs (1). Using all 1.
#> # A tibble: 2 × 5
#>   doc_id_a doc_id_b text_a              text_b     split
#>      <int>    <int> <chr>               <chr>      <chr>
#> 1        1        2 The quick brown fox A lazy dog train
#> 2        3        4 Hello world         Foo bar    test