Builds the feature-matrix from a text-vector

build_features(x, term_count_min = 1, mdl = NULL, parallel = TRUE,
  quiet = FALSE)

Arguments

x	a vector of text
term_count_min	a number passed to `prune_vocabulary`, defaults to 1. In case the function is used for training, it can and should be set to some higher value, i.e., 3.
mdl	is a list of existing models-data (containing the vectorizer, the tfidf, and the lsa object), defaults to NULL, in which case it is rebuild
parallel	T/F if the task should be executed in parallel, defaults to TRUE
quiet	T/F if the function remains silent, defaults to FALSE

Value

a list of two: a dgCMatrix that contains the features (columns) for each text (row) and as a second element a list of the model that can be passed as mdl

Examples

text <- c(
  "This is a first text that describes something",
  "A second Text That USES A LOT of CAPITALS",
  "Lastly MANY!!!! (like, really a lot!) punctuations!!!"
)

build_features(text)
#> Calculating Features...
#> Create DTM...
#> Finished in 5.12 seconds
#> $model_matrix
#> 3 x 21 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 21 column names ‘length’, ‘ncap’, ‘ncap_len’ ... ]]
#>                                                        
#> 1 45  1 0.02222222 . . .  . 8 . . 1 1 1 . . . . . . . .
#> 2 41 19 0.46341463 . . .  . 9 . . . . . 1 1 1 . . . . .
#> 3 53  5 0.09433962 . 8 . 11 7 . . . . . . . . 1 1 1 1 1
#> 
#> $mdl
#> $mdl$vectorizer
#> function (iterator, grow_dtm, skip_grams_window_context, window_size, 
#>     weights) 
#> {
#>     vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term, 
#>         attr(vocabulary, "ngram")[[1]], attr(vocabulary, "ngram")[[2]], 
#>         attr(vocabulary, "stopwords"), attr(vocabulary, "sep_ngram"))
#>     setattr(vocab_corpus_ptr, "ids", character(0))
#>     setattr(vocab_corpus_ptr, "class", "VocabCorpus")
#>     corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context, 
#>         window_size, weights)
#> }
#> <environment: 0x000000000652caf0>
#> 
#> 

# a second example
train <- c("Banking is finance", "flowers are not houses", "finance is power", "houses are build")
test <- c("finance is greed", "flowers belong in the garbage", "houses are build")

a1 <- build_features(test)
#> Calculating Features...
#> Create DTM...
#> Finished in 3.38 seconds
a12 <- build_features(test, mdl = a1$mdl)
#> Calculating Features...
#> Create DTM...
#> Finished in 2.99 seconds

a2 <- build_features(train, mdl = a1$mdl)
#> Calculating Features...
#> Create DTM...
#> Finished in 3.07 seconds
a2$model_matrix %>% as.matrix()
#>   length ncap   ncap_len nsen nexcl nquest npunct nword nsymb nsmile greed
#> 1     18    1 0.05555556    0     0      0      0     3     0      0     0
#> 2     22    0 0.00000000    0     0      0      0     4     0      0     0
#> 3     16    0 0.00000000    0     0      0      0     3     0      0     0
#> 4     16    0 0.00000000    0     0      0      0     3     0      0     0
#>   financ belong garbag flower build hous
#> 1      1      0      0      0     0    0
#> 2      0      0      0      1     0    1
#> 3      1      0      0      0     0    0
#> 4      0      0      0      0     1    1

Builds the feature-matrix from a text-vector

Arguments

Value

Examples

Contents