Builds the feature-matrix from a text-vector

build_features(x, term_count_min = 1, mdl = NULL, parallel = TRUE,
  quiet = FALSE)

Arguments

x

a vector of text

term_count_min

a number passed to prune_vocabulary, defaults to 1. In case the function is used for training, it can and should be set to some higher value, i.e., 3.

mdl

is a list of existing models-data (containing the vectorizer, the tfidf, and the lsa object), defaults to NULL, in which case it is rebuild

parallel

T/F if the task should be executed in parallel, defaults to TRUE

quiet

T/F if the function remains silent, defaults to FALSE

Value

a list of two: a dgCMatrix that contains the features (columns) for each text (row) and as a second element a list of the model that can be passed as mdl

Examples

text <- c( "This is a first text that describes something", "A second Text That USES A LOT of CAPITALS", "Lastly MANY!!!! (like, really a lot!) punctuations!!!" ) build_features(text)
#> Calculating Features... #> Create DTM... #> Finished in 5.12 seconds
#> $model_matrix #> 3 x 21 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 21 column names ‘length’, ‘ncap’, ‘ncap_len’ ... ]]
#> #> 1 45 1 0.02222222 . . . . 8 . . 1 1 1 . . . . . . . . #> 2 41 19 0.46341463 . . . . 9 . . . . . 1 1 1 . . . . . #> 3 53 5 0.09433962 . 8 . 11 7 . . . . . . . . 1 1 1 1 1 #> #> $mdl #> $mdl$vectorizer #> function (iterator, grow_dtm, skip_grams_window_context, window_size, #> weights) #> { #> vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term, #> attr(vocabulary, "ngram")[[1]], attr(vocabulary, "ngram")[[2]], #> attr(vocabulary, "stopwords"), attr(vocabulary, "sep_ngram")) #> setattr(vocab_corpus_ptr, "ids", character(0)) #> setattr(vocab_corpus_ptr, "class", "VocabCorpus") #> corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context, #> window_size, weights) #> } #> <environment: 0x000000000652caf0> #> #>
# a second example train <- c("Banking is finance", "flowers are not houses", "finance is power", "houses are build") test <- c("finance is greed", "flowers belong in the garbage", "houses are build") a1 <- build_features(test)
#> Calculating Features... #> Create DTM... #> Finished in 3.38 seconds
a12 <- build_features(test, mdl = a1$mdl)
#> Calculating Features... #> Create DTM... #> Finished in 2.99 seconds
a2 <- build_features(train, mdl = a1$mdl)
#> Calculating Features... #> Create DTM... #> Finished in 3.07 seconds
a2$model_matrix %>% as.matrix()
#> length ncap ncap_len nsen nexcl nquest npunct nword nsymb nsmile greed #> 1 18 1 0.05555556 0 0 0 0 3 0 0 0 #> 2 22 0 0.00000000 0 0 0 0 4 0 0 0 #> 3 16 0 0.00000000 0 0 0 0 3 0 0 0 #> 4 16 0 0.00000000 0 0 0 0 3 0 0 0 #> financ belong garbag flower build hous #> 1 1 0 0 0 0 0 #> 2 0 0 0 1 0 1 #> 3 1 0 0 0 0 0 #> 4 0 0 0 0 1 1