r - How to recreate same DocumentTermMatrix with new (test) data -

April 15, 2010

suppose have text based training data , testing data. more specific, have 2 data sets - training , testing - , both of them have 1 column contains text , of interest job @ hand.

i used tm package in r process text column in training data set. after removing white spaces, punctuation, , stop words, stemmed corpus , created document term matrix of 1 grams containing frequency/count of words in each document. took pre-determined cut-off of, say, 50 , kept terms have count of greater 50.

following this, train a, say, glmnet model using dtm , dependent variable (which present in training data). runs smooth , easy till now.

however, how proceed when want score/predict model on testing data or new data might come in future?

specifically, trying find out how create exact dtm on new data?

if new data set not have of similar words original training data terms should have count of 0 (which fine). want able replicate exact same dtm (in terms of structure) on new corpus.

any ideas/thoughts?

if understand correctly, have made dtm, , want make new dtm new documents has same columns (ie. terms) first dtm. if that's case, should matter of sub-setting second dtm terms in first, perhaps this:

first set reproducible data...

this training data...

library(tm) # make corpus text mining (data comes package, reproducibility)  data("crude") corpus1 <- corpus(vectorsource(crude[1:10]))     # process text (your methods may differ) skipwords <- function(x) removewords(x, stopwords("english")) funcs <- list(tolower, removepunctuation, removenumbers,               stripwhitespace, skipwords) crude1 <- tm_map(corpus1, fun = tm_reduce, tmfuns = funcs) crude1.dtm <- documenttermmatrix(crude1, control = list(wordlengths = c(3,10)))

and testing data...

corpus2 <- corpus(vectorsource(crude[15:20]))   # process text (your methods may differ) skipwords <- function(x) removewords(x, stopwords("english")) funcs <- list(tolower, removepunctuation, removenumbers,               stripwhitespace, skipwords) crude2 <- tm_map(corpus2, fun = tm_reduce, tmfuns = funcs) crude2.dtm <- documenttermmatrix(crude2, control = list(wordlengths = c(3,10)))

here bit want:

now keep terms in testing data present in training data...

# convert matrices subsetting crude1.dtm.mat <- as.matrix(crude1.dtm) # training crude2.dtm.mat <- as.matrix(crude2.dtm) # testing  # subset testing data colnames (ie. terms) or training data xx <- data.frame(crude2.dtm.mat[,intersect(colnames(crude2.dtm.mat),                                            colnames(crude1.dtm.mat))])

finally add testing data empty columns terms in training data not in testing data...

# make empty data frame colnames of training data yy <- read.table(textconnection(""), col.names = colnames(crude1.dtm.mat),                  colclasses = "integer")  # add incols of nas terms absent in  # testing data present # in training data # following schaunw's suggestion in comments above library(plyr) zz <- rbind.fill(xx, yy)

so zz data frame of testing documents, has same structure training documents (ie. same columns, though many of them contain na, schaunw notes).

is along lines of want?

Search This Blog

Three

r - How to recreate same DocumentTermMatrix with new (test) data -

Comments

Post a Comment

Popular posts from this blog

Socket.connect doesn't throw exception in Android -

SPSS keyboard combination alters encoding -

iphone - How do I keep MDScrollView from truncating my row headers and making my cells look bad? -