When we start a Text Mining Project, we usually think about how we can implement three processes: data transformation or text processing; modeling, and; evaluation. In the data transformation process our goal is to transform an unstructured data into structured data. Usually, we transform text into vectors, where each i-position in a vector is a word and each vector has values for each word in the text. We have many transformations that can be applied in the data transformation process, like: tokenization, filtering stopwords, stemming and generating n-grams. It is easy to think that the more transformations we apply to the text, the more accurate the results will be in the modeling process. But, sometimes this is not true.

Last year I supervised an undergraduate final work, the subject of which was Opinion Mining in micro-blogs. In this work, we used Twitter as a platform to collect data. We collected 1.300 messages and manually classified them into POSITIVE (650) and NEGATIVE (650) messages. We only implemented Naive Bayes Algorithm in the modeling process. However, we implemented eight different data transformation processes: (i) only 1-gram; (ii) 1-gram with filter stopwords; (iii) 1-gram with stemming; (iv) 1-gram, filter stopwords and stemming; (v) only 2-gram; (vi) 2-gram with filter stopwords; (vii) 2-gram with stemming, and; (viii) 2-gram, filter stopwords and stemming.

If someone asked me which process would have the best result I would tell them that it would be a process using 2-gram, filter stopwords and stemming in the data transformation process. Why? Because it is more complete! Because this process removes all words that are generally irrelevant, joins all words with the same prefix and groups words that usually go together.

However, after we applied a tree cross-validation with our dataset, the result was a different one. As we can see in the table below, the highest accuracy was obtained with data transformation that only uses 2-gram process.

Why this result? Maybe, because the dataset only has documents with less than 140 characters - one of Twitter's feature. In this case, stopwords and all the words (not only the word steam) are important to characterize documents.

Well... What we can learn from this? That it is always important to test all possibilities in the data transformation process. :)