Below, I will proceed from a simple linear regression to a generalized additive model to an ordered logistic regression analysis. And I will illustrate the results with nice plots along the way. Of course, all done in R you can get the script here.
I start by importing the reviews dataset in WEKA, then I perform some text preprocessing tasks such as word extraction, stop-words removal, stemming and term selection. Finally, I run various classification algorithms naive bayes, k-nearest neighbors and I compare the results, in terms of classification accuracy.
The movie reviews dataset The dataset consists of user-created movie reviews archived on the IMDb Internet Movie Database web portal at http: Each review consist of a plain text file.
The class attribute has only two values: For instance, if a scale is used, a vote greater or equal to 6 is considered as positive and anything less than 6 as negative.
The authors also state that similar criteria have been used in the case of different vote scales e. Practically speaking, we have an archive containing text files partitioned in two sub-directories pos and neg the class values.
By using this loader fig. Figure 2 shows the resulting relation. Text preprocessing and feature extraction in WEKA For the classification task to be done, a preliminary phase of text preprocessing and feature extraction is essential.
To build the vocabulary, various operations are typically performed many of which are language-specific: Word parsing and tokenization In this phase, each document is analyzed with the purpose of extracting the terms. Separator characters must be defined, along with a tokenization strategy for particular cases such as accented words, hyphenated words, acronyms, etc.
Stop-words removal A very common technique is the elimination of frequent usage words: This kind of terms should be filtered as they have a poor characterizing power, making them useless for the text classification.
Lemmatization and stemming The lemmatization of a word is the process of determining its lemma. A simple technique for approximated lemmatization is the stemming. Stemming algorithms work by removing the suffix of the word, according to some grammatical rules.
This term selection task also leads to a simpler and more efficient classification.
This filter allows to configure the different stages of the term extraction fig. Configure the tokenizer term separators ; Specify a stop-words list; Choose a stemmer. You can set a stop-words list by clicking on stopwords and setting to true the useStopList parameter. Before you can use it, you must download and add to the classpath the snowball stemmers library.
Furthermore, you can set a maximum limit on the number of words to be extracted by changing the wordsToKeep parameter default is words and a minimum document frequency for each term by means of the minTermFreq parameter default is 1.
The latter parameter makes the filter drop the terms that appear in less than minTermFreq documents of the same class — see note. After applying the StringToWordVector filter, we get the result shown in figure 4.I just did a project with movie data and had the exact same question.
As other posters stated - RottenTomatoes has a decent API but is limited and IMDB just has really annoying files to download. I eventually found the Open movie database which has a simple API that returns a .
Data Mining Projects topics list and ideas provided here consists of project reports with source code for final year students.
Project in Mining Massive Data Sets. Spring When dealing with these datasets please be careful and responsible. The datasets are meant to be used strictly for the purposes of the class project and nothing else. IMDB database: Everything about every movie ever made.
Movielens: User movie rating data. Data Mining project using IMDB, Movilens and Wikipedia datasets. Weka 3: Data Mining Software in Java Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization.
to transform the IMDb data into a format suitable for data mining, and provides a selection of information mined from this refined data, in section Experimental results.