Current Issues in Natural Language Processing
In this lab, you’ll explore Weka, a machine learning (ML) toolkit for data-mining. http://www.cs.waikato.ac.nz/ml/index.html
Weka is installed in RI105, but for this lab, you will have to download Weka and install it on your personal computers:
(Please, follow the instructions on the website. Pay attention to the requirements (e.g., Java) and the operating system. Download the right version.
Weka is very well documented. There are numerous (youtube) tutorials, slides, exercises online that you can use. There is a book that describes the algorithms, the software and the data formats, but you can also find slides if you scroll down: http://www.cs.waikato.ac.nz/ml/weka/book.html
In the lab you are going to
• Explore data in the form of an arff dataset;
• Preprocess data using statistical transformations;
• Visualize data in their different formats;
• Experiment with supervised machine learning classification;
• Use a Weka implementation called J48 of the decision tree classifier;
• Use a decision tree classifier on two different datasets;
• Familiarize with the presentation of the results in Weka.
Occasionally you may find that Weka crashes with an out-of-memory exception, particularly for large data sets or extensive preprocessing of data. This is related to the default heap size allocation for the Java Virtual Machine (JVM). To run Weka with a large memory allocation, download the following script: runweka.sh and execute the sh runweka.sh command. You can control memory allocation with the –Xmx option in the script, for example, -Xmx2048m would allocate up to 2GB of memory to Weka. Please, note that it may slow down your machine.
• Download iris dataset (iris.arff) and save it in a local folder.
• Load the dataset into the Explorer and start analyzing it. In order to do this, click the Open file button to bring up a standard dialog through which you can select a file.
• Choose iris.arff from your local folder
• You’re now in the Preprocess panel.
Answer the following questions:
1. What information do you have about the dataset (e.g., number of instances, attributes and classes)? What type of attributes does this dataset contain (nominal or numeric?) what are the classes in this dataset? Which attribute has the greatest standard deviation? What
Does it tell you about that attribute? You might also find it useful to open iris dataset in your favorite text editor or click the Edit button from the row of buttons at the top of the Preprocess panel.
2. Under Filter choose the Standardize filter and apply it to all attributes. What does it do? How does it affect the attributes’ statistics? Click Undo and now apply the Normalize filter and apply it to all the attributes. What does it do? How does it affect the attributes’ statistics? How does it differ from Standardize? Click Undo again to return the data to its original state.
3. Give definitions and examples of the two following graphical representations:
a. Scatter plot
4. At the bottom right of the window there should be a graph that visualizes the dataset, making sure Class:class (Nom) is selected in the drop down box. Click Visualize All. What can you interpret from these graphs? Which attribute(s) discriminate best between the classes in the dataset? How do the Standardize and Normalize filters affect these graphs?
5. Give definitions and examples of the following concepts:
a. Standard deviation
1. Start Weka. Launch the explorer window and select the “Preprocess” tab. Open the iris dataset. Select the Classify tab. Under Classifier, select J48. What main parameters can be specified for the classifier?
2. Under Test options, select Crossvalidation and under More options, check Output predictions. Click Start to start training the model. You should see a stream of output appear in the window named Classifier output. What do each of the following sections tell you about the model?
a. Predictions on …
c. Detailed accuracy by class
d. Confusion matrix
3. Go to the graphical representation of the decision tree. It can be displayed graphically in a pop-up tree visualizer (Right-click on the item in the Result list.
What is the feature under the root node that is the most discriminative feature?
4. Once you have finished with the iris dataset, repeat the same steps for the English past tense dataset. What is the performance (accuracy, P/R, and F measure?) of the decision tree classifier on this dataset? Try and explain why you get this performance on the past tense dataset.
Suggestion: look at the distribution of the classes and analyze the confusion matrix.
5. What is a loss function? Give an information definition and example(s).
6. Under Result list you should see the model that is created at each run. Right-click on the model created for the iris dataset and select Visualize classifier error. Points marked with a square are errors, i.e., incorrectly classified instances. How do you think the classifier performed? Once you have finished with the iris dataset, repeat the same with the English past tense dataset. How do you think the classifier performed on this larger dataset?
7. Analyze the graphical representation of the decision trees of both the iris dataset and the English past tense dataset. What can you notice? Describe what you see and interpret the trees.
What to submit:
A written report containing the reasoned answer to the tasks and questions.