Please read the entire pdf before starting.
You must do this assignment individually and, unless otherwise specified, you must follow all the general
instructions and regulations for assignments. Graders have the discretion to deduct up to 15% of the value
of this assignment for deviations from the general instructions and regulations. These regulations are posted
on the course website. Be sure to read them before starting.
Question 1: 20 points
Question 2: 80 points
100 points total
It is very important that you follow the directions as closely as possible. The directions, while
perhaps tedious, are designed to make it as easy as possible for the TAs to mark the assignments by letting
them run your assignment through automated tests. While these tests will not determine your entire grade,
it will speed up the process significantly, which will allow the TAs to provide better feedback and not waste
time on administrative details. Plus, if the TA is in a good mood while he or she is grading, then that
increases the chance of them giving out partial marks. Marks can be deducted if comments are missing, if
the code is not well structured, or if your solution does not respect the assignment requirements.
Assignment
Part 1 (0 points): Warm-up
Do NOT submit this part, as it will not be graded. However, doing these exercises might help you to do the
second part of the assignment, which will be graded. If you have difficulties with the questions of Part 1, then
we suggest that you consult the TAs during their office hours; they can help you and work with you through
the warm-up questions.
Warm-up Question 1 (0 points)
Write a program that opens a .txt, reads the contents of the file line by line, and prints the content of
each line. To do this, you should use look up how to use the BufferedReader or FileReader class1.
Remember to use the try and catch statements to handle errors like trying to open an non-existent file.
Warm-up Question 2 (0 points)
Modify the previous program so that it stores every line in an ArrayList of String objects. You have
to properly declare an ArrayList to store the results, and use add to store every line that your program
reads in the ArrayList.
1The documentation of the BufferedReadder class is available at http://docs.oracle.com/javase/7/docs/api/java/
io/BufferedReader.html. You can find an example on how to use it at http://www.tutorialspoint.com/java/io/
bufferedreader_readline.htm
1
Warm-up Question 3 (0 points)
Finally, modify your program so that, after reading all the content in the file, it prints how many words
are inside the text file. To do this, you should use the split method of the String class. Assume the
only character that separates words is whitespace " ".
Part 2
The questions in this part of the assignment will be graded. This assignment asks you to implement methods
that perform some specific function. To obtain full marks, you must adhere to the specified requirements for
the method names and input parameters. You may in addition implement helper methods to perform some
subtask as part of solving the questions below.
Question 1: Recursion vs Iteration: Logical Operators (20 points)
In class, we have seen the logical operators && (conjunction; “and”) and || (disjunction; “or”). These
were defined as binary logical operators that take two input boolean values. Here, we generalize the
definitions to take in any number of boolean values. We define the conjunction of N boolean values
x1, x2, ..., xN to be x1 if N is 1, and (((x1 && x2) && x3)... && xN) if N > 1. We likewise generalize the
definition of disjunction.
Implement the following four static methods in a class called LogicalOperators.java.
conjunctionIter: This method takes a boolean array as input, and returns the conjunction of all of
the values in the array using a loop.
conjunctionRec: This method takes a boolean array as input, and returns the conjunction of all of the
values in the array using recursive method calls.
disjunctionIter: This method takes a boolean array as input, and returns the disjunction of all of the
values in the array using a loop.
disjunctionRec: This method takes a boolean array as input, and returns the disjunction of all of the
values in the array using recursive method calls.
For the methods that use iteration, you may not call any helper methods or standard library methods,
and should instead use loops. For the methods that use recursion, you may define helper methods, but
you may not use loops. Calling a helper method which is recursive counts as using recursion.
Be sure to test your methods appropriately in the main method.
Question 2: Keyword Extraction (80 points)
One way to summarize the contents of an article is by describing it in terms of its keywords. For example,
an article about cats might be described by keywords such as “cats”, “pet”, “domesticated”, and “feline”.
In this question, we will develop a program that will automatically extract keywords from an article,
using techniques from information retrieval and computational linguistics.
The intuition behind the method that we will implement is as follows. First, one way to identify potential
keywords is to simply take words that appear frequently in the article; that is, an article about cats is
likely to contain many occurrences of the word “cat”. So, we can rank words according to how frequent
they are (this is called the term frequency), and select the top ranking words as the keywords.
Using term frequency alone, however, will not work well. In particular, there are many words which are
common across all kinds of articles in general. For examples, words such as “the”, “of”, “a”, “and”,
etc. appear very frequently in English texts, because they are used to help us relate different parts of a
sentence to each other grammatically. These words would rank highly in terms of term frequency, but
are not good keywords.
To adjust for this, we will incorporate a second factor called the inverse document frequency into our
ranking. We will penalize words according to how many documents they appear in, in a collection of
Page 2
texts. So, words that appear in all or nearly all the documents, such as “the”, would be heavily penalized,
and would rank lower than words that are specific to a particular article, such as “cat”.
The product of the two factors above is called TF-IDF (term frequency times inverse document frequency). The equation of the TF-IDF score of a word w in a document d is thus:
tfidf(w, d) = tf(w, d) × idf(w), (1)
where tf(w, d) is the number of times word w occurs in document d, and idf(w) is the IDF of word w
in a collection of texts that we have, to be defined more specifically below.
We will write some code to compute the TF-IDF scores of words in an article, in order to extract
keywords from that article. TF-IDF is a very popular techniques used in information retrieval, and is a
key component of modern search engines.
To begin, download the starter code and the data set that was released along with this document. The
data set consists of 40 arbitrarily selected articles extracted from the English Wikipedia, found in the
docs subdirectory. The articles are in 40 different text files, numbered “1.txt”, “2.txt”, ... “40.txt”.
You will separate your code into two classes: DocumentFrequency.java, and KeywordExtractor.java.
Your job is to implement the following methods in the specified classes. Feel free to define helper methods
to make your code easier to understand.
2.1: DocumentFrequency.java
In DocumentFrequency.java, implement the following methods:
2.1a: extractWordsFromDocument This method takes one argument, a String which is a file name.
It reads the indicated file, and returns the words that are found in the file as a HashSet<String>. You
may use the split() method of String to get the words in the documents. A method, normalize, is
provided in the starter code that will convert words into lowercase, and remove any extra whitespace
or punctuation. You should ensure that the empty string “” is not a vocabulary item in the returned
HashSet<String>.