Here is a program that reads a
collection of ten files, those that are specified
here , and extracts words
from these files. For our purposes, a word is defined as something
that is made up purely of characters. The program considers only words
that are more than four letters long. It then keeps track of the
number of occurances of each such word over the entire collection.
Finally, it prints the words that occur more than 20 times. This
program will be a useful starting point for this homework, to be
described next.
The basic task is to measure the similarity between any two files
in our collection. To do this, we will need a suitable universe
of words. This will consist of all words in the collection that
are (a) more than four letters long, (b) don't occur more than
20 times overall, and (c) don't occur in more than 7 files in the
collection. Now we constructor a vector (in the mathematical sense)
corresponding to each file. The vector will have as many coordinates
as words in the universe -- so there is one coordinate for each word
in the universe. If a word occurs in the file, the corresponding
coordinate is 1, otherwise it is 0.
Let us give an example: suppose the universe consists of the five
words: apple, grapes, banana, doctor, program. Suppose
file1 contains: apple, banana, program. Then the vector
for file1 is (1,0,1,0,1).
We need to normalize each of the vectors so that it has unit length.
So each coordinate in the above vector gets divided by the square
root of 3.
The similarity of two files is defined to be the scalar
product of the corresponding two vectors. The scalar product
of two vectors is obtained by multiplying corresponding components
and adding. For example, the scalar product of (2,1,3) and (0,5,6)
is 2 * 0 + 1 * 5 + 3 * 6.
Your task is to write a program that prints the names of the two
files with the highest similarity among the files in the collection, and
the names of the two files with the lowest similarity. The collection is
the same as the one processed in the above program, and consists of
the files named here . You'll need to
download the stuff in the directory, which is available in zipped
form here .