Homework 7

Here is a program that reads a collection of ten files, those that are specified here , and extracts words from these files. For our purposes, a word is defined as something that is made up purely of characters. The program considers only words that are more than four letters long. It then keeps track of the number of occurances of each such word over the entire collection. Finally, it prints the words that occur more than 20 times. This program will be a useful starting point for this homework, to be described next.

The basic task is to measure the similarity between any two files in our collection. To do this, we will need a suitable universe of words. This will consist of all words in the collection that are (a) more than four letters long, (b) don't occur more than 20 times overall, and (c) don't occur in more than 7 files in the collection. Now we constructor a vector (in the mathematical sense) corresponding to each file. The vector will have as many coordinates as words in the universe -- so there is one coordinate for each word in the universe. If a word occurs in the file, the corresponding coordinate is 1, otherwise it is 0.

Let us give an example: suppose the universe consists of the five words: apple, grapes, banana, doctor, program. Suppose file1 contains: apple, banana, program. Then the vector for file1 is (1,0,1,0,1).

We need to normalize each of the vectors so that it has unit length. So each coordinate in the above vector gets divided by the square root of 3.

The similarity of two files is defined to be the scalar product of the corresponding two vectors. The scalar product of two vectors is obtained by multiplying corresponding components and adding. For example, the scalar product of (2,1,3) and (0,5,6) is 2 * 0 + 1 * 5 + 3 * 6.

Your task is to write a program that prints the names of the two files with the highest similarity among the files in the collection, and the names of the two files with the lowest similarity. The collection is the same as the one processed in the above program, and consists of the files named here . You'll need to download the stuff in the directory, which is available in zipped form here .