Programming Techniques and Data Structures (22C:17, 22C:107)

Project 2: Due partly on 3/21/97 and partly on 3/31/97

Introduction

This project builds on Project 1. There are two main differences between this project and Project 1.

  1. You will now build a larger dictionary and will therefore use more sophisticated data structures to provide efficient insertion and search. Specifically, you will use a hash table data structure in this project.

  2. You will provide a more sophisticated interface to the user, by suggesting possible replacement words for errors discovered by spellCheck.

Overview

As in Project 1, you will have to write two separate programs: buildDict and spellCheck. As in Project 1, the program buildDict builds a dictionary of words, while the program spellCheck uses this dictionary of words to determine which words in a given document are misspelled. The program buildDict builds a dictionary by extracting words from a document known to be free of spelling errors. However, in this project, buildDict also keeps track of the frequency of occurrence of each word that it extracts from the document. When buildDict completes extracting all the words from the document, it writes the extracted words onto a text file called words.dat. In particular, buildDict starts by writing the 500 most frequent words (in any order) into words.dat and then writes the remaining words into words.dat in alphabetical order. The spellCheck program reads words.dat and creates not one, but two dictionaries: a primary dictionary containing the 500 most frequent words and a secondary dictionary containing the remaining words. The primary dictionary will be stored in a hash table data structure while the secondary dictionary will be stored as before in a sorted Vector of string objects. A hash table is used for the primary dictionary because it provides almost instantaneous search capabilities. For each word in the document that is being checked for spelling errors, the program spellCheck first consults the primary dictionary and if the word is not found there, it consults the secondary dictionary. The motivation for this two-level organization of the dictionary is that many empirical studies have found that 80% of the words in an average document come from a list of about 500 words!

Building a Dictionary

The rules that define what a word is are as described in the Project 1 handout, with one change. Now I also want you to throw out capitalized words that do not start a sentence. That is, such strings should no longer be considered valid words and should not be inserted into the dictionary. A word is said to be capitalized, if its first letter is in upper case and the rest of the letters are in lower case. A word is said to begin a sentence, if the non-whitespace character immediately preceding the first letter of the word is a period. The reason for throwing out capitalized words that do not begin a sentence is that such words are likely to be proper nouns. It is possible that throwing out words as described above may not remove all proper nouns and may in fact remove some words that are not proper nouns. For this project buildDict needs to keep track of the frequency of occurrence of the words it is extracting from the available texts. This means that the wordList class needs to be modified to keep track of frequency information. Call the modified wordList class fwList class (short for "frequency and word list"). You can decide what specific data members and member functions you need to add to the fwList class, to achieve this, but here are a few of my recommendations.

  1. Keep the member functions of the wordList class as they are with two changes.

    1. The insert member function should now increment the frequency of the word being inserted.

    2. The overloaded << operator should produce as output a list of words such that the first 500 words are the 500 most frequently occurring words followed by the remaining words in alphabetically sorted order.

  2. You may also consider adding a couple of other member functions to the class:

    1. int getFrequency(const string & myWord)
      This function takes a word and returns its frequency, as currently recorded. If the word does not exist in the list of words, the function returns 0.

    2. vector getMostFrequent(int k)
      This function returns a vector of the k most frequently occuring words in the list of words.
You are free to add other member functions that make the class more convenient to use.

The spellCheck program

After the buildDict program completes its task and creates a dictionary file words.dat, the spellCheck program is ready to take over. This program starts by reading the first 500 words from the file words.dat and stores then in a hash table. This is the primary dictionary. The hash table data structure will be described in the next section. It then reads the remaining words from words.dat and stores them in a list sorted in alphabetical order. This is the secondary dictionary. Then, as in Project 1, spellCheck sorts the words in the document it is spellchecking and removes duplicates. For each word in this resulting list, spellCheck searches the primary dictionary first. If the word is not found in the primary dictionary, spellCheck searches in the secondary dictionary. Any word not found in both dictionaries gets written to the file errors.dat. So for the program spellCheck you will have to define two classes: hashTable and sortedTable. I will leave the task of designing these classes to you.

What is a hash table?

The idea of a hash table is simple. Let U be the set of all possible strings of lower case letters. Let be a function that maps each string in U to a number in the range 0 through M-1. Then any subset of strings can be stored in a Vector called table of size M by storing a string in slot . To determine whether a given string is also in S, we simply compute and determine if contains . The only problem with this method is that there can be collisions between strings. For example suppose that for two distinct strings s and in S, and are both equal to some integer . This means that both s and have to be stored in slot . To handle such a situation, table should be defined not as a Vector of string objects, but as a Vector of string pointers. The only aspect of a hash table left to discuss is the hash function h itself.

The following paragraph contains the definition of a possible hash function h. Any string of lower case letters can be thought of as a number in base-26 by thinking of each letter as the number obtained by subtracting 97 (the ASCII value of 'a') from the ASCII value of the letter. For example, the string next can be thought of as the following base-26 number: 13 4 23 19. Thus a string has an equivalent decimal value defined as:

where equals the ASCII value of minus 97 for each i, . Now for any string S define a function as follows:

Thus the function h maps any string onto the range 0 though M-1. The only thing left to decide is the value of M. It is common practice to use a hash table whose size is approximately 10 times the number of items being stored in it. It is also common practice to choose the size of the hash table to be a prime number. These two considerations led me to pick 5003 (the smallest prime larger than 5000) as the value for M.

What (and when) to submit:

You are required to submit your work for this project in two parts. The first part of the submission contains the files: buildDict.C, row.h, row.cc, wordList.h, and wordList.cc. These files essentially constitute the buildDict program. These files need to be electronically submitted by 5 pm on Friday, 3/21. The rest of the files ( spellCheck.C, hashTable.h, hashTable.cc, sortedTable.h, and sortedTable.cc) that constitute the program spellCheck are due by 5 pm on Monday 3/31. As for Project 1 you should submit your files using the submit program.

A final word:

While I have specified quite a few details, I have left enough unspecified so as to leave you with a few design decisions to make. These decisions will affect the efficiency, robustness, and readability of your program. So make your choices with care.

This project is more complicated than Project 1 along many dimensions. I suspect it will take most of you all of three weeks; so please do not wait for two more weeks to start!

In general the graders of your program will look for several things besides the correctness of your program. The overall organization of your program, the partitioning of various tasks into functions, and program documentation will be taken into account while grading.



Sriram Pemmaraju
Mon Mar 3 08:03:11 CST 1997