22C:30, 22C:115 Project 1

Due date: Tuesday, 10/7, 5 pm

Introduction

This is the first in a series of 3 programming projects at the end of which you will have constructed a spell checker -- program that checks and possibly corrects spelling errors in documents. In this project you will take a first step, by building a somewhat primitive spell checker. It is my hope that this exercise will give you a glimpse into some of difficulties involved in constructing good programs for checking and correcting spellings.

Overview

As part of this project you will have to write two separate programs: buildDict and spellCheck. As the name suggests, the program buildDict builds a small dictionary of words, while the program spellCheck uses this dictionary of words to determine which words in a given document are misspelled. The task that buildDict performs can be thought of as a preprocessing step; buildDict will be used rarely as compared to spellCheck. Once a dictionary has been built, then there is no reason to use buildDict again until the performance of spellCheck is found to be poor enough to require a modification of the dictionary. The implication is that buildDict can expend more time and space resources on its task as compared to spellCheck.

Building a Dictionary

Building a good dictionary is a rather hard task and in this first attempt, the dictionary we build, will most likely lead to rather poor performance by spellCheck. However, in subsequent attempts we will improve and enlarge our dictionary. In this project, we will build a dictionary by extracting words from large on-line documents known to be error free. I am making available the following three famous novels in text form:

The above documents can be found in the files: alice.txt, carol.txt, and hyde.txt.

The program buildDict should read each of these documents, extract words from them, and insert them into a dictionary. It is important to have a precise understanding of what we mean by a word before we proceed. As far as this project is concerned, a word is a contiguous sequence of 2 or more letters (lower case or upper case) immediately followed by a non-letter character. For this project we will ignore the difference between upper and lower case letters and therefore the words The and the will be considered identical. To understand the above definition, consider the following text:

The giant telephone company AT&T bought an IBM360 computer last year. Since then the company has had an error-free run in the telecommunications industry.

The distinct words extracted from this text are listed below in alphabetical order:

 
an	at	bought	company	computer	error 		free
giant	had	has 	ibm 	in 		industry	last
run	since	telephone 	telecommunications	 	the 		 
then 	year

Notice that the word at has been extracted from AT&T and the hyphenated word error-free has contributed two words error and free to the above list. Each of the extracted words needs to be inserted into a data structure that can also be searched quickly (so that the same word is not inserted twice). To support this you should define and implement a class called wordList that can be used by the buildDict program. This class should contain a vector of words in alphabetically sorted order. I know that there are many different vector classes. For the sake of consistency and because we discussed this particular vector class in my lectures, I would like you to use the apvector class available on the course page. The class worldList should contain at least the following member functions:

  1. bool search(const string & myWord, int & position)
    This function returns true if myWord is found in the dictionary; otherwise it returns false. When the function returns true, it also returns additionally the position in the vector where the myWord was found.

  2. bool insert(const string & myWord)
    Using this function the user will attempt to insert words into the dictionary. The function returns true if the myWord has been successfully inserted.Otherwise, it returns false. A word may not be successfully inserted, if it is already in the dictionary.

  3. int numberWords()
    This function returns the number of words in the dictionary.

  4. A function that overloads the operator <<
    This function allows the user to write an output statement such as:

    cout << dictionary;

    in her programs. As before, dictionary is an object belonging to the class wordList. The effect of the above output statement should be to insert in alphabetical order the list of words in dictionary into the output stream in a neat and readable format.

The program buildDict should start by asking the user for a list of text files to read from. The user should respond by typing the file names as

[filename1] [white space] [filename2] [white space] [filename3]...
In this version of the project you will be using only three files: alice.txt, carol.txt, and hyde.txt. Then buildDict will read all specified text files and build a dictionary of all distinct words found. It should then write the contents of this dictionary into a file called words.dat. This file constitutes the dictionary that will subsequently be used by the program spellCheck. The file words.dat should contain words in alphabeticallt sorted order.

The spellCheck program

After the buildDict program completes its task and creates a dictionary file words.dat, the spellCheck program is ready to take over. This program should start by reading the list of words in the file words.dat into a data structure that can be quickly searched. In addition to the words in words.dat, the program spellCheck should add valid single letter words such as A and I to the data structure. For this data structure also, you should use the wordList class that you used in buildDict so that you do not have to write fresh code. Overload the extract operator >> so that it can be used in the following manner:

cin >> dictionary;

to extract from the input stream a list of words and store them in the wordList object called dictionary. After spellCheck has read the list of words from words.dat it should prompt the user for the name of a document she wants spell-checked. spellCheck should then read words from this user specified document, one-by-one and test if the words appears in its dictionary. As a running example, suppose that the user wants a file called myLetter to be spell-checked. As spellCheck reads from the file myLetter, it writes into a file called myLetter.out. After spellCheck has completely processed myLetter, the file myLetter.out will be the corrected version of myLetter. Note that myLetter.out is identical to myLetter except for misspelled words being replaced by correctly spelled words. The spellCheck program should make sure that all the white spaces and punctuation marks that appear in myLetter appear exactly in the same manner in myLetter.out. As the program scans the user specified document, it may encounter a word that is not present in its dictionary. It takes such a word to be misspelled and responds by (i) displaying the misspelled word and (ii) producing the following prompt:

Do you want to replace (R), replace all (P), ignore (I), ignore all (N), or exit (E)?
The user is expected to respond by typing an R, P, I, N, or an E. If the user types anything else, the program should simply ask the question again. Here is what your program should do for each of the possible responses from the user. The actions corresponding to the responses R, I, and E are easy to implement. The actions corresponding to the responses P and N require an additional data structure. For example, if the user asks that all occurrances of colour be replaced by color, then the program needs to remember this so that whenever subsequent occurances of the word colour are encountered they are replaced by color. Similarly, for N. Define a map class that consists of two vectors of strings of identical size. Suppose that left and right are the names of the two vectors. Then for each index i, the word in right[i] is a replacement for the word in left[i]. The spellCheck program should define an object, say replacementWords, that belongs to the map class. Suppose the misspelled word colour is encountered and the user responds by asking that all occurances of colour be replaced by color. Then, the pair of words colour and color should be inserted into the vectors left and right. Similarly, if the user responds by asking that all occurances of the word colour be ignored, then the program should insert the word colour into both left and right. So the same word is taken to be its replacement and as a result no change is made to the document. This organization means that when a word is encountered, the spellCheck program should first look into the replacementWords object and only if it does not find the word there, it should look in the dictionary. Like in the case of wordList class, the map class should provide, at least the following member functions. Again, it may help to keep the words in alphabetically sorted order (sorted in the array left).

What to submit:

You are required to electronically submit by 5 pm on Tuesday, 10/7 the following files: buildDict.cxx, spellCheck.cxx, wordList.cxx, wordList.h, map.cxx, and map.h. Make sure that your file names are exactly as above and make sure that you do not submit any additional files. You will submit your project using a program called submit that you can use on computer science departmental machines. I will post more details related to this program on the course page.

Final words:

While I have specified quite a few details, I have left enough unspecified so as to leave you with a few design decisions to make. For example, you should define some number of constructors for the wordList and the map class. You need to decide what the syntax for these constructors should be and what exactly they should accomplish.

You have 3 weeks for this project because I expect that it will take you 3 weeks! So please start right away.

We will soon post a grading scheme for the project that will help you decide what pieces to focus on first.



Sriram Pemmaraju
Tuesday Sept 16 2003