Introduction
This project builds on Project 1. Here are the main differences between this project and Project 1.
Keeping track of frequencies
In this project, buildDict keeps track of the frequency of occurrence of each word that it extracts from the given text documents. As in Project 1, when buildDict completes extracting all the words from the document, it writes the extracted words onto a text file called words.dat. However, in this project, buildDict first writes the 500 most frequent words (in any order) into words.dat and then writes the remaining words (in any order) into words.dat. The spellCheck program reads words.dat and creates not one, but two dictionaries: a primary dictionary containing the 500 most frequent words and a secondary dictionary containing the remaining words. This plan implies that as buildDict is extracting words from the document, it is also updating their frequency of occurance. To be able to do this, modify the Record class to include an integer data member for word frequency, in addition to the two string data members it already has. As in Project 1, buildDict uses a hash table to keep track of the words that it extracts.
Two-level dictionary
The program buildDict separates the words it extracts into two categories of high frequency and low frequency words so as to help spellCheck build a more efficient, two-level dictionary data structure. The spellCheck program stores the 500 high frequency words in a "primary dictionary". Since the number of words being stored in the primary dictionary is small, we can afford to use a hash table, that has a small load factor - say 1/10. This means that the size of the hash table will be at least 5000. So as your table size you might want to pick the smallest prime greater than 5000. The rest of the words (i.e., the low frequency words) are stored in the "secondary dictionary." We may have to store a large number of words in the secondary dictionary and therefore due to memory constraints, we cannot use a hash table with too small a load factor. So for the secondary dictionary, use a hash table with a load factor of about 2 or 3. In other words, use a table whose size is about half or even a third of the number of words that you intend to store in the table.
The motivation for this two-level organization of the dictionary is that many empirical studies have found that 80% of the words in an average document come from a list of about 500 words! For these words, we can afford to build a hash table with small load factor and get, almost instantaneous response to insert and search operations. For the rest of the words, we can afford to provide slightly slower access, because they are accessed less frequently. This two-level organization provides a good balance between time and space efficiency.
While checking for spelling errors, the spellCheck program takes each word and first looks for it in the primary dictionary. If the word does not appear in the primary dictionary, it then looks in the secondary dictionary. This order of looking first in the primary dictionary and then in the secondary dictionary is quite important. If the word is not found in either dictionary, spellCheck will flag the word as being mispelled.
Suggesting Replacements
Like in Project 1, spellCheck responds to a misspelled word by providing the following prompt:
Do you want to replace (R), replace all (P), ignore (I), ignore all (N), or abort (A)?
However, what the program does next is different. If the user responds with an R or a P, then the program responds by producing a list of at most 10 suggested replacement words that are labeled 0 through 9, followed by an option that allows the user to input their own replacement word. So for example, if spellCheck encounters the word ont it might respond as follows:
(0) ant (1) one (2) on (3) opt (4) out (5) Use my replacementNote that the list of suggested replacements may not always contain 10 words, 10 is just the maximum number of replacement words produced by the program. We use the limit 10 because we do not want to inundate the user with too many options. Also note that sometimes, the program may find no replacement words and in this case the user will just see the prompt
(0) Use my replacementThe user responds by typing a number corresponding to the option they want. If, in the above example the user types number between 0 and 4 (inclusive) then the program just uses the corresponding replacement word from the list. Otherwise, if the user types 5, then the program needs to ask the user for a replacement word and it uses whatever the user types in response.
The big question now is: how do we generate reasonable replacement words? We do this by assuming that the most likely errors a user makes while typing are
The actual procedure for finding all correctly spelled words that are at most 2 hops away from w is fairly simple and you are responsible for figuring this out. Problem Set 8 should be helpful, in this regard. Note that your program should make no attempt to explicitly build the graph mentioned above - it is too big! I mention the graph only so that we can precisely define what we mean by good replacements for a mispelled word. The procedure you come up with for generating words that are at most 2 hops away from w, implicitly traverses the edges of the imaginary graph to find reasonable replacements.
Dealing with apostrophes
As you may have already noticed, the spelling checker in Project 1 makes no attempt to correctly deal with words that contain apostrophes such as "I'll" or "don't" or "could've." For example, if a text document used by buildDict contains the sentence:
I could've taken it yesterday, but I guess now I'll have to take it next Sunday.Then ve and ll are flagged as valid words, which they are clearly not. In this project I want you to implement a clean way of dealing with apostrophes. Specifically, if words such as "I'll" and "could've" occur in the text document that buildDict reads, then these words should be considered to be correctly spelled words. However, "ve" and "ll" should not be considered being correct. You should think about modifying our definition of a word from Project 1, in order to cleanly deal with apostrophes. You are responsible for figuring out a simple and clean scheme that can reasonably deal with apostrophes.
What to submit:
You are required to electronically submit by midnight on Friday, 4/14 the following files: BuildDict.java, SpellCheck.java, Dictionary.java, Link.java, LinkList.java, and Record.java. Make sure that your file names are exactly as above and make sure that you do not submit any additional files.
A final word:
While I have specified quite a few details, I have left enough unspecified so as to leave you with a few design decisions to make. These decisions will affect the efficiency, robustness, and readability of your program. So make your choices with care.