22C:21: Computer Science II: Data Structures

Project 1. Posted 9/27. Due 10/18

Introduction

The file miles.dat contains basic geographic data on 128 cities in North America. This project asks you to build a program that reads miles.dat and then efficiently responds to geographic queries about the data. Specifically, miles.dat contains, for each city, the following pieces of information:

Name of the city
State of the city
Latitude and longitude of the city
Population of the city
Distances to all other cities

This is old data - from a 1949 Rand McNally mileage guide, so the population numbers are definitely out of date. The highway mileage may also be out of date, but the data serves our purposes quite nicely. Note that the data file specifies for each city, the distances to all cities before it in the file; this is sufficient to get pairwise distance between any pair of cities. Also note that latitudes and longitudes are specified without decimal points or N, S, E, W indicators: for example Youngstown, OH is at latitude 41.09972 N and longitude 80.64972 W and this is specified as "[4110,8065]." Starting with this data, your program should, eventually, be able to answer queries such as:

"How many cities with at least 100,000 people are there between latitudes 40 N and 50 N and longitudes 85 W and 100 W (i.e., the upper midwest)?"
"Is it possible to get from Yakima, WA to Wilmington, DE in 10 hops, where each hop connects a pair of cities that are at most 200 miles apart?"
"What is the length of the shortest tour that visits each city in the northeast (roughly, 65-80 W, 38-50 N) exactly once?"

Your program will end up using the Graph class that we have constructed, a fair bit, in order to answer queries such as those mentioned above.

The `City` class

Start by designing and implementing a class called City that is reponsible for reading miles.dat, storing all of the information in appropriate data structures, and then answering queries efficiently.

Here is a suggestion for the data members that the class City should maintain:

1-dimensional arrays to maintain the name, state, latitude, longitude, and population of each city.
A 2-dimensional array to maintain pairwise distances between cities. As in the case of the adjacency matrix in the myGraph class, you only need to maintain the lower triangle of the distance matrix.
A boolean 1-dimensional array called selectedCities that keeps track of all currently selected cities. The purpose of selectedCities will become clear shortly.
A boolean 2-dimensional array called selectedEdges that keeps track of all currently selected edges. The purpose of selectedEdges will become clear shortly.

Feel free to use additional data members, if you think these will lead to greater efficiency, ease of programming, etc. Keep in mind that you will be graded on the design of your class as well, so not all choices are equal.

The member functions of the class City should consist of constructors along with a bunch of methods for answering a variety of queries. My suggestion is to get the default constructor City() to read from miles.dat and store information read from the file into the data members mentioned above. The methods that your class is required to support are as follows:

selectCities(attribute, lowerBound, upperBound): This method will be used to "select" a set of cities that satisfy some conditions. For example, this method can be used to select all cities in the upper midwest that have population 100,000 or more. Now I will explain the arguments of this method. The argument attribute is a String that can be one of "name", "state", "latitude", "longitude", and "population." Depending on the value of attribute, the arguments lowerBound and upperBound specify a range. So for example, if attribute is "population" then setting lowerBound to 100,000 and upperBound to 500,000 will select all cities whose population is between one hundred thousand and five hundred thousand (inclusive). Here is another example: if attribute is "name" then setting lowerBound to "R" and upperBound to "T" will select all cities whose names are lexicographically in the range between "R" and "T" (inclusive). These examples should indicate to you that while attribute is guaranteed to be a String, lowerBound and upperBound will have different types, depending on the value of attribute. Thus you will have to write several different versions of the selectCities function.
An important property that selectCities is required to have is that it selects only from those cities that are already selected. This allows us to use selectCities repeatedly to select a set of cities that satisfy several constraints For example, we could first call
```
	selectCities("population", 100000, 500000)
```
to select cities whose population is in the range 100000 to 500000. We could then call
```
	selectCities("name", "R", "T")
```
This latter call would select cities with names between "R" and "T" from only among those cities already selected, i.e., those that have population in the range 100000 to 500000. As a result, the set of cities selected at this point is all cities with names between "R" and "T," whose population is in the range 100000 to 500000. You should be able to see that by calling selectCities repeatedly, one can select a set of cities that satisfy several constraints, e.g., all cities in California whose names start with "S" and which have population more than 200,000.
Note that the 1-dimensional boolean array selectedCities mentioned in the data members is meant to keep track of the set of cities that are currently selected. Assume that initially no cities are selected. Whether selectAllCities should have a void return type or whether it should return the selected cities in some manner, is for you to decide.
selectAllCities() and unselectAllCities(): These methods respectively select all cities and unselect all cities.
selectEdges(lowerBound, upperBound): Here lowerBound and upperBound specify a "distance range." For example, if lowerBound is set to 0 and upperBound is set to 500, this method will select all edges between pairs of cities whose distance (as specified in miles.dat) is at most 500 miles. For now the only criteria we will use to select edges is this distance criterea. Information on which edges are selected is stored in the 2-dimensional boolean array selectedEdges. Assume that initially no edges are selected.
selectAllEdges() and unselectAllEdges(): These methods respectively select all edges and unselect all edges.
makeGraph() This method makes and returns a graph whose vertex set is the set of selected cities and whose edge set is all selected edges connecting pairs of selected cities. For example, suppose you have selected all cities with population 100,000 or more and all edges between pairs of cities whose pairwise distance is at most 200 miles. Then makeGraph() will construct and return a graph whose vertex set is the set of all cities with population 100,000 or more and with edges connecting pairs of such "high population" cities that are at most 200 miles apart. Using algorithms that we will study shortly, we can find out if such a graph has a single connected component. If so, then we know that it is possible to travel from any "high population" city to any other "high population" city, without visiting any "low population" city and furthermore using only hops of length at most 200 miles.
printCities(attribute): As before, attribute can be one of "name", "state", "latitude", "longitude", and "population." This methods prints all selected cities sorted in increasing order by the given attribute. For example, if the attribute is "latitude" then the selected cities will be printed in increasing order of latitude (i.e., south to north). You should also implement printCities(), i.e., with no given attribute, which should behave exactly like printCities("names").
printEdges(): This should print all selected edges, in no particular order.

Feel free to add extra methods beyond those that are listed above. Place yourself in the role of a user of the City class and ask yourself what other methods might be useful. Also, if you notice the same code fragment repeating in several methods, you should probably turn that into a "helper" function that can be called by other methods.

Implementation Details

Here are a couple of implementation issues to pay particular attention to.

Use the adjacency list implementation of the graph class given in myListGraph.java. Note that this depends on the linked list class implemented in StringLinkList.java. Over the next couple of weeks, I'll add the following new methods to the ListGraph class:
1. A method that returns all connected components of a graph.
2. A method that computes a shortest path (in hops) between a given pair of vertices in a graph.
By using the methods in the City class along with the above methods you will be able to answer interesting geographic queries.
To efficiently implement methods such as selectCities and printCities, it seems important to maintain the cities in sorted order. For example, consider the function call printCities("latitude"). This should print selected cities in increasing order of latitude. If the cities were already sorted by latitude, this would just be a matter of scanning the sequence of cities and printing out those that were selected. Otherwise, we would have to copy the selected set of cities into an array, sort these, and then print them out. Sorting in response to each print query seems like a bad idea - it is better to do the sorting in the beginning and try to make the response to queries as efficient as possible. The question then is: how do we maintain the set of cities, so that it is simultaneously sorted by several keys, i.e., sorted by name, as well as by state, as well as by population, etc.? Try to answer this question on your own; we'll provide some help in future labs. Note that maintaining the cities in sorted order helps improve the efficiency of selectCities as well because this method could be implemented by doing a binary search for lowerBound followed by a binary search for upperBound.

Experiments

To test your implementation of the City class, use your program to answer the following question:

Is it possible to travel from Yakima, WA to Wilmington, DE along edges of length at most 200 miles? If yes, then what is the fewest number of hops it takes to do so. If no, then what is the minimum amount by which the threshold of 200 miles has to be increased so as to make this possible.
What is the population distribution of cities between latitudes 40 N and 50 N and longitudes 85 W and 100 W (i.e., the upper midwest)? In other words, how many cities in this region have population in the ranges [0,20,000], [20,000, 40,000], [40,000, 60,000], etc.?
Based on the populations in miles.dat, which is the most populated state?

Put your code for Experiment 1 into a file called CityTester1.java, code for Experiment 2 into a file called CityTester2.java, and code for Experiment 3 into a file called CityTester3.java. These experiments are a good way not only to test the correctness of the City class, but to also get ideas on what other methods the City class might need.

Discussion

This project gives you a glimpse into how a geographic information system (GIS) might work. At the back end of the system there are multiple sources of geographic data (in our case, all our data comes from miles.dat). At the front end we have a "query engine" that efficiently responds to geographic queries from the user. The user of a GIS, typically has no knowledge of the data sources, size and format of the data files, etc. that the back end is dealing with. In our context, this separation translates into the testing programs, CityTester1.java, CityTester2.java, and CityTester3.java having no knowledge of or access to miles.dat. Visualization of data is a critical part of most GIS software - I might try to incorporate some primitive visualization into subsequent projects - we'll see. I'll also try to increase the size of our data set so that efficiency of your implementation becomes a key issue.

What to submit

Submit your java programs: City.java, CityTester1.java, CityTester2.java, and CityTester3.java. You will not be modifying myListGraph.java or the classes that it depends on, so none of these need be submitted. In addition you should submit City.readme. As usual this should mention all known errors, gross inefficiencies, all required methods that are unimplemented, and documentation for all "extra" methods that were implemented. Also, submit Project1.pdf; this should describe the answers you got from running the three experiments.