|
Books Index
Chapter 2: Collections
By Cay S. Horstmann and Gary Cornell
Array Lists
In the preceding section, you saw the List interface and the
LinkedList class
that implements it. The List interface describes an ordered collection in
which the position of elements matters. There are two protocols for visiting
the elements: through an iterator and by random access with methods get and s
et. The latter are not appropriate for linked lists, but of course they make a
lot of sense for arrays. The collections library supplies an ArrayList
class
that implements the List interface. An ArrayList is
similar to a Vector: it
encapsulates a dynamically reallocated Object[] array.
Why use an ArrayList instead of a Vector? There is
one simple reason. All
methods of the Vector class are synchronized. It is safe to
access a Vector
object from two threads. But if you only access a vector from a single
thread--by far the more common case--your code wastes quite a bit of time
with synchronization. In contrast, the ArrayList methods are not synchronized.
We recommend that you use an ArrayList instead of a
Vector whenever you don't
need synchronization.
Using an ArrayList is as simple as using a Vector.
Just keep in mind that you
need to use the short method names get and set instead of the elementAt and
setElementAt methods.
Hash Sets
Linked lists and arrays let you specify in which order you want to arrange
the elements. However, if you are looking for a particular element and you
don't remember its position, then you need to visit all elements until you
find a match. That can be time-consuming if the collection contains many
elements. If you don't care about the ordering of the elements, then there
are data structures that let you find elements much faster. The drawback is
that those data structures give you no control over the order in which the
elements appear. The data structures organize the elements in an order that
is convenient for their own purposes.
A well-known data structure for finding objects quickly is the hash table.
A hash table computes an integer, called the hash code, for each object. We
see in the next section how these hash codes are computed. What's important
for now is that hash codes can be computed quickly and that the computation
only depends on the state of the object that needs to be hashed, and not on
the other objects in the hash table.
A hash table is an array of linked lists. Each list is called a bucket
(see Figure 2-8). To find the place of an object in the table, compute its
hash code and reduce it modulo the total number of buckets. The resulting
number is the index of the bucket that holds the element. For example, if an
object has hash code 345 and there are 101 buckets, then the object is placed
in bucket 42 (because the remainder of the integer division 345/101 is 42).
Perhaps you are lucky and there is no other element in that bucket. Then, you
simply insert the element into that bucket. Of course, it is inevitable that
you sometimes hit a bucket that is already filled. This is called a hash
collision. Then, you need to compare the new object with all objects in that
bucket to see if it is already present. Provided that the hash codes are
reasonably randomly distributed and the number of buckets is large enough,
only a few comparisons should be necessary.
Figure 2-8: A hash table
If you want more control over the performance of the hash table, you can
specify the initial bucket count. The bucket count gives the number of buckets
that are used to collect objects with identical hash values. If too many
elements are inserted into a hash table, the number of collisions increases
and retrieval performance suffers.
If you know approximately how many elements will eventually be in the table,
then you should set the initial bucket count to about 150 percent of the
expected element count. Some researchers believe that it is a good idea to
make the size of the hash table a prime number to prevent a clustering of keys.
The evidence for this isn't conclusive, but it certainly can't hurt. For
example, if you need to store about 100 entries, set the initial bucket size
to 151.
Of course, you do not always know how many elements you need to store, or your
initial guess may be too low. If the hash table gets too full, it needs to be
rehashed. To rehash the table, a table with more buckets is created, all
elements are inserted into the new table, and the original table is discarded.
In the Java programming language, the load factor determines when a hash table
is rehashed. For example, if the load factor is 0.75 (which is the default)
and the hash table becomes more than 75 percent full, then the table is
automatically rehashed, using twice as many buckets. For most applications,
it is reasonable to leave the load factor at 0.75.
Hash tables can be used to implement several important data structures. The
simplest among them is the set type. A set is a collection of elements without
duplicates. The add method of a set first tries to find the object to be added
and only adds it if it is not yet present.
The Java collections library supplies a HashSet class that implements a set
based on a hash table. At the time of this writing, the default constructor
HashSet constructs a hash table with 101 buckets and a load factor of 0.75.
These values may change in future releases. If you at all care about these
values, you should specify your own, with the constructors
HashSet(int initialCapacity)
HashSet(
int initialCapacity, float loadFactor)
You add elements with the add method. The contains
method is redefined to make
a quick lookup to find if an element is already present in the set. It only
checks the elements in one bucket and not all elements in the collection. The
hash set iterator visits all buckets in turn. Since the hashing scatters the
elements around in the table, they are visited in seemingly random order. You
would only use a hash set if you don't care about the ordering of the elements
in the collection.
The sample program at the end of this section (Example 2-2) reads words from
System.in, adds them to a set and finally prints out all words in the set. For
example, you can feed the program the text from Alice in Wonderland (which you
can obtain from www.gutenberg.net) by launching
it from a command shell as
java SetTest < alice30.txt
The program reads all words from the input and adds them to the hash set. It
then iterates through the unique words in the set and finally prints out a
count. (Alice in Wonderland has 5,909 unique words, including the
copyright
notice at the beginning.) The words appear in random order.
Example 2-2: SetTest.java
import java.util.*;
import java.io.*;
public class SetTest
{ public static void main(String[] args)
{ Set words = new HashSet(59999);
// set to HashSet or TreeSet
long totalTime = 0;
try
{ BufferedReader in = new
BufferedReader(
new InputStreamReader(
System.in));
Stringline;
while ((line = in.readLine(
)) != null)
{ StringTokenizer tokenizer =
new StringTokenizer(
line);
while (
tokenizer.hasMoreTokens())
{ Stringword =
tokenizer.nextToken();
long callTime =
System.currentTimeMillis();
words.add(word);
callTime =
System.currentTimeMillis() -
callTime;
totalTime += callTime;
}
}
}
catch (IOException e)
{ System.out.println("Error " + e);
}
Iterator iter = words.iterator();
while (iter.hasNext())
System.out.println(iter.next());
System.out.println(words.size()
+ " distinct words. " + totalTime
+ " milliseconds.");
}
}
|
java.util.HashSet
-
HashSet()
constructs an empty hash set.
-
HashSet(Collection elements)
constructs a hash set and adds all elements from a collection.
Parameters: elements the elements to add
-
HashSet(int initialCapacity)
constructs an empty hash set with the specified capacity.
Parameters: initialCapacity the initial number
of buckets
-
HashSet(int initialCapacity, float loadFactor)
constructs an empty hash set with the specified capacity and load factor.
Parameters: initialCapacity the initial number of buckets
loadFactor a number between 0.0 and 1.0 that determines at what
percentage of fullness the hash table will be rehashed into a larger one
Hash functions
You can insert strings into a hash set because the String class
has a hashCode
method that computes a hash code for a string. A hash code is an integer that
is somehow derived from the characters in the string. Table 2-1 lists a few
examples of hash codes that result from the hashCode function of
the String class.
Table 2-1: Hash codes resulting from the hashCode function
String Hash code
Hello 140207504
Harry 140013338
Hacker 884756206
The hashCode method is defined in the Object class.
Therefore, every object
has a default hash code. That hash code is derived from the object's memory
address. In general, the default hash function is not very useful because
objects with identical contents may yield different hash codes. Consider this
example.
String s = "Ok";
StringBuffer sb = new StringBuffer(s);
System.out.println(
s.hashCode() + " " + sb.hashCode());
String t = "Ok";
StringBuffer tb = new StringBuffer(t);
System.out.println(
t.hashCode() + " " + tb.hashCode());
|
Table 2-2 shows the result.
Table 2-2: Hash codes of objects with identical contents
|
|
"Ok"Stringhash code
3030
3030
|
"Ok" StringBuffer hash code
20526976
20527144
|
|
|
Note that the strings s and t have the same hash value because, for strings,
the hash values are derived from their contents. The string buffers
sb and tb
have different hash values because no special hash function has been defined
for the StringBuffer class and the default hash code function
in the Object
class derives the hash code from the object's memory address.
You should always define the hashCode method for objects that you insert into
a hash table. This method should return an integer (which can be negative).
The hash table code will later reduce the integer by dividing by the bucket
count and taking the remainder. Just scramble up the hash codes of the data
fields in some way that will give the hash codes for different objects a good
chance of being widely scattered.
For example, suppose you have the class Item for inventory items. An item
consists of a description string and a part number.
anItem = new Item("Toaster", 49954);
If you want to construct a hash set of items, you need to define a hash code.
For example,
class Item
{ . . .
public int hashCode()
{ return 13 * description.hashCode(
) +
17 * partNumber;
}
. . .
privateStringdescription;
private int partNumber;
}
|
As a practical matter, if the part number uniquely identifies the item, you
don't need to incorporate the hash code of the description.
Furthermore, you also need to make sure that the equals method is well defined.
The Object class defines equals, but that method only tests whether or not two
objects are identical. If you don't redefine equals, then every new object
that you insert into the table will be considered a different object.
You need to redefine equals to check for equal contents.
class Item
{ . . .
public boolean equals(Object other)
{ if (other != null && getClass() ==
other.getClass())
{ Item otherItem = (Item)other;
return description.equals(
otherItem.description)
&& partNumber ==
otherItem.partNumber;
}
else
return false;
}
. . .
}
|
CAUTION: Your definitions of equals and hashCode must be compatible: if
x.equals(y) is true, then x.hashCode() must be the same value as y.hashCode().
java.lang.Object
- boolean equals(Object obj)
compares two objects for equality; returns true if both objects are
equal; false otherwise.
Parameters: obj the object to compare with the first
object (may be null, in which case the method should return false)
- int hashCode()
returns a hash code for this object. A hash code can be any integer,
positive or negative. Equal objects need to return identical hash codes.
Tree Sets
TheTreeSetclass is similar to the hash set, with one added
improvement.
A tree set is a sorted collection. You insert elements into the collection
in any order. When you iterate through the collection, the values are
automatically presented in sorted order. For example, suppose you insert
three strings and then visit all elements that you added.
TreeSet sorter = new TreeSet();
sorter.add("Bob");
sorter.add("Angela");
sorter.add("Carl");
Iterator iter = sorter.iterator();
while (iter.hasNext(
)) System.println(
iter.next());
|
Then, the values are printed in sorted order: Angela Bob Carl. As the name of
the class suggests, the sorting is accomplished by a tree data structure.
(The current implementation uses a red-black tree. For a detailed description
of red-black trees, see, for example, Introduction to Algorithms by Thomas
Cormen, Charles Leiserson, and Ronald Rivest [The MIT Press 1990].) Every
time an element is added to a tree, it is placed into its proper sorting
position. Therefore, the iterator always visits the elements in sorted order.
Adding an element to a tree is slower than adding it to a hash table, but it
is still much faster than adding it into the right place in an array or linked
list. If the tree containsnelements, then an average of
log2ncomparisons
are required to find the correct position for the new element. For example,
if the tree already contains 1,000 elements, then adding a new element
requires about 10 comparisons.
Thus, adding elements into aTreeSetis somewhat slower than
adding into a
HashSet --see Table 2-3 for a comparison--but
theTreeSetautomatically sorts
the elements.
Table 2-3: Adding elements into hash and tree sets
|
|
|
Document
|
Total number of words |
Number of distinct words
|
HashSet
|
TreeSet
|
|
Alice in Wonderland |
28195 |
5909 |
5 sec |
7 sec |
|
The Count of Monte Cristo |
466300 |
37545 |
75 sec |
98 sec |
|
|
java.util.TreeSet
-
TreeSet()
constructs an empty tree set.
-
TreeSet(Collection elements)
constructs a tree set and adds all elements from a collection.
Parameters: elements the elements to add
Object comparison
How does theTreeSetknow how you want the elements sorted? By default, the
tree set assumes that you insert elements that implement the Comparable
interface. That interface defines a single method:
int compareTo(Object other)
The call a.compareTo(b) must return 0 if a
and b are equal, a negative integer
if a comes before b in the sort order, and a positive integer if a comes after
b. The exact value does not matter; only its sign (>0, 0, or < 0) matters.
Several standard
Java platform classes implement the Comparable interface. One example is the
String class. Its compareTo method compares strings
in dictionary order
(sometimes called lexicographic order).
If you insert your own objects, you need to define a sort order yourself by
implementing the Comparable interface. There is no default implementation of
compareTo in the Object class.
For example, here is how you can sort Item objects by part number.
class Item implements Comparable
{ public int compareTo(Object other)
{ Item otherItem = (Item)other;
return partNumber -
otherItem.partNumber;
}
. . .
}
|
Note that the explicit argument of the compareTo method has type
Object, not
Comparable. If the object is not of the correct type, then this compareTo
method simply throws a ClassCastException. (The compareTo
methods in the
standard library behave in the same way when presented with illegal argument
types.)
If you compare two positive integers, such as part numbers in our example,
then you can simply return their difference--it will be negative if the first
item should come before the second item, zero if the part numbers are
identical, and positive otherwise.
CAUTION: This trick does not work if the integers can be negative. If x is a
large positive integer and y is a large negative integer, then the difference
x - y can overflow.
However, using the Comparable interface for defining the sort order has
obvious limitations. You can only implement the interface once. But what
can you do if you need to sort a bunch of items by part number in one
collection and by description in another? Furthermore, what can you do if
you need to sort objects of a class whose creator didn't bother to implement
the Comparable interface?
In those situations, you tell the tree set to use a different comparison method,
by passing a Comparator object into theTreeSetconstructor. The Comparator
interface has a single method, with two explicit parameters:
int compare(Object a, Object b)
Just like the compareTo method, the compare method returns a negative integer
if a comes before b, zero if they are identical, or a positive integer
otherwise.
To sort items by their description, simply define a class that implements
the Comparator interface:
class ItemComparator
implements Comparator
{ public int compare(Object a, Object b)
{ Item itemA = (Item)a;
Item itemB = (Item)b;
StringdescrA =
itemA.getDescription();
StringdescrB =
itemB.getDescription();
return descrA.compareTo(descrB);
}
}
|
You then pass an object of this class to the tree set constructor:
ItemComparator comp = new ItemComparator();
TreeSet sortByDescription = new TreeSet(
comp);
If you construct a tree with a comparator, it uses this object whenever it
needs to compare two elements.
Note that the item comparator has no data. It is just a holder for the
comparison method. Such an object is sometimes called a function object.
Function objects are commonly defined "on the fly", as instances of anonymous
inner classes:
TreeSet sortByDescription = new TreeSet(
new Comparator()
{ public int compare(Object a, Object b)
{ Item itemA = (Item)a;
Item itemB = (Item)b;
StringdescrA =
itemA.getDescription();
StringdescrB =
itemB.getDescription();
return descrA.compareTo(descrB);
}
});
|
Using comparators, you can sort elements in any way you wish.
If you look back at Table 2-3, you may well wonder if you should always use a
tree set instead of a hash set. After all, adding elements does not seem to
take much longer, and the elements are automatically sorted. The answer
depends on the data that you are collecting. If you don't need the data
sorted, there is no reason to pay for the sorting overhead. More importantly,
with some data it is very difficult to come up with a sort order. Suppose you
collect a bunch of rectangles. How do you sort them? By area? You can have two
different rectangles with different positions but the same area. If you sort
by area, the second one is not inserted into the set. The sort order for a
tree must be a total ordering: Any two elements must be comparable, and the
comparison can only be zero if the elements are equal. There is such a sort
order for rectangles (the lexicographic ordering on its coordinates), but it
is unnatural and cumbersome to compute. In contrast, hash functions are
usually easier to define. They only need to do a reasonably good job of
scrambling the objects, whereas comparison functions must tell objects apart
with complete precision.
The program in Example 2-3 builds two tree sets of Item objects. The first one
is sorted by part number, the default sort order of item objects. The second
set is sorted by description, using a custom comparator.
Example 2-3: TreeSetTest.java
import java.util.*;
public class TreeSetTest
{ public static void main(
String[] args)
{ SortedSet parts = new TreeSet();
parts.add(new Item("Toaster", 1234));
parts.add(new Item("Widget", 4562));
parts.add(new Item("Modem", 9912));
System.out.println(parts);
SortedSet sortByDescription =
new TreeSet(
new Comparator()
{ public int compare(
Object a, Object b)
{ Item itemA = (Item)a;
Item itemB = (Item)b;
StringdescrA =
itemA.getDescription();
StringdescrB =
itemB.getDescription();
return descrA.compareTo(descrB);
}
});
sortByDescription.addAll(parts);
System.out.println(
sortByDescription);
}
}
class Item implements Comparable
{ public Item(
String aDescription, int aPartNumber)
{ description = aDescription;
partNumber = aPartNumber;
}
publicStringgetDescription()
{ return description;
}
publicStringtoString()
{ return "[description=" +
description
+ ", partNumber=" +
partNumber + "]";
}
public boolean equals(Object other)
{ if (getClass() == other.getClass())
{ Item otherItem = (Item)other;
return description.equals(
otherItem.description)
&& partNumber ==
otherItem.partNumber;
}
else
return false;
}
public int hashCode()
{ return 13 * description.hashCode(
) + 17 * partNumber;
}
public int compareTo(Object other)
{ Item otherItem = (Item)other;
return partNumber -
otherItem.partNumber;
}
private String description;
private int partNumber;
}
|
java.lang.Comparable
-
int compareTo(Object other)
compares this object with another object and returns a negative value if
this comes before other, zero if they are considered identical in the sort
order, and a positive value if this comes after other.
Parameters: other the object to compare
java.util.Comparator
-
int compare(Object a, Object b)
compares two objects and returns a negative value if a comes before b, zero
if they are considered identical in the sort order, and a positive value if
a comes after b.
Parameters: a, b the objects to compare
java.util.SortedSet
-
Comparator comparator()
returns the comparator used for sorting the elements, or null if the elements
are compared with the compareTo method of the Comparable interface.
-
Object first()
-
Object last()
return the smallest or largest element in the sorted set.
java.util.TreeSet
-
TreeSet(Comparator c)
constructs a tree set and uses the specified comparator for sorting its
elements.
Parameters: c the comparator to use for sorting
-
TreeSet(SortedSet elements)
constructs a tree set, adds all elements from a sorted set, and uses the same
element comparator as the given sorted set.
Parameters: elements the sorted set with the elements to
add and the comparator to use
Maps
A set is a collection that lets you quickly find an existing element.
However, to look up an element, you need to have an exact copy of the
element to find. That isn't a very common lookup--usually, you have some
key
information, and you want to look up the associated element. The map data
structure serves that purpose. A map stores key/value pairs. You can find a
value if you provide the key. For example, you may store a table of employee
records, where the keys are the employee IDs and the values are
Employee
objects.
The Java library supplies two general purpose implementations for maps:
HashMap and TreeMap. A hash map hashes the keys, and a tree map uses a total
ordering on the keys to organize them in a search tree. The hash or comparison
function is applied only to the keys. The values associated with the keys are
not hashed or compared.
Should you choose a hash map or a tree map? As with sets, hashing is a bit
faster, and it is the preferred choice if you don't need to visit the keys
in sorted order.
Here is how you set up a hash map for storing employees.
HashMap staff = new HashMap();
Employee harry =
new Employee("Harry Hacker");
staff.put("987-98-9996", harry);
. . .
|
BACK | NEXT
|