Sun Java Solaris Communities My SDN Account Join SDN
 
Core Java 2, Volume II

Core Java 2, Volume II - Chapter 2: Collections

 

Books Index



Array Lists

In the preceding section, you saw the List interface and the LinkedList class that implements it. The List interface describes an ordered collection in which the position of elements matters. There are two protocols for visiting the elements: through an iterator and by random access with methods get and s et. The latter are not appropriate for linked lists, but of course they make a lot of sense for arrays. The collections library supplies an ArrayList class that implements the List interface. An ArrayList is similar to a Vector: it encapsulates a dynamically reallocated Object[] array.

Why use an ArrayList instead of a Vector? There is one simple reason. All methods of the Vector class are synchronized. It is safe to access a Vector object from two threads. But if you only access a vector from a single thread--by far the more common case--your code wastes quite a bit of time with synchronization. In contrast, the ArrayList methods are not synchronized. We recommend that you use an ArrayList instead of a Vector whenever you don't need synchronization.

Using an ArrayList is as simple as using a Vector. Just keep in mind that you need to use the short method names get and set instead of the elementAt and setElementAt methods.

Hash Sets

Linked lists and arrays let you specify in which order you want to arrange the elements. However, if you are looking for a particular element and you don't remember its position, then you need to visit all elements until you find a match. That can be time-consuming if the collection contains many elements. If you don't care about the ordering of the elements, then there are data structures that let you find elements much faster. The drawback is that those data structures give you no control over the order in which the elements appear. The data structures organize the elements in an order that is convenient for their own purposes.

A well-known data structure for finding objects quickly is the hash table. A hash table computes an integer, called the hash code, for each object. We see in the next section how these hash codes are computed. What's important for now is that hash codes can be computed quickly and that the computation only depends on the state of the object that needs to be hashed, and not on the other objects in the hash table.

A hash table is an array of linked lists. Each list is called a bucket (see Figure 2-8). To find the place of an object in the table, compute its hash code and reduce it modulo the total number of buckets. The resulting number is the index of the bucket that holds the element. For example, if an object has hash code 345 and there are 101 buckets, then the object is placed in bucket 42 (because the remainder of the integer division 345/101 is 42).

Perhaps you are lucky and there is no other element in that bucket. Then, you simply insert the element into that bucket. Of course, it is inevitable that you sometimes hit a bucket that is already filled. This is called a hash collision. Then, you need to compare the new object with all objects in that bucket to see if it is already present. Provided that the hash codes are reasonably randomly distributed and the number of buckets is large enough, only a few comparisons should be necessary.


Figure 2-8: A hash table

If you want more control over the performance of the hash table, you can specify the initial bucket count. The bucket count gives the number of buckets that are used to collect objects with identical hash values. If too many elements are inserted into a hash table, the number of collisions increases and retrieval performance suffers.

If you know approximately how many elements will eventually be in the table, then you should set the initial bucket count to about 150 percent of the expected element count. Some researchers believe that it is a good idea to make the size of the hash table a prime number to prevent a clustering of keys. The evidence for this isn't conclusive, but it certainly can't hurt. For example, if you need to store about 100 entries, set the initial bucket size to 151.

Of course, you do not always know how many elements you need to store, or your initial guess may be too low. If the hash table gets too full, it needs to be rehashed. To rehash the table, a table with more buckets is created, all elements are inserted into the new table, and the original table is discarded. In the Java programming language, the load factor determines when a hash table is rehashed. For example, if the load factor is 0.75 (which is the default) and the hash table becomes more than 75 percent full, then the table is automatically rehashed, using twice as many buckets. For most applications, it is reasonable to leave the load factor at 0.75.

Hash tables can be used to implement several important data structures. The simplest among them is the set type. A set is a collection of elements without duplicates. The add method of a set first tries to find the object to be added and only adds it if it is not yet present.

The Java collections library supplies a HashSet class that implements a set based on a hash table. At the time of this writing, the default constructor HashSet constructs a hash table with 101 buckets and a load factor of 0.75. These values may change in future releases. If you at all care about these values, you should specify your own, with the constructors

HashSet(int initialCapacity)
HashSet(
   int initialCapacity, float loadFactor)

You add elements with the add method. The contains method is redefined to make a quick lookup to find if an element is already present in the set. It only checks the elements in one bucket and not all elements in the collection. The hash set iterator visits all buckets in turn. Since the hashing scatters the elements around in the table, they are visited in seemingly random order. You would only use a hash set if you don't care about the ordering of the elements in the collection.

The sample program at the end of this section (Example 2-2) reads words from System.in, adds them to a set and finally prints out all words in the set. For example, you can feed the program the text from Alice in Wonderland (which you can obtain from www.gutenberg.net) by launching it from a command shell as

java SetTest < alice30.txt

The program reads all words from the input and adds them to the hash set. It then iterates through the unique words in the set and finally prints out a count. (Alice in Wonderland has 5,909 unique words, including the copyright notice at the beginning.) The words appear in random order.

Example 2-2: SetTest.java

import java.util.*;
import java.io.*;

public class SetTest
{  public static void main(String[] args)
   {  Set words = new HashSet(59999); 
         // set to HashSet or TreeSet
      long totalTime = 0;

      try
      {  BufferedReader in = new 
            BufferedReader(
             new InputStreamReader(
                           System.in));
        Stringline;
         while ((line = in.readLine(
                          )) != null)
         {  StringTokenizer tokenizer = 
                   new StringTokenizer(
                                  line);
            while (
                 tokenizer.hasMoreTokens())
            { Stringword = 
                     tokenizer.nextToken();
               long callTime = 
                  System.currentTimeMillis();
               words.add(word);
               callTime = 
                System.currentTimeMillis() - 
                                    callTime;
               totalTime += callTime;
            }
         }
      }
      catch (IOException e)
      {  System.out.println("Error " + e);
      }

      Iterator iter = words.iterator();
      while (iter.hasNext())
         System.out.println(iter.next());
      System.out.println(words.size() 
         + " distinct words. " + totalTime 
                     + " milliseconds.");
   }
}

java.util.HashSet

  • HashSet()
    constructs an empty hash set.
  • HashSet(Collection elements)
    constructs a hash set and adds all elements from a collection.
    Parameters: elements the elements to add

  • HashSet(int initialCapacity)
    constructs an empty hash set with the specified capacity.
    Parameters: initialCapacity the initial number of buckets

  • HashSet(int initialCapacity, float loadFactor)
    constructs an empty hash set with the specified capacity and load factor.
    Parameters: initialCapacity the initial number of buckets
    loadFactor a number between 0.0 and 1.0 that determines at what percentage of fullness the hash table will be rehashed into a larger one

Hash functions

You can insert strings into a hash set because the String class has a hashCode method that computes a hash code for a string. A hash code is an integer that is somehow derived from the characters in the string. Table 2-1 lists a few examples of hash codes that result from the hashCode function of the String class.

Table 2-1: Hash codes resulting from the hashCode function

String  Hash code
Hello   140207504
Harry   140013338
Hacker  884756206

The hashCode method is defined in the Object class. Therefore, every object has a default hash code. That hash code is derived from the object's memory address. In general, the default hash function is not very useful because objects with identical contents may yield different hash codes. Consider this example.

String s = "Ok";
StringBuffer sb = new StringBuffer(s);
System.out.println(
  s.hashCode() + " " + sb.hashCode());
String t = "Ok";
StringBuffer tb = new StringBuffer(t);
System.out.println(
   t.hashCode() + " " + tb.hashCode());
 

Table 2-2 shows the result.

Table 2-2: Hash codes of objects with identical contents


"Ok"Stringhash code
3030
3030
"Ok" StringBuffer hash code
20526976
20527144

Note that the strings s and t have the same hash value because, for strings, the hash values are derived from their contents. The string buffers sb and tb have different hash values because no special hash function has been defined for the StringBuffer class and the default hash code function in the Object class derives the hash code from the object's memory address.

You should always define the hashCode method for objects that you insert into a hash table. This method should return an integer (which can be negative). The hash table code will later reduce the integer by dividing by the bucket count and taking the remainder. Just scramble up the hash codes of the data fields in some way that will give the hash codes for different objects a good chance of being widely scattered.

For example, suppose you have the class Item for inventory items. An item consists of a description string and a part number.

anItem = new Item("Toaster", 49954);

If you want to construct a hash set of items, you need to define a hash code.

For example,

class Item
{   . . .
   public int hashCode()
   {   return 13 * description.hashCode(
                                     ) + 
                         17 * partNumber;
   }
   . . .
   privateStringdescription;
   private int partNumber;
}

As a practical matter, if the part number uniquely identifies the item, you don't need to incorporate the hash code of the description.

Furthermore, you also need to make sure that the equals method is well defined. The Object class defines equals, but that method only tests whether or not two objects are identical. If you don't redefine equals, then every new object that you insert into the table will be considered a different object.

You need to redefine equals to check for equal contents.

class Item
{  . . .
     public boolean equals(Object other)
     {  if (other != null && getClass() == 
                        other.getClass())
        {  Item otherItem = (Item)other;
           return description.equals(
                     otherItem.description)
               && partNumber == 
                    otherItem.partNumber;
        }
        else
           return false;
     }
     . . .
        }
        

CAUTION: Your definitions of equals and hashCode must be compatible: if x.equals(y) is true, then x.hashCode() must be the same value as y.hashCode().


java.lang.Object

  • boolean equals(Object obj)
    compares two objects for equality; returns true if both objects are equal; false otherwise.
    Parameters: obj the object to compare with the first object (may be null, in which case the method should return false)
  • int hashCode()
    returns a hash code for this object. A hash code can be any integer, positive or negative. Equal objects need to return identical hash codes.

Tree Sets

TheTreeSetclass is similar to the hash set, with one added improvement. A tree set is a sorted collection. You insert elements into the collection in any order. When you iterate through the collection, the values are automatically presented in sorted order. For example, suppose you insert three strings and then visit all elements that you added.

TreeSet sorter = new TreeSet();
sorter.add("Bob");
sorter.add("Angela");
sorter.add("Carl");
Iterator iter = sorter.iterator();
while (iter.hasNext(
              )) System.println(
                    iter.next());

Then, the values are printed in sorted order: Angela Bob Carl. As the name of the class suggests, the sorting is accomplished by a tree data structure. (The current implementation uses a red-black tree. For a detailed description of red-black trees, see, for example, Introduction to Algorithms by Thomas Cormen, Charles Leiserson, and Ronald Rivest [The MIT Press 1990].) Every time an element is added to a tree, it is placed into its proper sorting position. Therefore, the iterator always visits the elements in sorted order.

Adding an element to a tree is slower than adding it to a hash table, but it is still much faster than adding it into the right place in an array or linked list. If the tree containsnelements, then an average of log2ncomparisons are required to find the correct position for the new element. For example, if the tree already contains 1,000 elements, then adding a new element requires about 10 comparisons.

Thus, adding elements into aTreeSetis somewhat slower than adding into a HashSet --see Table 2-3 for a comparison--but theTreeSetautomatically sorts the elements.

Table 2-3: Adding elements into hash and tree sets

Document Total number of words Number of distinct words HashSet TreeSet
Alice in Wonderland 28195 5909 5 sec 7 sec
The Count of Monte Cristo 466300 37545 75 sec 98 sec

java.util.TreeSet

  • TreeSet()
    constructs an empty tree set.
  • TreeSet(Collection elements)
    constructs a tree set and adds all elements from a collection.
    Parameters: elements the elements to add

Object comparison

How does theTreeSetknow how you want the elements sorted? By default, the tree set assumes that you insert elements that implement the Comparable interface. That interface defines a single method:

int compareTo(Object other)

The call a.compareTo(b) must return 0 if a and b are equal, a negative integer if a comes before b in the sort order, and a positive integer if a comes after b. The exact value does not matter; only its sign (>0, 0, or < 0) matters.

Several standard

Java platform classes implement the Comparable interface. One example is the String class. Its compareTo method compares strings in dictionary order (sometimes called lexicographic order).

If you insert your own objects, you need to define a sort order yourself by implementing the Comparable interface. There is no default implementation of compareTo in the Object class.

For example, here is how you can sort Item objects by part number.

 class Item implements Comparable
  {  public int compareTo(Object other)
      {   Item otherItem = (Item)other;
          return partNumber - 
                  otherItem.partNumber;
      }
      . . .
   }
   

Note that the explicit argument of the compareTo method has type Object, not Comparable. If the object is not of the correct type, then this compareTo method simply throws a ClassCastException. (The compareTo methods in the standard library behave in the same way when presented with illegal argument types.)

If you compare two positive integers, such as part numbers in our example, then you can simply return their difference--it will be negative if the first item should come before the second item, zero if the part numbers are identical, and positive otherwise.

CAUTION: This trick does not work if the integers can be negative. If x is a large positive integer and y is a large negative integer, then the difference x - y can overflow.

However, using the Comparable interface for defining the sort order has obvious limitations. You can only implement the interface once. But what can you do if you need to sort a bunch of items by part number in one collection and by description in another? Furthermore, what can you do if you need to sort objects of a class whose creator didn't bother to implement the Comparable interface?

In those situations, you tell the tree set to use a different comparison method, by passing a Comparator object into theTreeSetconstructor. The Comparator interface has a single method, with two explicit parameters:

int compare(Object a, Object b)

Just like the compareTo method, the compare method returns a negative integer if a comes before b, zero if they are identical, or a positive integer otherwise.

To sort items by their description, simply define a class that implements the Comparator interface:

class ItemComparator 
         implements Comparator
{  public int compare(Object a, Object b)
   {  Item itemA = (Item)a;
      Item itemB = (Item)b;
     StringdescrA = 
                  itemA.getDescription();
     StringdescrB = 
                  itemB.getDescription();
      return descrA.compareTo(descrB);
   }
}

You then pass an object of this class to the tree set constructor:

ItemComparator comp = new ItemComparator();
TreeSet sortByDescription = new TreeSet(
                                     comp);

If you construct a tree with a comparator, it uses this object whenever it needs to compare two elements.

Note that the item comparator has no data. It is just a holder for the comparison method. Such an object is sometimes called a function object.

Function objects are commonly defined "on the fly", as instances of anonymous inner classes:

TreeSet sortByDescription = new TreeSet(
   new Comparator()
   {  public int compare(Object a, Object b)
      {  Item itemA = (Item)a;
         Item itemB = (Item)b;
        StringdescrA = 
                itemA.getDescription();
        StringdescrB = 
                itemB.getDescription();
         return descrA.compareTo(descrB);
      }
   });
   

Using comparators, you can sort elements in any way you wish.

If you look back at Table 2-3, you may well wonder if you should always use a tree set instead of a hash set. After all, adding elements does not seem to take much longer, and the elements are automatically sorted. The answer depends on the data that you are collecting. If you don't need the data sorted, there is no reason to pay for the sorting overhead. More importantly, with some data it is very difficult to come up with a sort order. Suppose you collect a bunch of rectangles. How do you sort them? By area? You can have two different rectangles with different positions but the same area. If you sort by area, the second one is not inserted into the set. The sort order for a tree must be a total ordering: Any two elements must be comparable, and the comparison can only be zero if the elements are equal. There is such a sort order for rectangles (the lexicographic ordering on its coordinates), but it is unnatural and cumbersome to compute. In contrast, hash functions are usually easier to define. They only need to do a reasonably good job of scrambling the objects, whereas comparison functions must tell objects apart with complete precision.

The program in Example 2-3 builds two tree sets of Item objects. The first one is sorted by part number, the default sort order of item objects. The second set is sorted by description, using a custom comparator.

Example 2-3: TreeSetTest.java

import java.util.*;

public class TreeSetTest
{  public static void main(
                       String[] args)
   {  SortedSet parts = new TreeSet();
      parts.add(new Item("Toaster", 1234));
      parts.add(new Item("Widget", 4562));
      parts.add(new Item("Modem", 9912));
      System.out.println(parts);
      
      SortedSet sortByDescription = 
            new TreeSet(
         new Comparator()
         {  public int compare(
                         Object a, Object b)
            {  Item itemA = (Item)a;
               Item itemB = (Item)b;
              StringdescrA = 
                      itemA.getDescription();
              StringdescrB = 
                      itemB.getDescription();
             return descrA.compareTo(descrB);
            }
         });  
         
      sortByDescription.addAll(parts);      
      System.out.println(
                   sortByDescription);
   }
}

class Item implements Comparable
{  public Item(
    String aDescription, int aPartNumber)
   {  description = aDescription;
      partNumber = aPartNumber;
   }
   
   publicStringgetDescription()
   {  return description;
   }

   publicStringtoString()
   {  return "[description=" + 
                           description 
         + ", partNumber=" + 
                     partNumber + "]";
   }
   
   public boolean equals(Object other)
   {  if (getClass() == other.getClass())
      {  Item otherItem = (Item)other;
         return description.equals(
                  otherItem.description) 
            && partNumber == 
                    otherItem.partNumber;
      }
      else
         return false;
   }
         
   public int hashCode()
   {   return 13 * description.hashCode(
                    ) + 17 * partNumber;
   }

   public int compareTo(Object other)
   {   Item otherItem = (Item)other;
       return partNumber - 
               otherItem.partNumber;
   }

   private String description;
   private int partNumber;
}

java.lang.Comparable

  • int compareTo(Object other)
    compares this object with another object and returns a negative value if this comes before other, zero if they are considered identical in the sort order, and a positive value if this comes after other.
    Parameters: other the object to compare

java.util.Comparator

  • int compare(Object a, Object b)
    compares two objects and returns a negative value if a comes before b, zero if they are considered identical in the sort order, and a positive value if a comes after b.
    Parameters: a, b the objects to compare

java.util.SortedSet

  • Comparator comparator()
    returns the comparator used for sorting the elements, or null if the elements are compared with the compareTo method of the Comparable interface.
  • Object first()
  • Object last()
    return the smallest or largest element in the sorted set.

java.util.TreeSet

  • TreeSet(Comparator c)
    constructs a tree set and uses the specified comparator for sorting its elements. Parameters: c the comparator to use for sorting
  • TreeSet(SortedSet elements)
    constructs a tree set, adds all elements from a sorted set, and uses the same element comparator as the given sorted set. Parameters: elements the sorted set with the elements to add and the comparator to use

Maps

A set is a collection that lets you quickly find an existing element. However, to look up an element, you need to have an exact copy of the element to find. That isn't a very common lookup--usually, you have some key information, and you want to look up the associated element. The map data structure serves that purpose. A map stores key/value pairs. You can find a value if you provide the key. For example, you may store a table of employee records, where the keys are the employee IDs and the values are Employee objects.

The Java library supplies two general purpose implementations for maps: HashMap and TreeMap. A hash map hashes the keys, and a tree map uses a total ordering on the keys to organize them in a search tree. The hash or comparison function is applied only to the keys. The values associated with the keys are not hashed or compared.

Should you choose a hash map or a tree map? As with sets, hashing is a bit faster, and it is the preferred choice if you don't need to visit the keys in sorted order.

Here is how you set up a hash map for storing employees.

HashMap staff = new HashMap();
Employee harry = 
  new Employee("Harry Hacker");
staff.put("987-98-9996", harry);
. . .

BACK | NEXT