|
Tuning Garbage Collection with the 1.3.1 Java Virtual Machine
|
See also Performance Docs |
Introduction
The Java 2 Platform is increasingly used for large server applications such as web services.
These applications demand scalability, and directly benefit from large numbers
of threads, processors, sockets and memory. Yet 'big iron' performance
has a reputation as an art form, requiring special expertise beyond what
is needed for performance on smaller systems. Fortunately, the Java
Virtual Machine (JVM)* and Solaris operating
environment provide effective implementations
of threads, I/O and memory management. This document addresses a common
speed bump on the road to scalable high performance: poorly tuned garbage
collection (GC).
Amdahl observed that most workloads cannot be perfectly parallelized; some
portion is always sequential and does not benefit from parallelism.
This is also true for the Java 2 Platform. In particular, JVMs up to and including
version 1.3.1 do not have parallel garbage collection, so the impact of GC
on a multiprocessor system grows relative to an otherwise parallel
application.
The graph below models an ideal system that is perfectly scalable with
the exception of GC. The top line (red) is an application spending
only 1% of the time in GC on a uniprocessor; this translates into more than
20% loss in throughput at 32 processors. At 10%, not considered an outrageous
amount of time in GC in uniprocessor applications, more than 75% of throughput
is lost when scaling up.
This demonstrates that issues that appear lost in the noise when developing
on small systems may become principal bottlenecks when scaling up.
The silver lining is that small improvements in such a bottleneck can produce
large gains in performance. For a sufficiently large system it becomes
well worthwhile to tune garbage collection.
This document is written from the perspective of 1.3.1 JVM on the
Solaris (SPARC Platform Edition) operating environment,
because that platform provides the most scalable hardware/software Java 2
platform today. However, the descriptive text applies to other
supported platforms, including Linux, Microsoft Windows, and the
Solaris (Intel Architecture) operating environment, to the extent
that scalable hardware is available. Although command line options
are consistent across platforms, some platforms may have different defaults
than described here.
Generations
One of Java 2 Platform's great strengths is that it shields the substantial complexity
of memory allocation and garbage collection from the developer. However,
once GC has become the principal bottleneck, it becomes worth understanding
aspects of this hidden implementation. Garbage collectors make assumptions
about the way applications use objects, and these are reflected in tunable
parameters that can be adjusted for improved performance without sacrificing
the power of the abstraction.
An object is garbage when it can no longer be reached from any pointer
in the running program. The most straightforward garbage collection
algorithms simply iterate over every reachable object; any objects left
over are then known to be garbage. This approach takes time proportional
to the number of living objects, which is prohibitive for large applications
maintaining lots of living data.
The 1.3 JVM incorporates a number of different garbage collection algorithms
that are combined using generational collection. While naive
garbage collection examines every living object in the heap, generational
collection exploits several empirically observed properties of most applications
to avoid extra work.
The most important of these properties is infant mortality.
The blue area in the diagram below is a typical distribution for
the lifetimes of objects. The sharp peak at the left represents objects
that can be reclaimed shortly after being allocated. Iterator objects,
for example, are often alive for the duration of a single loop.
Some objects do live longer, and so the distribution stretches out to the
the right. For instance, there are typically some objects allocated
at initialization that live until the process exits. Between these
two extremes are objects that live for the duration of some intermediate
computation, seen here as the lump to the right of the infant mortality
peak. Some applications have very different looking distributions,
but a surprisingly large number possess this general shape. Efficient
collection is made possible by focusing on the fact that a majority of objects
die young.
To do this, memory is managed in generations: memory pools holding
objects of different ages. Garbage collection occurs in each generation
when it fills up; these collections are represented on the diagram above
with vertical bars. Objects are allocated in eden, and because
of infant mortality most objects die there. When Eden fills up it causes
a minor collection, in which some surviving objects are moved to
an older generation. When older generations need to be collected there
is a major collection that is often much slower because it involves
all living objects.
The diagram shows a well-tuned system in which most objects die before
they survive to the first garbage collection. The longer an object survives,
the more collections it will endure and the slower GC becomes. By arranging
for most objects to survive less than one collection, garbage collection
can be very efficient. This happy situation can be upset by applications
with unusual lifetime distributions, or by poorly sized generations that
cause collections to be too frequent.
The default garbage collection parameters were designed to be effective
for most small applications. They aren't optimal for many server applications.
This leads to the central tenet of this document:
|
If GC has become a bottleneck, you may wish to customize the generation
sizes. Check the verbose GC output, and then explore the sensitivity
of your individual performance metric to the GC parameters.
|
Types of collection
Each generation has an associated type of garbage collection that
can be configured to make different algorithmic time, space and pause
tradeoffs. In 1.3, the JVM implements three very different garbage
collectors:
- Copying (sometimes called scavenge): this collector
very efficiently moves objects between two or more generations. The
source generations are left empty, allowing remaining dead objects to be
reclaimed quickly. However, since it requires empty space to operate,
copying requires more footprint. In 1.3.1 copying collection is used
for all minor collections.
- Mark-compact: this collector allows generations to be collected
in place without reserving extra memory; however, compaction is significantly
slower than copying. In 1.3.1 mark-compact is used for major collections.
- Incremental (sometimes called train): this collector
is used only if -Xincgc is
passed on the command line. By careful bookkeeping, incremental GC
collects just a portion of the old generation at a time, trying to spread
the large pause of a major collection over many minor collections.
However, it is even slower than mark-compact when considering overall throughput.
Since copying is very fast, a tuning goal is to collect as many objects
as possible by copying rather than by compaction or incremental collection.
The default arrangement of generations looks something like this.
At initialization, a maximum address space is virtually reserved but not
allocated physical memory unless it is needed. The complete address
space reserved for object memory can be divided into the young and old generations.
The young generation consists of eden plus two survivor spaces
. Objects are initially allocated in eden. One survivor space
is empty at any time, and serves as the destination of the next copying
collection of any living objects in eden and the other survivor space.
Objects are copied between survivor spaces in this way until they age enough
to be tenured (copied to the old generation.)
(Other virtual machines, including the production JVM version 1.2 for
the Solaris operating environment,
used two equally sized spaces for copying rather than one large eden plus
two small spaces. This means the options for sizing the young generation
are not directly comparable; see the
Performance FAQ for an example.)
The old generation is collected in place by mark-compact. One portion
called the permanent generation is special because it holds all the
reflective data of the JVM itself, such as class and method objects.
Performance considerations
There are two primary measures of garbage collection performance.
Throughput is the percentage of total time not spent in garbage
collection, considered over long periods of time.
Throughput includes time spent in allocation (but tuning
for speed of allocation is generally not needed.)
Pauses are
the times when an application appears unresponsive because garbage collection
is going on.
Users have different requirements of garbage collection. For example,
some consider the right metric for a web server to be throughput, since
pauses during garbage collection may be tolerable, or simply obscured by
network latencies. But for an interactive graphical program, even
short pauses may upset the user experience.
Some users are sensitive to other considerations. Footprint
is the working set of a process, measured in pages and cache lines.
On systems with limited physical memory or many processes, footprint may
dictate scalability. Promptness is the time between when an
object becomes dead and when the memory becomes available, an important consideration
for distributed systems including RMI.
In general, a particular generation sizing chooses a trade-off between
these considerations. For example, a very large young
generation may maximize throughput, but does so at
the expense of footprint and promptness.
Pauses can be minimized by using a small young generation and incremental
collection, at the expense of throughput.
There is no one right way to size generations; the best choice is determined
by the way the application uses memory as well as user requirements.
For this reason the JVM's default GC choices may not be optimal, and may
be overridden by the user in the form of command line options below.
Measurement
Throughput and footprint are best measured using metrics particular to the
application. For example, throughput of a web server may be tested
using a client load generator, while footprint of the server might be measured
on the Solaris operating environment using the pmap
command. On the other hand, pauses due to GC are easily estimated
by inspecting the diagnostic output of the JVM itself.
The command line argument -verbose:gc
prints information at every collection. For example, here is output
from a large server application:
[GC 325407K->83000K(776768K),
0.2300771 secs]
[GC 325816K->83372K(776768K), 0.2454258 secs]
[Full GC 267628K->83769K(776768K),
1.8479984 secs]
Here we see two minor collections and one major one. The numbers
before and after the arrow indicate the combined size of live objects before
and after the GC. (After minor collections the count includes objects
that aren't necessarily alive but can't be reclaimed, either because they
are directly alive, or because they are within or referenced from the
old generation.) The number in parenthesis is the total available space,
which is the total heap minus one of the survivor spaces.
Sizing the generations
A number of parameters affect generation size. This diagram
illustrates the ones most important to tuning the 1.3.1 JVM.
Many parameters are actually ratios x:y, and these are
depicted with black (representing x) and grey (representing
y) size bars: 
Total heap
Since collections occur when generations fill up, throughput is inversely
proprotional to the amount of memory available. Total available memory
is the most important knob affecting GC performance.
By default, the JVM grows or shrinks the heap at each collection to try to
keep the proportion of free space to living objects at each collection within
a specific range. This target range is set as a percentage by the parameters
-XX:MinHeapFreeRatio=<minimum> and
-XX:MaxHeapFreeRatio=<maximum>, and the total size is bounded below by
-Xms and above by -Xmx
. The default parameters for the Solaris (SPARC Platform Edition)
operating environment are shown in this table:
-XX:MinFreeHeapRatio=
|
40
|
-XX:MaxHeapFreeRatio=
|
70
|
-Xms
|
3584k
|
-Xmx
|
64m
|
Large server apps often experience two problems with these defaults.
One is slow startup, because the initial heap is small and must be resized
over many major collections. A more pressing problem is that the default
maximum heap size is unreasonably small for most server applications.
The rules of thumb for server applications are:
Unless
you have problems with pauses, try granting as much memory as possible to
the JVM. The default size (64MB) is often too small.
Setting -Xms and
-Xmx to the same value increases predictability by removing the
most important sizing decision from the JVM. On the other hand, the
JVM can't compensate if you make a poor choice.
Be sure to increase the memory as you increase the number of processors,
since allocation can be parallelized, but GC is not parallel.
|
The young generation
The second most influential knob is the proportion of the heap dedicated
to the young generation. The bigger the young generation, the less
often minor collections occur. However, for a bounded heap size a larger
young generation implies a smaller old generation, which will increase the
frequency of major collections. The optimal choice depends on the lifetime
distribution of the application.
By default, the young generation size is controlled by NewRatio.
For example, setting -XX:NewRatio=3
means that the ratio between the young and old generation is 1:3; in other
words, the combined size of eden and the survivor spaces will be one fourth
of the heap.
The parameters NewSize and
MaxNewSize bound the young
generation size below and above. Setting these equal to one another
fixes the young generation, just as setting
-Xms and -Xmx equal
fixes the total heap size. This is useful for tuning the young generation
at a finer granularity than the integral multiples allowed by
NewRatio.
Because the young generation uses copying collection, enough free memory
must be reserved in the old generation to ensure that a minor collection
can complete. In the worst case, this reserved memory is equal to the
size of eden plus the objects in non-empty survivor space. When there isn't enough memory
available in the old generation for this worst case, a major collection will
occur instead. This policy is fine for small applications, because
the memory reserved in the old generation is typically only virtually committed
but not actually used. But for applications needing the largest possible
heap, an eden bigger than half the virtually committed size of the heap
is useless: only major collections would occur.
If desired, the parameter SurvivorRatio
can be used to tune the size of the survivor spaces, but this is often
not as important to performance. For example,
-XX:SurvivorRatio=6 sets the ratio between each survivor space
and eden to be 1:6; in other words, each survivor space will be one eighth
of the young generation (not one seventh, because there are two survivor
spaces).
If survivor spaces are too small, copying collection overflows directly
into the old generation. If survivor spaces are too large, they will
be uselessly empty. At each garbage collection the JVM chooses a threshold
number of times an object can be copied before it is tenured. This
threshold is chosen to keep the survivors half full. (For the intrepid,
a 1.3.1 option -XX:+PrintTenuringDistribution
can be used to show this threshold and the ages of objects in the new
generation. It is also useful for observing the lifetime distribution
of an application.)
Here are the default values for the Solaris (SPARC Platform Edition)
operating environment:
NewRatio
|
2 (client JVM:
8)
|
NewSize
|
2172k
|
MaxNewSize
|
32m
|
SurvivorRatio
|
25
|
The rules of thumb for server applications are:
First
decide the total amount of memory you can afford to give the JVM.
Then graph your own performance metric against young generation sizes to
find the best setting.
Unless you find problems with excessive major collection or pause times, grant plenty
of memory to the young generation. The default
MaxNewSize (32MB) is generally too small.
Increasing the young generation becomes counterproductive at half the total
heap or less.
Be sure to increase the young generation as you increase the number of
processors, since allocation can be parallelized, but GC is not parallel.
|
Other considerations
For most applications the permanent generation is not relevant to GC performance.
However, some applications dynamically generate and load many classes.
For instance, some implementations of JSPs do this. If necessary,
the maximum permanent generation size can be increased with
MaxPermSize.
Some applications interact with garbage collection by using finalization
and weak/soft/phantom references. These features can create performance
artifacts at a Java-programming-language level; an example is relying on finalization to close
file descriptors, which makes an external resource (descriptors) dependent
on GC promptness. Relying on GC to manage resources other than memory
is almost always a bad idea.
Another way apps can interact with garbage collection is by invoking GCs
explicitly, such as through the
System.gc() call. These calls force major collection,
and inhibit scalability on large systems. The performance impact of
explicit GCs can be measured using the unsupported flag
-XX:+DisableExplicitGC.
One of the most commonly encountered uses of explicit GC occurs with RMI's
distributed garbage collection (DGC). Applications using RMI refer
to objects in other JVMs. Garbage can't be collected in these distributed
applications without occasional local collection, so RMI forces periodic
full collection. The frequency of these collections can be controlled
with properties. For example,
java -Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000 ...
specifies explicit collection once per hour instead of the default rate
of once per minute. However, this may also cause some objects to take much
longer to be reclaimed. These properties can be set as high as Long.MAX_VALUE
to make the time between explicit collections effectively infinite, if there
is no desire for an upper bound on the timeliness of DGC activity.
The Solaris 8 operating environment supports an alternate version of libthread that binds threads to
LWPs directly; this may help avoid starvation of the finalization thread.
To try this, set the environment variable
LD_LIBRARY_PATH to include
/usr/lib/lwp before launching the JVM.
Soft references are cleared less aggressively in the server JVM than the
client. The rate of clearing can be slowed by increasing a parameter
in this way: -XX:SoftRefLRUPolicyMSPerMB=10000. The default is value 1000, or one second per megabyte.
For large dedicated systems, there are other
special options available to boost performance.
Conclusion
Garbage collection can become a bottleneck in highly parallel systems.
By understanding how GC works, it is possible to use a variety of command
line options to minimize that impact.
The demands of large servers are being met with larger hardware configurations
that ever before. For these systems, the 1.4 line of JVMs will provide
additional solutions, including a 64 bit address space for even larger generations
and concurrent collection to hide the pauses associated with major collection.
As used on the web site, the terms "Java Virtual
Machine and "JVM" mean a virtual machine for the Java platform.