Java SE 6 includes several new features and enhancements to improve performance in many areas of the platform. Improvements include: synchronization performance optimizations, compiler performance optimizations, the new Parallel Compaction Collector, better ergonomics for the Concurrent Low Pause Collector and application start-up performance.
2.1 Runtime performance optimizations
2.1.1 Biased locking
Biased Locking is a class of optimizations that improves uncontended synchronization
performance by eliminating atomic operations associated with the
Java language’s synchronization primitives. These
optimizations rely on the property that not only are most monitors
uncontended, they are locked by at most one thread during their
lifetime.
An object is "biased"
toward the thread which first acquires its monitor via a
monitorenter bytecode or synchronized method invocation;
subsequent monitor-related operations can be performed by that
thread without using atomic operations resulting in much better
performance, particularly on multiprocessor machines.
Locking
attempts by threads other that the one toward which the object is
"biased" will cause a relatively expensive operation
whereby the bias is revoked. The benefit of the elimination of
atomic operations must exceed the penalty of revocation for this
optimization to be profitable.
Applications with
substantial amounts of uncontended synchronization may attain
significant speedups while others with certain patterns of locking
may see slowdowns.
Biased Locking is enabled by default in
Java SE 6 and later. To disable Biased Locking, please add to the
command line -XX:-UseBiasedLocking .
For more on Biased Locking, please refer to the ACM OOPSLA 2006 paper by Kenneth Russell and David Detlefs: "Eliminating Synchronization-Related Atomic Operations with Biased Locking and Bulk Rebiasing".
2.1.2 Lock coarsening
There are some patterns of locking where a lock is released and then
reacquired within a piece of code where no observable operations occur in between.
The lock coarsening optimization technique implemented in hotspot
eliminates the unlock and relock operations in those situations (when a lock is released and then
reacquired with no meaningful work done in between those operations). It basically reduces the amount
of synchronization work by enlarging an existing synchronized region. Doing this around a loop
could cause a lock to be held for long periods of times, so the technique is only
used on non-looping control flow.
This feature is on by
default. To disable it, please add the following option to the
command line: -XX:-EliminateLocks
2.1.3 Adaptive spinning
Adaptive spinning is an
optimization technique where a two-phase spin-then-block strategy
is used by threads attempting a contended synchronized enter
operation. This technique enables threads to avoid undesirable
effects that impact performance such as context switching and
repopulation of Translation Lookaside Buffers (TLBs). It is
“adaptive" because the duration of the spin is
determined by policy decisions based on factors such as the rate of success
and/or failure of recent spin attempts on the same monitor and
the state of the current lock owner.
For more on Adaptive Spinning, please refer to the presentation by Dave Dice: "
Synchronization in Java SE 6"
2.1.4 Support for large page heap on x86 and amd64 platforms
Java SE 6 supports
large page heaps on x86 and amd64 platforms. Large page heaps
help the Operating System avoid costly Translation-Lookaside
Buffer (TLB) misses to enable memory-intensive applications
perform better (a single TLB entry can represent a larger memory
range).
Please note that large page
memory can sometimes negatively impact system performance. For
example, when a large amount of memory is pinned by an
application, it may create a shortage of regular memory and cause
excessive paging in other applications and slow down the entire
system. Also please note for a system that has been up for a long
time, excessive fragmentation can make it impossible to reserve
enough large page memory. When it happens, the OS
may revert to using regular pages. Furthermore, this effect can be minimized
by setting -Xms == -Xmx, -XX:PermSize == -XX:MaxPermSize and
-XX:InitialCodeCacheSize == -XX:ReserverCodeCacheSize .
Another possible drawback of large pages is that the default sizes of the perm gen
and code cache might be larger as a result of using a large page; this is particularly
noticeable with page sizes that are larger than the default sizes for these
memory areas.
Support for large pages is
enabled by default on Solaris. It's off by default on Windows and
Linux. Please add to the command line -XX:+UseLargePages to enable
this feature. Please note that Operating System configuration
changes may be required to enable large pages. For more information, please
refer to the documentation
on Java Support for Large Memory Pages on Sun Developer Network.
2.1.5 Array Copy Performance Improvements
The method instruction System.arraycopy() was further enhanced in Java SE 6. Hand-coded
assembly stubs are now used for each type size when no overlap
occurs.
2.1.6 Background Compilation in HotSpot™ Client Compiler
Prior to Java SE 6, the
HotSpot Client compiler did not compile Java methods in the
background by default. As a consequence, Hyperthreaded or
Multi-processing systems couldn't take advantage of spare CPU
cycles to optimize Java code execution speed. Background
compilation is now enabled in the Java SE 6 HotSpot client
compiler.
2.1.7 New Linear Scan Register Allocation Algorithm for the HotSpot™ Client Compiler
The HotSpot client
compiler features a new linear scan register allocation algorithm
that relies on static single assignment (SSA) form. This has the
added advantage of providing a simplified data flow analysis and
shorter live intervals which yields a better tradeoff between
compilation time and program runtime. This new algorithm has
provided performance improvements of about 10% on many internal
and industry-standard benchmarks.
For more information on
this new new feature, please refer to the following paper: Linear
Scan Register Allocation for the Java HotSpot™ Client Compiler
2.2 Garbage Collection
2.2.1 Parallel Compaction Collector
Parallel compaction is a
feature that enables the parallel collector to perform major
collections in parallel resulting in lower garbage collection
overhead and better application performance particularly for
applications with large heaps. It is best suited to platforms with
two or more processors or hardware threads.
Previous to
Java SE 6, while the young generation was collected in parallel,
major collections were performed using a single thread. For applications with frequent
major collections, this adversely affected scalability.
Parallel compaction is used by default in JDK 6,
but can be enabled by adding the
option -XX:+UseParallelOldGC to the command line in JDK 5 update 6 and later.
Please note that parallel
compaction is not available in combination with the concurrent
mark sweep collector; it can only be used with the parallel young
generation collector (-XX:+UseParallelGC). The documents referenced below provide more
information on the available collectors and recommendations for
their use.
For more on the Parallel Compaction Collection,
please refer to the Java
SE 6 release notes. For more information on garbage collection
in general, the HotSpot memory
management whitepaper describes the various collectors
available in HotSpot and includes recommendations on when to use
parallel compaction as well as a high-level description of the
algorithm.
2.2.2 Concurrent Low Pause Collector: Concurrent Mark Sweep
Collector Enhancements
The Concurrent Mark
Sweep Collector has been enhanced to provide concurrent collection
for the System.gc() and Runtime.getRuntime().gc() method
instructions. Prior to Java SE 6, these methods stopped all
application threads in order to collect the entire heap which
sometimes resulted in lengthy pause times in applications with
large heaps. In line with the goals of the Concurrent Mark Sweep
Collector, this new feature is enabling the collector to keep
pauses as short as possible during full heap collection.
To
enable this feature, add the option
-XX:+ExplicitGCInvokesConcurrent to the Java command line.
The concurrent marking task
in the CMS collector is now performed in parallel on platforms
with multiple processors . This significantly reduces the
duration of the concurrent marking cycle and enables the collector
to better support applications with larger numbers of threads and
high object allocation rates, particularly on large multiprocessor
machines.
For more on these new features, please refer to
the Java
SE 6 release notes.
2.3 Ergonomics in the 6.0 Java Virtual Machine
In Java SE 5, platform-dependent
default selections for the garbage collector, heap size,
and runtime compiler were introduced to better match the needs of
different types of applications while requiring less command-line
tuning. New tuning flags were also introduced to allow users to
specify a desired behavior which in turn enabled the garbage
collector to dynamically tune the size of the heap to meet the
specified behavior. In Java SE 6, the default selections have been
further enhanced to improve application runtime performance and
garbage collector efficiency.
The chart below compares
out-of-the-box SPECjbb2005™ performance between Java SE 5
and Java SE 6 Update 2. This test was conducted on a Sun
Fire V890 with 24 x 1.5 GHz UltraSparc CPU's and 64 GB RAM
running Solaris
10:
![[specjbb.png]](/performance/reference/whitepapers/charts/jdk5_vs_6_whitepaper_data_jbb2005.jpg)
In each case the benchmarks
were ran without any performance flags. Please see the SPECjbb
2005 Benchmark Disclosure
We also compared I/O
performance between Java SE 5 and Java SE 6 Update 2. This test
was conducted on
a Sun
Fire V890 with 24 x 1.5 GHz UltraSparc CPU's and 64 GB RAM
running Solaris
10:
![[specjbb.png]](/performance/reference/whitepapers/charts/jdk5_vs_6_whitepaper_data_io.jpg)
In each case the
benchmarks were ran without any performance flags.
We
compared VolanoMark™ 2.5 performance between Java SE 5 and
Java SE 6. VolanoMark is a pure Java benchmark that measures both
(a) raw server performance and (b) server network scalability
performance. In this benchmark, the client side simulates up to
4,000 concurrent socket connections. Only those VMs that
successfully scale up to 4,000 connections pass the test. In both
the raw performance and network scalability tests, the higher the
score, the better the result.
This test was
conducted on a Sun
Fire V890 with 24 x 1.5 GHz UltraSparc CPU's and 64 GB RAM
running Solaris
10:
![[volanomark.png]](/performance/reference/whitepapers/charts/jdk5_vs_6_whitepaper_data_volanomark.jpg)
In each
case we ran the benchmark in loopback mode without any performance
flags. The result shown is based upon relative throughput
(messages per second with 400 loopback connections).
The
full Java version for Java SE 5 is:
java
version "1.5.0"
Java(TM) 2 Runtime Environment,
Standard Edition (build 1.5.0-b64)
Java HotSpot(TM) Client VM
(build 1.5.0-b64, mixed mode)
The
full Java version for Java SE 6 is:
java
version "1.6.0_02"
Java(TM) SE Runtime Environment
(build 1.6.0_02-b05)
Java HotSpot(TM) Client VM (build
1.6.0_02-b05, mixed mode)
Please see the
VolanoMark™ 2.5 Benchmark Disclosure
Some other improvements in Java SE 6 include:
On
server-class machines, a specified maximum pause time goal of
less than or equal to 1 second will enable the Concurrent Mark
Sweep Collector.
The
garbage collector is allowed to move the boundary between the
tenured generation and the young generation as needed (within
prescribed limits) to better achieve performance goals. This mechanism
is off by default; to activate it add this to the command line: option -XX:+UseAdaptiveGCBoundary .
Promotion
failure handling is turned on by default for the serial
(-XX:+UseSerialGC) and Parallel Young Generation (-XX:+ParNewGC)
collectors. This feature allows the collector to start a minor
collection and then back out of it if there is not enough space
in the tenured generation to promote all the objects that need to
be promoted.
An
alternative order for copying objects from the young to the
tenured generation in the parallel scavenge collector has been
implemented. The intent of this feature is to decrease cache
misses for objects accessed in the tenured generation.This
feature is on by default. To disable it, please add this to the
command line -XX:-UseDepthFirstScavengeOrder
The
default young generation size has been increased to 1MB on x86
platforms
The
Concurrent Mark Sweep Collector's default Young Generation size
has been increased.
The
minimum young generation size was increased from 4MB to 16MB.
The
proportion of the overall heap used for the young generation was
increased from 1/15 to 1/7.
The CMS collector is now using the survivor spaces by default,
and their default size was increased.
The
primary effect of these changes is to improve application
performance by reducing garbage collection overhead. However,
because the default young generation size is larger, applications
may also see larger young generation pause times and a larger
memory footprint.
2.4 Client-side Performance Features and Improvements
2.4.1 New class list for Class Data Sharing
To reduce application
startup time and footprint, Java SE 5.0 introduced a feature
called "class data sharing" (CDS). On 32-bit
platforms, this mechanism works as follows: the Sun provided
installer loads a set of classes from the system jar (the jar file containing
all the Java class library, called rt.jar) file into a
private internal representation, and dumps that representation to
a file, called a "shared archive". On subsequent JVM
invocations, the shared archive is memory-mapped in, saving the
cost of loading those classes and allowing much of the Java
Virtual Machine's metadata for these classes to be shared among
multiple JVM processes.
In Java SE 6.0, the list of classes
in the "shared archive" has been updated to better
reflect the changes to the system jar file.
2.4.2 Improvements to the boot class loader
The Java Virtual
Machine's boot and extension class loaders have been enhanced to
improve the cold-start time of Java applications. Prior to Java SE
6, opening the system jar file caused the Java Virtual Machine to
read a one-megabyte ZIP index file that translated into a lot of
disk seek activity when the file was not in the disk cache. With
"class data sharing" enabled, the Java Virtual Machine
is now provided with a "meta-index" file (located in
jre/lib) that contains high-level information about which packages
(or package prefixes) are contained in which jar files.
This
helps the JVM avoid opening all of the jar files on the boot and
extension class paths when a Java application class is loaded.
Check bug 6278968} for more details.
Below we show a chart
comparing application start-up time performance between Java SE 5
and Java SE 6 Update 2. This test was conducted on an Intel Core 2
Duo 2.66GHz desktop machine with 1GB of memory:
![[specjbb.png]](/performance/reference/whitepapers/charts/jdk5_vs_6_whitepaper_data_startup.jpg)
The application start-up
comparison above shows relative performance (smaller is better)
and in each case the benchmarks were ran without any performance
flags.
We also compared memory
footprint size required between Java SE 5 and Java SE 6 Update 2.
This test was conducted on
an Intel Core 2 Duo 2.66GHz desktop
machine with 1GB of memory:
![[specjbb.png]](/performance/reference/whitepapers/charts/jdk5_vs_6_whitepaper_data_footprint.jpg)
The footprint comparison
above shows relative performance (smaller is better) and in each
case the benchmarks were run without any performance flags.
Despite the addition of
many new features, the Java Virtual Machine's core memory usage
has been pared down to make the actual memory impact on your
system even lower than with Java SE 5
2.4.3 Splash Screen Functionality
Java SE 6 provides a solution that allows an application to show a splash screen before the virtual
machine starts. Now, a Java application launcher is able to
decode an image and display it in a simple non-decorated window.
2.4.4 Swing's true double buffering
Swing's true double buffering has now been enabled. Swing used to provide double buffering on an
application basis, it now provides it on a per-window basis and native
exposed events are copied directly from the double buffer. This
significantly improves Swing performance, especially on remote servers.
Please
see the Scott Violet's Blog for full details.
2.4.5 Improving rendering on windows systems
The UxTheme API, which allows standard Look&Feel rendering of windows controls on Microsoft Windows systems, has been adopted
to improve the fidelity of Swing Systems Look & Feels.