Notes: Java Performance The Def Guide

Chapter 2: Testing Approaches
Microbenchmarks: measure execution time of code snippets
Be aware of compiler optimisation.
Be aware of cold vm – warm-up period: measure after VM warm-up.
Be aware of threads contention.

Macrobenchmarks: measure the whole system in conjunction with any external resources.
Measure in production environment:full system testing with multiple JVMs.
Pro: Can optimise bottleneck first.

Mesobenchmark: e.g. benchmark the response time of rendering from JSP.

Metrics:
Elapsed Time
Throughput
Response Time
* statistical significance
* Test early; Test often.
* Measure; Automate; On target system.

A load generator: Faban.

Chapter 3: Java Performance Toolbox
Tools from Linux: sar vmstat prstat iostat

CPU
Check CPU time first to understand why CPU usage is low. The goal is to drive the CPU usage up.

Disk usage
iostat
Swapping moves pages of data from RAM to disc and vice versa. This results in bad performance, especially for Java due to the Java heap.
vmstat has si and so for swap in and swap out respectively.
Common cause of IO bottleneck: inefficient writing; writing too much.

Network
Usually network monitoring does not show utilization but counting packets and bits.
netstat is very basic.
nicstat shows utilization stats. 40% utilization means saturated for LAN.

Java Tools
Working with tuning flags: jcmd jinfo
Thread information jstack jcmd
Heap: jmap
UI: jconsole -class, GC, etc jvisualvm – VM info

Profiling Tools
sampling mode: relatively low profile; fewer measurement artifacts.
instrumented mode: intrusive; more information
Thread Timelines from profilers show how threads work together.
Native Profilers inspect JVM itself including JVM compiler code, GC and application code.
Java Mission Control: jmc command.

Chapter 4: “Just in time” compiler
“JIT” is in between interpreted and compiled. The source code is compiled to java binary, which is then interpreted or compiled depending on how hot or critical that piece of code is. JVM will interpret and run code first because it will know more of the code w.r.t how to optimise by compiling the code.

Trade off: when to use values from RAM when to store them in register
Threads do not see each other’s register. Synchronization is needed for such scenario.

JIT compiler: 32 client; 32 server; 64-bit server.
c1 / c2 compiler : client (compile starts early) / server (compile stage starts late with faster compiled code) compiler
tiered compiler : a mix of both of the above and it often performs the best for long running programs. This is enabled by default since Java 8.

32 bit v.s. 64 bit: use 32 bit JIT compiler if total process size (heap, perigean, native code and native memory) < 3 GB. 32bit has smaller footprint.

Tuning needed:
JIT version
Code Cache that holds the assembly lang instructions
Compilation Thresholds

Inspecting the compilation:
1. jstat -compiler [ps id]
2. start program by java -XX:+PrintCompilation

Chapter 12: Java SE API

Buffered IO: always use buffered IO: BufferedOutputStream/BufferedInputStream to wrap I/O stream. However, ByteArrayI/OStream is a buffer itself.

Classloader: Java7 enables parallel class loading.

Random Num: java.util.Random is slow for multi-thread scenario, where java.util.concurrent.ThreadLocalRandom is quite faster. In contrast to these two, java.security.SecureRandom generate true random numbers from system events (entropy-based sources).

JNI: avoid it. “Java call C” is slower than “C call Java” because a this object is passed.

Exception: they are not that slow. System Exception is optimised than Exceptions created from user code. Use defensive coding for good performance. Disable stack track to speed up loading class due to the hierarchical structure of class loader: app classLoader -> system -> bootstrap(root) loader.

String: reuse string; string encode() decode() are slow; user StringBuilder; one line concatenation is optimised by javac.

Logging: apply minimum logging that meets business needs.

Collection: choose the right data structure; the collections of concurrent* version are much faster than before with not much extra memory overhead compared with the unsynchronised version.; initialise collections with the right size / capacity.

Lambda: lambda works the same as anonymous class. However, they are not anonymous classes but a static method that called through a helper class in JDK. The performance of both is the same in usual cases. Lambdas is faster to create because anonymous classes involves using classLoader.

Streams: Parallelised code and lazy evaluation.

Filter: single filter is faster than iterator.