Rational Java: escape analysis

This post looks at escape analysis, in particular how long it takes for the jvm to effect escape analysis in a running program. I make some observations but don't have all the explanation at this point.

By way of introduction let's take a detour to look at a little known and even less used flag (which we'll see is a good thing) in the jvm, -Xcomp.

The behaviour for this flag is defined in the jvm documentation as:

-Xcomp

Forces compilation of methods on first invocation. By default, the Client VM (-client) performs 1,000 interpreted method invocations and the Server VM (-server) performs 10,000 interpreted method invocations to gather information for efficient compilation. Specifying the -Xcomp option disables interpreted method invocations to increase compilation performance at the expense of efficiency.

At first sight this seems to be an excellent option. A shortcut to warming up the jvm through 10,000 cycles - we can get the code to compile straight away. Shouldn't we always enable this option by default?

But the documentation does warn that this will be 'at the expense of efficiency'.

The jvm learns about code behaviour in the 10,000 warmup cycles so that when it comes to compiling it compiles in the most efficient way possible. Compiling the code right away will mean that yes the code is indeed compiled but that the compiled code may not be the most efficient. You can read more about it in this blogpost - but that's not really the subject of this post.

Something else that doesn't happen if you use -Xcomp is escape analysis. This is actually rather surprising as the jvm shouldn't need to learn about whether escape analysis is possible by running the program. This should be evident by a static analysis of the code.

Have a look at this code (I was inspired by the ideas in this blog):

We need to make sure that the program runs without a gc (I suggest these flags):

-verbosegc -Xmx4g -Xms4g

When the program waits for input carry out a heap dump to see how many Optional object have been created. Then hit any key to resume the program.

To perform a heap dump first run jps to identify the pid of the program then run:

jmap -histo pid | head

Do this once without the -Xcomp flag and once with -Xcomp flag.

Without the -Xcomp flag:

After first iteration:

num #instances #bytes class name

----------------------------------------------

1: 644 123241360 [I

2: 234496 3751936 java.util.Optional

3: 6582 791968 [C

4: 2717 540712 [B

5: 4786 114864 java.lang.String

6: 662 75144 java.lang.Class

7: 1349 63912 [Ljava.lang.Object;

After second iteration:

num #instances #bytes class name

----------------------------------------------

1: 644 116687472 [I

2: 644095 10305520 java.util.Optional

3: 6583 792056 [C

4: 2717 540712 [B

5: 4787 114888 java.lang.String

6: 662 75144 java.lang.Class

7: 1349 63912 [Ljava.lang.Object;

All subsequent iterations are the same no further objects are created:

There's clearly escape analysis kicking in after 234k iterations - not sure why it should take so long, usually (for example with compiling code) 10k iterations is enough? Also in the second iteration it creates another ~400k objects before escape analysis kick in which is also a bit mysterious.

With the -Xcomp flag

After the first iteration:

num #instances #bytes class name

----------------------------------------------

1: 653 153880352 [I

2: 1000001 16000016 java.util.Optional

3: 7397 834744 [C

4: 2717 540728 [B

5: 5685 136440 java.lang.String

6: 672 76208 java.lang.Class

7: 1349 63912 [Ljava.lang.Object;

After the second iteration:

num #instances #bytes class name

----------------------------------------------

1: 654 159354896 [I

2: 2000001 32000016 java.util.Optional

3: 7398 834832 [C

4: 2717 540728 [B

5: 5686 136464 java.lang.String

6: 672 76208 java.lang.Class

7: 1349 63912 [Ljava.lang.Object;

After each iteration the number of Optional objects goes up by 1m.

Summary

-Xcomp is a switch that should almost certainly never be used in production. I can imagine some scenarios where you might want to play around with disabling the interpreter but those would be very specific edge cases.
It seems to take at least 200K iteration for escape analysis to be effective. So you need to allow longer than the 10k iterations for a full warm up.
There is also another phase where after escaping out objects it seems to need to do this again. This needs further understanding.
If you slow the program down a bit by doing some work in between the calls to create the Optional the number of objects reduces. For example I found that a call to Math.sin reduces the Optional objects by about 50%.

In an earlier post I wrote about how to start optimising for performance. If you haven't read the post I would advise you do have a look but the long and short of it is that it's pretty much always worth reducing allocations as a first step to improving performance. But understanding exactly where your program actually allocates is not as easy as you might think...

Whilst reviewing an article on a similar topic it occurred to me that it would be interesting to try and measure exactly how much longer it takes to allocate as opposed to reuse an existing object. We know that allocation is really fast in Java (ignore the side effects of object allocation for now) so let's see how allocation compares to object re-use.

This is the first attempt at the code I wrote to try and measure that change:

The output form the program was as below:

No allocation took 15 with result 100000000
Allocations took 11 with result 100000000
No allocation took 12 with result 100000000
Allocations took 13 with result 100000000
No allocation took 0 with result 100000000
Allocations took 1 with result 100000000

Hardly anything in it - but then I got suspicious....

Was the testAllocations method actually allocating or was escape analysis being applied? Escape analysis allows the JIT to bypass the object allocation and use the stack instead of the heap. See here for more details.

Well one way to find out is to run the program with the command line setting -verbosegc and see if the gc was being invoked. As I suspected there was no gc, which, given the fact I had not set the -Xms and -Xmx flags should not have been possible.

So I wrote another little program forcing the testAllocate method to 'escape' and thereby allocate:

The output for this was very different, see below:

I've run with -verbose:gc - I've picked out the System.out in red so that the times can be easily identified.

*** No allocation took 43 with result 100000000
[GC (Allocation Failure) 111405K->108693K(142336K), 0.1243020 secs]
[Full GC (Ergonomics) 108693K->69499K(190976K), 0.3276730 secs]
[GC (Allocation Failure) 102779K->102923K(204288K), 0.1275237 secs]
[GC (Allocation Failure) 149515K->149731K(224256K), 0.2172119 secs]
[Full GC (Ergonomics) 149731K->149326K(359936K), 1.1698787 secs]
[GC (Allocation Failure) 215886K->215502K(365056K), 0.2680804 secs]
**** Allocations took 2321 with result 100000000
*** No allocation took 37 with result 100000000
[Full GC (Ergonomics) 365307K->52521K(314368K), 0.1279458 secs]
[GC (Allocation Failure) 124201K->124497K(314368K), 0.2493061 secs]
[GC (Allocation Failure) 196177K->196401K(419840K), 0.3004662 secs]
**** Allocations took 733 with result 100000000
[GC (Allocation Failure) 276094K->274825K(429056K), 0.5655052 secs]
[Full GC (Ergonomics) 274825K->1024K(427520K), 0.0042224 secs]
*** No allocation took 583 with result 100000000
[GC (Allocation Failure) 98304K->59382K(527872K), 0.0727687 secs]
[GC (Allocation Failure) 224246K->224582K(534016K), 0.1465164 secs]
**** Allocations took 263 with result 100000000
*** No allocation took 15 with result 100000000
[GC (Allocation Failure) 389446K->167046K(643584K), 0.0831585 secs]
**** Allocations took 119 with result 100000000
[GC (Allocation Failure) 366101K->91504K(653312K), 0.0006495 secs]
*** No allocation took 15 with result 100000000
[GC (Allocation Failure) 329072K->290206K(681472K), 0.1650293 secs]
**** Allocations took 202 with result 100000000

As you can see this time the program is definitely allocating and the time for the allocated test is considerably slower.

Let's see if we can eliminate the need for GC pauses by setting a high -Xms value.

This time the GC is almost eliminated and we get these results:

*** No allocation took 48 with result 100000000
**** Allocations took 162 with result 100000000
*** No allocation took 44 with result 100000000
**** Allocations took 172 with result 100000000
*** No allocation took 28 with result 100000000
**** Allocations took 179 with result 100000000
*** No allocation took 40 with result 100000000
**** Allocations took 365 with result 100000000
*** No allocation took 73 with result 100000000
[GC (Allocation Failure) 1572864K->188862K(6029312K), 0.2602120 secs]
**** Allocations took 499 with result 100000000
*** No allocation took 43 with result 100000000
**** Allocations took 62 with result 100000000
*** No allocation took 17 with result 100000000
**** Allocations took 228 with result 100000000
*** No allocation took 95 with result 100000000
**** Allocations took 364 with result 100000000
*** No allocation took 46 with result 100000000
**** Allocations took 194 with result 100000000
*** No allocation took 14 with result 100000000
[GC (Allocation Failure) 1761726K->199238K(6029312K), 0.2232670 secs]

**** Allocations took 282 with result 100000000

So even without GC, allocating is considerably slower.

The next step was to examine what was happening by looking at my programs with a profiler.

I ran the first program, Allocation1, with my favourite profiler, YourKit and started seeing very different behaviour. The program which we had previously shown not be allocating, was now allocating (I could see this my running with verbose GC below).

This is the output:

[YourKit Java Profiler 2014 build 14114] Log file: /Users/daniel/.yjp/log/Allocation1-11980.log
No allocation took 20 with result 100000000
[GC (Allocation Failure) 33280K->1320K(125952K), 0.0032859 secs]
[GC (Allocation Failure) 34600K->1232K(125952K), 0.0028476 secs]
[GC (Allocation Failure) 34512K->1136K(125952K), 0.0020955 secs]
[GC (Allocation Failure) 34416K->1168K(159232K), 0.0024346 secs]
[GC (Allocation Failure) 67728K->1240K(159232K), 0.0021334 secs]
Allocations took 109 with result 100000000
No allocation took 26 with result 100000000
[GC (Allocation Failure) 67800K->1376K(222208K), 0.0036706 secs]
[GC (Allocation Failure) 134496K->1508K(222208K), 0.0024567 secs]
Allocations took 98 with result 100000000
No allocation took 1 with result 100000000
[GC (Allocation Failure) 134628K->1516K(355840K), 0.0008581 secs]

Allocations took 56 with result 100000000

And in YourKit we see the creation of those objects:

Running with YourKit had interfered with the escape analysis and caused the objects to be allocated!!

The advice I gave in my previous post 'First Rule of Performance Optimisation' was to look to eliminate allocation. This program (Allocation1) looks (from YourKit's view) like it is allocating and a developer, following my advice, might spend considerable effort reducing that allocation. The developer would be surprised and disappointed when they found that on rerunning the program (without YourKit) that nothing at all had been achieved. Of course now we know why that is, allocation were only a side-effect of running the profiler! Without the profiler the allocations were being 'escaped-anaysised' away.

So how are we to know whether allocations we see in the profiler are real or just caused by the profiler interfering with escape analysis.

The answer is to use Flight Recorder. See my previous post on the differences between Flight Recorder and other profilers.

This is the Flight Recording when we run Allocation1.

As you can see no allocations have been recorded. Contrast this with the Flight Recording we get from running Allocation2.

Conclusions:

It is always better not to allocate objects
But escape analysis means that calling new does not always lead to an allocation!
In general, profilers interfere with escape analysis so be careful of fixing an allocation that, without the profiler might be escaped away.
Using Flight Recorder should not interfere with escape analysis and give you a more accurate reflection as to your real allocation.

Further Reading on the subject:

I subsequently came across this excellent (as usual) post from Nitsan Wakart
http://psy-lob-saw.blogspot.co.uk/2014/12/the-escape-of-arraylistiterator.html

Rational Java

Pages

Tuesday, 22 December 2015

How long does it take the jvm to effect escape analysis? Maybe longer than you think.

Monday, 18 May 2015

Revisiting the First Rule of Performance Optimisation: Effects of Escape Analysis