Fast Allocations in Ruby 3.5

admin

May 22, 2025 - 15:15

0 0

Many Ruby applications allocate objects. What if we could make allocating objects six times faster? We can! Read on to learn more!

Speeding up allocations in Ruby

Object allocation in Ruby 3.5 will be much faster than previous versions of Ruby. I want to start this article with benchmarks and graphs, but if you stick around I’ll also be explaining how we achieved this speedup.

For allocation benchmarks, we’ll compare types of parameters (positional and keyword) with and without YJIT enabled. We’ll also vary the number of parameters we pass to initialize so that we can see how performance changes as the number of parameters increases.

The full benchmark code can be found expanded below, but it’s basically as follows:

class Foo
  # Measure performance as parameters increase
  def initialize(a1, a2, aN)
  end
end

def test
  i = 0
  while i < 5_000_000
    Foo.new(1, 2, N)
    Foo.new(1, 2, N)
    Foo.new(1, 2, N)
    Foo.new(1, 2, N)
    Foo.new(1, 2, N)
    i += 1
  end
end

test

Full Benchmark Code

Positional parameters benchmark:

N = (ARGV[0] || 0).to_i

class Foo
  class_eval <<-eorb
  def initialize(#{N.times.map { "a#{_1}" }.join(", ") })
  end
  eorb
end

eval <<-eorb
def test
  i = 0
  while i < 5_000_000
    Foo.new(#{N.times.map { _1.to_s }.join(", ") })
    Foo.new(#{N.times.map { _1.to_s }.join(", ") })
    Foo.new(#{N.times.map { _1.to_s }.join(", ") })
    Foo.new(#{N.times.map { _1.to_s }.join(", ") })
    Foo.new(#{N.times.map { _1.to_s }.join(", ") })
    i += 1
  end
end
eorb

test

Keyword parameters benchmark:

N = (ARGV[0] || 0).to_i

class Foo
  class_eval <<-eorb
  def initialize(#{N.times.map { "a#{_1}:" }.join(", ") })
  end
  eorb
end

eval <<-eorb
def test
  i = 0
  while i < 5_000_000
    Foo.new(#{N.times.map { "a#{_1}: #{_1}" }.join(", ") })
    Foo.new(#{N.times.map { "a#{_1}: #{_1}" }.join(", ") })
    Foo.new(#{N.times.map { "a#{_1}: #{_1}" }.join(", ") })
    Foo.new(#{N.times.map { "a#{_1}: #{_1}" }.join(", ") })
    Foo.new(#{N.times.map { "a#{_1}: #{_1}" }.join(", ") })
    i += 1
  end
end
eorb

test

We want to measure how long this script will take, but change the number and type of parameters we pass. To emphasize the cost of object allocation while minimizing the impact of loop execution, the benchmark allocates several objects per iteration.

Running the benchmark code with 0 to 8 parameters, varying parameter type and whether or not YJIT is enabled will produce the following graph:

The graph illustrates the speedup ratio, calculated by dividing the time spent on Ruby 3.4.2 by that spent on Ruby 3.5. That means that any values below 1 represent a slowdown, where any values above 1 would represent a speedup. When we compare Ruby 3.5 to Ruby 3.4.2 we either disable YJIT on both versions or enable YJIT on both versions. In other words we compare Ruby 3.5 with Ruby 3.4.2 and Ruby 3.5+YJIT with Ruby 3.4.2+YJIT.

The X axis shows the number of parameters passed to initialize, and the Y axis is the speedup ratio. The blue bars are positional parameters without YJIT, the green bars are positional parameters with YJIT. The grey bars are keyword parameters without YJIT, and the yellow bars are keyword parameters with YJIT.

First, we can see that all bars are above 1, meaning that every allocation type is faster on Ruby 3.5 than on Ruby 3.4.2. Positional parameters have a constant speedup ratio regardless of the number of parameters.

Positional Parameter Comparison

For positional parameters the speedup ratio remains constant regardless of the number of parameters. Without YJIT, Ruby 3.5 is always about 1.8x faster than Ruby 3.4.2. When we enable YJIT, Ruby 3.5 is always about 2.3x faster.

Keyword Parameter Comparison

Keyword parameters are a little more interesting. For both the interpreter and YJIT, as the number of keyword parameters increases, the speedup ratio also increases. In other words, the more keyword parameters used, the more effective this change is.

With just 3 keyword parameters passed to initialize, Ruby 3.5 is 3x faster than Ruby 3.4.2, and if we enable YJIT it’s over 6.5x faster.

Bottlenecks in Class#new

I’ve been interested in speeding up allocations, and thus Class#new for a while. But what made it slow?

Class#new is a very simple method. All it does is allocate an instance, pass all parameters to initialize, and then return the instance. If we were to implement Class#new in Ruby, it would look something like this:

class Class
  def self.new(...)
    instance = allocate
    instance.initialize(...)
    instance
  end
end

The implementation has two main parts. First, it allocates a bare object with allocate, and second it calls the initialize method, forwarding all parameters new received. So to speed up this method, we can either speed up object allocation, or speed up calling out to the initialize method.

Speeding up allocate means speeding up the garbage collector, and while there are merits to doing that, I wanted to focus on the runtime side of the equation. That means trying to decrease the overhead of calling out to another method. So what makes a method call slow?

Calling Ruby methods from Ruby

Ruby’s virtual machine, YARV, uses a stack as a scratch space for processing values. We can think of this stack as a really large heap allocated array. Every time we process a YARV instruction, we’ll read or write to this heap allocated array. This is also true for passing parameters between functions.

When we call a function in Ruby, the caller pushes parameters to the stack before the call is made to the callee. The callee then reads its parameters from the stack, does any processing it needs, and returns.

def add(a, b)
  a + b
end

def call_add
  add(1, 2)
end

For example in the above code, the caller call_add will push the arguments 1 and 2 to the stack before calling the add function. When the add function reads its parameters in order to perform the +, it reads a and b from the stack. The values pushed by the caller become the parameters for the callee. You can see this in action in our recent post about Launching ZJIT.

This “calling convention” is convenient because the arguments pushed to the stack don’t need to be copied anywhere when they become the parameters to the callee. If you examine the memory addresses for where 1 and 2 are stored, you’ll see that they are the same addresses used for the values of a and b.

Calling C methods from Ruby

Unfortunately C functions do not use the same calling convention as Ruby functions. That means when we call a C function from Ruby, or a Ruby function from C, we must convert method parameters to their respective calling convention.

In C, parameters are passed via registers or machine stack. This means that when we call a C function from Ruby, we need to copy values from the Ruby stack into registers. Or when we call a Ruby function from C, we must copy register values to the Ruby stack.

This conversion between calling conventions takes some time, so this is a place we can target for optimization.

When calling a C function from Ruby, positional parameters can be directly copied to registers.

static VALUE
foo(VALUE a, VALUE b)
{
  return INT2NUM(NUM2INT(a) + NUM2INT(b));
}

# calls the `foo` C function
foo(1, 2)

In the above example, on ARM64, the parameters a and b will be in the X0 and X1 registers respectively. When we call the foo function from Ruby, the parameters can be copied directly to the X0 and X1 registers from the Ruby stack.

Unfortunately the conversion isn’t so simple for keyword parameters. Since C doesn’t support keyword parameters, we have pass the keyword parameters as a hash to the C function. This means allocating a new hash, iterating over the parameters, and setting them in the hash.

We can see this in action with the following program when run on Ruby 3.4.2:

class Foo
  def initialize(a:)
  end
end

def measure_allocations
  x = GC.stat(:total_allocated_objects)
  yield
  GC.stat(:total_allocated_objects) - x
end

def test
  measure_allocations { Foo.new(a: 1) }
end

# We need to warm the callsite before measurement because inline caches are Ruby
# objects, so they will skew our results
test # warmup
test # warmup
p test

If we run the above program with Ruby 3.4.2, we’ll see that the test method allocates 2 objects: and instance of Foo, and a hash for passing the keyword parameters to the C implementation of Class#new.

Achieving an allocation speedup

I want to start first with a little bit of history.

I’ve been interested in speeding up allocations for quite some time. We know that calling a C function from Ruby incurs some overhead, and that the overhead depends on the type of parameters we pass. So my initial inclination was to rewrite Class#new in Ruby. Since Class#new just forwards all of its parameters to initialize, it seemed quite natural to use the triple-dot forwarding syntax (...). You can find remnants of my initial implementation here. Unfortunately I found that using ... was quite expensive because at the time, it was syntactic sugar for *, **, &, and Ruby would allocate extra objects to represent these splat parameters.

This lead me to implement an optimization for .... The optimization for ... allowed us to use parameter forwarding without allocating any extra objects. I think this optimization is useful in general, but what I had in mind was using it for Class#new. Fast forward some months, and I was able to implement Class#new in Ruby with this new optimization. The initial benchmarks were decent, it eliminated allocations and decreased the cost of passing parameters from new to initialize. But I was somewhat worried about inline cache misses at this call site.

The Class#new implementation linked to above is a little complex, but if we boil it down, it’s essentially the same as the Class#new implementation we saw at the beginning of the post:

class Class
  def self.new(...)
    instance = allocate
    instance.initialize(...)
    instance
  end
end

The problem with the above code is the inline cache at the initialize call site. When we make method calls, Ruby will try to cache the destination of that call. That way we can speed up subsequent calls on the same type at that call site.

CRuby only has a monomorphic inline cache, meaning it can only store one inline cache at any particular call site. The inline cache is used to help look up the method we will call, and the key to the cache is the class of the receiver (in this case, the class of the instance local variable). Each time the type of the receiver changes, the cache misses, and we have to do a slow path lookup of the method.

It’s very rare for code to allocate exactly the same type of object many times in a row, so the class of the instance local variable will change quite frequently. Meaning we could potentially have very poor cache hit rates. Even if the call site could support multiple cache entries (a “polymorphic” inline cache), the cardinality at this particular call site would be so high that cache hit rates would still be quite poor.

I showed this PR to Koichi Sasada (author of YARV), and he suggested that instead of implementing Class#new in Ruby, we add a new YARV instruction and “inline” the implementation of Class#new. I worked with John Hawthorn to implement it and we had a prototype implementation done within a week. Fortunately (or unfortunately) this prototype turned out to be much faster than a Ruby implementation of Class#new, so I decided to abandon that effort.

Inlining `Class#new`

So what is inlining? Inlining is pretty much just copy / pasting code from the callee to the caller.

Foo.new

Any time the compiler sees code like the above, instead of generating a simple method call to new, it generates the instructions that new would have used but at the call site of new.

To make this more concrete, lets look at the instructions for the above code before and after inlining.

Here is the bytecode for Foo.new before inlining:

> ruby -v --dump=insns -e'Foo.new'
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
== disasm: #@-e:1 (1,0)-(1,7)>
0000 opt_getconstant_path                                   (   1)[Li]
0002 opt_send_without_block                 
0004 leave

Here is the bytecode for Foo.new after inlining:

> ./ruby -v --dump=insns -e'Foo.new'
ruby 3.5.0dev (2025-04-29T20:36:06Z master b5426826f9) +PRISM [arm64-darwin24]
== disasm: #@-e:1 (1,0)-(1,7)>
0000 opt_getconstant_path                                   (   1)[Li]
0002 putnil
0003 swap
0004 opt_new                                , 11
0007 opt_send_without_block                 
0009 jump                                   14
0011 opt_send_without_block                 
0013 swap
0014 pop
0015 leave

Before inlining, the instructions look up the constant Foo, then call the new method. After inlining, we still look up the constant Foo, but instead of calling the new method, there are a bunch of other instructions.

The most important of these new instructions is the opt_new instruction which allocates a new instance and writes that instance to the stack. Immediately after the opt_new instruction we see a method call to initialize. These instructions effectively allocate a new instance and call initialize on that instance, the same thing that Class#new would have done, but without actually calling Class#new.

What’s really nice about this is that any parameters pushed onto the stack are left on the stack for the initialize method to consume. Where we had to do copies in the C implementation, there are no longer any copies! Additionally, we no longer push and pop a stack frame for Class#new which further speeds up our code.

Finally, since every call to new includes another call to initialize we have very good cache hit rates compared to the pure Ruby implementation of Class#new. Rather than one initialize call site, we have an initialize call site at every call to new.

Eliminating a stack frame, eliminating parameter copies, and improving inline cache hits are the major advantages of this optimization.

Downsides to Inlining

Of course this optimization is not without downsides.

First, there are more instructions, so it requires more memory usage. However, this memory increase only grows in proportion to the number of call sites that use new. We measured this in our monolith and only saw a 0.5% growth in instruction sequence size, which is an even smaller percentage of overall heap size.

Second, this optimization introduces a small backwards incompatibility. Consider the following code:

class Foo
  def initialize
    puts caller
  end
end

def test
  Foo.new
end

test

If we run this code with Ruby 3.4, the output is like this:

> ruby -v test.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
test.rb:8:in 'Class#new'
test.rb:8:in 'Object#test'
test.rb:11:in ''

If we run this code with Ruby 3.5, the output is like this:

> ./ruby -v test.rb
ruby 3.5.0dev (2025-04-29T20:36:06Z master b5426826f9) +PRISM [arm64-darwin24]
test.rb:8:in 'Object#test'
test.rb:11:in ''

The Class#new frame is missing from Ruby 3.5 and that is because the frame has been eliminated.

Conclusion

If you’ve made it this far, I hope you found the topic interesting. I’m really excited for Ruby 3.5 to be released later this year, and I hope you are too! I want to thank Koichi Sasada for suggesting inlining (and the opt_new instruction) and John Hawthorn for helping me with the implementation.

If you’re curious, take a look at the implementation in the pull request and the discussion in the RedMine ticket. I didn’t explain every detail of this patch (for example, what happens if you’re calling new on something that isn’t a class?) so if you have questions don’t hesitate to email me or ask on social media.

Have a good day!