Vector API in Java 25
-
Last Updated: December 16, 2025
-
By: javahandson
-
Series
Learn Java in a easy way
Vector API in Java 25 allows developers to write high-performance SIMD code (Single Instruction, Multiple Data) directly in Java. Learn how vectors, vector species, loopBound, and masked operations work, with clear explanations and practical code examples for modern CPUs.
As modern CPUs continue to evolve, performance gains no longer come primarily from higher clock speeds. Instead, processors achieve better performance by executing multiple data operations in parallel using SIMD (Single Instruction, Multiple Data) instructions. While Java has long benefited from JVM and JIT optimizations, developers have traditionally had no explicit control over how and when such vectorized execution takes place.
Before Java 25, we largely depended on the JVM’s auto-vectorization, where the JIT compiler attempts to optimize simple loops into vector instructions. Although this works in some cases, it is not guaranteed, transparent, or consistent—especially for complex logic or performance-critical workloads. This unpredictability makes it difficult to write Java code that reliably takes advantage of modern CPU capabilities.
The Vector API in Java 25 addresses this limitation by allowing us to express vector computations directly and explicitly in Java code. Instead of hoping the JVM recognizes a vectorization opportunity, we can clearly state our intent to operate on multiple data elements in parallel. The JVM then maps these operations to the most efficient SIMD instructions supported by the underlying hardware, such as AVX on x86 or NEON on ARM.
Most importantly, the Vector API preserves Java’s core strengths—portability, safety, and readability. We write the same Java code regardless of the CPU architecture, and the JVM dynamically chooses the optimal vector size at runtime. This makes it possible to achieve predictable high performance without sacrificing maintainability or writing platform-specific native code.
In short, the Vector API matters in Java 25 because it bridges the gap between modern hardware capabilities and high-level Java programming, enabling us to build faster, more scalable, and future-ready applications.
The Vector API is a Java language feature that allows us to perform data-parallel operations directly in Java code. It enables a single operation to be applied to multiple data elements at the same time, using the SIMD (Single Instruction, Multiple Data) capabilities of modern CPUs. Instead of processing values one by one, we can express computations that work on entire vectors of data in parallel.
Introduced as JEP 508 in Java 25, the Vector API provides a set of specialized classes for working with vectors of primitive types such as int, float, and double. These classes allow us to load data from arrays into vectors, perform arithmetic or logical operations on all elements simultaneously, and store the results back into memory. The JVM then translates these high-level vector operations into the most efficient hardware instructions available on the running system.
A key aspect of the Vector API is its platform independence. We do not write CPU-specific code or worry about instruction sets like AVX or NEON. Instead, we describe what computation should happen, and the JVM decides how to execute it optimally for the current hardware. This ensures that the same Java code can run efficiently across different processors without modification.
Unlike JVM auto-vectorization, which is implicit and unpredictable, the Vector API makes vectorization explicit and intentional. This gives us better control, more predictable performance, and clearer reasoning about how our code uses modern hardware capabilities. At the same time, the API remains safe, readable, and consistent with Java’s design principles.
In summary, the Vector API brings explicit SIMD programming to Java in a portable and high-level way, making it possible to write high-performance numerical code without leaving the Java ecosystem.
To use the Vector API effectively in Java 25, we must first understand three fundamental concepts: VectorSpecies, lanes, and vectors. These concepts define how data is grouped, how many elements are processed in parallel, and how computations are executed using SIMD instructions. Once these ideas are clear, the rest of the Vector API becomes much easier to reason about.
A VectorSpecies describes the shape of a vector. It answers a very important question:
How many elements of a particular type can be processed in parallel on this machine?
A species defines the following:
In most cases, we do not hardcode a specific size. Instead, we ask the JVM to choose the most efficient vector shape supported by the current CPU.
package com.javahandson.jep508;
import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.VectorSpecies;
public class Main {
private static final VectorSpecies<Integer> SPECIES =
IntVector.SPECIES_PREFERRED;
public static void main(String[] args) {
System.out.println("Vector species: " + SPECIES);
System.out.println("Lane count: " + SPECIES.length());
}
}
Output:
Vector species: Species[int, 16, S_512_BIT]
Lane count: 16
This tells us that the JVM selected a 512-bit vector, capable of processing 16 integers in parallel. On a different machine, the same code might use 8 or 4 lanes instead. This adaptability is what makes the Vector API portable across platforms.
A lane is a single slot inside a vector. Each lane holds exactly one value of the vector’s element type.
For example:
Conceptually, a vector can be visualized like this:
Lane index: 0 1 2 3 ... 15 Values: a0 a1 a2 a3 ... a15
When we perform a vector operation, such as addition, the operation is applied lane by lane:
[ a0 a1 a2 ... a15] + [ b0 b1 b2 ... b15] ------------------- [c0 c1 c2 ... c15]
Each lane is independent, but all lanes are processed in parallel by the CPU. This is the essence of SIMD execution.
A vector is the actual object that holds lane values and supports vectorized operations. In Java, vectors are represented by classes such as IntVector, FloatVector, and DoubleVector.
The typical lifecycle of a vector operation looks like this:
Here is a small example demonstrating this flow:
package com.javahandson.jep508;
import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.VectorSpecies;
import java.util.Arrays;
public class Main {
private static final VectorSpecies<Integer> SPECIES =
IntVector.SPECIES_PREFERRED;
public static void main(String[] args) {
int[] a = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
int[] b = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160};
int[] c = new int[a.length];
// Load array data into vectors
IntVector va = IntVector.fromArray(SPECIES, a, 0);
IntVector vb = IntVector.fromArray(SPECIES, b, 0);
// Perform lane-wise addition
IntVector vc = va.add(vb);
// Store result back to array
vc.intoArray(c, 0);
System.out.println(Arrays.toString(c));
}
}
Output: [11, 22, 33, 44, 55, 66, 77, 88, 99, 110, 121, 132, 143, 154, 165, 176]
In this example, we intentionally used 16 elements in the input arrays because the selected vector species on this machine is:
Species[int, 16, S_512_BIT]
This means the Vector API is working with 16 lanes, and each unmasked vector load (fromArray) expects exactly one full vector’s worth of data starting at the given index. When we call:
IntVector.fromArray(SPECIES, a, 0);
the JVM attempts to load 16 integers at once (a[0] through a[15]) into a single vector register. Since our array contains exactly 16 elements, this operation is safe and succeeds without any additional checks.
Q. What happens if the array has fewer elements?
If the array contains fewer than 16 elements, the same code would fail at runtime with an IndexOutOfBoundsException. This is because an unmasked vector load always assumes that all lanes are valid, and the JVM cannot read past the end of the array.
For example:
Q. What about arrays with more elements?
If the array contains more than 16 elements, this code would process only the first 16 values and ignore the remaining elements. To handle larger arrays correctly, we must iterate in vector-sized chunks and process any remaining elements separately.
Q. How do we handle variable-sized arrays?
In real-world code, we never assume that the array size matches the lane count. Instead, we use:
Masking allows us to disable unused lanes so that vector operations remain safe even when the array length is not a perfect multiple of the vector size. We will learn more about this in the sections below.
One important thing to understand is that vector operations always work with a fixed number of lanes, determined by the species. Even if the array contains fewer elements than the lane count, the vector size does not shrink. This is why the Vector API provides masking, which allows us to safely operate on partial vectors when dealing with small arrays or leftover elements. We will explore masking in detail in the tail-processing section.
The separation of species and vector operations gives us two major benefits:
Predictable performance – we know exactly how many elements are processed per iteration.
Portability – the same code adapts to different CPUs automatically.
To understand why the Vector API in Java 25 is useful, we should compare it with the traditional way we process arrays in Java. In most applications, we write loops that handle one element at a time. This is called scalar computation. The Vector API allows us to process multiple elements in one step using SIMD instructions, which is known as vector computation.
In scalar code, every loop iteration processes exactly one value:
package com.javahandson.jep508;
import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.VectorSpecies;
import java.util.Arrays;
public class Main {
public static void main(String[] args) {
int[] a = {1, 2, 3, 4, 5, 6};
int[] b = {10, 20, 30, 40, 50, 60};
int[] c = new int[a.length];
// Scalar: one addition per loop iteration
for (int i = 0; i < a.length; i++) {
c[i] = a[i] + b[i];
}
System.out.println(Arrays.toString(c));
}
}
Output: [11, 22, 33, 44, 55, 66]
Here, the CPU effectively performs:
1 + 10
2 + 20
3 + 30
… and so on
Even though the loop is fast, it still performs the operation value by value.
In scalar computation, we add numbers one by one inside a loop. With the Vector API in Java 25, we can pack multiple values into a vector and perform the operation lane-wise (in parallel). Below is a simple example where we add two arrays using one vector operation.
Note: On my system, IntVector.SPECIES_PREFERRED reports Species[int, 16, S_512_BIT]. That means we have 16 lanes, so we intentionally used 16 elements in the arrays to match the lane count and keep the demo simple.
package com.javahandson.jep508;
import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.VectorSpecies;
import java.util.Arrays;
public class Main {
private static final VectorSpecies<Integer> SPECIES =
IntVector.SPECIES_PREFERRED;
public static void main(String[] args) {
int[] a = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
int[] b = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160};
int[] c = new int[a.length];
// Load array data into vectors
IntVector va = IntVector.fromArray(SPECIES, a, 0);
IntVector vb = IntVector.fromArray(SPECIES, b, 0);
// Perform lane-wise addition
IntVector vc = va.add(vb);
// Store result back to array
vc.intoArray(c, 0);
System.out.println(Arrays.toString(c));
}
}
Output: [11, 22, 33, 44, 55, 66, 77, 88, 99, 110, 121, 132, 143, 154, 165, 176]
In this example, we use the Vector API in Java 25 to add two integer arrays using a single vector operation. The VectorSpecies defines the vector shape chosen by the JVM based on the CPU. On this system, the preferred species has 16 lanes, so each vector can process 16 integers in parallel.
Because the array length exactly matches the lane count, the vector computation runs only once and adds all 16 integers in a single operation. The fromArray calls load the entire arrays into vector registers, the add operation performs lane-wise addition in parallel, and intoArray stores the result back into the output array.
This example works only when the array size matches the vector width. For smaller or larger arrays, we must use looping and masked vector operations to handle the data safely.
Q. Why is vector code faster (for big arrays)
When arrays are large (thousands/millions of elements), vector code reduces loop iterations drastically:
Example (16 lanes):
Scalar: 1,000,000 iterations
Vector: 1,000,000 / 16 = 62,500 iterations (+ small tail)
Fewer iterations + SIMD execution = better performance, especially for compute-heavy workloads.
With the Vector API, a key idea is that the JVM executes computations in fixed-size vector chunks, not one element at a time. The size of that chunk is decided by the species (VectorSpecies). For example, if our machine reports Species[int, 16, S_512_BIT], then one IntVector holds 16 integers (16 lanes). A full (unmasked) vector load, such as IntVector.fromArray(SPECIES, a, i), always tries to read all 16 lanes starting at index i. That means the JVM must be sure a[i] through a[i+15] exists. If they don’t, Java throws an out-of-bounds exception.
This is exactly why we use loopBound(). It gives us the largest index up to which we can safely execute the full-vector loop without crossing the array boundary. After that, if there are leftover elements (the tail), we process them using a mask. A VectorMask simply tells the JVM which lanes are valid (inside the array length) and which lanes must be ignored.
Below are the four common “array length vs lane count” cases, explained, with a short code example for each. To maintain consistency across machines, we will use IntVector.SPECIES_128 (usually 4 lanes for int), so the behavior is easy to see. The concept is identical when your machine uses 8 or 16 lanes.
Case 1: Array length is less than lanes (example: length 3, lanes 4)
When the array is smaller than the lane count, there is no place where a full vector fits. That means #loopBound() becomes 0, so the vector loop runs zero times. In this case, we must handle the entire array using a masked vector operation (tail path). Masking lets us safely operate only on the available elements.
package com.javahandson.jep508;
import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.VectorMask;
import jdk.incubator.vector.VectorSpecies;
import java.util.Arrays;
public class Main {
private static final VectorSpecies<Integer> SPECIES =
IntVector.SPECIES_128; // lanes = 4
public static void main(String[] args) {
System.out.println("SPECIES : "+ SPECIES);
int[] a = {1, 2, 3};
int[] b = {10, 20, 30};
int[] c = new int[a.length];
int i = 0;
int upperBound = SPECIES.loopBound(a.length); // 0
System.out.println("upper bound : "+ upperBound);
// vector loop won't run as upper bound is 0 , hence not defined , instead used mask
VectorMask<Integer> mask = SPECIES.indexInRange(i, a.length);
IntVector va = IntVector.fromArray(SPECIES, a, i, mask);
IntVector vb = IntVector.fromArray(SPECIES, b, i, mask);
va.add(vb).intoArray(c, i, mask);
System.out.println(Arrays.toString(c));
}
}
Output:
SPECIES : Species[int, 4, S_128_BIT]
upper bound : 0
[11, 22, 33]
Case 2: Array length is exactly equal to lanes (example: length 4, lanes 4)
When the array length matches the lane count exactly, we get the cleanest case: loopBound() equals the length, so the vector loop runs exactly once, and there is no tail. We can safely use unmasked loads and stores because a full vector fits perfectly.
package com.javahandson.jep508;
import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.VectorMask;
import jdk.incubator.vector.VectorSpecies;
import java.util.Arrays;
public class Main {
private static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_128; // lanes = 4
public static void main(String[] args) {
System.out.println("SPECIES : "+ SPECIES);
int[] a = {1, 2, 3, 4};
int[] b = {10, 20, 30, 40};
int[] c = new int[a.length];
int i = 0;
int upperBound = SPECIES.loopBound(a.length); // 4
System.out.println("upper bound : "+ upperBound);
for (; i < upperBound; i += SPECIES.length()) {
IntVector va = IntVector.fromArray(SPECIES, a, i);
IntVector vb = IntVector.fromArray(SPECIES, b, i);
va.add(vb).intoArray(c, i);
}
// no tail because i == a.length
System.out.println(Arrays.toString(c));
}
}
Output:
SPECIES : Species[int, 4, S_128_BIT]
upper bound : 4
[11, 22, 33, 44]
Case 3: Array length is greater than lanes but not a multiple (example: length 6, lanes 4)
This is the most common real-world scenario. Here, #loopBound() gives the boundary for full vectors. With length 6 and lanes 4, #loopBound() becomes 4. That means the vector loop processes indices 0–3 as one full vector, and then indices 4–5 remain as the tail. For the tail, we must use a mask, because we cannot safely load a full 4-lane vector starting at index 4.
package com.javahandson.jep508;
import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.VectorMask;
import jdk.incubator.vector.VectorSpecies;
import java.util.Arrays;
public class Main {
private static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_128; // lanes = 4
public static void main(String[] args) {
System.out.println("SPECIES : "+ SPECIES);
int[] a = {1, 2, 3, 4, 5, 6};
int[] b = {10, 20, 30, 40, 50, 60};
int[] c = new int[a.length];
int i = 0;
int upperBound = SPECIES.loopBound(a.length); // 4
System.out.println("upper bound : "+ upperBound);
// full vector chunk (0..3)
for (; i < upperBound; i += SPECIES.length()) {
IntVector va = IntVector.fromArray(SPECIES, a, i);
IntVector vb = IntVector.fromArray(SPECIES, b, i);
va.add(vb).intoArray(c, i);
}
System.out.println("Before mask runs : "+Arrays.toString(c));
// tail chunk (4..5) masked
if (i < a.length) {
VectorMask<Integer> mask = SPECIES.indexInRange(i, a.length);
IntVector va = IntVector.fromArray(SPECIES, a, i, mask);
IntVector vb = IntVector.fromArray(SPECIES, b, i, mask);
va.add(vb).intoArray(c, i, mask);
}
System.out.println("After mask runs : "+Arrays.toString(c));
}
}
Output:
SPECIES : Species[int, 4, S_128_BIT]
upper bound : 4
Before mask runs : [11, 22, 33, 44, 0, 0]
After mask runs : [11, 22, 33, 44, 55, 66]
Case 4: Array length is a perfect multiple of lanes (example: length 8, lanes 4)
When the length is an exact multiple of the lane count, #loopBound() equals the full length, and the loop runs multiple times with full vectors. There is no tail, so we never need masking. This is the best case for performance because every iteration is a full vector operation.
package com.javahandson.jep508;
import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.VectorMask;
import jdk.incubator.vector.VectorSpecies;
import java.util.Arrays;
public class Main {
private static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_128; // lanes = 4
public static void main(String[] args) {
System.out.println("SPECIES : "+ SPECIES);
int[] a = {1,2,3,4,5,6,7,8};
int[] b = {10,20,30,40,50,60,70,80};
int[] c = new int[a.length];
int i = 0;
int upperBound = SPECIES.loopBound(a.length); // 8
System.out.println("upper bound : "+ upperBound);
int iteration = 1;
// two full vectors: (0..3) and (4..7)
for (; i < upperBound; i += SPECIES.length()) {
IntVector va = IntVector.fromArray(SPECIES, a, i);
IntVector vb = IntVector.fromArray(SPECIES, b, i);
va.add(vb).intoArray(c, i);
System.out.println("Iteration " + iteration++ + " Summed " + Arrays.toString(c));
}
// no tail because i == a.length
}
}
Output:
SPECIES : Species[int, 4, S_128_BIT]
upper bound : 8
Iteration 1 Summed [11, 22, 33, 44, 0, 0, 0, 0]
Iteration 2 Summed [11, 22, 33, 44, 55, 66, 77, 88]
When using the Vector API, we should think in vector-sized chunks. #loopBound() tells us how many elements can be safely processed using full vectors, and masking handles whatever remains. If the array is smaller than one vector, the vector loop does not run at all, and the entire work is done via a masked tail. If the array aligns perfectly with vector width, we get full-vector execution only, with no tail overhead.
1. Parallel data processing with fewer instructions – The Vector API allows a single vector instruction to operate on multiple data elements at once (for example, 8 or 16 integers in parallel). This significantly reduces the number of CPU instructions compared to scalar loops, especially when working with large arrays.
2. Predictable and explicit performance – Unlike JVM auto-vectorization, which is implicit and may or may not happen, the Vector API makes vectorization explicit. This gives us more predictable performance characteristics for compute-intensive code.
3. Best suited for large, compute-heavy workloads – The Vector API shines when processing large arrays and performing numeric computations such as mathematical calculations, image processing, signal processing, financial calculations, or data analytics. The larger the dataset, the greater the performance benefit.
4. Portable across different CPU architectures – We write the code once using SPECIES_PREFERRED, and the JVM maps it to the most efficient SIMD instructions available on the underlying hardware (AVX, NEON, etc.), without requiring platform-specific code.
5. Not ideal for small arrays or business logic – For small data sizes, I/O-bound tasks, or typical business logic, the overhead of vector setup and masking may outweigh the benefits. In such cases, simple scalar loops are often sufficient and easier to maintain.
The Vector API lives in the incubator module jdk.incubator.vector. Because of this, it is not enabled automatically, even if you are already using JDK 25. To compile and run any code that imports jdk.incubator.vector.*, we must explicitly add this module.
1. Compile and Run from Command Line
Compile: javac --add-modules jdk.incubator.vector Main.java Run: java --add-modules jdk.incubator.vector Main
2. If We’re Using Preview Features Too
If our project also uses Java 25 preview features (for example, other preview JEPs), include –enable-preview along with the vector module.
Compile: javac --enable-preview --release 25 --add-modules jdk.incubator.vector Main.java Run: java --enable-preview --add-modules jdk.incubator.vector Main
3. IntelliJ IDEA (Common Setup)
In IntelliJ, add the module option in Run/Debug Configuration → VM options:
--add-modules jdk.incubator.vector
If we are using preview features in your project, add:
--enable-preview --add-modules jdk.incubator.vector
Important: keep a space between options (don’t write –enable-preview–add-modules).
4. Quick Verification
To verify it’s working, print the species:
System.out.println(IntVector.SPECIES_PREFERRED);
We may see output like: Species[int, 16, S_512_BIT]
This confirms the Vector API module is loaded and the JVM is selecting the best SIMD shape for our CPU.
When working with the Vector API in Java 25, understanding a few common mistakes and recommended practices can save a lot of debugging time and help you get the best performance.
1. Assuming vector code works like scalar code – Vector operations always work on a fixed number of lanes. Unmasked loads require a full vector, so using fromArray without checking array length can lead to an IndexOutOfBoundsException.
2. Forgetting #loopBound() and tail handling – Skipping #loopBound() and masking is a common error. Arrays whose lengths are not exact multiples of the lane count will either produce incorrect results or fail at runtime.
3. Expecting performance gains for small arrays – For small datasets, the overhead of vector setup and masking may outweigh the benefits. The Vector API is designed for large, compute-heavy workloads.
4. Hardcoding a specific vector size – Choosing a fixed species (for example, always using 256-bit vectors) reduces portability. The recommended approach is to use SPECIES_PREFERRED so the JVM can adapt to the hardware.
5. Forgetting to enable the Vector API module – Since the Vector API is in the incubator module, forgetting to add –add-modules jdk.incubator.vector will result in runtime or compilation errors.
1. Use SPECIES_PREFERRED for portability and performance – Let the JVM select the optimal vector size for the current CPU instead of hardcoding lane counts.
2. Follow the standard pattern: vector loop + masked tail – Always combine #loopBound() for full vectors with masked operations for the remaining elements. This ensures both performance and safety.
3. Keep vector code focused and simple – Vector code is most effective when it operates on simple, tight loops with minimal branching and predictable memory access.
4. Use vectorization only where it matters – Apply the Vector API to performance-critical sections identified through profiling, not across the entire codebase.
5. Test on different data sizes and platforms – Validate behavior with small arrays, large arrays, and non-aligned lengths to ensure correctness and portability across CPUs.
The Vector API in Java 25 brings explicit SIMD programming to the Java ecosystem in a safe, portable, and predictable way. It allows us to express data-parallel computations directly in Java, enabling the JVM to map those operations to the most efficient vector instructions supported by the underlying hardware.
By understanding core concepts such as vector species, lanes, loopBound, and masked operations, we can write high-performance code that scales well for large, compute-intensive workloads. At the same time, Java’s design principles—portability, safety, and readability—remain intact, as the same code adapts automatically to different CPU architectures.
While the Vector API is not meant for everyday business logic or small datasets, it is a powerful tool for scenarios where performance truly matters. Used thoughtfully and in the right places, it helps bridge the gap between modern CPU capabilities and high-level Java programming, making Java 25 an even stronger choice for performance-critical applications.