> it's probably because you are trying to write Java in C++ or Rust

Well, sure. In principle, we know that for every Java program there exists a C++ program that performs at least as well because HotSpot is such a program (i.e. the Java program itself can be seen as a C++ program with some data as input). The question is can you match Java's performance without significantly increasing the cost of development and especially evolution in a way that makes the tradeoff worthwhile? That is quite hard to do, and gets harder and harder the bigger the program gets.

> I was not familiar with the term "object flattening", but apparently it just means storing data by value inside a struct. But data layout is exactly the thing you should be thinking about when you are trying to write performant code.

Of course, but that's why Java is getting flattened objects.

> As a first approximation, performance means taking advantage of throughput and avoiding latency, and low-level languages give you more tools for that

Only at the margins. These benefits are small and they're getting smaller. More significant performance benefits can only be had if virtually all objects in the program have very regular lifetimes - in other words, can be allocated in arenas - which is why I think it's Zig that's particularly suited to squeezing out the last drops of performance that are still left on the table.

Other than that, there's not much left to gain in performance (at least after Java gets flattened objects), which is why the use of low-level languages has been shrinking for a couple of decades now and continues to shrink. Perhaps it would change when AI agents can actually code everything, but then they might as well be programming in machine code.

What low-level languages really give you through better hardware control is not performance, but the ability to target very restricted environments with not much memory (as one of Java's greatest performance tricks is the ability to convert RAM to CPU savings on memory management) assuming you're willing to put in the effort. They're also useful, for that reason, for things that are supposed to sit in the background, such as kernels and drivers.

> The question is can you match Java's performance without significantly increasing the cost of development and especially evolution in a way that makes the tradeoff worthwhile?

This question is mostly about the person and their way of thinking.

If you have a system optimized for frequent memory allocations, it encourages you to think in terms of small independently allocated objects. Repeat that for a decade or two, and it shapes you as a person.

If you, on the other hand, have a system that always exposes the raw bytes underlying the abstractions, it encourages you to consider the arrays of raw data you are manipulating. Repeat that long enough, and it shapes you as a person.

There are some performance gains from the latter approach. The gains are effectively free, if the approach is natural for you and appropriate to the problem at hand. Because you are processing arrays of data instead of chasing pointers, you benefit from memory locality. And because you are storing fewer pointers and have less memory management overhead, your working set is smaller.

What you're saying may (sometimes) be true, but that's not why Java's performance is hard to beat, especially as programs evolve (I was programming in C and C++ since before Java even existed).

In a low-level language, you pay a higher performance cost for a more general (abstract) construct. E.g. static vs. dynamic dispatch, or the Box/Rc/Arc progression in Rust. If a certain subroutine or object requires the more general access even once, you pay the higher price almost everywhere. In Java, the situation is opposite: You use a more general construct, and the compiler picks an appropriate implementation per use site. E.g. dispatch is always logically dynamic, but if at a specific use site the compiler sees that the target is known, then the call will be inlined (C++ compilers sometimes do that, too, but not nearly to the same extent; that's because a JIT can perform speculative optimisations without proving they're correct); if a specific `new Integer...` doesn't escape, it will be "allocated" in a register, and if it does escape it will be allocated on the heap.

The problem with Java's approach is that optimisations aren't guaranteed, and sometimes an optimisation can be missed. But on average they work really well.

The problem with a low-level language is that over time, as the program evolves and features (and maintainers) are added, things tend to go in one direction: more generality. So over time, the low-level program's performance degrades and/or you have to rethink and rearchitect to get good performance back.

As to memory locality, there's no issue with Java's approach, only with a missing feature of flattening objects into arrays. This feature is now being added (also in a general way: a class can declare that it doesn't depend on identity, and the compiler then transparently decides when to flatten it and when to box it).

Anyway, this is why it's hard, even for experts to match Java's performance without a significantly higher effort that isn't a one-time thing, but carries (in fact, gets worse) over the software's lifetime. It can be manageable and maybe worthwhile for smaller programs, but the cost, performance, or both suffer more and more with bigger programs as time goes on.

From my perspective, the problem with Java's approach is memory, not computation. For example, low-level languages treat types as convenient lies you can choose to ignore at your own peril. If it's more convenient to treat your objects as arrays of bytes/integers (maybe to make certain forms of serialization faster), or the other way around (maybe for direct access to data in a memory-mapped file), you can choose to do that. Java tends to make solutions like that harder.

Java's performance may be hard to beat in the same task. But with low-level languages, you can often beat it by doing something else due to having fewer constraints and more control over the environment.

> or the other way around (maybe for direct access to data in a memory-mapped file), you can choose to do that. Java tends to make solutions like that harder.

Not so much anymore, thanks to the new FFM API (https://openjdk.org/jeps/454). The verbose code you see is all compiler intrinsics, and thanks to Java's aggressive inlining, intrinsics can be wrapped and encapsulated in a clean API (i.e. if you use an intrinsic in method bar which you call from method foo, usually it's as if you've used the intrinsic directly in foo, even though the call to bar is virtual). So you can efficiently and safely map a data interface type to chunks of memory in a memory-mapped file.

> But with low-level languages, you can often beat it by doing something else due to having fewer constraints and more control over the environment.

You can, but it's never free, rarely cheap (and the costs are paid throughout the software's lifetime), and the gains aren't all that large (on average). The question isn't "is it possible to write something faster" but "can you get sufficient gains at a justifiable costs", and that's already hard and getting harder and harder.