If you get the same speeds for C++ and Java, I'd like to point out that the C++ implementation is likely very sub-optimal.

This can obviously be true for toy problems, but tends not to generalize.