I'm pretty sure you can get rid of the 0xFFFFFFFF / p and get some more speedup by manually implementing the bitarray ops. You can get another boost by using BSF instruction [1] to quickly scan for the next set bit. And you really only need to store odd numbers; storing the even numbers is just wasteful.
You can get even more speedup by taking into account cache effects. When you cross out all the multiples of 3 you use 512MB of bandwidth. Then when you cross out all multiples of 5 you use 512MB more. Then 512MB again when you cross out all multiples of 7. The fundamental problem is that you have many partially generated cache-sized chunks and you cycle through them in order with each prime. I'm pretty sure it's faster if you instead fully generate each chunk and then never access it again. So e.g. if your cache is 128k you create a 128k chunk and cross out multiples of 3, 5, 7, etc. for that 128k chunk. Then you do the next 128k chunk again crossing out multiples of 3, 5, 7, etc. That way you only use ~512MB of memory bandwidth in total instead of 512MB per prime number. (Actually it's only really that high for small primes, it starts becoming less once your primes get bigger than the number of bits in a cache line.)