Unfortunately, mersenne twister is a huge bottleneck in our code. Meanwhile, torch.rand gives [different CPU + GPU results](https://discuss.pytorch.org/t/deterministic-prng-across-cpu-cuda/116275). Any chance of including Warp Generator PRNG in torchcsprnch? [This blog](http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-warp_generator.html) offers code and benchmarks vs mersenne.