I tried profiling the core _bincount function, and then accelerating it with numba, a naive application of which only made it about 30% faster overall. Notebook here (and gist here in case the notebook cache hasn't updated yet).
However I've never used numba before, and not done much profiling in python, so maybe there is something else I should have tried.
I also profilied to see if the block_size argument has much effect.
Also I noticed that #49 introduced a regression where the block_size argument is no-longer passed down to _bincount, so doesn't actually do anything. Fixed in #62.
@rabernat @gjoseph92 @jrbourbeau