- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13
Add checkbounds for gather and support empty source array #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| @mcabbott would you please review this? | 
Co-authored-by: Peter <[email protected]> Update src/gather.jl Co-authored-by: Peter <[email protected]> Update src/gather.jl Co-authored-by: Peter <[email protected]> Update src/gather.jl Co-authored-by: Peter <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I have the permission, so you would need someone else to review
| @CarloLucibello would you please review this? | 
| We can make the boundschecking much faster. In the simplest case with integer indexes: a = axes(src, ndims(src))
checkindex(a, collect(extrema(idx))) | 
| I just try it. It seems that  | 
| I think there won't be much difference in speed since they are all calling the  | 
| We can check just the  | 
| I think @chengchingwen is right. They all pass through the same GPU kernel such that computation over an array costs the same time as computing a single value. Since  using CUDA
using BenchmarkTools
T = Float32
CT = CuArray{Float32}
src = CT([3, 4, 5, 6, 7])
idx = cu([1 2 3 4;
            4 2 1 3;
            3 5 5 3])
function checkbounds_src(src, dims::Union{Int, Val}, ::Type{<:Any})
    return i -> checkbounds(Bool, src, ntuple(x -> Colon(), dims)..., i...)
end
function checkbounds_src(src, dims::Union{Int, Val}, ::Type{<:CartesianIndex})
    return i -> checkbounds(Bool, src, ntuple(x -> Colon(), dims)..., i)
end
function checkbounds1(src, idx, dims)
    return map(checkbounds_src(src, Val(dims), eltype(idx)), idx)
end
function checkbounds2(src, idx, dims)
    a = axes(src, ndims(src))
    return checkindex(Bool, a, minimum(idx):maximum(idx))
end
checkbounds1(src, idx, 1)
checkbounds2(src, idx, 1)
julia> @benchmark CUDA.@sync checkbounds1($src, $idx, 1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.406 μs … 73.799 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.080 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.074 μs ±  3.770 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
  ▇█▆▅▆▅▄▃▂▁                                                  ▂
  █████████████▇▇▇▇▆▇▆▅▄▄▄▄▄▃▄▄▂▃▅▄▄▂▄▄▆▅▄▅▅▄▅▅▄▄▄▅▂▄▄▅▅▄▅▄▃▃ █
  10.4 μs      Histogram: log(frequency) by time      32.7 μs <
 Memory estimate: 3.12 KiB, allocs estimate: 56.
julia> @benchmark CUDA.@sync checkbounds2($src, $idx, 1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  31.644 μs …  39.208 ms  ┊ GC (min … max): 0.00% … 44.45%
 Time  (median):     33.708 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.350 μs ± 391.981 μs  ┊ GC (mean ± σ):  4.32% ±  0.44%
  ▅██▇▆▄▂▁          ▁▁▁▁                                       ▂
  █████████████▇█▇▇▇█████▇▇▇▇▇▇▆▅▆▅▆▅▇▆▆▆▆▇▆▆▆▆▅▅▅▄▄▅▅▄▄▅▄▅▅▅▄ █
  31.6 μs       Histogram: log(frequency) by time      79.8 μs <
 Memory estimate: 4.88 KiB, allocs estimate: 88.The original approach ( | 
| Any updates? | 
| Wait for review. | 
| @yuehhua can you benchmark this PR vs master for a few input sizes? Just to make sure that boundschecking doesn't take more than the real computation | 
| Benchmark code: This PR: master branch: It seems to take around 5 times slower than master branch. | 
| Mhmh. That is very small size though, can you check with 10x or 100x bigger arrays? | 
| PR: master branch:  | 
| It seems to take too much GC time. Maybe the closure causes this. | 
| is  | 
        
          
                src/gather.jl
              
                Outdated
          
        
      |  | ||
| # check bounds | ||
| in_bnd = map(checkbounds_src(src, Val(dims), eltype(idx)), idx) | ||
| if !all(in_bnd) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line slows down the code to a great portion
| I have done some tests, the closure here is fine. It won't help to use  | 
| The closure is resolved and benchmarked below: @MilkshakeForReal  It is still 4 times slower. So, we could come out with a more efficient CUDA kernel or find other efficient way to check bounds, or even no bound checks. | 
| For the last version,  commenting out  | 
| Drop   | 
| @MilkshakeForReal For the last version,  | 
| MRE: function NNlib.gather!(dst::AnyCuArray, src::AnyCuArray, idx::AnyCuArray)
    # check dims
    dims = gather_check_dims(src, dst, idx)
    dims_size = size(src)[1:dims]
    max_dims_idx = prod(dims_size)
    max_idx = max_dims_idx * length(idx)
    # check bounds
    idx_bounds = size(src, ndims(src))#[dims+1:end]
    in_bnd = map(i -> i <= idx_bounds, idx)
    isempty(src) && return dst
    # cuda kernel
    args = dst, src, idx, max_idx, max_dims_idx, dims_size
    kernel = @cuda launch=false gather_kernel!(args...)
    config = launch_configuration(kernel.fun; max_threads=256)
    threads = min(max_idx, config.threads)
    blocks = cld(max_idx, threads)
    kernel(args...; threads=threads, blocks=blocks)
    return dst
end
julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.900 μs …  2.720 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     21.700 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.564 μs ± 71.784 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
   ▆██▇▆▅▄▃▃▃▂▂▁▁▁▁▁▁▂▁▁▁                                     ▂
  ███████████████████████████▇▇██▇▆▇▇▇▇▆▆▇▇▆▆▆▆▅▅▆▄▄▄▄▅▃▅▂▃▄▃ █
  18.9 μs      Histogram: log(frequency) by time      60.7 μs <
 Memory estimate: 1.33 KiB, allocs estimate: 31.With  julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  55.200 μs …   2.510 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     69.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   79.856 μs ± 102.221 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
    ▂█▇▃▂▁
  ▁▃██████▆▅▆▄▅▇▇▆▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  55.2 μs         Histogram: frequency by time          174 μs <
 Memory estimate: 3.72 KiB, allocs estimate: 72.Without any bounds checking:  | 
| 
 I believe you can achieve almost identical speedup by writing in_bnd = mapreduce(checkbounds_src(src, Val(dims), eltype(idx)), &, idx)in the last version. The speedup does not majorly come from resolving closure. | 
| This is my implementation based on yours. It seems working fine. I still don't know why we need to dispatch  function _checkbounds_indices(i::Tuple, idx_bounds::Tuple)
    return Base.checkbounds_indices(Bool, idx_bounds, i) 
end
function _checkbounds_indices(i::CartesianIndex, idx_bounds::Tuple)
    return Base.checkbounds_indices(Bool, idx_bounds, Tuple(i))
end
function _checkbounds_indices(i::Int, idx_bounds::Tuple)
    return Base.checkbounds_indices(Bool, idx_bounds, (i,))
end
function NNlib.gather!(dst::AnyCuArray, src::AnyCuArray, idx::AnyCuArray)
    # check dims
    dims = gather_check_dims(src, dst, idx)
    dims_size = size(src)[1:dims]
    max_dims_idx = prod(dims_size)
    max_idx = max_dims_idx * length(idx)
    # check bounds
    idx_bounds = axes(src)[dims+1:end]
    in_bnd = mapreduce(Base.Fix2(_checkbounds_indices,idx_bounds), &, idx)
    if !in_bnd
        #whatever is here, we don't need to care about the speed when something is wrong.
    end
    isempty(src) && return dst
    # cuda kernel
    args = dst, src, idx, max_idx, max_dims_idx, dims_size
    kernel = @cuda launch=false gather_kernel!(args...)
    config = launch_configuration(kernel.fun; max_threads=256)
    threads = min(max_idx, config.threads)
    blocks = cld(max_idx, threads)
    kernel(args...; threads=threads, blocks=blocks)
    return dst
end | 
| 
 The size of the improvement by removing  | 
| 
 We still have to raise a bound check error to users, otherwise there is meaningless to do bound check. You could have other ways to replace  | 
| It's just to demonstrate we don't really need to avoid closure in this case. I just tested the latest version and the performance was almost identical to mine, with closure or not. Both of them use  | 
| @MilkshakeForReal avoiding closure is for reducing the gc time and memory allocation. | 
| Not much improvement for the latest proposal.  | 
| Do you mean my proposal? There isn't any improvement. Just similar performance. | 
| @MilkshakeForReal We still need   | 
| Where do we have this issue? In my code its already reduced. The closure is also avoided if we don't want it. | 
| 
 I don't know much about the gc time, just speaking from my test results | 
| Could we isolate the support for empty arrays and leave bounds checking to further discussion? | 
Closes FluxML/NNlib.jl#416, FluxML/NNlib.jl#411