Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vectorization in elementwise_util #9432

Draft
wants to merge 7 commits into
base: gh/swolchok/385/head
Choose a base branch
from

Conversation

swolchok
Copy link
Contributor

@swolchok swolchok commented Mar 20, 2025

This is a first cut at #9241 . In this PR I've vectorized op_mul to make sure that vectorization doesn't break tests; a follow-up PR will make all existing portable ops vectorized-capable. I've left covering ops that use the unary_ufunc_* utilities in pattern.h for a follow-up push, because pattern.h and elementwise_util need some work before we can migrate pattern.h's utilities to be backed by elementwise_util.

Copy link

pytorch-bot bot commented Mar 20, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9432

Note: Links to docs will display an error until the docs builds have been completed.

❌ 70 New Failures

As of commit 0beabbb with merge base 1572381 (image):

NEW FAILURES - The following jobs have failed:

  • Build documentation / build (buck2) / Build doc (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • Lint / lintrunner / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / android / build-llm-demo / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-custom-ops-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-eval_llama-mmlu-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_sub.cpp:120:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-eval_llama-wikitext-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_sub.cpp:120:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-llama_runner_eager-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_mul.cpp:102:50: error: invalid operands to binary expression ('const at::vec::Vectorized<bool>' and 'const CTYPE_COMPUTE' (aka 'const bool'))
  • pull / test-llama-runner-linux (bf16, custom, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.arm64.2xlarge, executorch-ubuntu-22.04-gc... / linux-job (gh)
    /usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
  • pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04... / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu... / linux-job (gh)
    /usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::[ 51%] Building C object backends/xnnpack/third-party/XNNPACK/CMakeFiles/microkernels-prod.dir/src/f32-vcopysign/gen/f32-vcopysign-scalar.c.o
  • pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu-22.04-... / linux-job (gh)
    /usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
  • pull / test-llama-runner-linux-android / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-llama-runner-qnn-linux (fp32, qnn_8a8w, qnn) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-llava-runner-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_sub.cpp:120:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (add_mul, portable, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (add_mul, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (add, portable, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (add, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (emformer_join, portable, linux.4xlarge.memory) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
  • pull / test-models-linux (emformer_join, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
  • pull / test-models-linux (emformer_transcribe, portable, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-models-linux (emformer_transcribe, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (ic3, portable, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (ic3, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (ic4, portable, linux.4xlarge.memory) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
  • pull / test-models-linux (ic4, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
  • pull / test-models-linux (linear, portable, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-models-linux (linear, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (llama3_2_vision_encoder, portable, linux.4xlarge.memory) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
  • pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (mv2, portable, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-models-linux (mv2, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (phi_4_mini, portable, linux.4xlarge.memory) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
  • pull / test-models-linux (resnet18, portable, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (resnet18, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-models-linux (resnet50, portable, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux (resnet50, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-models-linux (w2l, portable, linux.4xlarge.memory) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
  • pull / test-models-linux-basic (mv3, portable, buck2, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux-basic (mv3, portable, cmake, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-models-linux-basic (mv3, portable, cmake, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11... / linux-job (gh)
    /usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
  • pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, buck2, linux.2xlarge, executorch-u... / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.arm64.2xlarge, execut... / linux-job (gh)
    /usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
  • pull / test-models-linux-basic (vit, portable, buck2, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-models-linux-basic (vit, portable, cmake, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux-basic (vit, portable, cmake, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11... / linux-job (gh)
    /usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
  • pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, buck2, linux.2xlarge, executorch-u... / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, cmake, linux.arm64.2xlarge, execut... / linux-job (gh)
    /usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
  • pull / test-moshi-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-openvino-linux / linux-job (gh)
    /usr/include/c++/9/bits/std_function.h:667:7: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl(const Op&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, Args ...) [with CTYPE_COMPUTE = c10::BFloat16; CTYPE_OUT = c10::BFloat16; Op = torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>; Args = {std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>}]::<lambda(auto:26, auto:27)>; <template-parameter-2-2> = void; <template-parameter-2-3> = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl(const Op&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, Args ...) [with CTYPE_COMPUTE = c10::BFloat16; CTYPE_OUT = c10::BFloat16; Op = torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>; Args = {std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>}]::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
  • pull / test-phi-3-mini-runner-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_sub.cpp:120:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-pybind-build-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-quantized-aot-lib-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / test-selective-build-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / test-setup-linux-gcc / linux-job (gh)
    /usr/include/c++/9/bits/std_function.h:667:7: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl(const Op&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, Args ...) [with CTYPE_COMPUTE = double; CTYPE_OUT = double; Op = torch::executor::native::add_scalar_out(executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, const executorch::runtime::etensor::Scalar&, const executorch::runtime::etensor::Scalar&, executorch::runtime::etensor::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:54)>; Args = {std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>}]::<lambda(auto:26, auto:27)>; <template-parameter-2-2> = void; <template-parameter-2-3> = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl(const Op&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, Args ...) [with CTYPE_COMPUTE = double; CTYPE_OUT = double; Op = torch::executor::native::add_scalar_out(executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, const executorch::runtime::etensor::Scalar&, const executorch::runtime::etensor::Scalar&, executorch::runtime::etensor::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:54)>; Args = {std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>}]::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
  • pull / test-static-llama-qnn-linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / unittest / linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
  • pull / unittest / macos / macos-job (gh)
    /Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'CTYPE_COMPUTE' (aka 'double'))
  • pull / unittest-arm / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / unittest-buck / linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / unittest-buck / macos / macos-job (gh)
    /Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / unittest-editable / linux / linux-job (gh)
    /pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
  • pull / unittest-editable / macos / macos-job (gh)
    /Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'CTYPE_COMPUTE' (aka 'double'))

This comment was automatically generated by Dr. CI and updates every 15 minutes.

swolchok added a commit that referenced this pull request Mar 20, 2025
this works with op_mul, which is vectorized-friendly, but doesn't work
when we roll out to pattern.h because those ops will not work with
Vectorized yet. See TODO in elementwise_util.h

ghstack-source-id: 30d2311bed080c3a5390ab00ca20a1e33563f077
ghstack-comment-id: 2738665976
Pull Request resolved: #9432
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 20, 2025
@swolchok swolchok marked this pull request as draft March 20, 2025 02:48
swolchok added a commit that referenced this pull request Mar 20, 2025
this works with op_mul, which is vectorized-friendly, but doesn't work
when we roll out to pattern.h because those ops will not work with
Vectorized yet. See TODO in elementwise_util.h

ghstack-source-id: d546d5d595929e84814aa38833c8a07bf3cf6ec5
ghstack-comment-id: 2738665976
Pull Request resolved: #9432
[ghstack-poisoned]
[ghstack-poisoned]
@swolchok swolchok changed the title Add vectorization in elementwise_util (not working yet) Add vectorization in elementwise_util Mar 26, 2025
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@swolchok
Copy link
Contributor Author

This is draft status because

  • need to see PR that migrates ops to vectorization
  • need to see size impact of that PR and that it's acceptable

I have confirmed from local measurements with op_mul that this does in fact cause vectorization and it is a significant perf win (matches optimized op_mul for the treat-as-1d case, not too surprisingly).

@kimishpatel
Copy link
Contributor

I cant find the pr where we were discussing using vecotrized class for scalar compute, but if we fix the issue in vec_base.h that you highlighted, that is, https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec_base.h#L157, size is > 1?, would that make a better solution? Then we can just use Vectorized class everywhere.

@swolchok
Copy link
Contributor Author

we can just use Vectorized class everywhere.

We can't use Vectorized everywhere unless we want to commit to keeping the ExecuTorch copy/paste of at::vec::Vectorized in sync and up to date. I would very much like to delete it instead.

It is also probably going to end up being worse code to use it everywhere, though I admit that the cost of not having to write Vectorized everywhere may end up being too high.

@kimishpatel
Copy link
Contributor

We can't use Vectorized everywhere unless we want to commit to keeping the ExecuTorch copy/paste of at::vec::Vectorized in sync and up to date. I would very much like to delete it instead.

Not suggesting to copy/paste it everywhere, but I am not sure what you mean,

It is also probably going to end up being worse code to use it everywhere

Worse from readability perspective?

@swolchok
Copy link
Contributor Author

swolchok commented Apr 1, 2025

not suggesting to copy/paste it everywhere, but I am not sure what you mean,

https://github.com/pytorch/executorch/tree/main/kernels/optimized/vec is a (partial, out-of-sync) copy/paste of https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec . I would like to delete the ExecuTorch version, not further cement its use.

Worse from readability perspective?

Yes, and from being able to cleanly control when vectorization is used. It's not necessary to pollute everything with Vectorized; this diff shows how to control Vectorization centrally and I am working on making the rollout to operators clean as well.

@kimishpatel
Copy link
Contributor

from being able to cleanly control when vectorization is used

where do you expect we would have to say not want to vectorize something?

@swolchok
Copy link
Contributor Author

swolchok commented Apr 2, 2025

where do you expect we would have to say not want to vectorize something?

we should not bother generating vectorized loops at least when at::vec::Vectorized doesn't have accelerated support for the architecture we are targeting. (I need to update this PR). It is also something we may want to let the user disable to squeeze on size, though I think this is a weaker reason.

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Apr 2, 2025
this works with op_mul, which is vectorized-friendly, but doesn't work
when we roll out to pattern.h because those ops will not work with
Vectorized yet. See TODO in elementwise_util.h

ghstack-source-id: 8d76653f819dc58a0c93540f3d71a89bfdb7cd26
ghstack-comment-id: 2738665976
Pull Request resolved: #9432
@swolchok
Copy link
Contributor Author

swolchok commented Apr 2, 2025

This PR is not yet complete (I need to go through and make sure no ops are needlessly held back from vectorization, and I need to overload unary - on at::vec::Vectorized so that op_sigmoid.cpp can vectorize nicely), but it builds and should vectorize a large chunk of elementwise portable ops. Hopefully this makes the direction/vision more clear @kimishpatel

(CI failures are expected; the builds are contingent on pytorch/pytorch#150380 which is not yet actually merged into PyTorch core. We'll need a pin bump.)

Comment on lines +50 to 52
[](const CTYPE_COMPUTE val_a, const CTYPE_COMPUTE val_b, const CTYPE_COMPUTE val_c) {
return val_c ? val_a : val_b;
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ATen, and therefore our own optimized op_where, doesn't vectorize this)

@kimishpatel
Copy link
Contributor

where do you expect we would have to say not want to vectorize something?

we should not bother generating vectorized loops at least when at::vec::Vectorized doesn't have accelerated support for the architecture we are targeting. (I need to update this PR). It is also something we may want to let the user disable to squeeze on size, though I think this is a weaker reason.

I am not fully sold but I havent looked at the PR in detail so I will further comment after that. Maybe you are right in that not everything will fit the Vectorized pattern and for those cases your pattern of checking callable for vectorized is probably a better way to do it.

swolchok added a commit that referenced this pull request Apr 2, 2025
this works with op_mul, which is vectorized-friendly, but doesn't work
when we roll out to pattern.h because those ops will not work with
Vectorized yet. See TODO in elementwise_util.h

ghstack-source-id: 8d76653f819dc58a0c93540f3d71a89bfdb7cd26
ghstack-comment-id: 2738665976
Pull Request resolved: #9432
[ghstack-poisoned]
swolchok added a commit that referenced this pull request Apr 2, 2025
this works with op_mul, which is vectorized-friendly, but doesn't work
when we roll out to pattern.h because those ops will not work with
Vectorized yet. See TODO in elementwise_util.h

ghstack-source-id: 033b63ce3bee8a0136efdab3e03905cafb79b915
ghstack-comment-id: 2738665976
Pull Request resolved: #9432
swolchok added a commit that referenced this pull request Apr 2, 2025
this works with op_mul, which is vectorized-friendly, but doesn't work
when we roll out to pattern.h because those ops will not work with
Vectorized yet. See TODO in elementwise_util.h

ghstack-source-id: 033b63ce3bee8a0136efdab3e03905cafb79b915
ghstack-comment-id: 2738665976
Pull Request resolved: #9432
Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense but I left some questions around why we need can_use_vectorized based approach

template <typename T> \
auto func_name(at::vec::Vectorized<T> vec) { \
if constexpr (!::executorch::runtime::is_floating_point<T>::value) { \
return at::vec::convert<float>(vec).func_name(); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a valid thing to do? that is convert say an instance of at::vec::Vectorized<int8_t> to float and apply the func_name? Maybe i am misunderstanding how this works

*/
#define ET_INTERNAL_VECTORIZED_FLOAT_UNARY_FUNC(func_name) \
namespace executorch { \
inline namespace math { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does inline to namespace do?

Comment on lines +62 to +63
* corresponding operator is a "float op" in TensorIterator parlance
* (i.e., uses something like build_borrowing_binary_float_op()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know if anyone reading this code would understand what this means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this provides answer to my earlier question

@@ -47,7 +47,7 @@ Tensor& where_out(
CTYPE_COMPUTE,
op_name,
utils::SupportedTensorDtypes::SAME_AS_COMMON>(
[](const auto val_a, const auto val_b, const auto val_c) {
[](const CTYPE_COMPUTE val_a, const CTYPE_COMPUTE val_b, const CTYPE_COMPUTE val_c) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this unrelated change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we can't vectorize this op

@@ -51,6 +56,34 @@ inline int64_t scalar_to<int64_t>(const Scalar& s) {
}

namespace internal {
template <typename Ignore, typename T>
using ignore_first_yield_second = T;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these names

if constexpr (std::is_invocable_v<
Op,
ignore_first_yield_second<Args, Vec>...>) {
// For bool, we will get a false positive if we rely on only the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean bool return type?

Copy link
Contributor Author

@swolchok swolchok Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool ctype_compute

Comment on lines +103 to +105
((inputs.first->scalar_type() ==
CppTypeToScalarType<CTYPE_COMPUTE>::value) &&
...));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, this is a good reminder of all the template meta programming magic

...);
if (!any_is_broadcasted) {
using Vec = at::vec::Vectorized<CTYPE_COMPUTE>;
::executorch::extension::parallel_for(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing this blindly for each op is a bit risky in that, no all multithreading is always better. some ops benefit from smaller grain size while others with larger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC xnnpack blindly parallelizes absolutely everything; we're doing strictly better here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I am not comparing with xnnpack. In fact one bit part of the reason why we ended up leveraging optimized op lib for some of the llama stuff for exactly that reason. That it blindly parallelized everything and that actually hurt perf

Comment on lines +125 to +126
const auto vectorized_begin =
begin + (Vec::size() - begin % Vec::size()) % Vec::size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like something that has chances of bug. Hope we test this enough. I would doubt if our test cases will exercise this code path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although I do see you treat scalar left overs of both head and tails separately

inputs.first->sizes(), out.sizes()) &&
...);
if (!any_is_broadcasted) {
using Vec = at::vec::Vectorized<CTYPE_COMPUTE>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the one point of contention for me is that why do we need vectorized_math.h which largely is doing trampoline to underlying vectorized methods. Mainly you dont even need to use can_use_vectorized, because on non accelerated platforms Vectorized falls back to scalar impl even if Vec::size() != `. Maybe you said that the generated code would be worse if forced Vectorized, but I am not sure why. Rest makes sense.

However, place where I can potentially see that being useful is for dtype that Vectorized doesnt support but for float I am not sure. So maybe if you can clarify that it would help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need vectorized_math.h which largely is doing trampoline to underlying vectorized methods

without it, you can't take the same lambda you already wrote for scalars and reuse it for Vectorized (the change isn't zero because you have to point at executorch::math, but crucially it doesn't require separate code)

@kimishpatel
Copy link
Contributor

Ok we synced offline.

So the major value prop as I understood: You can write your lambdas without having to explicitly use Vectorized. Why explicitly using Vectorized might not be good? Because maybe Vectorized does not have everything you need to implement your lambda. So as op author you dont have to worry about Vectorized when writing your lambda (although do have to use executorch::math ops). And later if Vectorized added support for the ::math.. ops you get vectorization for free without having to go back and rewrite your lambda with Vectorized. So this is nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants