Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RISC-V Vector 1.0 Support: If and where to start? #772

Closed
JakeSaphhire opened this issue Sep 17, 2024 · 12 comments · Fixed by #774
Closed

RISC-V Vector 1.0 Support: If and where to start? #772

JakeSaphhire opened this issue Sep 17, 2024 · 12 comments · Fixed by #774

Comments

@JakeSaphhire
Copy link

Hello,

RISC-V's Vector extension was ratified a few years back and recently vector-supporting boards have come out, many based on the octo-core Spacemit K1/M1. I have been using one such board for a while now, with gnuradio and more, but the performance is quite lacking: when profiling with volk_profile, the generic is around an order of magnitude faster than the alternatives

All that to say, I think RVV 1.0 has many instructions useful for volk and I am willing to help but I have no idea where to start.
If this is in the project's plans, is there a roadmap?

@jdemel
Copy link
Contributor

jdemel commented Sep 18, 2024

Thanks for your interest in the topic.

We don't have a fixed roadmap. Though, we're interested in adding support for as many platforms as possible, as long as this is supportable. Risc-V checks all those boxes.

We build and test on Risc-V already. Next steps that would be great to tackle are

  • add another Risc-V machine with the vector extension enabled. Smth like the -mavx flags for x86.
  • Add checks to dynamically detect the presence of the vector extension, preferably via cpu_features
  • Start to add kernels that use Risc-V vector intrinsics.

Especially, the infrastructure together with a first kernel for Risc-V would be great. More optimized kernels should be addable way easier afterwards. Depending on the compiler, it may be already very beneficial to have the extension and vector machine available. That needs benchmarking of course.

@balister
Copy link
Contributor

What compiler are you using? I'm told gcc 14 has support for riscv vector instructions.

@camel-cdr
Copy link
Contributor

FYI, to avoid duplicating work: I'm starting to implement some of the kernels (starting from the alphabetically first).

I haven't worked with the project before, so I'm unfamiliar with the build structure and CI.

@camel-cdr
Copy link
Contributor

Quick update, I'm now about halfway through the kernels.
Should they be optimized for smaller input sizes, <1000 elements, or does the gnuradio use case usually use very large chunks?
I'm using the overloaded v1.0 intrinsics, which are supported in gcc >=14, and clang >=18.

@jdemel
Copy link
Contributor

jdemel commented Oct 9, 2024

The GR use case is probably mostly in the 1k-10k element range. Obviously, this might vary. Further, our default benchmarks uses 2^17-1 elements. This is typically too large but an historical artefact. Since your changes would require a rather recent compiler, I suggest to ifdef your contributions such that older compilers don't try to compile what they can't.

@drmpeg
Copy link
Member

drmpeg commented Oct 10, 2024

Good to see @camel-cdr here. The DVB-T2 transmitter in GNU Radio uses quite a few kernels with fairly large vectors. I also have "bit perfect" test files for the example flow graphs (although for floating point, you have to compare with some margin).

UPDATE: The DVB-T2 flow graph I'm considering uses pretty big vectors. 32768 * 19 = 622,592 complex elements (1,245,184 floats).

Let me know if you want to use that strategy for testing, and I'll set you up with a set of test files.

Also, there's some discussion about infrastructure in #625

DVB-T2 transmitter kernels
volk_32fc_32f_multiply_32fc
volk_32fc_x2_add_32fc
volk_32f_s32f_multiply_32f
volk_32fc_magnitude_32f
volk_32fc_s32fc_multiply2_32fc
volk_32fc_s32fc_multiply_32fc
volk_32f_x2_subtract_32f
volk_32fc_x2_multiply_conjugate_32fc
volk_32f_x2_add_32f
volk_32fc_x2_multiply_32fc

@jdemel
Copy link
Contributor

jdemel commented Oct 10, 2024

Obviously, there's no one size fits all. @drmpeg these are quite large, and DVB typical values. I hope that most kernels perform comparably well. I suppose testing for short, full-ish L1 cache, etc. makes the most sense.

@drmpeg
Copy link
Member

drmpeg commented Oct 10, 2024

As it turns out, I was in error. After remembering what I implemented, the vector size is only 32768 complex elements.

@camel-cdr
Copy link
Contributor

camel-cdr commented Oct 10, 2024

I've asked regarding the input size because I'm writing the kernels to maximize LMUL without causing spills.
This means most things are implicitly unrolled 8 times, and the loop is, for N!=0, always traversed once.

For benchmarking I was just planning to run volk_profile, but if there is something else I can easily test I'd also be interested.

One annoyance is that the RISC-V toolchain doesn't provide a way to add single extensions with a command line argument, you can just set -march to a fix isa string.
The best way I could think of for solving this is by always making sure the last arch of a machine sets all previous extensions.

Something like this:

<machine name="rv64gcv">
<archs>generic riscv64 rvv orc|</archs>
</machine>

<!--machine name="rva22v">
<archs>generic riscv64 rvv rvb rva22v orc|</archs>
</machine>

<machine name="rva23">
<archs>generic riscv64 rvv rvb rva22v rva23 orc|</archs>
</machine-->

RVA22 and RVA23 are profiles, but google/cpu_features doesn't support them or their extensions currently.
google/cpu_features's RISC-V extension parsing is fundamentally broken at the moment, but this is unlikely to affect anything with just rv64gcv. (It would parse rv64gc_xmycustomextensionwithavsomewhere as rv64gcv)
I've created a fix, but IDK how to sign the CLA, so who knows when this will be fixed: google/cpu_features#368

@jdemel
Copy link
Contributor

jdemel commented Oct 11, 2024

For x86, we do -mavx, -mavx2, etc. Does that work for RiscV? I know some of the compiler flags in this realm behave differently depending on the ISA.

Is rva22v strictly < rva23? I'm glad they introduced profiles. Everything else is hard do keep track of.

volk_profile is our long term tool. Another option would be google/benchmark to implement micro benchmarks.

Your machine definitions look sane to me. My gut feeling is that we need to get started with RiscV kernels and potentially, we'd need re-organize our support code (or extend it) when we realize that our approach doesn't work long-term. At the moment, I'd like to encourage you to do what you think makes the most sense.

@michael-roe
Copy link
Contributor

I’ll just add that the rva22u64 profile also includes bitmap instructions, which some of the kernels might be able to use (e.g. there’s a popcount instruction).

@camel-cdr
Copy link
Contributor

Yeah, I didn't want to create too many different targets, so I choose base rvv, rva22+v, and rva23, which also includes Zvbb.

I've also created a pseudo target rvvseg, that uses segmented load stores when dealing with complex numbers, because they aren't fast on all current hardware (C910). (the regular rvv target uses vnsrl to deinterleave the complex number components)

I'll try to get it ready for a PR this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants