This is the Gidel and Quartus project for Block Aggregation SpMM core. Use this project to generate the bitstream for the SpMM core.
- Quartus 21.4
- Gidel ProcWizard (If you want to modify the peripherals like HBM)
- Open Gidel project in Gidel ProcWizard
- Click "Generate" -> "Generate HDL code"
- Move the generated directory under this repo, rename it as you prefer (e.g., quartus_proj_singlecore)
A pre-generated Quartus project is provided here for convenience.
Specify the path to the SpMM core design generated by SpinalHDL in Block Aggregation SpMM core.
If you cloned this repo as a submodule of Block Aggregation SpMM core,
the path should be ../../src/generated_spmm_core_6x12x8/.
SpinalHDL will generate a file (usually named generated_spmm_core_<core_config>tensor_core_array_wrapper.lst) list including all generated verilog source files.
Make sure all the files listed in the lst file are added to the Quartus project file src/ic_tcorearray.qsf, such as:
set_global_assignment -name SYSTEMVERILOG_FILE ../../src/main/sverilog/out_asym_fifo.sv
set_global_assignment -name SYSTEMVERILOG_FILE ../../src/main/sverilog/blk_delay_core.sv
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/enumdefine.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/StreamFifoIp_252.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/StreamFifoIp_240.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/StreamFifoIp_229.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/StreamFifoIp_231.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/StreamFifoIp_49.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/spram_megafunc_583.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/MergeSortRedundancyRemoverUnit.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/MergeSortRedundancyRemoverUnit_6.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/TensorCoreChainBf12.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/TensorCoreChainBf12_60.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/StreamFifoIp_4.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/spram_megafunc.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/CasLoadBubbleInsert.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/AsymBufferN2One.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/FixedBfp12Converter.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/FixedBfp12Converter_1.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/StreamOutAsymFifo.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/StreamFifoIp.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/IndexGenerator.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/StreamFifoIp_2.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/StreamFifoIp_3.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/TensorCoreChainArray.v
set_global_assignment -name VERILOG_FILE ../../src/generated_spmm_core_6x12x8/tensor_core_array_wrapper.v
Then copy the qsf file and top level sv file to the Quartus project directory.
cp src/ic_tcorearray.qsf quartus_proj_singlecore/.
cp src/ic_tcorearray.sv quartus_proj_singlecore/.Using quartus_sh is recommended since it supports both local compilation and sbatch submission.
To run a full compilation:
cd quartus_project_singlecore
quartus_sh --flow compile ic_tcorearrayFor slurm-based remote compilation, a sbatch script sample is provided below:
#!/bin/bash
#
#SBATCH --job-name spmm_core
#SBATCH -p <your_partition>
#SBATCH --nodelist=<your_node>
#SBATCH --output sbatch.out
#SBATCH --error sbatch.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
# Note: running a full compilation involves IP generation,
# which requires a quartus GUI process and X server support
quartus_sh --flow compile ic_tcorearray
# if you want to skip ip_generation phase, use this instead:
# quartus_sh --flow compile ic_tcorearray -start synthesis
# if you only want to rerun P&R, use this instead:
# quartus_sh --flow compile ic_tcorearray -start fitterA testing script gen_onchip_file_and_runtest.sh is provided to generate the on-chip SpMM test files and run the test, using the sparse attention values generated by sparse attention analyzer.
A set of extracted sparse attention value with BFP-format is provided to simplify the simulation process. It is extracted from chatglm2-6b-32k on LongBench's vcsum test. Download and extract partial BFP-format attention value from COMPAS NFS and extract it.
A copy of extracted data is located at
/compas-old/projects/sparse-attentionin COMPAS NFS.
To run the test:
- Program FPGA using Gidel SofLoader:
SofLoader quartus_files/ic_tcorearray.sof
- Reboot the host machine.
- Modify the
attn_dirandonchip_test_dirin gen_onchip_file_and_runtest.sh to the extracted data directory.# specify the path to the sparse attention values attn_dir="/compas-old/projects/sparse-attention/chatglm2-6b-32k-attn-bfp20-vcsum" # intermediate result directory should follow this format: # attn_dir/../onchip/<modelname-taskname> onchip_test_dir="/compas-old/projects/sparse-attention/onchip/chatglm2-6b-32k-attn-bfp20-vcsum"
- Run the test script:
cd ../sw/scripts ./gen_onchip_file_and_runtest.sh
The result will be written into ${onchip_test_dir}/{inst_id}/hw_config_{head_id}.json, for each head.
Following is a sample of the result file:
{
...
"mat a size": 27456,
"mat a vec size": 208,
"mat b size": 27456,
"mat b vec size": 208,
"total_lat_counter_res": 25048, // total latency in number of clock cycles
"compute_lat_counter_res": 24212, // compute latency in number of clock cycles
"mat_b_load_counter_res": 27456 // mat b load latency in number of clock cycles
}