Skip to content

FlatRecon: Flat Placement Reconstruction Full Legalizer #3193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 82 commits into
base: master
Choose a base branch
from

Conversation

haydar-c
Copy link
Contributor

@haydar-c haydar-c commented Jul 8, 2025

This PR brings the FlatRecon full legalizer. It aims to reconstruct a given flat placement with minimum disturbance to the given solution. It can be used with a solution read from an '.fplace' file or with Global Placement output. However, it expects the given solution to be close to legal in both cases.

It has beed run on the MCNC, VTR Chain, Koios, and Titan benchmarks. It also been tested on the elfPlace placements generated from the aug-elfPlace on Titan benchmarks. [Results to be appended]

I have added 2 regression tests for that legalizer. The first into the vtr_reg_strong with fast checks on MCNC benchmarks and the second one into the vtr_reg_nightly_test7 with some of the Titan benchmarks.

The logged output of that legalizer looks like below for an example of the LU32PEEng.v from VTR Chain:
LU32PEEng_FlatRecon

haydar-c added 30 commits March 28, 2025 16:08
This works well for the reconstruction of provided
flat placement. However, if used directly after GP
output, if we have too many orphan molecules the
runtime goes really high.
The legalizations strategy of cluster_legalizer is set to
SKIP_LB_ROUTING for the first passs. After first pass, each
checked and illegal ones destroyed and their molecules are
passed to second pass.

The legalization strategy is set to FULL after first pass
in that commit.

Be careful with the place where you compress the legalizer
and extract the atom lookup for placement.
Before that, the strategy was converted to full after the
first reconstruction pass. After that, I was doing a neighbour
search with full strategy. Now, I am doing neighbour search
with SKIP_LB_ROUTING as well. After that, it requires last
neighbour pass with FULL strategy.

Also added ugly timer prints to be cleaned.
Changed the way of processing mainly in the first pass. In this
version, they are first sorted by external pin number and then
grouped into tiles. Then processed similar to naive by tiles but
this one still checks the compatibility and capacity. Cleaning
after each tile clusters created in that pass. Then, checking
each cluster legality after they are created in the first
neighbour pass. Second is already done with FULL strategy. The
memory usage is decreased by nearly by %35. We lost runtime but it
seems near 1-2%.

TODO for that one: I added an if in neighbour pass that checks if
the molecule is already clustered before checking its compatibility.
This works now but ideally my algorithm should not try to add an
already clustered molecule to another one. When removed, it failed
on vtr_chain_largest for only LU32PEEng.v.
This is the version that results are presented in 2 May
Friday meeting. The memory improved version is presented with master
merged. We have things to try on top of that version.

Handling the illegal clusters right away. Also added a 0.5f buffer
into fixed blocks in partial placement verification to pass the
assert. This should be handled more explicitely later.
In this version, after doing my packing for reconstruction, I am
calling the initial_placement for ap flow. The results seems
promising for this version and being added to the presentation.
Especially reduces the max displacement as expected.
The displacement for each atom was being calculated as distance
to the head of the tile. Now it considers the offset of the tile
as well. That affects the percent of atoms displaced, max atom
displacemetn and average atom displacement.

Did no touched the cluster error for now. This can also be
rewritten. Instead of being calculated solely on cluster location
and centroid, we can try to use cluster information.
first pass if any failing cluster occurs

In the first pass of reconstruction, creating clusters at a tile
and checking whether they are legal or not. If illegal, try to
cluster with FULL strategy there before going to next tile.
Molecules being still unclustered passed to neighbour pass.
Prioritizing chained molecules in the reconstruction pass.
Ensuring initial placer placing the clusters created in
reconstruction pass first. Its current sorting also results
in the same ordering due to standart deviation and size ordering.
However, if used in reconstruction legalizer, I want to ensure that
clusters created in reconstruction pass processed first.
Prioritizing the long carry chains in the reconstruction pass.
@github-actions github-actions bot added VPR VPR FPGA Placement & Routing Tool libarchfpga Library for handling FPGA Architecture descriptions docs Documentation lang-cpp C/C++ code lang-python Python code labels Jul 8, 2025
@haydar-c haydar-c marked this pull request as ready for review July 8, 2025 11:37
Copy link
Contributor

@AlexandreSinger AlexandreSinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @haydar-c

Thank you so much for these changes. I have some comments on the code. Two major things that must be resolved before this can come in is a quadratic run time loop in your algorithm which can be resolved and an issue with how you are compressing the cluster legalizer.

g_vpr_ctx.mutable_atom().mutable_flat_placement_info().blk_x_pos[atom_blk_id] = g_vpr_ctx.device().grid.width() / 2.0f;
g_vpr_ctx.mutable_atom().mutable_flat_placement_info().blk_y_pos[atom_blk_id] = g_vpr_ctx.device().grid.height() / 2.0f;
g_vpr_ctx.mutable_atom().mutable_flat_placement_info().blk_layer[atom_blk_id] = 0;
g_vpr_ctx.mutable_atom().mutable_flat_placement_info().blk_sub_tile[atom_blk_id] = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutating the global flat placement info here is odd to me. Why does it need to be updated? Can we not modify the flat placement object coming in? Or even better leave the flat placement object as it is and only use the partial placement object?

@@ -245,6 +264,15 @@ void run_analytical_placement_flow(t_vpr_setup& vpr_setup) {
// Print the device utilization
print_device_utilization(target_device_utilization);

// Write out a flat placement file at the end of Full Legalization if the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By moving this to here, you have changed the meaning of the write_flat_place option:
Screen Shot 2025-07-08 at 5 01 46 PM

I am ok with this change; in the future, if we want the post-placement flat placement, we can create another option. You should update the documentation for this option accordingly.

Since this has become AP-specific though, you should probably change this option to something like "--ap_write_post_fl_flat_place" or something to make this more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange to not be able to write out the final placement; maybe we need another command line option to specify this.

atom_loc_layer = current_loc_layer;
atom_loc_sub_tile = current_loc_sub_tile;
found_valid_atom = true;
if (current_loc_x != -1 && current_loc_y != -1 && current_loc_layer != -1 && current_loc_sub_tile != -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line concerns me a little. Here, -1 means that the user has left that part of the location unspecified.

For example, it is common for people to leave the layer or sub_tile unspecified instead of forcing it to 0. I think that there should be special handling for that. If not, this code will put a lot of blocks in the center of the device!

* both cases it expects the given flat placement to be close to legal.
*
* Before packing, it sorts the molecules such that long carry chain elements
* are being first, and then molecules with highest external input pin numbers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"are being first" -> "are packed first"

*
* Before packing, it sorts the molecules such that long carry chain elements
* are being first, and then molecules with highest external input pin numbers
* becoming first among long carry chain elements and the others. It then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Which is packed first, long carry chains or high external input pin? I think this sentence needs to be cleaned up.

// how reconstruction pass clusters created, so there is no need to explicitely
// prioritize these clusters. However, if it changes in time, the atoms clustered
// in neighbor pass and atoms misplaced might not match exactly.
if (g_vpr_ctx.atom().flat_placement_info().valid) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why even use the global flat placement info here? Why not just always just cast the p_placement object into the flat placement info? This would simplify the code and it would also remove your reliance on this global variable.

for (AtomBlockId atom_blk_id : atom_netlist.blocks()) {
if (!atom_blk_id.is_valid()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check needed? I do not think the atom blocks can ever be invalid...

// Then compute the L1 distance from the block location to the projected
// position. This will be the minimum distance this point needs to move.
float dx = std::abs(proj_x - blk_x);
float dy = std::abs(proj_y - blk_y);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this change. I would call the method though instead of inlining the code:

/**
* @brief Returns the L1 distance something at the given flat location would
* need to move to be within the bounds of a tile at the given tile loc.
*/
inline float get_manhattan_distance_to_tile(const t_flat_pl_loc& src_flat_loc,
const t_physical_tile_loc& tile_loc,
const DeviceGrid& device_grid) {
// Get the bounds of the tile.
// Note: The get_tile_bb function will not work in this case since it
// subtracts 1 from the width and height.
auto tile_type = device_grid.get_physical_type(tile_loc);
float tile_xmin = tile_loc.x - device_grid.get_width_offset(tile_loc);
float tile_xmax = tile_xmin + tile_type->width;
float tile_ymin = tile_loc.y - device_grid.get_height_offset(tile_loc);
float tile_ymax = tile_ymin + tile_type->height;
// Get the closest point in the bounding box (including the edges) to
// the src_flat_loc. To do this, we project the point in L1 space.
float proj_x = std::clamp(src_flat_loc.x, tile_xmin, tile_xmax);
float proj_y = std::clamp(src_flat_loc.y, tile_ymin, tile_ymax);
// Then compute the L1 distance from the src_flat_loc to the projected
// position. This will be the minimum distance this point needs to move.
float dx = std::abs(proj_x - src_flat_loc.x);
float dy = std::abs(proj_y - src_flat_loc.y);
return dx + dy;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big are these files? These files look massive. I do not recommend commiting them into the git repostory if they are large.

I would recommend compressing them, and then in the CI test uncompressing them. See what I did for the Z1000 architecture for example. The issue is that these files add a lot of bloat to the repository which we do not want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this blocks you, I would recommend keeping these tests behind for now and adding them later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we don't do enormous circuits these files shouldn't be too big. One line per primitive, so 100,000 lines for neuron; if the primitive names are long that could be 10 MB. Not 0, but should be OK as long as we don't do the biggest designs or too many. If we do those we probably have to bite the bullet and compress them as @AlexandreSinger suggests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is fine in size; but we want to keep the file sizes for these to on the order of MBs if possible.

Copy link
Contributor

@vaughnbetz vaughnbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a quick review since it looks like there will be some changes due to @AlexandreSinger 's comments.
The output messages look good, except I suggest adding something to make clear what reconstruction pass and neighbour pass mean (or change the terminology to something that's more obvious to the reader). You want to make clear that 'reconstruction pass' means it went where you said it should go, and 'neighbour pass' means it was modified (moved and/or grouped with other primitives than specified).

@@ -1286,6 +1286,12 @@ Analytical Placement is generally split into three stages:

* ``appack`` Use APPack, which takes the Packer in VPR and uses the flat atom placement to create better clusters.

* ``flat-recon`` Use the Flat Placement Reconstruction Full Legalizer which tries to reconstruct a clustered placement
that is as close to the incoming flat placement as possible. It can be used to read flat placement from an '.fplace' file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read flat -> read a flat
from an '.fplace' -> from a '.fplace'

that is as close to the incoming flat placement as possible. It can be used to read flat placement from an '.fplace' file
or with Global Placement output. In both cases, it expects the given solution to be close to legal. If used with
an '.fplace' file, each atom of a molecule should share same location information. If none of the atoms in a molecule
have a specified location in the file, the molecule will be assigned to the center of the device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to explain what this means. You won't really leave it in the center of the device. I suggest just saying it is legal to leave some molecules unconstrained; the reconstruction phase will choose where to place them but does not attempt to optimize those locations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a link to the file format (which has to be described somewhere).

@@ -245,6 +264,15 @@ void run_analytical_placement_flow(t_vpr_setup& vpr_setup) {
// Print the device utilization
print_device_utilization(target_device_utilization);

// Write out a flat placement file at the end of Full Legalization if the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange to not be able to write out the final placement; maybe we need another command line option to specify this.

* In gen_ap_netlist_from_atoms, we add 0.5f to each atom but do not consider
* it here. Adding these buffers here as well if the flat placement provided.
* TODO: This should be handled more explicitely.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the coding style guide, use // for implementation-related comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation lang-cpp C/C++ code lang-python Python code libarchfpga Library for handling FPGA Architecture descriptions VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants