-
Notifications
You must be signed in to change notification settings - Fork 422
FlatRecon: Flat Placement Reconstruction Full Legalizer #3193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This works well for the reconstruction of provided flat placement. However, if used directly after GP output, if we have too many orphan molecules the runtime goes really high.
The legalizations strategy of cluster_legalizer is set to SKIP_LB_ROUTING for the first passs. After first pass, each checked and illegal ones destroyed and their molecules are passed to second pass. The legalization strategy is set to FULL after first pass in that commit. Be careful with the place where you compress the legalizer and extract the atom lookup for placement.
Before that, the strategy was converted to full after the first reconstruction pass. After that, I was doing a neighbour search with full strategy. Now, I am doing neighbour search with SKIP_LB_ROUTING as well. After that, it requires last neighbour pass with FULL strategy. Also added ugly timer prints to be cleaned.
Changed the way of processing mainly in the first pass. In this version, they are first sorted by external pin number and then grouped into tiles. Then processed similar to naive by tiles but this one still checks the compatibility and capacity. Cleaning after each tile clusters created in that pass. Then, checking each cluster legality after they are created in the first neighbour pass. Second is already done with FULL strategy. The memory usage is decreased by nearly by %35. We lost runtime but it seems near 1-2%. TODO for that one: I added an if in neighbour pass that checks if the molecule is already clustered before checking its compatibility. This works now but ideally my algorithm should not try to add an already clustered molecule to another one. When removed, it failed on vtr_chain_largest for only LU32PEEng.v.
This is the version that results are presented in 2 May Friday meeting. The memory improved version is presented with master merged. We have things to try on top of that version. Handling the illegal clusters right away. Also added a 0.5f buffer into fixed blocks in partial placement verification to pass the assert. This should be handled more explicitely later.
In this version, after doing my packing for reconstruction, I am calling the initial_placement for ap flow. The results seems promising for this version and being added to the presentation. Especially reduces the max displacement as expected.
The displacement for each atom was being calculated as distance to the head of the tile. Now it considers the offset of the tile as well. That affects the percent of atoms displaced, max atom displacemetn and average atom displacement. Did no touched the cluster error for now. This can also be rewritten. Instead of being calculated solely on cluster location and centroid, we can try to use cluster information.
first pass if any failing cluster occurs In the first pass of reconstruction, creating clusters at a tile and checking whether they are legal or not. If illegal, try to cluster with FULL strategy there before going to next tile. Molecules being still unclustered passed to neighbour pass.
Prioritizing chained molecules in the reconstruction pass. Ensuring initial placer placing the clusters created in reconstruction pass first. Its current sorting also results in the same ordering due to standart deviation and size ordering. However, if used in reconstruction legalizer, I want to ensure that clusters created in reconstruction pass processed first.
Prioritizing the long carry chains in the reconstruction pass.
…_legalizer and its dependances
… update documentations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @haydar-c
Thank you so much for these changes. I have some comments on the code. Two major things that must be resolved before this can come in is a quadratic run time loop in your algorithm which can be resolved and an issue with how you are compressing the cluster legalizer.
g_vpr_ctx.mutable_atom().mutable_flat_placement_info().blk_x_pos[atom_blk_id] = g_vpr_ctx.device().grid.width() / 2.0f; | ||
g_vpr_ctx.mutable_atom().mutable_flat_placement_info().blk_y_pos[atom_blk_id] = g_vpr_ctx.device().grid.height() / 2.0f; | ||
g_vpr_ctx.mutable_atom().mutable_flat_placement_info().blk_layer[atom_blk_id] = 0; | ||
g_vpr_ctx.mutable_atom().mutable_flat_placement_info().blk_sub_tile[atom_blk_id] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mutating the global flat placement info here is odd to me. Why does it need to be updated? Can we not modify the flat placement object coming in? Or even better leave the flat placement object as it is and only use the partial placement object?
@@ -245,6 +264,15 @@ void run_analytical_placement_flow(t_vpr_setup& vpr_setup) { | |||
// Print the device utilization | |||
print_device_utilization(target_device_utilization); | |||
|
|||
// Write out a flat placement file at the end of Full Legalization if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By moving this to here, you have changed the meaning of the write_flat_place option:
I am ok with this change; in the future, if we want the post-placement flat placement, we can create another option. You should update the documentation for this option accordingly.
Since this has become AP-specific though, you should probably change this option to something like "--ap_write_post_fl_flat_place" or something to make this more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems strange to not be able to write out the final placement; maybe we need another command line option to specify this.
atom_loc_layer = current_loc_layer; | ||
atom_loc_sub_tile = current_loc_sub_tile; | ||
found_valid_atom = true; | ||
if (current_loc_x != -1 && current_loc_y != -1 && current_loc_layer != -1 && current_loc_sub_tile != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line concerns me a little. Here, -1 means that the user has left that part of the location unspecified.
For example, it is common for people to leave the layer or sub_tile unspecified instead of forcing it to 0. I think that there should be special handling for that. If not, this code will put a lot of blocks in the center of the device!
* both cases it expects the given flat placement to be close to legal. | ||
* | ||
* Before packing, it sorts the molecules such that long carry chain elements | ||
* are being first, and then molecules with highest external input pin numbers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"are being first" -> "are packed first"
* | ||
* Before packing, it sorts the molecules such that long carry chain elements | ||
* are being first, and then molecules with highest external input pin numbers | ||
* becoming first among long carry chain elements and the others. It then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Which is packed first, long carry chains or high external input pin? I think this sentence needs to be cleaned up.
// how reconstruction pass clusters created, so there is no need to explicitely | ||
// prioritize these clusters. However, if it changes in time, the atoms clustered | ||
// in neighbor pass and atoms misplaced might not match exactly. | ||
if (g_vpr_ctx.atom().flat_placement_info().valid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why even use the global flat placement info here? Why not just always just cast the p_placement object into the flat placement info? This would simplify the code and it would also remove your reliance on this global variable.
for (AtomBlockId atom_blk_id : atom_netlist.blocks()) { | ||
if (!atom_blk_id.is_valid()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check needed? I do not think the atom blocks can ever be invalid...
// Then compute the L1 distance from the block location to the projected | ||
// position. This will be the minimum distance this point needs to move. | ||
float dx = std::abs(proj_x - blk_x); | ||
float dy = std::abs(proj_y - blk_y); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this change. I would call the method though instead of inlining the code:
vtr-verilog-to-routing/vpr/src/base/flat_placement_utils.h
Lines 24 to 50 in 1ed4a94
/** | |
* @brief Returns the L1 distance something at the given flat location would | |
* need to move to be within the bounds of a tile at the given tile loc. | |
*/ | |
inline float get_manhattan_distance_to_tile(const t_flat_pl_loc& src_flat_loc, | |
const t_physical_tile_loc& tile_loc, | |
const DeviceGrid& device_grid) { | |
// Get the bounds of the tile. | |
// Note: The get_tile_bb function will not work in this case since it | |
// subtracts 1 from the width and height. | |
auto tile_type = device_grid.get_physical_type(tile_loc); | |
float tile_xmin = tile_loc.x - device_grid.get_width_offset(tile_loc); | |
float tile_xmax = tile_xmin + tile_type->width; | |
float tile_ymin = tile_loc.y - device_grid.get_height_offset(tile_loc); | |
float tile_ymax = tile_ymin + tile_type->height; | |
// Get the closest point in the bounding box (including the edges) to | |
// the src_flat_loc. To do this, we project the point in L1 space. | |
float proj_x = std::clamp(src_flat_loc.x, tile_xmin, tile_xmax); | |
float proj_y = std::clamp(src_flat_loc.y, tile_ymin, tile_ymax); | |
// Then compute the L1 distance from the src_flat_loc to the projected | |
// position. This will be the minimum distance this point needs to move. | |
float dx = std::abs(proj_x - src_flat_loc.x); | |
float dy = std::abs(proj_y - src_flat_loc.y); | |
return dx + dy; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big are these files? These files look massive. I do not recommend commiting them into the git repostory if they are large.
I would recommend compressing them, and then in the CI test uncompressing them. See what I did for the Z1000 architecture for example. The issue is that these files add a lot of bloat to the repository which we do not want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this blocks you, I would recommend keeping these tests behind for now and adding them later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we don't do enormous circuits these files shouldn't be too big. One line per primitive, so 100,000 lines for neuron; if the primitive names are long that could be 10 MB. Not 0, but should be OK as long as we don't do the biggest designs or too many. If we do those we probably have to bite the bullet and compress them as @AlexandreSinger suggests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is fine in size; but we want to keep the file sizes for these to on the order of MBs if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a quick review since it looks like there will be some changes due to @AlexandreSinger 's comments.
The output messages look good, except I suggest adding something to make clear what reconstruction pass and neighbour pass mean (or change the terminology to something that's more obvious to the reader). You want to make clear that 'reconstruction pass' means it went where you said it should go, and 'neighbour pass' means it was modified (moved and/or grouped with other primitives than specified).
@@ -1286,6 +1286,12 @@ Analytical Placement is generally split into three stages: | |||
|
|||
* ``appack`` Use APPack, which takes the Packer in VPR and uses the flat atom placement to create better clusters. | |||
|
|||
* ``flat-recon`` Use the Flat Placement Reconstruction Full Legalizer which tries to reconstruct a clustered placement | |||
that is as close to the incoming flat placement as possible. It can be used to read flat placement from an '.fplace' file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read flat -> read a flat
from an '.fplace' -> from a '.fplace'
that is as close to the incoming flat placement as possible. It can be used to read flat placement from an '.fplace' file | ||
or with Global Placement output. In both cases, it expects the given solution to be close to legal. If used with | ||
an '.fplace' file, each atom of a molecule should share same location information. If none of the atoms in a molecule | ||
have a specified location in the file, the molecule will be assigned to the center of the device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to explain what this means. You won't really leave it in the center of the device. I suggest just saying it is legal to leave some molecules unconstrained; the reconstruction phase will choose where to place them but does not attempt to optimize those locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a link to the file format (which has to be described somewhere).
@@ -245,6 +264,15 @@ void run_analytical_placement_flow(t_vpr_setup& vpr_setup) { | |||
// Print the device utilization | |||
print_device_utilization(target_device_utilization); | |||
|
|||
// Write out a flat placement file at the end of Full Legalization if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems strange to not be able to write out the final placement; maybe we need another command line option to specify this.
* In gen_ap_netlist_from_atoms, we add 0.5f to each atom but do not consider | ||
* it here. Adding these buffers here as well if the flat placement provided. | ||
* TODO: This should be handled more explicitely. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the coding style guide, use //
for implementation-related comments.
This PR brings the FlatRecon full legalizer. It aims to reconstruct a given flat placement with minimum disturbance to the given solution. It can be used with a solution read from an '.fplace' file or with Global Placement output. However, it expects the given solution to be close to legal in both cases.
It has beed run on the MCNC, VTR Chain, Koios, and Titan benchmarks. It also been tested on the elfPlace placements generated from the aug-elfPlace on Titan benchmarks. [Results to be appended]
I have added 2 regression tests for that legalizer. The first into the vtr_reg_strong with fast checks on MCNC benchmarks and the second one into the vtr_reg_nightly_test7 with some of the Titan benchmarks.
The logged output of that legalizer looks like below for an example of the LU32PEEng.v from VTR Chain:
