-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#2083] improvement: Quickly delete local or HDFS data at the shuffleId level. #2084
base: master
Are you sure you want to change the base?
Conversation
ecf9e44
to
6052399
Compare
8a37d06
to
b264a11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help share which case will need to shorten the deletion time.
During the Stage retry, delete the shuffle data block from the disk or hdfs. |
b7a438c
to
ccbe953
Compare
@zuston Help trigger the error module, I have no local error. |
96e5fbe
to
8126263
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your effort on this feature for the stage retry. One comment is how to enable this by config. I hope this feature could be scoped in the explicility config option and be disable by default.
} | ||
|
||
@Override | ||
public void removeResources(PurgeEvent event, boolean isQuick) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make the softDeletion/isQuick
as the internal variable in the PurgeEvent?
deleteHandler.quickDelete(asynchronousDeleteEvent); | ||
boolean isSucess = quickNeedDeletePaths.offer(asynchronousDeleteEvent); | ||
if (!isSucess) { | ||
LOG.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is abnormal that will make the data leaked. For this case, the metrics should be added for better observability
|
||
HadoopStorageManager(ShuffleServerConf conf) { | ||
super(conf); | ||
hadoopConf = conf.getHadoopConf(); | ||
shuffleServerId = conf.getString(ShuffleServerConf.SHUFFLE_SERVER_ID, "shuffleServerId"); | ||
isStorageAuditLogEnabled = conf.getBoolean(ShuffleServerConf.SERVER_STORAGE_AUDIT_LOG_ENABLED); | ||
Runnable clearNeedDeletePathTask = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not intergrating this part async deletion into the underlying class like SingleStorageManager for localfile and hadoop storage type to share
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments are complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let me take a look again
0af2052
to
a6e8d64
Compare
a6e8d64
to
9145b24
Compare
@@ -227,7 +227,7 @@ public void registerShuffle( | |||
taskInfo.refreshLatestStageAttemptNumber(shuffleId, stageAttemptNumber); | |||
try { | |||
long start = System.currentTimeMillis(); | |||
shuffleServer.getShuffleTaskManager().removeShuffleDataSync(appId, shuffleId); | |||
shuffleServer.getShuffleTaskManager().softRemoveShuffleDataSync(appId, shuffleId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this could be enabled by the extra config option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my previous comment, I think this deletion could be named as TwoPhaseDeletion, which will include 2 phases
- Soft deletion
- Hard deletion
And for the original deletion way is the hard deletion, we could extra the abstract class to have a good abstraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use another concept? Because hard deletion makes me think of rm --force
. Maybe you can use rename
directly.
@@ -157,6 +157,9 @@ public class ShuffleServerMetrics { | |||
public static final String TOPN_OF_ON_HADOOP_DATA_SIZE_FOR_APP = | |||
"topN_of_on_hadoop_data_size_for_app"; | |||
|
|||
private static final String TOTAL_HADOOP_SOFT_DELETE_FAILED = "total_hadoop_soft_delete_failed"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_hadoop_two_phases_deletion_failed
@@ -157,6 +157,9 @@ public class ShuffleServerMetrics { | |||
public static final String TOPN_OF_ON_HADOOP_DATA_SIZE_FOR_APP = | |||
"topN_of_on_hadoop_data_size_for_app"; | |||
|
|||
private static final String TOTAL_HADOOP_SOFT_DELETE_FAILED = "total_hadoop_soft_delete_failed"; | |||
private static final String TOTAL_LOCAL_SOFT_DELETE_FAILED = "total_local_soft_delete_failed"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
} | ||
} else { | ||
deleteHandler.delete(deletePaths.toArray(new String[deletePaths.size()]), appId, user); | ||
} | ||
removeAppStorageInfo(event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this will effect the metrics analysis when using the 2 phase deletion?
bb11961
to
9e9721f
Compare
Overwrite delete logic. |
b8194df
to
93be639
Compare
What changes were proposed in this pull request?
At the shuffleId level, data on the local or HDFS needs to be deleted synchronously. In some scenarios, the deletion time needs to be shortened. You can rename folders and delete them asynchronously.
Why are the changes needed?
Fix: #2083
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT.