compression codec interface and Snappy support to reduce buffer size to improve scalability #685
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The biggest scalability issue we observed in our tests is: in order to scale to bigger data size, we need to have more reduce splits, so when we want to scale to massive data, we end up with having too many reduce splits. Similar to what's stated in https://spark-project.atlassian.net/browse/SPARK-751. We believe #669 is very important.
While even with #669 we would still have many reduce splits, like in our case, processing 10 TB data needs 1 million reduce splits. One of the scalability bottleneck of this is, when we have many splits, there would be the same number of open writers as the number of splits. For each writer there's 1 compression stream and 1 buffer stream, which both contain some buffer internally. The current LZF compression library has 64K fixed buffer, which would cause OOM easily when there are too many reduce splits. (https://github.com/ning/compress/blob/master/src/main/java/com/ning/compress/lzf/LZFOutputStream.java)
In this change, compression codec interface is introduced with Snappy compression which support customization of buffer size. (We can also rewrite the output stream implementation of LZF to support configurable buffer size, while snappy already supports this so might be more straightforward to use snappy.) LZF ans Snappy can be configured from system properties. Other compression codec implementation can be defined in driver program as well.
We did some tests, Snappy is slightly faster and has slightly better compression rate than LZF when using the same buffer size. We also tested compression rate of Snappy in different buffer size:
15496 LZF (64k buffer)
15434 snappy (64k buffer)
16594 snappy (32k buffer)
18028 snappy (16k buffer)
So we can reduce the memory footprint by 4x with a slightly worse compression rate.