Pcompress is a utility to do compression/decompression and deduplication in parallel by splitting input data into chunks. It has a modular structure and includes support for multiple algorithms like LZMA, Bzip2, PPMD, LZ4, etc., with KECCAK/BLAKE2/SHA-256/512 chunk checksums. SSE optimizations for the bundled LZMA are included. It also implements chunk-level Content-Aware Deduplication and Delta Compression features based on a Polynomial Fingerprinting scheme. It has low metadata overhead and overlaps I/O and compression to achieve maximum parallelism. It has AES encryption capability and uses Scrypt from Tarsnap to generate per-session unique keys from passwords. It can work in pipe mode, reading from stdin and writing to stdout. It also provides some adaptive compression modes in which a suitable algorithm is chosen per chunk based on heuristics.
|Tags||data compression Deduplication|
|Operating Systems||UNIX / Linux|
Release Notes: This release fixes several issues, including some corner case crashes and a couple of buffer overflows. Data Deduplication can now be done using blocks as small as 2KB, providing a much higher dedupe ratio than virtually any other deduplication software. Similarity based deduplication performance has been improved. Free memory detection accuracy has also been improved.
Release Notes: This release fixes a few bugs and provides several improvements in efficiency and performance. The Similarity detection effectiveness for similarity based near-exact deduplication has been improved. At the same time memory requirements for the index has been reduced. Accuracy of data partitioning between threads has been improved. Chunking and indexing performance have been improved and the KMV Sketch computation is now more accurate. This release moves all the core functionality into a shared library in preparation for an API interface that will be introduced in future releases.
Release Notes: This is primarily a Bugfix release. It fixes some crashes with invalid input and build problems on Debian6 and older non-SSE4 processors. The Min-heap based Similarity matching for Delta Encoding has been improved and made faster and more accurate. Accuracy of scalable Segmented Global Deduplication has been further improved to be greater than 95%. More testcases have been added.
Release Notes: This release adds many bugfixes and performance improvements. Accuracy in finding duplicates in Global Dedupe has been improved. SHA256 is now the default block hash algorithm for dedupe, with the ability to change it separately from the chunk verification hash. Overall, many performance improvements have been made, with better parallelism, more SSE vectorization, and faster sorting and improving the segment hash list file handling, resulting in smaller I/O and fewer random accesses. Bugs in calculating in-memory index size has been fixed to avoid overflowing free RAM and swapping to disk.
Release Notes: This release introduces the capability to do Global Deduplication. This performs deduplication across the entire dataset using an in-memory index as opposed to deduplication only within segments. Two kinds of indexes are used based on the dataset size. A full chunk hash index is used for small datasets. A special segmented similarity based index is used when the dataset is very large. The latter index size is just 0.002% of the dataset size with >90% efficiency of a full chunk index based exact dedupe using 4KB chunks. Streaming support allows optimization of network transfer of large data.