Save space with deduplication

How to use our built-in deduplication - or disable for more throughput

As of 2024, an estimated 90% of global data is estimated to be redundant, emphasizing the need for efficient storage management. UltiHash tackles this challenge with its byte-level deduplication algorithm, designed to minimize storage volumes by identifying and eliminating redundant data across all objects, regardless of format. This method can reduce overall storage needs by up to 60%, enabling organizations to scale their data without proportionally increasing capacity requirements.

Want to see how much deduplication you could achieve? Test your data with our 5min terminal demo ↓
curl -fsSL https://ultihash.io/1line | bash

How does UltiHash's deduplication work?

The deduplication process works by splitting objects into fragments of varying sizes depending on the dataset. If a fragment already exists within the system, it isn’t stored again, eliminating unnecessary duplication across datasets. This ongoing comparison ensures that storage resources are utilized efficiently while maintaining data integrity.

Unlike traditional compression techniques that often decrease performance, UltiHash’s deduplication runs continuously and is data-type-agnostic, supporting structured, unstructured, and even compressed data. In certain cases, such as with RAW files, tests have shown volume reductions of up to 74%. This makes UltiHash ideal for environments handling large quantities of redundant data, particularly in AI, machine learning, and media-heavy applications.

What kinds of data deduplicate well?

UltiHash optimizes data volume out of the box through a built-in deduplication algorithm that eliminates redundancies at a byte level, regardless of data type or format. This results in significant space savings of up to 60% on the entire data volume, depending on various factors including:

compressed vs uncompressed data format: UltiHash generates up to 75% space savings on uncompressed formats (e.g. RAW, TIFF) and up to 51% on compressed formats (e.g. JPG, PNG)
similarity between the objects: the higher the similarity, the more space saved

This section documents the space savings generated by UltiHash on different datasets, giving a fair demonstration of UltiHash’s capabilities. The results can be reproduced on any UltiHash cluster.

Dataset + link

Dataset size

Space savings

DICOM files of brain MRI scans

1.51 GB

67.13 %

JPGs of driving scenarios

2.41 GB

52.32 %

PNGs of synthetic textures with defects

5.89 GB

50 %

WAVs of human speech for emotion recognition

0.33 GB

50 %

TIFF images of climate data

16.28 GB

45.84 %

TIFFs of fossil segmentations

8.26 GB

44.64 %

CSV tables of symptoms

0.0014 GB

42 %

Models of dinosaur teeth

1.87 GB

33.09 %

Parquet files with temperature, humidity, wind and land uses

1.81 GB

21.59 %

NetCDF climatic and atmospheric data

0.68 GB

18.65 %

LIDAR data of driving scenarios

33.26 GB

18.31 %

PDFs of Indian supreme court judgements

5.57 GB

11.95 %

How to disable deduplication

UltiHash includes an integrated deduplication service that is enabled by default. It is recommended for most workloads, especially read-intensive ones, as it does not introduce latency during read operations. However, for write-intensive workloads where the overhead of deduplication is not needed, you can disable the service for the entire storage cluster.

To disable deduplication, update the Helm chart values by setting the deduplicator.enabled flag to false and then apply the change to the Helm release:

helm upgrade <release_name> oci://registry.ultihash.io/stable/ultihash-cluster \
  -n <namespace> \
  --set deduplicator.enabled=false

You can also make this change by editing values.yaml directly:

deduplicator:
  enabled: false

Then reapply the release:

helm upgrade <release_name> oci://registry.ultihash.io/stable/ultihash-cluster \
  -n <namespace> \
  --values values.yaml

The deduplicator service can be disabled or re-enabled at any time by updating the Helm chart values and upgrading the release.

How to access deduplication metrics

UltiHash extends beyond the standard S3 API with features like deduplication metrics, which allow you to query the effective size of your data after deduplication. This unique functionality is crucial for optimizing storage and understanding your actual storage usage.

get_effective_size.py : https://github.com/UltiHash/scripts/tree/main/boto3/ultihash_info

# Retrieve deduplicated data size from UltiHash
get_effective_size.py --url <https://ultihash>

Last updated 3 months ago

Was this helpful?