# Save space with deduplication

As of 2024, an estimated 90% of global data is [estimated to be redundant](https://www.businesswire.com/news/home/20200508005025/en/IDCs-Global-DataSphere-Forecast-Shows-Continued-Steady-Growth-in-the-Creation-and-Consumption-of-Data), emphasizing the need for efficient storage management. UltiHash tackles this challenge with its byte-level deduplication algorithm, designed to minimize storage volumes by identifying and eliminating redundant data across all objects, regardless of format. This method can reduce overall storage needs by up to 60%, enabling organizations to scale their data without proportionally increasing capacity requirements.

<figure><img src="/files/SrgBChMSle7MjodPPFnD" alt=""><figcaption></figcaption></figure>

> **Want to see how much deduplication you could achieve? Test your data with our 5min terminal demo   ↓**
>
> ```
> curl -fsSL https://ultihash.io/1line | bash
> ```

***

### How does UltiHash's deduplication work?

The deduplication process works by splitting objects into fragments of varying sizes depending on the dataset. If a fragment already exists within the system, it isn’t stored again, eliminating unnecessary duplication across datasets. This ongoing comparison ensures that storage resources are utilized efficiently while maintaining data integrity.

Unlike traditional compression techniques that often decrease performance, UltiHash’s deduplication runs continuously and is data-type-agnostic, supporting structured, unstructured, and even compressed data. In certain cases, such as with RAW files, tests have shown volume reductions of up to 74%. This makes UltiHash ideal for environments handling large quantities of redundant data, particularly in AI, machine learning, and media-heavy applications.

***

### What kinds of data deduplicate well?

UltiHash optimizes data volume out of the box through a built-in deduplication algorithm that eliminates redundancies at a byte level, regardless of data type or format. This results in significant space savings of up to 60% on the entire data volume, depending on various factors including:

* compressed vs uncompressed data format: UltiHash generates up to 75% space savings on uncompressed formats (e.g. RAW, TIFF) and up to 51% on compressed formats (e.g. JPG, PNG)
* similarity between the objects: the higher the similarity, the more space saved

This section documents the space savings generated by UltiHash on different datasets, giving a fair demonstration of UltiHash’s capabilities. The results can be reproduced on any UltiHash cluster.

<table><thead><tr><th width="467">Dataset + link</th><th width="129" align="right">Dataset size</th><th width="144" align="right">Space savings</th></tr></thead><tbody><tr><td><a href="https://www.kaggle.com/datasets/amritpal333/adni4dicomnano10514/">DICOM files of brain MRI scans</a></td><td align="right">1.51 GB</td><td align="right">67.13 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/zaynena/selfdriving-car-simulator">JPGs of driving scenarios</a></td><td align="right">2.41 GB</td><td align="right">52.32 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/mhskjelvareid/dagm-2007-competition-dataset-optical-inspection">PNGs of synthetic textures with defects</a></td><td align="right">5.89 GB</td><td align="right">50 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/barelydedicated/savee-database">WAVs of human speech for emotion recognition</a></td><td align="right">0.33 GB</td><td align="right">50 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/abireltaief/highresolution-geotiff-images-of-climatic-data">TIFF images of climate data</a></td><td align="right">16.28 GB</td><td align="right">45.84 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/brsdincer/fossil-segmentation-image-set-microfossil?select=Fossil_Segmentation">TIFFs of fossil segmentations</a></td><td align="right">8.26 GB</td><td align="right">44.64 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning">CSV tables of symptoms</a></td><td align="right">0.0014 GB</td><td align="right">42 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/kmader/3d-dinosaur-teeth">Models of dinosaur teeth</a></td><td align="right">1.87 GB</td><td align="right">33.09 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/zilikons/2014-2017-athens-center-cop-data">Parquet files with temperature, humidity, wind and land uses</a></td><td align="right">1.81 GB</td><td align="right">21.59 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/brsdincer/july-2531-2021-climate-data-nasa">NetCDF climatic and atmospheric data</a></td><td align="right">0.68 GB</td><td align="right">18.65 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/usharengaraju/pandaset-dataset">LIDAR data of driving scenarios</a></td><td align="right">33.26 GB</td><td align="right">18.31 %</td></tr><tr><td><a href="https://www.kaggle.com/datasets/vangap/indian-supreme-court-judgments">PDFs of Indian supreme court judgements</a></td><td align="right">5.57 GB</td><td align="right">11.95 %</td></tr></tbody></table>

***

### How to disable deduplication <a href="#deduplication-metrics" id="deduplication-metrics"></a>

UltiHash includes an integrated deduplication service that is enabled by default. It is recommended for most workloads, especially read-intensive ones, as it does not introduce latency during read operations. However, for write-intensive workloads where the overhead of deduplication is not needed, you can disable the service for the entire storage cluster.

To disable deduplication, update the Helm chart values by setting the deduplicator.enabled flag to false and then apply the change to the Helm release:

```
helm upgrade <release_name> oci://registry.ultihash.io/stable/ultihash-cluster \
  -n <namespace> \
  --set deduplicator.enabled=false
```

You can also make this change by editing values.yaml directly:

```
deduplicator:
  enabled: false
```

Then reapply the release:

```
helm upgrade <release_name> oci://registry.ultihash.io/stable/ultihash-cluster \
  -n <namespace> \
  --values values.yaml
```

The deduplicator service can be disabled or re-enabled at any time by updating the Helm chart values and upgrading the release.

***

### How to access deduplication metrics <a href="#deduplication-metrics" id="deduplication-metrics"></a>

UltiHash extends beyond the standard S3 API with features like deduplication metrics, which allow you to query the effective size of your data after deduplication. This unique functionality is crucial for optimizing storage and understanding your actual storage usage.

`get_effective_size.py` : <https://github.com/UltiHash/scripts/tree/main/boto3/ultihash_info>

```
# Retrieve deduplicated data size from UltiHash
get_effective_size.py --url <https://ultihash>
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.ultihash.io/operations/save-space-with-deduplication.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
