KB labels

Ticket Product Version

Ticket Category

Ticket Assignee

Hot Fixes

Published Fixes
Your Arcserve Support User Profile
First Name:
Last Name:
Customer Type:


Time zone:

arcserve-KB : How does the arcserve Backup Deduplication option work?

Last Update: 2015-12-15 22:08:57 UTC


Last Modified Date:    06/16/2009
Document ID:    TEC490803
Tech Document
Title:  How does the CA ARCserve Backup Deduplication option work?


How does the CA ARCserve Backup Deduplication option work?


How Data Deduplication Works

Data deduplication is technology that allows you to fit more backups on the same physical media, retain backups for longer periods of time, and speed up data recovery. Deduplication analyzes data streams sent to be backed up, looking for duplicate 'chunks.' It saves only unique chunks to disk. Duplicates are tracked in special index files.

In CA ARCserve Backup, deduplication is an in-line process that occurs at the backup server, within a single session.

During the first backup:

  • CA ARCserve Backup scans incoming data and segments it into chunks. This process occurs in the SIS layer of the Tape Engine.

  • CA ARCserve Backup executes a hashing algorithm that assigns a unique value to each chunk of data and saves those values to a hash file.

  • CA ARCserve Backup compares hash values. When duplicates are found, data is written to disk only once, and a reference is added to a reference file pointing back to the storage location of the first identified instance of that data chunk.

In the diagram below, the disk space needed to backup this data stream is smaller in a deduplication backup job than in a regular backup job.

Figure 1

With deduplication, three files are created for every backup session:

  • Index Files (Metadata files)

    • Hash files--store the markers assigned to each redundant chunk of data.

    • Reference files --count hashes and store the address in the data files that correspond to each hash.

  • Data files --store the unique instances of the data you backed up. The two index files together consume a small percentage of the total data store so the size of the drive that stores these files is not as critical as its speed. Consider a solid state disk or similar device with excellent seek times for this purpose.

During subsequent backups:

  • CA ARCserve Backup scans incoming data and breaks it into chunks.

  • CA ARCserve Backup executes the hashing algorithm to assign hash values.

  • CA ARCserve Backup compares new hash values to previous values, looking for duplicates. When duplicates are found, data is not written to disk. Instead, the reference file is updated with the storage location of the original instance of the data chunk.

When you must restore deduplicated data, CA ARCserve Backup refers to the index files to first identify and then find each chunk of data needed to reassemble the original data stream.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request