Server 2012 Data deduplication support?

paris · May 2013

Hello there,

Does anyone know if data deduplication (http://blogs.technet.com/b/filecab/archive/2012/05/21/introduction-to-data-deduplication-in-windows-server-2012.aspx) is supported in Driveppol?

I am talking about the technology of Microsoft to reduce disk usage by finding duplicated files, not the duplicate folder feature of Drivepool.

If this is supported, I am wondering if it can be activated indifferently on the physical drives pooled in Drivepool, or directly on the pool drive...

paris · May 2013

For those who are interested, I am currently testing that.

Created a pool composed of two drives (D: E:).

is used as landing zone with the archive optimizer plugin.

E: used as Archive disk.

Activated deduplication and bitlocker on E: only.

Will keep you in touch if it works correctly...

Drashna · May 2013

It is possible on the pool parts, but not on the actual pool.

However, by doing so, you remote the entire point of duplication as the "duplicates" are "hard linked" (aka, file system shortcut/link) to the other file.

Henrik · May 2013

I created a thread May 5th with the same question

never got an answer.

What I saw when testing was that it doesn't work since any file that has been de-duplicated will be unreadable through the pool (CoveFS).

I don't quite understand what you mean Drashna.

Duplication and de-duplication serves different purposes and one does not exclude the other.

Duplication duplicates a file so that it exists on more than one physical disc.

De-duplication removes duplicate copies of the same data within a single physical disc.

So essentially instead of having 100 copies of the same data on Disk A. I could have one copy on Disk A and one copy on Disk B thus saving 98 copies of disc space and increase reliability at the same time.

Drashna · May 2013

Ah, I actually wasn't aware it was for one disk. Most of the references I've seen refer to the "system" in general and not to individual disks.

Though you answered the other part. They don't play well together.

That, and I'm kinda OCD about organizing things, so I've never really needed it.

And if you want, there is "Snoop-de-Dupe" add-in for WHS2011/SBS2011E that also supports Server 2012 Essentials. It will scan whatever folder/drive that you specify and try to find duplicate files and lets you manage them (delete/move/ignore/etc). It's a paid add-in with a trial. May be worth at least checking out.

paris · May 2013

@Henrik : So in your tests de-duplicating files onto the archive disks (not the pool directly) make them disappear from the pool drive?

@Drashna : Snoop-de-dupe does not serve the same purpose. I do not want to remove duplicates on a single HDD, I want them to be automatically handled. When you do software development, you can have a lot of duplicated files, libs and sources, even the build system, present in a lot of places intentionally. I do not want to get rid of them, I want the file system to reduce their storage impact. For my documents it is another story; I want the FS to handle my mess for me

paris · May 2013

@Henrik

Just performed my own test.
I have one pool with one feeder disk (without data dedup) and one archive disk (with dedup activated).
I have copied one MP3 folder onto my pool, containing 16 files for 74.5MB.
The copied 3 more times the same folder at various places on my pool, for a total of 298MB.

After that, I forced the balancing into drivepool.
Waiting a little, I checked that all the 4 copies where moved to my archive disk.
Once done, forced the deduplication.

Checking the dedup stats, it indicates me that it saves 224MB on the disk (which corresponds to 3x74.5MB).
So dedup works on my archive disk.
Now let us check that the files are still accessible on the pool drive.
Going there, I copy each of the 4 copies into one more place.
Everything goes fine, no error, so I assume that the files are all accessible.

Voila, it seems that dedup for archive disks works isn't it?

Henrik · May 2013

@Drashna

De-duplication and single instance storage is not the same thing. Single instance storage only looks for identical files while de-duplication looks at the data within files to similar data. NetApp for example uses a 4KB block size and Windows Server 2012 uses a variable block size with an average of about 80KB.

For example I have lot of ISOs with Office installation media on it. Each ISO is unique but there's a lot of duplication data within those ISOs.

Regarding the other part; Yes, Server 2012 de-duplication operates on a per NTFS volume level. So if you have several NTFS volumes they will de-duplication individually.

@Paris

What I found out was that once a file has been de-duplicated it was visible just fine but when trying to read the file through the pool drive I got a read error. Reading it directly from the NTFS drive still worked of course and running an unoptimization job to turn it back into to a "normal" file made it readable through the pool drive as well.

Are you saying that on your machine files were still readable through the pool after being de-duplicated?

(and you did check using Measure-DedupFileMetadata cmdlet that the files were indeed de-duplicated; by default as you probably know only files older than 5 days are deduped)

paris · May 2013

@Henrik

I did the test by copying the de-duplicated files again, so I am assuming that the copy would fail with an error if the file was unreadable.

To force deduplication I set the "older than" parameter to 0, and scheduled a dedup operation 1 minute later. Checking the task manager, I verified that the dedup process has been activated. Once done, the amount of saved space went from 0 to 224MB, so I assumed that the files I wanted to dedup had been correctly processed (note that the drive was empty if not those files).

What is the cmdlet you are talking about? I am interested in this but do not know how to operate that.

Henrik · May 2013

I tried it again and I came up with the same thing. Any files on the underlying NTFS volume that have been de-duplicated are unreadable through the pool drive.

Don't know if this is because of the fact that the actual data is no longer under the Poolpart folder but under System Volume Information folder and Drive Pool can't read it there or if it's some other limitation.

Of course if it does work for you that would be very interesting since it doesn't for me

Try this is in a powershell window:

get-help measure-dedupfilemetadata -full

A file that has been de-duplicated will have a SizeOnDisk equal to your cluster size (normally 4KB) and the actual amount of storage used would be DedupSize and DedupDistinctSize would be the unique data for this file that is not shared with any other(= how much you would recover if you deleted the file)

If you need more info on the de-duplication cmdlets then google is your friend

paris · May 2013

Strange, retesting that now, once I have removed all the duplicates, fails.

So apparently it is the same as you Henrik (but in the meantime I have reinstalled my whole server and do not have the same accounts as before so the file permissions may play a role).

For now I will deactivate this dedup thing... too bad, would be a soooo great addition to Drivepool!

Howdy, Stranger!

Categories

Poll

Server 2012 Data deduplication support?

Comments