It wasn't me. You can't prove anything.


2011-07-26

Shrinking data

When you dd a partition in to gzip and to a file you must still compress blank space. This takes up more space. You could use bz2, but that takes 10 times as long to perform the compression, but the blank space will get compressed much more efficiently.

The gzip program is zip compatible for the most part. The math is the same. It takes a fixed size slice of your file and calculates the compression based on a given compression algorithm. Same math and slice applied to the whole bit of data you feed to the program. Has been working for several decades.

The bzip2 program uses a variable size slice and a variable math approach to the data in order to squeeze the air out of the data in the best possible way. This means the program must look at the data several times and compare bits of math to one another. This means that at compression time, you are performing several calculations with slightly different settings over and over. Sometimes it pays off. I'm not sure how long bzip2 has been around. It is a nice tool in the compression toolbox.

I've done this with hard drives, flash drives, RAM and just about every other thing that counts as computer storage. The dd command is probably the most exceptional Linux command ever. It is the move that is just plane ones and zeros from here to there. It is about the lowest level command you have at the command line. You just about have to write machine code to get lower. Swaths of the operating system have been designed to use the dd command. That is why there is a /proc after all, so you can use dd and related commands to move things around.

If you understand any part of what I've just said, congratulations. You are a Linux geek.

What I'm talking about is not just adding files to a compressed folder or downloading a file and then copying files out of it for use. It turns out there is far more to data compression than this. I remember the days of ARC. Then came the days of "Down with ARC" because the guy who wrote ARC got nothing for it. The licensing was nuts too. This lead directly to ZIP. That lasted for a million years. Now, it is anything goes. BZ2 is just one of a million different compression schemes. There are specific designers for compressing whole mounted drives. Some are just for compressing streamed data. I haven't even gotten to th idea of loss compression. This is another universe beyond what we are talking about.

No comments: