Introduction
Data stored in computer are in Bytes format. 26 letters
of Roman alphabets A, B, C~Z stored in computer are in ASCII characters.
ASCII encoding is a fixed-length encoding method, 8 bits (1 byte) are used
for each character.
These files can be further compressed to save storage space.
C++ Data Type
char : 1 byte
unsigned __int8: 1 byte
short : 2 bytes
int : 4 bytes
long : 4 bytes
float : 4 bytes
double : 8 bytes
Custom
non-Huffman Coding compression - Sorting by frequency of occurrence
This compression method using the variable-length code
table shown below to encode the alphabet source file. The table derived
from the estimated probability or frequency of occurrence for all
alphabets. The method is simple and direct, more commonly occurred
alphabets are represented using few bits while less frequent used
alphabets are encoded in longer bits.
The alphabets are sorted by their frequency of occurrence and
distinguished/separated by adding an empty bit as delimiter in between 2
individual alphabets when encoding. The delimiter is shown as '0' item no.
1 from the table below.
No.
|
Alphabets
|
Frequency
|
Encoding Bits in binary
format
|
1
|
''
|
|
0
|
2
|
T
|
12.8
|
1
|
3
|
E
|
12.8
|
11
|
4
|
A
|
8.1
|
111
|
5
|
O
|
7.6
|
1111
|
6
|
I
|
7.1
|
1 1111
|
7
|
N
|
6.9
|
11 1111
|
8
|
S
|
6.4
|
111 1111
|
9
|
H
|
6.1
|
1111 1111
|
10
|
R
|
6.1
|
1 1111 1111
|
11
|
D
|
4.3
|
11 1111 1111
|
12
|
L
|
4.0
|
111 1111 1111
|
13
|
C
|
2.8
|
1111 1111 1111
|
14
|
U
|
2.8
|
1 1111 1111 1111
|
15
|
M
|
2.4
|
11 1111 1111 1111
|
16
|
W
|
2.4
|
111 1111 1111 1111
|
17
|
F
|
2.3
|
1111 1111 1111 1111
|
18
|
G
|
2.0
|
1 1111 1111 1111 1111
|
19
|
P
|
2.0
|
11 1111 1111 1111 1111
|
20
|
Y
|
2.0
|
111 1111 1111 1111 1111
|
21
|
B
|
1.5
|
1111 1111 1111 1111
1111
|
22
|
V
|
1.0
|
1 1111 1111 1111 1111
1111
|
23
|
J
|
0.2
|
11 1111 1111 1111 1111
1111
|
24
|
K
|
0.1
|
111 1111 1111 1111 1111
1111
|
25
|
Q
|
0.1
|
1111 1111 1111 1111
1111 1111
|
26
|
X
|
0.1
|
1 1111 1111 1111 1111
1111 1111
|
27
|
Z
|
0.1
|
11 1111 1111 1111 1111
1111 1111
|
As we can see from above, this method has disadvantages, the
longest bits are 26 bits for alphabet 'Z'. And in between each alphabets we adding dummy bit '0' to separate the
alphabets, therefore 1 bit is wasted/occupied. If the file contains many
less frequent use alphabets, the output compression file will be bigger
than the source file. Therefore Huffman coding is
overall better than this customized compression method.
Bitwise Operators
1) When we want to set the specific bits in a byte, we can
use Bitwise OR operator ('|')
//C++ Bitwise OR operator
0x11 = 0x01 | 0x10
|
2) When we want to check a specific
bits in a byte is ON, we can use Bitwise AND operator ('&')
//C++ Bitwise AND operator
0x04 = 0x07 & 0x04
|
3) Left shift operator moves all the bits of a number a
specified number of places to the left ('<<')
//C++ Bitwise left shift operator
00100000 = 00001000 << 2
|
4) Right shift operator moves all the bits of a number a
specified number of places to the right ('>>')
//C++ Bitwise right shift operator
00000010 = 00001000 >> 2
|
Sample Visual c++ programming
code:
- Compression
programming source code: non-Huffman Zipping.cpp
- De-compression C++
programming source code: non-Huffman
unzipping.cpp
Huffman
Coding Compression
Huffman code is a type of optimal prefix code that is commonly
used for lossless data compression. The output from Huffman's algorithm can
be viewed as a variable-length code table for encoding a source symbol
(such as a character in a file). Huffman coding derives from the
estimated probability or frequency of occurrence (weight) for each possible
value of the source symbol. More common symbols are represented using fewer
bits than less common symbols.
Frequently occurred alphabets occupy fewer bits, therefore
overall supposingly the file take
lesser space.
Huffman's decoding can be efficient. Decoding codes based
on the frequency occurrence of characters means that frequently used source
symbols are faster to be decoded than those low occurrence source symbol.
Things to take note during Huffman coding programming
- Huffman coding sum
up equal to 0 (e.g '00', '0000000', '0000')
are not feasible in programming. Although theoretically is possible.
This will cause programming error when decoding. However
if Huffman coding's total '0' bits more than or equal to 1 byte
(e.g. '0 0000 0000'), the programming become feasible instead.
- New line feed '\n'
actually is one of ASCII code, hexadecimal '0x0A'. Therefore
new line feed '\n' should be encoded like other characters during the
programming.
- When reading data
from the file, it should be programmed such that the encoder &
decoder software should read all the data from the file
once instead of line-by-line. As I mentioned before, line feed
'\n' actually is one of ASCII family charaters. Programming error will occured if software was written that way.
Above points had detoured me for awhile
during code writing.
By implementing Huffman coding, we can encrypt data
files. And only we can decode them as long as
we keep the specific Huffman Tree we are using.
1)
Sample Visual c++ programming code #1 - for
txt files only containing alphabet and line feed
Huffman Coding
Compression source code:
Encoding source
code: HuffZipping-10.cpp, Huff_Zipping-10.exe
Decoding source code: Huff_Unzipping-4.cpp, Huff_Unzipping-4.exe
Execution file 'Huff_Zipping-10.exe' is to encode files
containing alphabet and line feed. While execution file
'Huff_Unzipping-4.exe' is to decode and revert back
the original alphabet files.
Only the below capital letter can be use
for encoding in this example code#1: 'A', 'B', 'C', 'D', 'E', 'F', 'G',
'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
'W', 'X', 'Y', 'Z' and line-feed '\n'.

Huffman Tree graph
- Alphabet only for sample programming code #1
Pros
and Cons of Huffman Coding
1) If the file's frequency occurrence of characters are not follow to standard distribution, the output of
Huffman coding file might be larger than original file which will defeat
the purpose of compressing a characters files. Therefore
compression using Huffman coding method might not be workable all the
times.
2) Huffman coding can encode any types of characters in ASCII
and others.
3) Huffman coding unlike other encryption method
only encrypt characters till byte level. Huffman coding are
encoding characters till the bits level, which make the encoded output
files highly secure.
4) By making the Huffman encoding Tree not follow to standard
& predictable frequency occurrence of characters, we can make the
output encryption file very difficult to decode. For instance
character 'E' has very high frequency of occurence
(12.8), we can break it down to 10 separate tree nodes which all refering to character 'E'. By this way, people
will not able to decode the encrypted file based on charaters
frequency of occurence.
|