C++: ASCII File Encryption using Huffman Coding

Introduction

Data stored in computer are in Bytes format. 26 letters of Roman alphabets A, B, C~Z stored in computer are in ASCII characters. ASCII encoding is a fixed-length encoding method, 8 bits (1 byte) are used for each character.

These files can be further compressed to save storage space.

C++ Data Type
char   : 1 byte
unsigned __int8: 1 byte
short : 2 bytes
int    : 4 bytes
long   : 4 bytes
float : 4 bytes
double : 8 bytes

Custom non-Huffman Coding compression - Sorting by frequency of occurrence

This compression method using the variable-length code table shown below to encode the alphabet source file. The table derived from the estimated probability or frequency of occurrence for all alphabets. The method is simple and direct, more commonly occurred alphabets are represented using few bits while less frequent used alphabets are encoded in longer bits.

The alphabets are sorted by their frequency of occurrence and distinguished/separated by adding an empty bit as delimiter in between 2 individual alphabets when encoding. The delimiter is shown as '0' item no. 1 from the table below.

No.	Alphabets	Frequency	Encoding Bits in binary format
1	''		0
2	T	12.8	1
3	E	12.8	11
4	A	8.1	111
5	O	7.6	1111
6	I	7.1	1 1111
7	N	6.9	11 1111
8	S	6.4	111 1111
9	H	6.1	1111 1111
10	R	6.1	1 1111 1111
11	D	4.3	11 1111 1111
12	L	4.0	111 1111 1111
13	C	2.8	1111 1111 1111
14	U	2.8	1 1111 1111 1111
15	M	2.4	11 1111 1111 1111
16	W	2.4	111 1111 1111 1111
17	F	2.3	1111 1111 1111 1111
18	G	2.0	1 1111 1111 1111 1111
19	P	2.0	11 1111 1111 1111 1111
20	Y	2.0	111 1111 1111 1111 1111
21	B	1.5	1111 1111 1111 1111 1111
22	V	1.0	1 1111 1111 1111 1111 1111
23	J	0.2	11 1111 1111 1111 1111 1111
24	K	0.1	111 1111 1111 1111 1111 1111
25	Q	0.1	1111 1111 1111 1111 1111 1111
26	X	0.1	1 1111 1111 1111 1111 1111 1111
27	Z	0.1	11 1111 1111 1111 1111 1111 1111

As we can see from above, this method has disadvantages, the longest bits are 26 bits for alphabet 'Z'. And in between each alphabets we adding dummy bit '0' to separate the alphabets, therefore 1 bit is wasted/occupied. If the file contains many less frequent use alphabets, the output compression file will be bigger than the source file. Therefore Huffman coding is overall better than this customized compression method.

Bitwise Operators

1) When we want to set the specific bits in a byte, we can use Bitwise OR operator ('|')

//C++ Bitwise OR operator
0x11 = 0x01 | 0x10

2) When we want to check a specific bits in a byte is ON, we can use Bitwise AND operator ('&')

//C++ Bitwise AND operator
0x04 = 0x07 & 0x04

3) Left shift operator moves all the bits of a number a specified number of places to the left ('<<')

//C++ Bitwise left shift operator
00100000 = 00001000 << 2

4) Right shift operator moves all the bits of a number a specified number of places to the right ('>>')

//C++ Bitwise right shift operator
00000010 = 00001000 >> 2

Sample Visual c++ programming code:

Compression programming source code: non-Huffman Zipping.cpp

Execution file: non-Huffman.exe

De-compression C++ programming source code: non-Huffman unzipping.cpp

Huffman Coding Compression

Huffman code is a type of optimal prefix code that is commonly used for lossless data compression. The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). Huffman coding derives from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. More common symbols are represented using fewer bits than less common symbols.

Frequently occurred alphabets occupy fewer bits, therefore overall supposingly the file take lesser space.

Huffman's decoding can be efficient. Decoding codes based on the frequency occurrence of characters means that frequently used source symbols are faster to be decoded than those low occurrence source symbol.

Things to take note during Huffman coding programming

Huffman coding sum up equal to 0 (e.g '00', '0000000', '0000') are not feasible in programming. Although theoretically is possible. This will cause programming error when decoding. However if Huffman coding's total '0' bits more than or equal to 1 byte (e.g. '0 0000 0000'), the programming become feasible instead.
New line feed '\n' actually is one of ASCII code, hexadecimal '0x0A'. Therefore new line feed '\n' should be encoded like other characters during the programming.
When reading data from the file, it should be programmed such that the encoder & decoder software should read all the data from the file once instead of line-by-line. As I mentioned before, line feed '\n' actually is one of ASCII family charaters. Programming error will occured if software was written that way.

Above points had detoured me for awhile during code writing.

By implementing Huffman coding, we can encrypt data files. And only we can decode them as long as we keep the specific Huffman Tree we are using.

1) Sample Visual c++ programming code #1 - for txt files only containing alphabet and line feed

Huffman Coding Compression source code:

Encoding source code: HuffZipping-10.cpp, Huff_Zipping-10.exe

Decoding source code: Huff_Unzipping-4.cpp, Huff_Unzipping-4.exe

Execution file 'Huff_Zipping-10.exe' is to encode files containing alphabet and line feed. While execution file 'Huff_Unzipping-4.exe' is to decode and revert back the original alphabet files.

Only the below capital letter can be use for encoding in this example code#1: 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z' and line-feed '\n'.

Huffman Tree graph - Alphabet only for sample programming code #1

Pros and Cons of Huffman Coding

1) If the file's frequency occurrence of characters are not follow to standard distribution, the output of Huffman coding file might be larger than original file which will defeat the purpose of compressing a characters files. Therefore compression using Huffman coding method might not be workable all the times.

2) Huffman coding can encode any types of characters in ASCII and others.

3) Huffman coding unlike other encryption method only encrypt characters till byte level. Huffman coding are encoding characters till the bits level, which make the encoded output files highly secure.

4) By making the Huffman encoding Tree not follow to standard & predictable frequency occurrence of characters, we can make the output encryption file very difficult to decode. For instance character 'E' has very high frequency of occurence (12.8), we can break it down to 10 separate tree nodes which all refering to character 'E'. By this way, people will not able to decode the encrypted file based on charaters frequency of occurence.