「哈夫曼编码」哈夫曼编码

哈夫曼编码

7-9 Huffman Codes（30 分）

In 1953, David A. Huffman published his paper "A Method for the Construction of Minimum-Redundancy Codes", and hence printed his name in the history of computer science. As a professor who gives the final exam problem on Huffman codes, I am encountering a big problem: the Huffman codes are NOT unique. For example, given a string "aaaxuaxz", we can observe that the frequencies of the characters 'a', 'x', 'u' and 'z' are 4, 2, 1 and 1, respectively. We may either encode the symbols as {'a'=0, 'x'=10, 'u'=110, 'z'=111}, or in another way as {'a'=1, 'x'=01, 'u'=001, 'z'=000}, both compress the string into 14 bits. Another set of code can be given as {'a'=0, 'x'=11, 'u'=100, 'z'=101}, but {'a'=0, 'x'=01, 'u'=011, 'z'=001} is NOT correct since "aaaxuaxz" and "aazuaxax" can both be decoded from the code 00001011001001. The students are submitting all kinds of codes, and I need a computer program to help me determine which ones are correct and which ones are not.

Input Specification:

Each input file contains one test case. For each case, the first line gives an integer N (2≤N≤63), then followed by a line that contains all the N distinct characters and their frequencies in the following format:

c[1] f[1] c[2] f[2] ... c[N] f[N]

where c[i] is a character chosen from {'0' - '9', 'a' - 'z', 'A' - 'Z', '_'}, and f[i] is the frequency of c[i] and is an integer no more than 1000. The next line gives a positive integer M (≤1000), then followed by M student submissions. Each student submission consists of N lines, each in the format:

c[i] code[i]

where c[i] is the i-th character and code[i] is an non-empty string of no more than 63 '0's and '1's.

Output Specification:

For each test case, print in each line either "Yes" if the student's submission is correct, or "No" if not.

Note: The optimal solution is not necessarily generated by Huffman algorithm. Any prefix code with code length being optimal is considered correct.

7-3 哈夫曼编码（20 分）

给定一段文字，如果我们统计出字母出现的频率，是可以根据哈夫曼算法给出一套编码，使得用此编码压缩原文可以得到最短的编码总长。然而哈夫曼编码并不是唯一的。例如对字符串"aaaxuaxz"，容易得到字母 'a'、'x'、'u'、'z' 的出现频率对应为 4、2、1、1。我们可以设计编码 {'a'=0, 'x'=10, 'u'=110, 'z'=111}，也可以用另一套 {'a'=1, 'x'=01, 'u'=001, 'z'=000}，还可以用 {'a'=0, 'x'=11, 'u'=100, 'z'=101}，三套编码都可以把原文压缩到 14 个字节。但是 {'a'=0, 'x'=01, 'u'=011, 'z'=001} 就不是哈夫曼编码，因为用这套编码压缩得到 00001011001001 后，解码的结果不唯一，"aaaxuaxz" 和 "aazuaxax" 都可以对应解码的结果。本题就请你判断任一套编码是否哈夫曼编码。

输入格式：

首先第一行给出一个正整数 N（2≤N≤63），随后第二行给出 N 个不重复的字符及其出现频率，格式如下：

c[1] f[1] c[2] f[2] ... c[N] f[N]

其中c[i]是集合{'0' - '9', 'a' - 'z', 'A' - 'Z', '_'}中的字符；f[i]是c[i]的出现频率，为不超过 1000 的整数。再下一行给出一个正整数 M（≤1000），随后是 M 套待检的编码。每套编码占 N 行，格式为：

c[i] code[i]

其中c[i]是第i个字符；code[i]是不超过63个'0'和'1'的非空字符串。

输出格式：

对每套待检编码，如果是正确的哈夫曼编码，就在一行中输出"Yes"，否则输出"No"。

注意：最优编码并不一定通过哈夫曼算法得到。任何能压缩到最优长度的前缀编码都应被判为正确。

输入样例：

7
A 1 B 1 C 1 D 3 E 3 F 6 G 6
4
A 00000
B 00001
C 0001
D 001
E 01
F 10
G 11
A 01010
B 01011
C 0100
D 011
E 10
F 11
G 00
A 000
B 001
C 010
D 011
E 100
F 101
G 110
A 00000
B 00001
C 0001
D 001
E 00
F 10
G 11

输出样例：

Yes
Yes
No
No

提交结果

提交时间	状态	分数	题目	编译器	耗时	用户
2018/5/24 23:50:17	答案正确	30	7-9	C++ (g++)	72 ms	copper

测试点	结果	耗时	内存
0	答案正确	3 ms	240KB
1	答案正确	3 ms	240KB
2	答案正确	3 ms	240KB
3	答案正确	72 ms	6220KB
4	答案正确	3 ms	236KB
5	答案正确	3 ms	236KB
6	答案正确	3 ms	316KB

这道题可以说是树这块的压轴题了，无论是代码量还是思维难度都和其他题目不在一个档次。题目意思是给定一个带权的输入序列，和N个与带权输入序列元素相同的编码测试序列，若编码符合最优编码，则输出Yes，若不符合，则输出No。

这道题考察最优编码长度，实际上是在考察Huffman树与Huffman编码，出题人担心你想不到，还特地在题干开头专门介绍了David A. Huffman和他提出的"A Method for the construction of Minimum-Redundancy Codes"（一种实现最小冗余编码结构的方法），也就是Huffman Codes（哈夫曼编码）。

对于每个给定的带权元素序列来说，其必定可以建成一棵Huffman树，尽管根据建树方法的不同，树的结构不同，但是对于这个给定的带权元素序列，其最优编码长度是固定的，即某一形式的Huffman树下的Huffman编码长度。（建立Huffman树的过程为，每次将权重最小的两个结点合成一棵二叉树，其树根结点权值为两子树权值之合。再将该二叉树当作结点进行重新合并。重复上述过程直到使用完所有结点，建立成一棵树（理论上N个结点要进行N-1次合并））但需要注意的是，Huffman编码的编码长度是最优编码长度，但是最优编码长度可以不是其Huffman编码（题目最后一句也有特地提醒，看来出题人还是比较好心的）。因此，验证输入测试序列需要靠两点：1.符合最优编码长度，2.能够无歧义解码。

那么思路就出来了。先根据输入序列建立Huffman树，并获得最优编码长度。再对提交数据进行检查：1.是否符合最优编码长度，2.是否符合无歧义解码规则（前缀码编码，数据仅存在于二叉树叶节点）。

获得最优编码长度的过程，需要先创建一棵Huffman树，又需要先将带权序列建立成最小堆，再每轮弹出2次最小堆的顶点，作为二叉树的左右子树进行合并，合并完后的二叉树进行权值更新，再继续放入最小堆进行合并……直到最小堆元素全部弹出，最后弹出一整棵Huffman树。（最小堆的建立可以参见：#数据结构与算法学习笔记#PTA14：最小堆与最大堆（C/C++））。计算每一个测试序列的编码长度，与标准Huffman编码长度比较即可。最小堆每次插入和弹出都需要对全堆某个路径（根节点到叶子结点的一条路径）进行一次调整，具体情况分析详见代码注释。

检查前缀码编码的过程，需要根据输入序列的每个元素编码，模拟其在树中的路径（相当于每次创建一个元素编码所代表一条二叉树的路径，0代表左子树，1代表右子树）。模拟过程中的两种情况可以验证不满足前缀码要求（如下图）：1.后创建的分支经过或超过已经被定义的叶子结点，2.后创建分支创建结束时未达到叶子结点。具体情况分析详见代码注释。

// HuffmanCodes.cpp : 定义控制台应用程序的入口点。
//

//#include <stdafx.h>
#include <vector>
#include <iOStream>
#include <string.h>

using namespace std;

//Huffman树结点类
class Node {
public:
	Node() {}
	Node(char element, int weight)
		:element(element), weight(weight), left(NULL), right(NULL) {}

	char element;
	int weight;
	Node* left = NULL;
	Node* right = NULL;
	bool isleave = false;
};
typedef Node* HFMTree;

//输入测试样例结点类
class Case {
public:
	char element;
	char route[1000];
	int length;

	int getlength() {
		return strlen(this->route);
	}
};

void Read(int num, vector<HFMTree>& minHeap, vector<HFMTree>& inputlist);
void Insert(vector<HFMTree>& minHeap, HFMTree node);		//插入数据创建最小堆
HFMTree CreateHFMT(vector<HFMTree>& minHeap);			//根据最小堆创建Huffman树
HFMTree DeleteMinHeap(vector<HFMTree>& minHeap);		//从最小堆中取出最小元素，删除该结点并重新调整最小堆,最后删除该结点
int getHFMLength(HFMTree hfmtree, int depth);						//获得该树编码长度

void Input(vector<Case>& testcase, int num);
bool isOptimalLen(vector<Case>& testcase, vector<HFMTree>& inputlist, int weight);	//检查是否符合最优编码长度
bool isPrefixCode(vector<Case>& testcase);				//检查是否符合前缀码编码


int main()
{
	/*根据输入序列建立Huffman树，并获得最优编码长度*/
	int num;
	cin >> num;

	vector<HFMTree> minHeap;		//创建最小堆，用最小堆对序列进行存储
	vector<HFMTree> inputlist;		//记录输入顺序与权值大小
	HFMTree flag = new Node('-', -1);
	minHeap.push_back(flag);
	Read(num, minHeap, inputlist);

	HFMTree hfmtree;				//利用最小堆创建Huffman树
	hfmtree = CreateHFMT(minHeap);
	int optcodelength = getHFMLength(hfmtree, 0);	//通过序列创建的Huffman树获得最优编码长度


	/*对提交数据进行检查：1.是否符合最优编码长度，2.是否符合无歧义解码规则（前缀码编码，数据仅存在于二叉树叶节点）*/
	int count;
	cin >> count;

	for (int i = 0;i < count;i++) {
		vector<Case> testcase;
		Input(testcase, num);
		bool isoptimallen = isOptimalLen(testcase, inputlist, optcodelength);
		bool isprefixcode = isPrefixCode(testcase);
		if (isoptimallen && isprefixcode) {
			cout << "Yes" << endl;
		}
		else {
			cout << "No" << endl;
		}
	}

	system("pause");
	return 0;
}

void Read(int num, vector<HFMTree>& minHeap, vector<HFMTree>& inputlist) {
	char element;
	int weight;
	for (int i = 0; i < num; i++) {
		cin >> element >> weight;
		HFMTree node = new Node(element, weight);
		inputlist.push_back(node);
		Insert(minHeap, node);
	}
	//minHeap.erase(minHeap.begin());
}

void Insert(vector<HFMTree>& minHeap, HFMTree node) {
	int index = minHeap.size();
	minHeap.push_back(node);

	//每次插入后自底向上进行调整
	while ((*minHeap[index / 2]).weight > (*node).weight) {
		//此处不可单纯进行值交换，需要交换两个对象
		//(*minHeap[index]).element = (*minHeap[index / 2]).element;
		//(*minHeap[index]).weight = (*minHeap[index / 2]).weight;
		minHeap[index] = minHeap[index / 2];
		index /= 2;
	}
	minHeap[index] = node;
}

HFMTree CreateHFMT(vector<HFMTree>& minHeap) {

	HFMTree hfmtree = new Node();
	int size = minHeap.size() - 1;
	//进行size-1次合并
	for (int i = 1; i < size; i++) {
		HFMTree node = new Node();
		//每次从最小堆中取出堆顶的两个结点作为该结点的左右子结点
		node->left = DeleteMinHeap(minHeap);
		node->right = DeleteMinHeap(minHeap);
		node->weight = node->left->weight + node->right->weight;
		//将该结点作为根节点的二叉树重新加入最小堆
		Insert(minHeap, node);
	}

	//从最小堆中取出建好的Huffman树
	hfmtree = DeleteMinHeap(minHeap);

	return hfmtree;
}

HFMTree DeleteMinHeap(vector<HFMTree>& minHeap) {
	//检查是否堆空
	if (minHeap.size() == 1) {
		return NULL;
	}

	//将该堆最大元素装入新结点并返回
	HFMTree node = new Node();
	node = minHeap[1];

	//重新调整该堆
	int size = minHeap.size();
	int parent, child;
	//用最大堆中最后一个元素从根结点开始向上过滤下层结点
	HFMTree cmp = new Node();
	cmp = minHeap[size - 1];

	//从根节点开始，用parent记录根结点下标，用child记录其最小子结点下标，每次循环将parent更新为上一次循环的child
	//当parent指向底层结点时跳出循环（会有极端情况比如偏向一边的堆使得parent最终并非指向该子树底层结点，但不影响结果）
	for (parent = 1; 2 * parent < size; parent = child) {
		child = parent * 2;
		//若该子结点不是堆尾结点，令child指向左右子结点中的较小者
		if ((child != size - 1) && ((*minHeap[child]).weight > (*minHeap[child + 1]).weight)) {
			child++;
		}
		//当循环到堆尾结点值小于等于该子结点值时，可以结束（此时堆尾结点会替换parent结点而不是child结点）
		if (cmp->weight <= (*minHeap[child]).weight) {
			break;
		}
		else {
			minHeap[parent] = minHeap[child];
		}
	}
	//将尾结点与当前父结点替换
	minHeap[parent] = cmp;

	//删除堆尾结点
	//此处不能用minHeap.erase(minHeap.end());，因为erase会返回被删除结点的下一结点，而尾结点的下一结点超限
	minHeap.pop_back();

	//返回该结点
	return node;
}

int getHFMLength(HFMTree hfmtree, int depth) {
	//若为叶子节点，直接返回其编码长度
	if (!hfmtree->left && !hfmtree->right) {
		return hfmtree->weight * depth;
	}
	//否则其他节点一定有两个子树，返回左右子树编码长度之合,深度相应加一
	else {
		return getHFMLength(hfmtree->left, depth + 1) + getHFMLength(hfmtree->right, depth + 1);
	}
}

void Input(vector<Case>& testcase, int num) {
	for (int i = 0;i < num;i++) {
		Case inputcase;
		cin >> inputcase.element >> inputcase.route;
		inputcase.length = inputcase.getlength();
		testcase.push_back(inputcase);
	}
}

bool isOptimalLen(vector<Case>& testcase, vector<HFMTree>& inputlist, int weight) {
	int testweight = 0;
	for (int i = 0;i < testcase.size();i++) {
		testweight += (testcase[i].length * (*inputlist[i]).weight);
	}
	if (testweight == weight) {
		return true;
	}
	else {
		return false;
	}

}

bool isPrefixCode(vector<Case>& testcase) {
	bool isprefixcode = true;
	HFMTree newtree = new Node();

	//两种情况验证不满足前缀码要求：1.后创建的分支经过或超过已经被定义的叶子结点，2.后创建分支创建结束时未达到叶子结点
	for (int i = 0;i < testcase.size();i++) {
		HFMTree point = newtree;
		if (isprefixcode == false)break;

		for (int j = 0;j < testcase[i].length;j++) {

			if (isprefixcode == false)break;

			if (testcase[i].route[j] == '0') {
				//先检查左子结点是否存在，若不存在，则创建一个左子结点
				if (!point->left) {
					HFMTree newnode = new Node();
					point->left = newnode;
					point = point->left;
					//若此时为分支的最后一环，则将该结点定义为叶子结点
					if (j == testcase[i].length - 1) {
						point->isleave = true;
					}
				}
				//若左子树存在,则先将标记指针移至左子树。
				else {
					point = point->left;
					//若左子树为叶子结点，则不符合要求
					if (point->isleave) {
						isprefixcode = false;
						break;
					}
					//若此时为分支的最后一环且仍有叶子结点，则不符合要求
					if ((j == testcase[i].length - 1) && (point->left || point->right)) {
						isprefixcode = false;
						break;
					}
				}
			}
			else if (testcase[i].route[j] == '1') {
				//先检查右子结点是否存在，若不存在，则创建一个右子结点
				if (!point->right) {
					HFMTree newnode = new Node();
					point->right = newnode;
					point = point->right;
					//若此时为分支的最后一环，则将该结点定义为叶子结点
					if (j == testcase[i].length - 1) {
						point->isleave = true;
					}
				}
				//若左子树存在,则先将标记指针移至左子树。
				else {
					point = point->right;
					//若左子树为叶子结点，则不符合要求
					if (point->isleave) {
						isprefixcode = false;
						break;
					}
					//若此时为分支的最后一环且仍有叶子结点，则不符合要求
					if ((j == testcase[i].length - 1) && (point->left || point->right)) {
						isprefixcode = false;
						break;
					}
				}
			}
		}
	}

	return isprefixcode;
}

哈夫曼编码