+==========================================================================================+ |Keen 4-6 DICT format: | +==========================================================================================+ The 'DICT' of huffman dictionary files are the (Usually internal) huffman binary trees that are used to decompress individual files from group files. There are three maind DICTs, AUDIODICT (Sound), EGADICT (Graphics) and MAPDIC (Game maps) They are the first component of the 'group file trilogy' (Dictionary, header, data) Each DICT is exactly 1024 bytes long, consisting of 255 'nodes' and 4 blank bytes at the file end. The 255 nodes, numbered 0-254, are connected together to form a 'binary tree'; this is covered more fully below. ------------------------------------------------------------------------------- FILE STRUCTURE: NODES ... .. . PADDING ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- STRUCTURE: 0 4 Node Huffman node 1020 4 Padding Blank ------------------------------------------------------------------------------- ------------------------------------------------ NODE STRUCTURE: ? 1 Left val Left branch character or node value +1 1 Val type Left branch type; $00 = character, $01 = node +2 1 Right val Right branch character or node value +3 1 Val type Right branch type; $00 = character, $01 = node ------------------------------------------------ ------------------------------------------------------------------------------- HUFFMAN COMPRESSION: Huffman compression compresses data by representing characters as strings of bits, (1s and 0s) with commonly used characters having shorter strings. This is just what normal data is, (For example the space character is represented by $00100000) You can think of normal data then as 256 different characters each with its own individual 8-bit string. We can organize these strings into a neat 'tree' shape, with left being 0 and right being 1, we'll start off with 1 trunk and branch in two over and over until we have 256 tips. To find the character we want, we start at the trunk and move onto right or left branches as we read the bit string for a character. In the example below a 4-character tree has been built for the numbers 0-3. To find, say, the number two, we take its bit string (In this case only two bits long, 10) and move as you can see: 0 1 2 3 \ / \ / * * \ / \ / * ROOT Each of the '*' is called a NODE, and there are n-1 of them, where n is the number of types of characters you have in your data. The trunk node is called the ROOT, and is always the highest numbered node (In this case node 3) Right, so how does Huffman compress? If I have the string '03323330', this tree doesn't really let me compress it at all! But what Huffman can do is 'rearrange' the tree, so that more common characters (Like 3) get shorter 'branches' and less common ones get longer. There is only one other possible tree and it is shown below. Notice that '3' now only needs one bit while 0 and 1 need 3. BUT, because 3 is more common we'll have a lot more 1 bit strings than 3 bits, so our data will shrink: 0 1 \ / * 2 \ / * 3 \ / * ROOT Our string '0133233310' is 00,11,11,10,11,11,11,00 in 2-binary Using the first tree we don't get any change, the output is 0011111011111100, 16 bits of data. Using our second tree however we get 000,1,1,01,1,1,1,000 or 0001101111000 13 bits, and almost 50% reduction in size. Since Huffman relies on rearrangement, it can 'only' work with data where some characters occur a lot more than others. Fortunately, this is almost always the case, indeed you can often identify a huffman tree's end by the string $00 $00 $FD $01 which is usually node 254 (The root) which has $00 (Blank) as its character and node 253 as a branch for the whole rest of the tree. You may have already been able to guess how the nodes in the DICT are joined together, each node has two branches, leading either to another node or to a character. The second and fourth bytes of each node determine that, 1 for a node, 0 for a character. From the nodes then you can build up the binary tree you need to decompress huffman. NOTE: Our example string of 13 bits would be encoded as two bytes, with the extra bit space being filled with zeroes. All huffman programs need to be told when to stop decoding, in case they come across 'spare bits' This is usually done by telling the program how long the decompressed file is. -------------------------------------------------------------------------------