By Tiger Hwang


2017-01-21 23:30:21 8 Comments

I am going to prepare some test data for a compression test. One of the them is the 'worst case test', which should make the compressor work worst. Use random number to generate such a file is an idea, but there still contain some kind of patterns in the data. Using 7zip to compress such a file, I get a output file which is a little bit smaller than input file.

So I make a small piece of code to generate a file, which does not contain any repeated two bytes. To make life difficult, I shuffle those byte pairs in special order, hope the compressor will have more difficulty to find any match, even with predication.

#include <iostream>
#include <fstream>

int main(int argc, char* argv[])
{
    const char* filename = "c:\\_Test\\hard.dat";

    std::ofstream ofs(filename, std::ios::binary | std::ios::out);
    if (ofs.bad())
    {
        std::cerr << "fail to open file\n";
        return -1;
    }

    for (unsigned int i = 0; i < 0x10000; ++i)
    {
        unsigned int t = (i * 0xc369) & 0xFFFF;
        ofs.write((char*)&t, 2);
    }
    std::cout << "job done\n";

    return 0;
}

It's running in windows, and I think it's easy to change filename to make it work on other system, or even use command line argument as filename, but that's not the point.

I tried use 7zip to compress it, both format (7z and zip) created file bigger than original file. I have some other compressors, will test them later.

Any suggestions and help are appreciated.

2 comments

@Tiger Hwang 2017-01-22 20:52:48

Here is the new version, with one fix and some small fix. I did wrong calculation in the beginning, the max possible none repeated bytes pair is 64K+1, not 128K.

It cannot be calculate with one single lineal equation. To get the result, I have to use a 2D array to keep trace which pair exists already. Each time calculate a new byte with equation 1, when it was marked as used, use equation 2 to get a new byte. By set up different multipilier, recursive result is avoid.

I had some code to check if the output is correct use simple loop to compare all bytes pairs. It was so slow so I removed it.

#include <iostream>
#include <fstream>
#include <vector>

const unsigned int DATA_COUNT = 65537;    // max possible bytes serials without two bytes
const unsigned int DATA_OFFSET = 4357;    // just a primitive number
const unsigned int MUL1 = 37;             // mul1 to get next index  
const unsigned int MUL2 = 13;             // mul2 to get next index

int main(int argc, char* argv[])
{
    const char* filename = "c:\\_Test\\hard.dat";

    std::ofstream ofs(filename, std::ios::binary);
    if (ofs.bad() || !ofs.is_open())
    {
        std::cerr << "fail to open file\n";
        return -1;
    }

    bool mask[256][256] = { false };
    std::vector<unsigned char> data;
    data.reserve(DATA_COUNT);

    unsigned int index = 3; // start with any value between 0 and 255
    data.push_back(index);
    while (data.size() < DATA_COUNT)
    {
        unsigned int next = (index * MUL1) & 0xff;
        while (mask[index][next])
        {
            next = (next * MUL2 + 1) & 0xff;
        }
        mask[index][next] = true;
        data.push_back(next);
        index = next;
    }

    ofs.write((char*)data.data(), data.size());
    std::cout << "job done\n";

    return 0;
}

After generate the file, I use 7zip again to compress it with 7z and zip format. What interesting is with 7z format, the file is 'compressed' to a bigger size. With zip format, the file is not compressed but stored directly in the zip file, with additional head information.

@user1118321 2017-01-22 03:26:18

This is a cool idea! Here are some thoughts I had:

What are these magic numbers?

You have 4 magic numbers in your main loop: 0x10000, 0xc369, 0xFFFF, and 2. What do they mean? It looks like 0x10000 is the number of 2-byte words you're writing to the file. It's also the limit of a 16-bit number. It would be nice if there were a named constant for that. Perhaps something like:

const int kWordsPerFile = 0x10000;

or

const int kMax16BitValue = 0x10000;

Or something along those lines. (Technically it's 1 more than the max, so maybe something more descriptive.)

Honestly, I can live with 0xFFFF but it wouldn't hurt to give it a name like kLSWMask (where LSW is least significant word) or something similar.

You could get rid of the 2 by making t be a uint16_t and then using sizeof(t).

That leaves 0xc369. I have no idea what it is. I assume this is some sort of linear congruential pseudo-random number generator, but I'm not really well versed in such things. How is it derived? What significance does it have? Give it a name so it's understandable to someone else 6 months from now.

Comments

I'm a big fan of self-documenting code. However, in this case, it would be nice to have at least a sentence explaining what the loop does. Coming across this code in the code base I work on would leave me scratching my head. I'd have to go back through source control comments to see if there was any clue of what it was about. Maybe just a comment like:

// Generate some pseudo-random numbers where no 2 bytes are repeated

would really help a lot! And if you have a link or a short sentence to explain the algorithm, that would be nice, too.

@Tiger Hwang 2017-01-22 11:32:08

Nice suggeation. I will update it soon. The number 0xc396 was with a ^ operator, but it did not work well. I changed it to * and got a better output. I think it should even better to use a primirive number instead.

@Rakete1111 2017-01-22 15:26:20

@TigerHwang Also, the std::ios::out flag is redundant, as when you create a std::ofstream, it is already specified.

Related Questions

Sponsored Content

2 Answered Questions

[SOLVED] C++ LZ77 compression algorithm

  • 2017-05-24 00:03:40
  • Jeremy
  • 2267 View
  • 3 Score
  • 2 Answer
  • Tags:   c++ compression

1 Answered Questions

[SOLVED] Compression unit test data "easy case"

2 Answered Questions

[SOLVED] Analyzer for compression algorithms (or tools)

  • 2016-08-31 13:59:07
  • user116360
  • 150 View
  • 3 Score
  • 2 Answer
  • Tags:   c file compression

2 Answered Questions

[SOLVED] First time writing test case

0 Answered Questions

MSTest Data Driven Test Inline Data

  • 2016-04-21 16:01:37
  • Nkosi
  • 1543 View
  • 6 Score
  • 0 Answer
  • Tags:   c# unit-testing

1 Answered Questions

[SOLVED] Custom compression tool

4 Answered Questions

[SOLVED] Test case for a caching library

2 Answered Questions

[SOLVED] Create better Unit test

2 Answered Questions

[SOLVED] Create and destroy object in Unit Test

1 Answered Questions

[SOLVED] htaccess for URL remapping, caching, and compression

Sponsored Content