By Jonathan


2015-12-31 20:59:20 8 Comments

I attempted a problem from the Cracking the Coding Interview book. The following input: aabcccaaa should give the following output: a2b1c3a3. Any advice is much appreciated!

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char* compress(char*);

int main(int argc, char** argv)
{   
    if(argc==2)
    {
        printf("The compression of %s is %s \n", argv[1],
            compress(argv[1]));
    }
    else
        printf("Not correct number of inputs.");
}

char* compress(char* str)
{
    const size_t len = strlen(str);
    char result[len];

    //initialized to 0, so length check can be done later
    memset(result, 0, len); 

    int counter;
    int j = 0;

    if(str==NULL || len == 0)
        return 0; 

    result[0] = str[0];
    for(size_t i = 0; i < len; i++)
    {   
        counter = 1;//reset counter

        //to make sure no array out of bounds exception occurs
        if(result[j+1]==0)
            result[++j] = counter + '0';
        else
            return str;

        while(str[i] == str[i+1])
        {
            //to store number in char array, add the asci value 
            // of 0.
            result[j] = (++counter + '0'); 
            i++;
        }
        if(result[j+1]==0)
            result[++j] = str[i+1];
        else 
            return str;
    }

    result[++j] = '0';

    //to avoid "function returns address of local variable" error
    //to also avoid altering original argument
    char* str1 = (char*)malloc(sizeof(result));
    strcpy(str1, result);
    return str1;
}

4 comments

@SirPython 2016-01-01 04:29:11

Don't speak of one

I know it is against what the problem expects, but an improvement in your compression algorithm would be to not print any number if there is only one occurrence of the letter.

This way, it would impossible to encounter that first bug mentioned in ChrisWue's answer. And, in files that don't have that many consecutive letters, they will have less added bytes from the compression.

However, as noted in two comments, if the digits are numbers, you can run into some issues. However, after reading the next section, you can see that, if you have consecutive numbers, they will be ASCII numbers, where the values accompanying the characters will not be ASCII.


ASCII numbers

This is also against the challenge output.

Most compression services don't leave an intentionally readable output file in the end. In that case, there is not much need to have the numbers accompanying the letters be ASCII. That way, when it comes to large repetitions of characters (at least > 10), you will not be spending an extra byte or two.

Now, you may think: if the number got high enough, it might interfere with the file content. That is true; but your code will do this now too. However, when decompressing, it can be kept in mind that the compression will result in two byte pairs: one byte for the character, one byte the character occurrences.


UTF

Right now, your compression algorithm won't do so well when it comes to UTF files. Since some UTF characters actually exceed 1 byte, this algorithm would probably end up not compressing anything at all.

Now, this may be out of the scope of your problem, but it would be a nice addition. If you look at the UTF-8 wiki page, you can see that it is somewhat possible to detect if a character is UTF, judging by the binary digits at the beginning of the digit.

This, of course, would raise complications as UTF-8 characters are variable length and you would have to determine the length in bytes of the character by the first binary digits.

To ease things up, you could have the user specify whether or not the file is UTF encoded (or, you might be able to find out by reading the file).


Misc

  • Always use braces.

@Jonathan 2016-01-01 06:39:03

Hello, I am just doing this problem to practice coding interview questions, but thank you for all the useful information!

@David Richerby 2016-01-01 13:01:26

If you "don't speak of 1", then the string "23" is ambiguous. Does it mean three twos or a two and a three?

@Jonathan Leffler 2016-01-01 15:48:47

Note that not including the repeat when it is 1 makes it harder to decode accurately, especially if the input contains digits: b2211z etc.

@SirPython 2016-01-01 22:20:45

@DavidRicherby and JonathanLeffler These are good points. I've edited my post.

@Jamal 2015-12-31 21:12:41

  • You can eliminate that function prototype by defining main() last, but it's up to you.

  • It's more common to first check the command line usage and then exit with a non-zero value if it's invalid. You should also print all errors to stderr, not stdout (the latter is always used with printf()).

    if (argc != 2)
    {
        fprintf(stderr, "Not correct number of inputs.");
        return -1;
    }
    
  • You can probably check str right at the start, especially with the len assignment. If there is an error, then it may be clearer to return NULL instead of 0. At least that's what many library functions return specifically.

  • The while loop can instead be a for loop:

    for (; str[i] == str[i+1]; i++)
    {
        //to store number in char array, add the asci value 
        // of 0.
        result[j] = (++counter + '0');
    }
    
  • The NULL character for result is wrong; it should be '\0'.

  • There's actually no need to cast the return type of malloc() in C. Although malloc() does return a void*, it's not required because any pointer type can be assigned to a void* and is done "automatically" here.

    Moreover, you can avoid that aforementioned error by allocating result after making sure that the original string is valid. This will also avoid the need to call strcpy() at the end and risk other bugs. Since result will not be local this way, you can return it from the function without losing the data.

    char* compress(char* str)
    {
        // check to make sure str is valid...
    
        // assuming len is also valid
        char* result = malloc(len);
    
        // OR, use calloc() instead to get the memory zeroed
        //   thus no need for memset() or anything else
        char* result = calloc(len, sizeof(char));
    
        // probably also good to check for sufficient memory
        // if insufficient, print error and abort
        if (!result)
        {
            perror("malloc"); // or "calloc"
            abort();
        }
    
        // do the compression...
    
        // return it right away; no need to copy anything
        return result;
    }
    
  • You never call free() on str1, so this function leaks memory. You would have to do this in main() with the returned string, and it may be necessary to assign compress() to a const char* to display before freeing afterward. This also means that compress() should return a const char*, not a char*, preventing it from being modified further.

    It may work to have something like this in main():

    int main(int argc, char** argv)
    {
        if (argc != 2)
        {
            // let the user know of the correct input
            fprintf(stderr, "usage: %s string\n", argv[0]);
            return -1;
        }
    
        const char* original = argv[1];
        const char* compressed = compress(original);
    
        printf("The compression of %s is %s \n", original, compressed);
    
        free(compressed);
    }
    

@Jonathan 2015-12-31 21:30:36

Yes after you pointed out it leaks memory, i immediately did that. Thank you. I've added in the following lines of code. char* result; if(argc==2) { result = compress(argv[1]); free(compress(argv[1])); printf("The compression of %s is %s \n", argv[1], result); } Is that better?

@Jamal 2015-12-31 21:35:56

@JonathanSmit: That would be ugly anyway. You can always create two separate const char*s for the original and compressed strings respectively. It would still be better than a memory leak. I can add something like that to my answer.

@David Richerby 2016-01-01 12:57:20

What's the advantage of rewriting while (cond) {stmt; incr;} as for (; cond; incr) {stmt;}? The former seems much more natural and much clearer.

@Jonathan Leffler 2016-01-01 15:40:15

Don't forget that messages in general and error messages in particular should end with a newline '\n' (usually as the last character in the format string).

@Corbin 2016-01-02 01:53:54

@DavidRicherby Jamal likely has his own reasons for this, but considering I recently posted similar advice on one of Jamal's answers, I figure I might go ahead and add my own reasons. 1) It makes it trivial to glance at the loop and understand the flow of it (i.e. you can immediately tell how and when it increments -- at least assuming no shenanigans inside the loop). 2) It keeps the loop control around the loop instead of inside of it. 3) It keeps the same noisy ++i; boilerplate line out of 80% of loops.

@Corbin 2016-01-02 01:54:27

(And yeah, these do essentially all come down to "it looks better to me" and "my brain parses it more easily" just as the while version looks more natural and clear to you... :/)

@greybeard 2016-01-10 16:47:04

@DavidRicherby using the for-variant, I promise incr is about all there is to advance to next iteration, in addition, I can reasonably use if (done with this iteration) continue; to keep nesting/indentation(/bother about else) in check.

@David Richerby 2016-01-01 13:07:14

Bug. What happens if your input contains more than nine consecutive instances of the same character? What if it contains a lot more than nine? You don't do any bounds checking before assigning ++counter + '0' into a char variable and then printf-ing it back in main.

@ChrisWue 2016-01-01 02:10:19

  1. Bug: you assume that the result will never be longer than the original string while in fact it could be twice as long. For example abcd would yield a1b1c1d1.

  2. Instead of using memset to initialize result you can also initialize it to 0 like this:

    char result[len] = { 0 };
    
  3. You don't have to assign the counter on every iteration to the result. Simply increment the counter while the letter doesn't change and then add the count afterwards.

  4. Another bug: the way you convert the counter into a string will stop working for any number larger than 9. You want sprintf.

  5. Also strlen returns the length of the string excluding the terminating NULL character so your result is short by 1 in any case.

  6. Given the fairly severe bugs you really ought to test your code better.

@Jonathan 2016-01-01 03:07:08

Hello, for the bug, I believe I had an else statement, that will catch that and return the original string if the compressed string is actually longer. Does the code not do that?

@ChrisWue 2016-01-01 03:19:31

@JonathanSmit You have some if else statements to return the original string if result is not 0 but that's just wrong since it implies an out of bound access which invokes undefined behavior. It may just happen to occasionally work but it's still broken.

@Jamal 2016-01-01 04:31:06

@ChrisWue: Might as well make it snprintf() (add the size protection). As for point #2, I'm not sure it always works that way with C, or maybe that's just me. At least memset() can be something to fall back on.

@Jonathan 2016-01-01 04:35:47

@ChrisWue I added if((j+1) < len condition before each time i increment j and deference result. I also removed all the out of bounds access. Does that work better? Or should I be going about this differently?

@ChrisWue 2016-01-01 05:05:36

@JonathanSmit: Well, you should fix all the bugs, run some test cases - preferably with valgrind and then resubmit your code for review in a follow up question. I updated the question with another bug I found.

Related Questions

Sponsored Content

3 Answered Questions

[SOLVED] Javascript run length encoding algorithm

2 Answered Questions

[SOLVED] Basic string compression implementation in C

  • 2017-11-12 17:39:34
  • Emre Can Kucukoglu
  • 1313 View
  • -1 Score
  • 2 Answer
  • Tags:   c strings compression

2 Answered Questions

[SOLVED] Simple string compression reloaded

  • 2016-01-03 00:04:46
  • ChrisWue
  • 152 View
  • 4 Score
  • 2 Answer
  • Tags:   c compression

1 Answered Questions

[SOLVED] String compression implementation in C

2 Answered Questions

[SOLVED] Run-length encoding using C

1 Answered Questions

[SOLVED] Run-length encoding in Haskell

2 Answered Questions

[SOLVED] Run-length decoder

2 Answered Questions

[SOLVED] Run Length Encoding

2 Answered Questions

[SOLVED] Using reduce twice in a row for Run Length Encoding

2 Answered Questions

[SOLVED] Evaluation of a solution for Run-length encoding algorithm

Sponsored Content