Friday, January 9, 2009

Error prone constructs

C keeps, unlike some other languages, some error prone constructs ready.
As a sample, Java doesn't pose problems with explicit casts, array sizes, buffer sizes, macros and less problems with error checking (through the use of exceptions). Lisp e.g. doesn't pose problems with number ranges.

Explicit casts
Compilers do not generate errors or warnings on semantically false explicit casts. The explicit casts are accepted as is38.
Use explicit casts as rarely as possible39. It's good to think whether an explicit casts is necessary and what the compiler will do with it.
The C language allows implicit conversions from T* to void* and vice versa40. There is no explicit cast needed in C to convert from void*. malloc() is a sample of an often used function that returns void*.

Type size
Know what your integer size is. Is it 16 bit, 32 bit, 64 bit, 128 bit?
int size limits the range of numbers you can use41. Check if e.g. 31 bit (signed int) is enough for your problem domain.

Array size
There are several methods to refer to an array size (sample array being int a[32];):
• 32
• SIZE (a macro used for the array definition)
• sizeof(a)/sizeof(*a)
The first one is the worst, it will be invalid if the array definition is changed without tracking the other occurrences of the size. The second one is better, but the third using sizeof is the preferred one.

Buffer sizes
It is strongly discouraged to implement or use functions that require a buffer as an argument, without also requiring the buffer size.
This rule should be strictly followed if the input size is an external (and hence uncontrolled) property (e.g. a line length with gets()).
Buffer overflows can
• corrupt adjacent data
• corrupt the stack frame (if on the stack)
• corrupt malloc internal data (if on the heap)
The first one is hard to find because it can subtly change the program logic. The last one is also hard to find, since the program often crashes at some later point in a call to malloc() or free(). In that case often only a malloc debug package helps.
A sample of a corrupted stack frame is a program that crashed (on a little endian system), leaving a core42. Because of the overwritten stack, the debugger that was used to examine the core (sdb) was unable to display a backtrace and just displayed the message "cannot get_text from 0x63697245", which confused on the first look, but was a good hint that upon returning from the corrupting function, the program tried to jump to nil.
One problem is to guess how much buffer size a sprintf() will require. However, sprintf() allows to specify maximal lengths of spliced fields to limit the output string size (e.g. sprintf(buf, "...%.*s...", sizeof(buf)-1-..., p).
Note that buffer overflows are security problems. Overwriting stack based buffers (while knowing the affected program and the system it runs on very well) can be used to insert manipulated function caller addresses and hence execute malicious code43.

Macro parameters
Macro parameters must be protected to ensure operator precedence.
#define sqr(x) x*x
sqr(a+b)
will have unwanted side effects, where
#define sqr(x) ((x)*(x))
sqr(a+b)
will have less side effects. Note the protection of the parameters and the result.

Macro side effects
Side effects are inevitable if macro parameters appear more than once, e.g.
#define max(a,b) ((a)>(b)?(a):(b))
k = max(i++,j++);
The task of a macro can be implemented in a function, if not in a performance critical part of the code. Looking at the compiler output will show whether function inlining (as compiler optimization step) produces the same result that would be expected from macro expansion.

Sign extension
Sign extension must be considered if a signed char is converted to a short or long or a signed short is converted to a long.
A typical sample is the sign extension from character to integer:
char* p; ...
printf("0x%02x", *p);
may rather print 0xffffffe4 than an intended 0xe4. The character needs a cast to unsigned before the conversion:
printf("0x%02x", (unsigned char)*p);

Error checking
Missing error checks may lead to bugs.
Lint's warning "return value sometimes ignored" may help to identify offending code locations.
A classical C language programming error is not to check malloc() for nullpointer return44.

Sequence points
A statement such as
*p++ = *p++ = 0;
invokes undefined behavior, i.e. is either
*p = 0;
p++;
*p = 0;
p++;
(intended) or
*p = 0;
*p = 0;
p++;
p++;
(not intended) or something worse (even less intended), because C does not define a sequence point between the assignments. Note that the first step will (probably) be an assignment and the last an increment, but the order in between is not determined.
It would be nice if undefined behavior through missing sequence point definition was generally diagnosed by compilers45.

Optimizer errors
Sometimes hard-to-track errors origin from errors in the compiler optimizer step. The optimizer may e.g. look at a variable as invariant and produce erroneous code.
There are two considerations:
• Do most of the development without optimization.
• Look at the assembly output if you suspect errors.
When (and if) switching to optimized release code, test cases must be run to check integrity.

Style
Numbers
Numbers (numeric constants) should not appear in the code. Explicitly used numbers should be limited to 0, 1 and -1.
Define the numbers as constants, macros46 or enums outside the functions.
They should especially not appear in the code if they're meaningful for limitations or performance of an algorithm (e.g. if they limit some input size).
Counterexamples are
• buffers that starts at some size and increase if needed
• values that are encapsulated deep in some implementation of an algorithm
Use as few hardcoded values as possible. Don't use static sized tables of data, since they are almost never appropriate.
Don't generate any hidden dependencies among constants. Define constants by means of the constants they derive from.
Numeric constants are hard to understand if they're at the same time not commented and not composed of other named constants. Compilers are quite able to do arithmetics at compile time, use them.
Sample: if you need a buffer to hold a string representation of an integer, define its size in terms of INT_MAX or sizeof(int). E.g.: sizeof(int)*5/2+347 (assuming 8 bits per byte).

Unsigned numbers
You may consider not to use unsigned values at all in application programming48.
Typically you will only gain one of 32 or 64 bits, which can often be neglected. Again: know your problem domain. If you need more than 31 bits in an application, you may want to switch to 63 bits or bignums.
The C language will also not indicate an exception if you subtract a larger unsigned number from a smaller unsigned number, so you can't make your programs more robust by means of using unsigned values.
Using signed and unsigned values leads to ambiguities when comparing or adding them.

Longs, shorts
Using longs is an issue on 16 bit systems (either if you develop for 16 bit systems directly or plan to port your products to them sometimes)49.
Traditionally, long and short (or unsigned long, unsigned short) were used (together with htonl(), ntohl(), htons(), ntohs()) in implementing low-level network protocols, such as UDP-based application protocols50. The assumption was that C implementations define a long to be exactly 32 bits, which is however not defined by the C language standard.
Use ASCII representations of numbers, when you write them to file or network, in order to be system architecture independent (size, byteorder, padding)51.
Besides using htonl() and ASCII, there exist some architecture independent data representation libraries like XDR52. However, ASCII representations seem easier to debug, because human readable.
Using shorts may save significant space in large arrays. However, if the problem domain changes, shorts may become too small. Conversions from shorts to ints and back may also bring some computational overhead.
Know also that unexpected alignments may occur if you mix shorts and longs. Sample: struct {short a; long b;}; will most probably be eight bytes of size, not six53.

Floats
Avoid floating point numbers (double, float) if possible.
Reasons being
• Integers may be more adapted to the discrete nature of a problem.
• Integer arithmetics are faster than float arithmetics (if that matters).
• Not using floating point numbers results in smaller executables on systems that require floating point handling routines54 and link them statically.
Many problems are solvable without using floats. E.g. a typical hashtable high-water-mark of 0.75 may be expressed by a ratio and handled by integer arithmetics: if(4*items > 3*size) ...
Avoid single precision float. Use double.
If you have to deal with single precision floats on file, then encapsulate the code that deserializes (reads them back).
If space counts, you may consider to use normalized numbers, that are adapted to the problem domain (e.g. shorts signifying 1000th).

Parameter types
Express arrays as pointers in the function parameters.
Use int main(int argc, char** argv) instead of int main(int argc, char* argv[]). The internal semantics of a parameter are that of a variable declared as char** argv, not char* argv[].

Variable arguments
Variable argument functions55 don't let the compiler check number and type of the arguments. For this reason you may chose to use them rarely.
Some compilers (and e.g. lint) warn of wrong arguments supplied to the variable argument function families printf() and scanf(), which are part of the standard C library.

Portability types
If a simple type (e.g. some kind of identification number) is supposed to change sometimes (e.g. from short to long), then introduce a type synonym for this type using typedef.

Standard library
Use the standard library functions where possible. They are portable and usually optimized.
Some standard library functions might even get inline expanded (memcpy()), so there's probably no performance problem.
You should use stuff that's offered. E.g. strerror() will tell a lot about the origin of an error reported by the operating system. Not using it will leave the user and support group clueless.
Don't use gets() and the scanf() family for safety (buffer overflow crashes or program corruption) and security reasons (buffer overflow exploits). Use fgets() respectively fgets();strtok();atoi(); etc. instead.

NULL macro
Nullpointer comparisons can be expressed by
• if(!p) ...
• if(p == 0) ...
• if(p == NULL) ...
all three being perfectly valid in C56.

Register
Don't use register.
One could assume that compilers know the CPU registers better than the C programmer does, since they are the interfaces to the register-using assembly languages.
Also, compilers are free to ignore the register keyword (and often will57).

Auto
Don't use auto.
Rarely, auto was used to emphasize that a function variable needs explicitly to be automatic. E.g. in a recursive function in which some variables may be modified to be static (to save stack space). However, the latter is bad practice since it is not multithreading save58.

Goto
Don't use goto. Gotos lead to a confusing program flow.
Most of the control flow problems can be solved by using additional layers of local functions (that need not imply overhead). Use return to jump out of them. Introducing function layers may enhance modularity and code encapsulation.
Appreciate also break and continue instead of goto.

Multiple returns
The opinions differ about using multiple return statements in a function.
I see multiple returns as a good micro design construct. They allow function code to be less deeply nested.

Obscurities
Avoid the use of the logic operators && and || as standalone statements.
Don't use
f() && g();
instead of
if(f()) g();
Don't overuse the comma operator.

Topics left out
The style topics intentionally left out in this paper are
• the pros and cons of the ternary operator ?:59
• the pros and cons of polymorphic function arguments (e.g. declaring a parameter as a long and casting all possible arguments to it)
• the use of a function parameter to hold separate information (e.g. putting two shorts into a long parameter)
• the pros and cons of code optimization and when and where to deploy such60
• recommendations on how many lines of code per function and module61

0 Comments: