On header files and dependencies

Dirk van Deun, dirk at dinf.vub.ac.be
March 2005, June 2005

Motivation

Earlier I developed some simple tools for what I call aspect colouring, or what you could also call concern colouring or API colouring, and which is a technique to make your editor highlight interesting aspects of program code to help you understand its overall structure, just like syntax colouring and automatic indentation help you see its local structure. My tools work best for projects with many separate and reasonably well-organised header files: the idea is that the programmer selects header files or sets of header files which correspond to interesting aspects of the project, and that the editor will then automatically colour anything related to these in separate colours.

The next problem to solve is of course how to identify header files which correspond to interesting aspects in a project that you have never worked on before, and how to do it simple and fast.

I would propose that interesting aspects are aspects that turn up in many places all over the project. Sometimes these are interesting because their dispersal makes them hard to understand; and sometimes they are interesting because they provide the structure that holds the project together. So the interesting header files should be among the ones that many other source files depend on. These dependencies can easily be abstracted and presented graphically by a tool, so that it will become evident which header files are central to the project. Selecting just the right header files will still be up to you, and should be done based on their names, some browsing, and lots of common sense.

This is all very well for smallish projects, but there is a limit to what one graph can represent in a readable way. So we also need to be able to do a rough first selection of which header files may be important and which are not. I will propose a simple way of handling this too.

Drawing graphs

I wrote a shell script, cdeps, which takes a number of C source and header files as arguments, counts dependencies of function calls and macro expansions on function declarations and macro definitions among them, and presents these to the user in one big graph. Its output is written in the graph definition language of dot, one of the tools in the graphviz package, so the simplest use of the tool is as follows:

  cdeps *.c *.h > mygraph.dot
  dot -Tps mygraph.dot -o mygraph.ps

This will generate a text file mygraph.dot describing the graph, and a PostScript file mygraph.ps of the drawing itself. Dot can generate many other output formats, like .jpeg and .png, and even has facilities for generating clickable maps. More about dot can be read in the excellent graphviz documentation.

Note that the graph made in the above example will contain dependencies between source and header files, but also dependencies among header files. That may be interesting, but it can also generate lots of unwanted information. You might for instance be interested in which C library functions are used in your project, but you are probably not interested in the internal dependencies of the C library header files, so this is not the way to go:

  cdeps *.c *.h /usr/include/*.h > mygraph.dot

That is why cdeps has the option -p, which stands for "prepare". You use cdeps -p on a set of header files to generate a text database cdeps.db about their function declarations and macro definitions. This database will be used to locate function declarations and macro definitions during future runs of cdeps in the same directory. So if you do want to see internal dependencies among the header files in the current directory, but not among header files outside the current directory, you might do:

  cdeps -p *.h /usr/include/*.h
  cdeps *.c *.h > mygraph.dot
Expert users of gcc will be glad to hear that cdeps can handle the output of gcc -M and gcc -MM among its parameters, by simply ignoring doubles and the names of object files. So the following commands will achieve the same effect as above (or better, if the example above missed some relevant header files in other directories):
  cdeps -p `gcc -M *.c`
  cdeps *.c *.h > mygraph.dot

To conclude I should mention the option -z, which stands for "zoom", and which adds the functions and macros as nodes to the graph. (Nodes with the same incoming and outgoing edges are combined.) This can be interesting, but it can also generate very big and very messy graphs if used on more than 20 files or so.

Making a rough first selection

A header file will probably not be very important for the overall structure of a project if its functions and macros are used only in a few source files. Similarly header files that declare functions and macros that are used throughout the project but only very sparsely do not seem very interesting. So a first rough estimate of the importance of a header file could be computed from the amount of source files that contain references to it, and from the amount of references itself. To make these two values fit for meaningful addition we convert them into percentages: a percentage of the total amount of source files and a percentage of the total amount of references. These we add, resulting in a "weight" value.

This weight tends to favour some very boring header files, however: for instance header files that declare only one function "log" that turns up all over the project. It would certainly seem that header files that declare a set of functions and macros that are often used together might be more interesting candidates for aspect colouring: so a better way to compute weights should give extra marks for clusters of functions and macros. I propose to count the different functions and macros from a header file in all the source files that contain at least one such reference, and multiply the weight value from the previous paragraph with the median of these.

The cdeps tool computes such weights when used with the option -w and uses these to make a rough ranking of the header files, more or less from the most interesting to the most boring:

  cdeps -p *.h
  cdeps -w *.c

Implementation

Cdeps depends heavily on exuberant ctags, which is a great improvement on the old standard ctags. As exuberant ctags can extract lots of useful information from possibly syntactically incorrect code in many programming languages, it is the best tool available for quick-and-dirty source code analysis. Even exuberant ctags does not extract much structural information, however, so cdeps also depends on cscope, which is universally available, rather dirty, and occasionally very useful.

The basic algorithm to create a graph can be written in 10 lines, using ctags, cscope and some standard unix tools like sort, grep and join. This basic algorithm can still be extracted easily from the code of cdeps 1.1. Although the code of cdeps is pretty dense, it is also quite short and therefore not too hard to reuse or extend. As an example can serve the cgi-bin script cdeps.cgi, which creates clickable maps of cdeps -z.

The cdeps tool should work on any unix-like operating system, and has been tested on Slackware Linux, OpenBSD and Apple's OS/X. Installing it is a simple matter of downloading and making it executable.