Size Trees: augmenting hierarchies to aid size comparisons between categories and individuals

John D. Corbett
18 October 2003

Abstract

A Size Tree is an interactive diagram design for presenting hierarchies that highlights size relationships across levels of hierarchy. It combines a tree drawn as nested boxes and a stacked bar chart. An efficient algorithm has been developed for drawing size trees. Interactive behavior is demonstrated with a Java applet.

Size trees are useful for visualizing file systems amd many aspects of software development such as subroutine size, code performance and coding activity. A plug-in for the Eclipse IDE is under development.

Using Size Trees to visualize US population data

The population of the United States circa 2000 is presented below as a size tree. Height represents population. Boxes represent regions and states. Items to small to be rendered at the current scale are indicated with horizontal lines. Boxes that continue beyond the top or bottom of the view are drawn without borders and with small arrows to indicate more above or below. The color coding is designed to provide secondary indicators of the hierarchical relationships, the position of the view within the entire data set, and the scale of the view.

Several observations are easy to make that would not be as easy if the data were displayed in tabular form. California and Texas have the tallest boxes because they are the most populous states. The population of the South about the same as the Northeast and Midwest combined.

Size Tree showing US population by state and region

Fig. 1. Size Tree diagram showing US population for states and regions.

In the Java applet, clicking on the West, would scale the data so that the West filled the available height. In this view, it is apparent that California is more populous than the entire Mountain region.

Size Tree showing population of Western United States

Fig. 2. Size Tree diagram focused on the Western United States.

Click here to view the Java Size Tree applet.

In the Java applet, you can click on any state or region to change the scale of the chart. The fraction of the size in view is displayed at the upper left corner.

Algorithms

Simple tree layout algorithms like this one are typically require time and space linearly proportional to the number of nodes. Layout can be accomplished in a single pass over the data, positioning parents relative to already positioned children.

The storage used for the tree and the running times for layout and hit detection are all proportional to the number of nodes. Although hit detection could be further optimized, tests with millions of nodes run at interactive speeds.

Input consists of a list of expressions. Terminals (leaf nodes) are written as a label and size separated by a colon, e.g. foo:42. Non-terminals (tree nodes) are written as a label and a parenthesized list of expressions, e.g. a(foo:42, bar(), baz:0).

Layout consists of three steps:

  1. A stacked bar chart of the leaves is created, scaled to fit the available height. The leaf column width is set to accommodate the widest label, whether displayed or not.
  2. Tops and bottoms of subtrees are chosen to coincide with the first and last leaf they span respectively.
  3. Subtrees are arranged in columns. The tree column width is set to accommodate the widest label, whether displayed or not. When there is a choice between drawing a tree closer to parent or closer to descendents, the tree is drawn closer to the children.

After layout parameters are calculated, drawing proceeds in two passes:

  1. In the first pass, boxes are drawn for all leaf nodes and subtrees that are large enough to be drawn as boxes at the current scale.
  2. In the second pass, horizontal lines are drawn to highlight leaf nodes to small to be drawn at the current scale. Finally, a heavy line is drawn at the top of each tree, whether it is large enough to have a box or not. No special accommodation is made for multiple items too small to draw at the current scale.

In the demonstration system, after the tree is initially constructed, no nodes are added or deleted. Thus it is efficient and convenient to store the tree nodes in an array. Nodes are stored in preorder. There are two classes of nodes: Leaf and Tree. A node stores a label, top coordinate, bottom coordinate and a reference to its parent. A leaf has a size in abstract units. A trees has a pointer to the next node below in the tree's column, a depth, and a width.

If storage needed to be further optimized, there are several opportunities.

Leaves are densely packed, therefore it is redundant to store both the top and the bottom of every leaf.

The parent of a node is always the nearest preceeding tree node in the array, since the nodes are arranged in preorder. Therefore, it is not necessary for each node to store it's parents. Care would be required to keep the running time linear.

Since the layout is constructed from abstract sizes, parents, and abstract sizes, it would be possible to store only the labels and the abstract sizes and recalculate layout parameters as needed.

In the current demonstration, no provision is made for deep hierarchies or long labels, either of which could overflow the available width. Several schemes are under consideration. See Future directions below.

Interactive behavior

Clicking on any node (box) scales that node to fill the available height. Clicking outside the boxes scales the entire data set to fill the available height. Clicking on the copyright notice takes you to my home page. :-)

Conclusion

Information is well organized if you can quickly find the answers to your questions and spot important relationships. By annotating the stacked bar chart with the hierarchy, it becomes possible to see at a glance the relative magnitude of collections as well as individuals.

Prior Art

The idea of displaying a tree with heights representing file sizes was independently invented by P. Dykstra in 1991. See XDU - A Disk Usage (DU) or other tree data display program.

Querying Web Site Visitor Trend Data with Coordinated Nested Bar and Pie Charts

Mark Sifer

http://crpit.com/confpapers/CRPITV11Sifer1.pdf

Also does the stacked bar chart + tree.

Future Directions

Hold column size constant so area comparisons are not potentially misleading. Create a standalone utility. Scrolling. Split pane. Locate highlighting. Mouse press highlighting. Call outs to make better use of space in the diagram. Ability to hide or contract nodes. Non-linear scales, esp. qualitative scaling using cluster analysis. Animation. Scale indicated by texture mapping. Use color to show an additional property of the data such as file last modification date. Horizontal scrolling.  A splitter bar to allow different regions to be compared side by side (what of scale? vertical scroll bars would definitely be needed since scales would have to be same to make comparison meaningful). Collapsing subtrees suffers the limitation that if the source is badly organized, there may not be a handle to collapse the stuff you need to filter out. Filters based on other criteria, such as file name wild cards or additional properties such as age