John D. Corbett
18 October 2003
A Size Tree is an interactive diagram design for presenting hierarchies that highlights size relationships across levels of hierarchy. It combines a tree drawn as nested boxes and a stacked bar chart. An efficient algorithm has been developed for drawing size trees. Interactive behavior is demonstrated with a Java applet.
Size trees are useful for visualizing file systems amd many aspects of software development such as subroutine size, code performance and coding activity. A plug-in for the Eclipse IDE is under development.
The population of the United States circa 2000 is presented below as a size tree. Height represents population. Boxes represent regions and states. Items to small to be rendered at the current scale are indicated with horizontal lines. Boxes that continue beyond the top or bottom of the view are drawn without borders and with small arrows to indicate more above or below. The color coding is designed to provide secondary indicators of the hierarchical relationships, the position of the view within the entire data set, and the scale of the view.
Several observations are easy to make that would not be as easy if the data were displayed in tabular form. California and Texas have the tallest boxes because they are the most populous states. The population of the South about the same as the Northeast and Midwest combined.
In the Java applet, clicking on the West, would scale the data so that the West filled the available height. In this view, it is apparent that California is more populous than the entire Mountain region.
Click here to view the Java Size Tree applet.
In the Java applet, you can click on any state or region to change the scale of the chart. The fraction of the size in view is displayed at the upper left corner.
Simple tree layout algorithms like this one are typically require time and space linearly proportional to the number of nodes. Layout can be accomplished in a single pass over the data, positioning parents relative to already positioned children.
The storage used for the tree and the running times for layout and hit detection are all proportional to the number of nodes. Although hit detection could be further optimized, tests with millions of nodes run at interactive speeds.
Input consists of a list of expressions. Terminals (leaf nodes) are written as a label and size separated by a colon, e.g. foo:42. Non-terminals (tree nodes) are written as a label and a parenthesized list of expressions, e.g. a(foo:42, bar(), baz:0).
Layout consists of three steps:
After layout parameters are calculated, drawing proceeds in two passes:
In the demonstration system, after the tree is initially constructed, no nodes are added or deleted. Thus it is efficient and convenient to store the tree nodes in an array. Nodes are stored in preorder. There are two classes of nodes: Leaf and Tree. A node stores a label, top coordinate, bottom coordinate and a reference to its parent. A leaf has a size in abstract units. A trees has a pointer to the next node below in the tree's column, a depth, and a width.
If storage needed to be further optimized, there are several opportunities.
Leaves are densely packed, therefore it is redundant to store both the top and the bottom of every leaf.
The parent of a node is always the nearest preceeding tree node in the array, since the nodes are arranged in preorder. Therefore, it is not necessary for each node to store it's parents. Care would be required to keep the running time linear.
Since the layout is constructed from abstract sizes, parents, and abstract sizes, it would be possible to store only the labels and the abstract sizes and recalculate layout parameters as needed.
In the current demonstration, no provision is made for deep hierarchies or long labels, either of which could overflow the available width. Several schemes are under consideration. See Future directions below.
Clicking on any node (box) scales that node to fill the available height. Clicking outside the boxes scales the entire data set to fill the available height. Clicking on the copyright notice takes you to my home page. :-)
Information is well organized if you can quickly find the answers to your questions and spot important relationships. By annotating the stacked bar chart with the hierarchy, it becomes possible to see at a glance the relative magnitude of collections as well as individuals.
The idea of displaying a tree with heights representing file sizes was independently invented by P. Dykstra in 1991. See XDU - A Disk Usage (DU) or other tree data display program.
Querying Web Site Visitor Trend Data with Coordinated Nested Bar and Pie Charts
Also does the stacked bar chart + tree.
Hold column size constant so area comparisons are not potentially misleading. Create a standalone utility. Scrolling. Split pane. Locate highlighting. Mouse press highlighting. Call outs to make better use of space in the diagram. Ability to hide or contract nodes. Non-linear scales, esp. qualitative scaling using cluster analysis. Animation. Scale indicated by texture mapping. Use color to show an additional property of the data such as file last modification date. Horizontal scrolling. A splitter bar to allow different regions to be compared side by side (what of scale? vertical scroll bars would definitely be needed since scales would have to be same to make comparison meaningful). Collapsing subtrees suffers the limitation that if the source is badly organized, there may not be a handle to collapse the stuff you need to filter out. Filters based on other criteria, such as file name wild cards or additional properties such as age