Binary Search Trees

Steven J. Zeil

Old Dominion University, Dept. of Computer Science

Table of Contents

1. Definition: Binary Search Trees
1.1. The Binary Search Tree ADT
2. Implementing Binary Search Trees
2.1. Searching a Binary Tree
2.2. Inserting into Binary Search Trees
2.3. Deletion
3. How Fast Are Binary Search Trees?
3.1. Balancing
3.2. Performance
3.3. Balanced Best Case
3.4. Balanced Worst-Case
3.5. Degenerate Best-Case
3.6. Degenerate Worst-Case
3.7. Average-Case
3.8. Can We Avoid the Worst Case?

A tree in which every parent has at most 2 children is a binary tree.

The most common use of binary trees is for ADTs that require frequent searches for arbitrary keys.

For this we use a special form of binary tree, the binary search tree.

1. Definition: Binary Search Trees

A binary tree T is a binary search tree if for each node n with children TL and TR:

  • The value in n is greater than the values in every node in TL.

  • The value in n is less than the values in every node in TR.

  • Both TL and TR are binary search trees.

Question: Is this a BST?

1.1. The Binary Search Tree ADT

Yes, this is a binary search tree. Each node is greater than or equal to all of its left descendants, and is less than or equal than all of its right descendants.

Let's look at the basic interface for a binary search tree.

This code is taken from your textbook and is the same code used in our prior discussion of tree iterators.

Some points of note:

  • The stnode template implements individual tree nodes.

  • The stree template represents the entire tree, with functions for searching, insertion, iteration, etc..

  • Our primary focus in this lecture will be on the find, insert and erase functions.

2. Implementing Binary Search Trees

Since you have, presumably, read your text's discussion of how to implement BSTs, I'm mainly going to hit the high points.

2.1. Searching a Binary Tree

We'll start by reviewing the basic searching algorithm.

The tree's find operation works by using a private utility function, findNode to find a pointer to the node containing the desired data and then uses that pointer to construct an iterator representing the position of that node.

We search a tree by comparing the value we're searching for to the current node, t. If the value we want is smaller, we look in the left subtree. If the value we want is larger, we look in the right subtree.

You may note that this algorithm bears a certain resemblance to the binary search algorithm we studied earlier in the semester. We shall see shortly that the performance of both search algorithms on a collection of N items is O(log N), but that binary trees support faster insertion operations, allowing us to build the searchable collection in less time than when using binary search over sorted arrays.

You can run this algorithm to see how it works.

2.2. Inserting into Binary Search Trees

The first part of the insertion function is closely related to the recursive form of the search. In fact, we are searching for the place where the new data would reside, if it were in the tree.

We know we have not found it when we reach a null pointer. Since that pointer (as either the left or right child of some parent node) was found by asking where would this data go if it were in the tree?, we know that we can, in fact, insert the data here.

You might want to run this algorithm and experiment with inserting nodes into binary search trees. Take particular note of what happens if you insert data in ascending or descending order, as opposed to inserting randomly ordered data.

2.3. Deletion

Our tree class actually provides two distinct approaches to erasing. We can erase the data at a given position (iterator) or erase a given value, if it exists.

Deleting a value is shown here. We simply do a conventional binary search tree findNode call and, if the value actually exists in the tree, erase the node at the position where we found the data.

In essence, this passes the buck to the "erase at a position" function, which we will look at next.

Here is the erase algorithm. For the moment, concentrate on the code for replacing the node we want to erase, pNodePtr, by a replacement node rNodePtr. You can see that it is careful to place the address of the replacement into either the tree root, the left child of the erased node's parent, or the right child of the erased node's parent, depending on the data value in the parent.

Most of the code in this function is actually concerned with finding that replacement node. We can break down the problem of finding a suitable replacement when removing a given node from a BST into cases:

  1. Removing a leaf

  2. Removing a node that has only one child

    • only a left child

    • only a right child

  3. Removing a node that has two children

Removing a Leaf

Case 1: Suppose we wanted to remove the 40 from this tree. What would we have to do so that the remaining nodes would still be a valid BST?

Nothing at all!

If we simply delete this node (setting the pointer to it from its parent to 0), what's left would still be a perfectly good binary search tree --- it would satisfy all the BST ordering requirements.

Now, take a look at this code for removing a node, pointed at by dNodePtr, from a BST. Find the leaf case, and you can see that all we do is to delete the node. (Note that when we assign dNodePtr->left to rNodePtr, that in this leaf case dNodePtr->left is null.)

So if we are removing a tree leaf, we "replace" it by a null pointer.

Removing A Node with a Null Right Child

Case 2: Suppose we wanted to remove the 20 or the 70 from this tree. What would we have to do so that the remaining nodes would still be a valid BST?

There is one pointer to the node being deleted, and one pointer from that node to its only child. So this is actually a bit like deleting a node from the middle of a linked list. All we need to do is to reroute the pointer from the parent (30) to the node we want to remove, making that pointer point directly to the child of the node we are going to remove.

For example, starting from this:

Verify for yourself if we remove 20:

or 70:

in this manner, that the results are still valid BSTs.

Again, take a look at this code for the case when the node being erased has exactly one child. Notice that its non-null child is chosen as the replacement node, rNodePtr.

Removing a Node with Two Non-Null Children

Case 3: Suppose we wanted to remove the 50 or the 30 from this tree. What would we have to do so that the remaining nodes would still be a valid BST?

This is a hard case. Clearly, if we remove either the "50" or "30" nodes, we break the tree into pieces, with no obvious place to put the now-detached subtrees.

So let's take a different tack. Instead of deleting this node, is there some other data value that we could put into that node that would preserve the BST ordering (all nodes to the left must be less, all nodes to the right must be greater or equal)?

There are, in fact, two values that we could safely put in there: the smallest value from the right subtree, or the largest value from the left subtree.

We can find the largest value on the left by

  • taking one step to the left

  • then running as far down to the right as we can go

We can find the smallest value on the right by

  • taking one step to the right

  • then running as far down to the left as we can go

Now, if we replace 30 by the largest value from the left:

or by the smallest value from the right:

the results are properly ordered for a BST, except possibly for the node we just copied the value from. But since that node is now redundant, we can delete it from the tree.

And here's the best part. Since we find the node to copy from by running as far as we can go in one direction or the other, we know that the node we copied from has at least 1 null child pointer (otherwise we would have kept running past it). So removing it from the tree will always fall into one of the earlier, simpler cases (leaf or only one child).

Again, take a look at the code for removing a node. This code does the "step to the right, then run to the left" behavior we have just described in order to find the replacement node. The remaining code is concerned with removing that replacement node from where it currently resides so that we can then link it in to the parent of the node being erased.

Finally, try running this algorithm, available as erase from a position. Try to observe each of the major cases, as outlined here, in action.

3. How Fast Are Binary Search Trees?

Each step in the BST insert and findNode algorithms move one level deeper in the tree. Similarly, in erase, the only part that is not constant time is the running down the tree to find the smallest value to the right.

The number of recursive calls/loop iterations in all these algorithms is therefore no greater than the height of the tree.

But how high can a BST be?

That depends on how well the tree is balanced.

3.1. Balancing

A binary tree is balanced if for every interior node, the height of its two children differ by at most 1.

Unbalanced trees are easy to obtain.

This is a BST.

But, so is this!

The shape of the tree depends upon the order of insertions.

The worst case is when the data being inserted is already in order (or in reverse order). In that case, the tree degenerates into a sorted linked list, as shown here.

The best case is when the tree is balanced, meaning that, for each node, the heights of the node's children are nearly the same.

3.2. Performance

Consider the findNode operation on a nearly balanced tree with N nodes.

Question: What is the complexity of the best case?

  • O(1)

  • O(log N)

  • O(N)

  • O(N log N)

  • O(N^2)

3.3. Balanced Best Case

In the best case, we find what we're looking for in the root of the tree. That's O(1) time.

Question: Consider the findNode operation on a nearly balanced tree with N nodes.

What is the complexity of the worst case?

  • O(1)

  • O(log N)

  • O(N)

  • O(N log N)

  • O(N^2)

3.4. Balanced Worst-Case

The findNode operation starts at the root and moves down one level each recursion. So it is, in the worst case, O(h) where h is the height of the tree.

But how high is a balanced tree?

A nearly balanced tree will be height log N. Consider a tree that is completely balanced and has its lowest level full. Since every node on the lowest level shares a parent with one other, there will be exactly half as many nodes on the next-to-lowest level as on the lowest. And, by the same reasoning, each level will have half as many nodes as the one below it, until we finally get to the single root at the top of the tree.

So a balanced tree has height log N.

Question: Now, consider the findNode operation on a degenerate tree with N nodes.

What is the complexity of the best case?

  • O(1)

  • O(log N)

  • O(N)

  • O(N log N)

  • O(N^2)

3.5. Degenerate Best-Case

In the best case, we find what we're looking for in the root of the tree. That's O(1) time.

Question: Consider the findNode operation on a degenerate tree with N nodes.

What is the complexity of the worst case?

  • O(1)

  • O(log N)

  • O(N)

  • O(N log N)

  • O(N^2)

3.6. Degenerate Worst-Case

A degenerate tree looks like a linked list. In the worst case, the value we're looking for is at the end of the list, so we have to search through all N nodes to get there. Thus the worst case is O(N).

There's quite a difference, then, between the worst case behavior of trees, depending upon the tree's shape.

3.7. Average-Case

So the question is, does the "average" binary tree look more like the balanced or the degenerate case?

An intuitive argument is:

  • No tree with n nodes has height `< log(n)`

  • No tree with n nodes has height `> n`

  • Average depth of all nodes is therefore bounded between `n/2` and `(log n)/2`.

  • The more unbalanced a tree is, the less likely that a random insertion would increase the tree height.

    For example, if we are inserting into this tree, then any insertion will increase the tree's height.

    But if we were inserting a randomly selected value into this one, then there is only a `2/8` chance that we will increase the height of the tree.

    For trees that are somewhere between those two extremes, the chances of a random insertion actually increasing the height of the tree will fall somewhere between those two probability extremes.

  • Insertions that don't increase the tree height make the tree more balanced.

So, the more unbalanced a tree is, the more likely that a random insertion will actually tend to increase the balance of the tree. This suggests (but does not prove) that randomly constructed binary search trees tend to be reasonably balanced.

It is possible to prove this claim, but the proof is beyond the scope of this class.

But, it's not safe to be too sanguine about the height of binary search trees. Although random construction tends to yield reasonable balance, in real applications we often do not get random values.

Question: Which of the following data would, if inserted into an initially empty binary search tree, yield a degenerate tree?

  • data that is in ascending order

  • data that is in descending order

  • both of the above

  • none of the above

3.8. Can We Avoid the Worst Case?

Both data in ascending and descending order results in degenerate trees. (Try it if you are not convinced.)

It's very common to get data that is in sorted or almost sorted order, so degenerate behavior turns out to be more common than we might expect.

Also, the arguments made so far don't take deletions into account, which tend to unbalance trees.

Later, we'll look at variants of the binary search tree that use more elaborate insertion and deletion algorithms to maintain tree balance.


In the Forum:

(no threads at this time)