Better histograms with Python

Histograms are frequently used to visualize the distribution of a data set or to compare between multiple distributions. Python, via matplotlib.pyplot, contains convenient functions for plotting histograms; the default plots it generates, however, leave much to be desired in terms of visual appeal and clarity.

The two code blocks below generate histograms of two normally distributed sets using default matplotlib.pyplot.hist settings and then, in the second block, I add some lines to improve the data presentation. See the comments to determine what each individual line is doing.

## DEFAULT HISTOGRAMS
import matplotlib.pyplot as plt
import numpy as np

# set random seed so we get reproducible behavior
np.random.seed(1)

# generate two data series each containing 1,000 normally distributed values
d1 = np.random.normal(5.0, 2.0, 1000)
d2 = np.random.normal(6.0, 2.0, 1000)

# make the plot with default settings
plt.clf()
plt.hist(d1)
plt.hist(d2)
plt.savefig('default_hist.png', dpi=300)

The output of this program is:

And now for the slightly longer but much improved histogram code:

## BETTER HISTOGRAMS
import matplotlib.pyplot as plt
import numpy as np

# set random seed so we get reproducible behavior
np.random.seed(1)

# generate two data series each containing 1,000 normally distributed values
d1 = np.random.normal(5.0, 2.0, 1000)
d2 = np.random.normal(6.0, 2.0, 1000)

# make the plot
plt.clf()

# generate subplot object so we can modify axis lines easily
ax = plt.subplot(111)

# updated histogram commands
# use colors that can be differentiated by the colorblind from Paul Tol's notes
# do not use "filled" histograms so all bin heights can be seen clearly
plt.hist(d1, histtype='step', color='#EE8026', label='Data Set 1', alpha=0.7)
plt.hist(d2, histtype='step', color='#BA8DB4', label='Data Set 2', alpha=0.7)

# new things
ax.spines['top'].set_visible(False)   # turn off top line
ax.spines['right'].set_visible(False) # turn off right line
plt.ylabel('Counts')                  # label the y axis
plt.xlabel('Values')                  # label the x axis
plt.xlim(-2, 14)                      # set x limits that span full data range
plt.ylim(-10, 300)                    # set y limits so that full range can be seen
plt.legend(loc='best', fancybox=True) # add a legend

plt.savefig('better_hist.png', dpi=300)

The result of this program is:

This second plot is easier to read, has less visual clutter thanks to the removal of the “filled” histograms, and has labeled axes. The choice of histogram bins is an important consideration that I am not going to touch on here. You can experiment yourself to see how adding, for example, bins=’fd’ to the plt.hist calls in the second program above changes the visual depiction of the results with all else held constant.

Author