Tales of Huffman: An Exercise in Dealing with Messy Data

  • Robert H. Carver


Statistics education reformers have for years called for the use of real data in teaching introductory statistics(Ballman, 1997; Garfield et al., 2004; Hogg, 1991). Instructors now have ready access to cases, textbookproblems and other exercises with accompanying well-documented sets of real or realistic data. On-line portalsand data libraries provide a huge array of real data sets keyed variously to substantive topics and statisticaltechniques suitable for introductory students.

The vast majority of these real datasets tend to have already been cleaned up by their preparers. As enrichingas these resources are, relatively few of them offer students first-hand experience with the essential messiness of“real” real data. There is a good case to be made that data cleaning and preparation belong in introductorycourses (Burger & Leopold, 2001). Certainly, problems of missing, dirty, and incomplete data are importanttopics within the field (Hoyle, 1971; Rubin, 1976; Wagner, 2002).

Using field data from the Wright Brothers’ 1904 experiments, this case leads introductory or intermediate studentsthrough a process of data preparation, illustrating five common steps in data preparation and cleaning:standardizing the format of data records, deciding how to treat ambiguously recorded data, conversion of measurementsto a single standard unit, detecting and resolving issues with outliers, and imputation of missing data.