Abstract: Human learners appreciate that observations usually form hierarchies of regularities and sub-regularities.
For example, English verbs have irregular cases that must be memorized (e.g., go ↦ went) and regular cases that generalize well
(e.g., kiss ↦ kissed, miss ↦ missed). Likewise, deep neural networks have the capacity to memorize rare or irregular forms but
nonetheless generalize across instances that share common patterns or structures. We analyze how individual instances are treated
by a model via a consistency score. The score is the expected accuracy of a particular architecture for a held-out
instance on a training set of a given size sampled from the data distribution. We obtain empirical estimates of this score
for individual instances in multiple data sets, and we show that the score identifies out-of-distribution and mislabeled examples
at one end of the continuum and regular examples at the other end. We explore two categories of proxies to the consistency score:
pairwise distance based proxy and the training statistics based proxies. We conclude with two applications using C-scores to help
understand the dynamics of representation learning and filter out outliers, and discussions of other potential applications such as curriculum learning, and active data collection.

Pre-computed C-scores: We provide pre-computed C-score for download below. The files are in Numpy's data format exported via numpy.savez.
For CIFAR-10 and CIFAR-100, the exported file contains two arrays labels and scores. Both arrays are stored in the order of training
examples as defined by the original datasets. The data loading tools provided in some deep learning
library might not be following the original data example orders, so we provided the labels array for easy sanity check of the data ordering.

MNIST

We show top-ranking (top row) and bottom-ranking (bottom row) examples from MNIST by C-scores computed via multi-layer perceptrons.
Use the dropdown menu to select the class to show.

CIFAR-10

We show top-ranking (top row) and bottom-ranking (bottom row) examples from CIFAR-10 by C-scores computed via Inception models.
Use the dropdown menu to select the class to show. The pre-computed C-scores can be downloaded from here.

CIFAR-100

We show top-ranking (top row) and bottom-ranking (bottom row) examples from CIFAR-100 by C-scores computed via Inception models.
Use the dropdown menu to select the class to show. The pre-computed C-scores can be downloaded from here.

ImageNet

We show examples from ImageNet by C-scores computed via ResNet50 models. For each class, the top 2 rows
show the top ranking examples, and the bottom 2 rows show the bottom ranking examples. In the middle, a
histogram of the C-scores of all the training examples in this class is show, in both log scale and linear
scale.

Because ImageNet contains 1000 classes, we select a subset to visualize.
The first subset contains a few representative classes, as indicated
by the ★ in the figure here. yellow lady's slipper is a typical regular class, where
most of the instances are highly regular and even the
bottom ranking examples show some color consistency. oscilloscope, green snake, Norwich terrier and weasel,
ordered by the average C-scores in each class, represent most of the classes in the ImageNet dataset: they contain both high regular
top-ranking examples and highly irregular bottom-ranking examples. Finally, projectile is a typical irregular class, where the
instances are extremely diversified.
The second subset contains 100 randomly sampled classes.

The pre-computed C-scores can be downloaded from here.
Since there is no well defined example ordering, we order the exported scores arbitrarily, while include the filename of each example to help identify the example-score mapping.
More specifically, the exported file for ImageNet contains three arrays labels, scores and filenames. Again we include labels for easy sanity checking.