Characterizing Structural Regularities of Labeled Data in Overparameterized Models

Abstract: Human learners appreciate that observations usually form hierarchies of regularities and sub-regularities. For example, English verbs have irregular cases that must be memorized (e.g., go ↦ went) and regular cases that generalize well (e.g., kiss ↦ kissed, miss ↦ missed). Likewise, deep neural networks have the capacity to memorize rare or irregular forms but nonetheless generalize across instances that share common patterns or structures. We analyze how individual instances are treated by a model via a consistency score. The score is the expected accuracy of a particular architecture for a held-out instance on a training set of a given size sampled from the data distribution. We obtain empirical estimates of this score for individual instances in multiple data sets, and we show that the score identifies out-of-distribution and mislabeled examples at one end of the continuum and regular examples at the other end. We explore two categories of proxies to the consistency score: pairwise distance based proxy and the training statistics based proxies. We conclude with two applications using C-scores to help understand the dynamics of representation learning and filter out outliers, and discussions of other potential applications such as curriculum learning, and active data collection.

Pre-computed C-scores: We provide pre-computed C-score for download below. The files are in Numpy's data format exported via numpy.savez. For CIFAR-10 and CIFAR-100, the exported file contains two arrays labels and scores. Both arrays are stored in the order of training examples as defined by the original datasets. The data loading tools provided in some deep learning library might not be following the original data example orders, so we provided the labels array for easy sanity check of the data ordering. For ImageNet, please refer to the ImageNet section below.

For TFDS users: because TFDS saves the example id when preparing the dataset (at least for CIFAR), it is possible to remap the exported C-scores to TFDS ordering with the following code snippet:

# load the full cifar10 dataset into memory to get the example ids
data_name = 'cifar10:3.0.2'
raw_data, info = tfds.load(name=data_name, batch_size=-1, with_info=True,
                           as_dataset_kwargs={'shuffle_files': False})
raw_data = tfds.as_numpy(raw_data)
trainset_np, testset_np = raw_data['train'], raw_data['test']

# load c-scores in original data order
cscore_fn = '/path/to/cifar10-cscores-orig-order.npz'
cscore_arrays = load_npz(cscore_fn)

# get example index
def _id_to_idx(str_id):
  return int(str_id.split(b'_')[1])
vec_id_to_idx = np.vectorize(_id_to_idx)
trainset_orig_idx = vec_id_to_idx(trainset_np['id'])

# sanity check with labels to make sure that data order is correct
assert np.all(trainset_np['label'] == cscore_arrays['labels'][trainset_orig_idx])

# now this is c-scores in TFDS order
ordered_cscores = cscore_arrays['scores'][trainset_orig_idx]

Characterizing Structural Regularities of Labeled Data in Overparameterized Models

Ziheng Jiang^♮, Chiyuan Zhang^♮, Kunal Talwar, Michael C. Mozer ^♮ Equal contribution

MNIST

CIFAR-10

CIFAR-100

ImageNet

A few representative classes

100 random classes

Characterizing Structural Regularities of Labeled Data in Overparameterized Models

Ziheng Jiang♮, Chiyuan Zhang♮, Kunal Talwar, Michael C. Mozer ♮ Equal contribution

MNIST

CIFAR-10

CIFAR-100

ImageNet

A few representative classes

100 random classes

Ziheng Jiang^♮, Chiyuan Zhang^♮, Kunal Talwar, Michael C. Mozer ^♮ Equal contribution