
Active Learning Literature Survey

Burr Settles

Computer Sciences Technical Report 1648

University of Wisconsin–Madison

Updated on: January 26, 2010


Burr Settles





The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. Active learning is well-motivated in many modern machine learning problems, where unlabeled data may be abundant or easily obtained, but labels are difficult, time-consuming,or expensive to obtain.



This report provides a general introduction to active learning and a survey of the literature. This includes a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date. Ananalys is of the empirical and theoretical evidence for successful active learning, a summary of problem setting variants and practical issues, and a discussion of related topics in machine learning research are also presented.


1 Introduction

This report provides a general review of the literature on active learning. There have been a host of algorithms and applications for learning with queries over the years, and this document is an attempt to distill the core ideas, methods, and applications that have been considered by the machine learning community. To make this survey more useful in the long term, an online version will be updated and maintained indefinitely at:





When referring to this document, I recommend using the following citation:

Burr Settles. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison. 2009.


Burr Settles。主动学习文献综述。计算机科学技术报告1648,威斯康星大学麦迪逊分校。 2009年。

An appropriate BIBTEX entry is:


Author = {Burr Settles},

Institution = {University of Wisconsin--Madison},

Number = {1648},

Title = {Active Learning Literature Survey},

Type = {Computer Sciences Technical Report},

Year = {2009},



@techreport {settles.tr09,

作者= {Burr Settles},

机构 = {威斯康星大学麦迪逊分校},

数字= {1648},

篇名 = {主动学习文献调查},

输入= {计算机科学技术报告},

年= {2009},


This document is written for a machine learning audience, and assumes the reader has a working knowledge of supervised learning algorithms (particularly statistical methods). For a good introduction to general machine learning, I recommend Mitchell (1997) or Duda et al. (2001). I have strived to make this review as comprehensive as possible, but it is by no means complete. My own research deals primarily with applications in natural language processing and bioinformatics, thus much of the empirical active learning work I am familiar with is in these areas. Active learning (like so many subfields in computer science) is rapidly growing and evolving in a myriad of directions, so it is difficult for one person to provide an exhaustive summary. I apologize for any oversights or inaccuracies, and encourage interested readers to submit additions, comments, and corrections to me at: [email protected].

本文档是为机器学习受众编写的,并假设读者具有监督学习算法(特别是统计方法)的相关知识储备。如果想更好地了解机器学习,我推荐Mitchell(1997)或Duda等。(2001年)。我努力使这篇文章尽可能通俗易懂,但并没有完全做到。我自己的研究主要涉及自然语言处理和生物信息学的应用,因此我熟悉的许多经验主动学习工作都是关于这些领域的。主动学习(就像计算机科学中的许多子领域一样)正在迅速发展并在无数方向上推进,因此一个人很难提供详尽的总结。对于任何疏忽或不准确之处,我深表歉意,并鼓励感兴趣的读者向我提交补充,评论和更正:[email protected]

1.1 What is Active Learning?

Active learning (sometimes called “query learning” or “optimal experimental design” in the statistics literature) is a subfield of machine learning and, more generally, artificial intelligence. The key hypothesis is that if the learning algorithm is allowed to choose the data from which it learns—to be “curious,” if you will—it will perform better with less training. Why is this a desirable property for learning algorithms to have? Consider that, for any supervised learning system to perform well, it must often be trained on hundreds (even thousands) of labeled instances. Sometimes these labels come at little or no cost, such as the the “spam” flag you mark on unwanted email messages, or the five-star rating you might give to films on a social networking website. Learning systems use these flags and ratings to better filter your junk email and suggest movies you might enjoy. In these cases you provide such labels for free, but for many other more sophisticated supervised learning tasks, labeled instances are very difficult, time-consuming, or expensive to obtain. Here are a few examples:



_ Speech recognition. Accurate labeling ofs peech utterances is extremely time consuming and requires trained linguists. Zhu (2005a) reports that annotation at the word level can take ten times longer than the actual audio (e.g., one minute of speech takes ten minutes to label), and annotating phonemes can take 400 times as long (e.g., nearly seven hours). The problem is compounded for rare languages or dialects.

_ 语音识别。准确标记语音非常耗时,需要训练有素的语言专家。 Zhu(2005a)认为,对音频单词进行标记所花的时间可能比实际音频长十倍(例如,一分钟的语音需要十分钟来标记),注释音素可能需要400倍(例如,近七个小时)。稀有语言或方言的问题更加复杂。

_ Information extraction. Good information extraction systems must be trained using labeled documents with detailed annotations.Users highlight entities or relations of interest in text, such as person and organization names, or whether a person works for a particular organization.Locating entities and relations can take a half-hour or more for even simple news wire stories (Settleset al., 2008a). Annotations for other knowledge domains may require additional expertise, e.g., annotating gene and disease mentions for biomedical information extraction usually requires PhD-level biologists.


_ Classification and filtering.Learning to classify documents (e.g., articles or web pages) or any other kind of media (e.g., image, audio, and video files) requires that users label each document or media file with particular labels, like “relevant” or “not relevant.” Having to annotate thousands of these instances can be tedious and even redundant.


Active learning systems attempt to overcome the labeling bottleneck by asking queries in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator).In this way, the active learner aims to achieve high accuracy using as few labeled instances as possible, thereby minimizing the cost of obtaining labeled data. Active learning is well-motivated in many modern machine learning problems where data may be abundant but labels are scarce or expensive to obtain. Note that this kind of active learning is related in spirit, though not to be confused, with the family of instructional techniques by the same name in the education literature (Bonwell and Eison, 1991).


