「论文」主动学习文献综述(中英文对照)-绪论、1.1节

Active Learning Literature Survey

Burr Settles

Computer Sciences Technical Report 1648

University of Wisconsin–Madison

Updated on: January 26, 2010

主动学习文献综述

Burr Settles

计算机科学技术报告1648

威斯康星大学麦迪逊分校

更新日期:2010年1月26日


Abstract

The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. Active learning is well-motivated in many modern machine learning problems, where unlabeled data may be abundant or easily obtained, but labels are difficult, time-consuming,or expensive to obtain.

引言

主动学习背后的关键思想是,如果允许机器学习算法自主选择学习的数据,则可以使用较少的训练标签来实现更高的准确性。主动学习在许多现代机器学习问题中表现很好的,在现代机器学习问题中未标记的数据可能是丰富的或容易获得的,但是对其进行标签则是一件困难,耗时且难以实现的事情。

This report provides a general introduction to active learning and a survey of the literature. This includes a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date. Ananalys is of the empirical and theoretical evidence for successful active learning, a summary of problem setting variants and practical issues, and a discussion of related topics in machine learning research are also presented.

本报告对主动学习进行了简要介绍,并对相关文献做了综述。文章内容包括对可以制定查询场景的讨论,以及对迄今为止在文献中提出的查询策略框架做了概述。还介绍了成功实现主动学习的成功经验和理论证据分析,并总结了变体设置问题和实际应用问题,讨论了机器学习研究中的相关主题。

1 Introduction

This report provides a general review of the literature on active learning. There have been a host of algorithms and applications for learning with queries over the years, and this document is an attempt to distill the core ideas, methods, and applications that have been considered by the machine learning community. To make this survey more useful in the long term, an online version will be updated and maintained indefinitely at:

http://active-learning.net/

1简介

本报告对主动学习的文献进行了总体回顾。多年来,已有许多用于学习查询的算法和应用程序,本文档试图提炼机器学习社区中已经提及过的核心思想,方法和应用程序。为了使这项工作在长期内有用,在线版本将无限期更新和维护:

http://active-learning.net/

When referring to this document, I recommend using the following citation:

Burr Settles. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison. 2009.

在提及本文档时,我建议使用以下引文:

Burr Settles。主动学习文献综述。计算机科学技术报告1648,威斯康星大学麦迪逊分校。 2009年。

An appropriate BIBTEX entry is:

@techreport{settles.tr09,

Author = {Burr Settles},

Institution = {University of Wisconsin--Madison},

Number = {1648},

Title = {Active Learning Literature Survey},

Type = {Computer Sciences Technical Report},

Year = {2009},

}

适当的BIBTEX条目是:

@techreport {settles.tr09,

作者= {Burr Settles},

机构 = {威斯康星大学麦迪逊分校},

数字= {1648},

篇名 = {主动学习文献调查},

输入= {计算机科学技术报告},

年= {2009},

}

This document is written for a machine learning audience, and assumes the reader has a working knowledge of supervised learning algorithms (particularly statistical methods). For a good introduction to general machine learning, I recommend Mitchell (1997) or Duda et al. (2001). I have strived to make this review as comprehensive as possible, but it is by no means complete. My own research deals primarily with applications in natural language processing and bioinformatics, thus much of the empirical active learning work I am familiar with is in these areas. Active learning (like so many subfields in computer science) is rapidly growing and evolving in a myriad of directions, so it is difficult for one person to provide an exhaustive summary. I apologize for any oversights or inaccuracies, and encourage interested readers to submit additions, comments, and corrections to me at: [email protected].

本文档是为机器学习受众编写的,并假设读者具有监督学习算法(特别是统计方法)的相关知识储备。如果想更好地了解机器学习,我推荐Mitchell(1997)或Duda等。(2001年)。我努力使这篇文章尽可能通俗易懂,但并没有完全做到。我自己的研究主要涉及自然语言处理和生物信息学的应用,因此我熟悉的许多经验主动学习工作都是关于这些领域的。主动学习(就像计算机科学中的许多子领域一样)正在迅速发展并在无数方向上推进,因此一个人很难提供详尽的总结。对于任何疏忽或不准确之处,我深表歉意,并鼓励感兴趣的读者向我提交补充,评论和更正:[email protected]

1.1 What is Active Learning?

Active learning (sometimes called “query learning” or “optimal experimental design” in the statistics literature) is a subfield of machine learning and, more generally, artificial intelligence. The key hypothesis is that if the learning algorithm is allowed to choose the data from which it learns—to be “curious,” if you will—it will perform better with less training. Why is this a desirable property for learning algorithms to have? Consider that, for any supervised learning system to perform well, it must often be trained on hundreds (even thousands) of labeled instances. Sometimes these labels come at little or no cost, such as the the “spam” flag you mark on unwanted email messages, or the five-star rating you might give to films on a social networking website. Learning systems use these flags and ratings to better filter your junk email and suggest movies you might enjoy. In these cases you provide such labels for free, but for many other more sophisticated supervised learning tasks, labeled instances are very difficult, time-consuming, or expensive to obtain. Here are a few examples:

1.1什么是主动学习?

主动学习(有时在统计学文献中称为“查询学习”或“最佳实验设计”)是机器学习的子领域,更一般地说是人工智能的子领域。其关键思想是,如果允许学习算法选择它所学习的数据,它会以较少的训练达到更好的表现。为什么这是学习算法的理想属性?考虑到,为了使任何监督学习系统表现良好,通常必须对数百(甚至数千)标记实例进行训练。有时这些标签的成本很少或没有,例如您在不需要的电子邮件中标记的“垃圾邮件”标记,或者您可能在社交网站上给予电影的五星评级。学习系统使用这些标记和评级来更好地过滤您的垃圾邮件并推荐您可能喜欢的电影。在这些情况下,您可以免费提供此类标签,但对于许多其他更复杂的监督学习任务,标记的实例非常困难,耗时或昂贵。这里有一些例子:

_ Speech recognition. Accurate labeling ofs peech utterances is extremely time consuming and requires trained linguists. Zhu (2005a) reports that annotation at the word level can take ten times longer than the actual audio (e.g., one minute of speech takes ten minutes to label), and annotating phonemes can take 400 times as long (e.g., nearly seven hours). The problem is compounded for rare languages or dialects.

_ 语音识别。准确标记语音非常耗时,需要训练有素的语言专家。 Zhu(2005a)认为,对音频单词进行标记所花的时间可能比实际音频长十倍(例如,一分钟的语音需要十分钟来标记),注释音素可能需要400倍(例如,近七个小时)。稀有语言或方言的问题更加复杂。

_ Information extraction. Good information extraction systems must be trained using labeled documents with detailed annotations.Users highlight entities or relations of interest in text, such as person and organization names, or whether a person works for a particular organization.Locating entities and relations can take a half-hour or more for even simple news wire stories (Settleset al., 2008a). Annotations for other knowledge domains may require additional expertise, e.g., annotating gene and disease mentions for biomedical information extraction usually requires PhD-level biologists.

_信息提取。必须使用带有详细注释的带标签文档来训练良好的信息提取系统。用户对文本中感兴趣的实体或关系进行标记,例如人员和组织名称,或者某个人是否为特定组织工作。即使是简单的新闻专题报道,定位实体和关系也可能需要半小时或更长时间(Settles等,2008a)。其他知识领域的注释可能需要额外的专业知识,例如,注释基因和疾病涉及的生物医学信息提取通常需要博士级生物学家来完成。

_ Classification and filtering.Learning to classify documents (e.g., articles or web pages) or any other kind of media (e.g., image, audio, and video files) requires that users label each document or media file with particular labels, like “relevant” or “not relevant.” Having to annotate thousands of these instances can be tedious and even redundant.

_分类和过滤。学习对文档(例如,文章或网页)或任何其他类型的媒体(例如,图像,音频和视频文件)进行分类要求用户使用特定标签标记每个文档或媒体文件,例如“相关”或“不相关”。“注释数以千计的这些实例可能是乏味的,甚至是多余的。

Active learning systems attempt to overcome the labeling bottleneck by asking queries in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator).In this way, the active learner aims to achieve high accuracy using as few labeled instances as possible, thereby minimizing the cost of obtaining labeled data. Active learning is well-motivated in many modern machine learning problems where data may be abundant but labels are scarce or expensive to obtain. Note that this kind of active learning is related in spirit, though not to be confused, with the family of instructional techniques by the same name in the education literature (Bonwell and Eison, 1991).

主动学习系统试图通过将未标记的实例交由oracle(例如,人类注释器)标记来克服标记瓶颈。以这种方式,主动学习器旨在使用尽可能少的标记实例来实现高准确度,从而最小化获得标记数据的成本。主动学习在很多现代机器学习问题中都很上佳表现,因为数据可能很丰富,但标签很少或者很难获得。请注意,这种主动学习与神经学中的主动学习同名(Bonwell和Eison,1991),但不要混淆。



分享到:


相關文章: