「論文」主動學習文獻綜述(中英文對照)-緒論、1.1節

Active Learning Literature Survey

Burr Settles

Computer Sciences Technical Report 1648

University of Wisconsin–Madison

Updated on: January 26, 2010

主動學習文獻綜述

Burr Settles

計算機科學技術報告1648

威斯康星大學麥迪遜分校

更新日期:2010年1月26日


Abstract

The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. Active learning is well-motivated in many modern machine learning problems, where unlabeled data may be abundant or easily obtained, but labels are difficult, time-consuming,or expensive to obtain.

引言

主動學習背後的關鍵思想是,如果允許機器學習算法自主選擇學習的數據,則可以使用較少的訓練標籤來實現更高的準確性。主動學習在許多現代機器學習問題中表現很好的,在現代機器學習問題中未標記的數據可能是豐富的或容易獲得的,但是對其進行標籤則是一件困難,耗時且難以實現的事情。

This report provides a general introduction to active learning and a survey of the literature. This includes a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date. Ananalys is of the empirical and theoretical evidence for successful active learning, a summary of problem setting variants and practical issues, and a discussion of related topics in machine learning research are also presented.

本報告對主動學習進行了簡要介紹,並對相關文獻做了綜述。文章內容包括對可以制定查詢場景的討論,以及對迄今為止在文獻中提出的查詢策略框架做了概述。還介紹了成功實現主動學習的成功經驗和理論證據分析,並總結了變體設置問題和實際應用問題,討論了機器學習研究中的相關主題。

1 Introduction

This report provides a general review of the literature on active learning. There have been a host of algorithms and applications for learning with queries over the years, and this document is an attempt to distill the core ideas, methods, and applications that have been considered by the machine learning community. To make this survey more useful in the long term, an online version will be updated and maintained indefinitely at:

http://active-learning.net/

1簡介

本報告對主動學習的文獻進行了總體回顧。多年來,已有許多用於學習查詢的算法和應用程序,本文檔試圖提煉機器學習社區中已經提及過的核心思想,方法和應用程序。為了使這項工作在長期內有用,在線版本將無限期更新和維護:

http://active-learning.net/

When referring to this document, I recommend using the following citation:

Burr Settles. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison. 2009.

在提及本文檔時,我建議使用以下引文:

Burr Settles。主動學習文獻綜述。計算機科學技術報告1648,威斯康星大學麥迪遜分校。 2009年。

An appropriate BIBTEX entry is:

@techreport{settles.tr09,

Author = {Burr Settles},

Institution = {University of Wisconsin--Madison},

Number = {1648},

Title = {Active Learning Literature Survey},

Type = {Computer Sciences Technical Report},

Year = {2009},

}

適當的BIBTEX條目是:

@techreport {settles.tr09,

作者= {Burr Settles},

機構 = {威斯康星大學麥迪遜分校},

數字= {1648},

篇名 = {主動學習文獻調查},

輸入= {計算機科學技術報告},

年= {2009},

}

This document is written for a machine learning audience, and assumes the reader has a working knowledge of supervised learning algorithms (particularly statistical methods). For a good introduction to general machine learning, I recommend Mitchell (1997) or Duda et al. (2001). I have strived to make this review as comprehensive as possible, but it is by no means complete. My own research deals primarily with applications in natural language processing and bioinformatics, thus much of the empirical active learning work I am familiar with is in these areas. Active learning (like so many subfields in computer science) is rapidly growing and evolving in a myriad of directions, so it is difficult for one person to provide an exhaustive summary. I apologize for any oversights or inaccuracies, and encourage interested readers to submit additions, comments, and corrections to me at: [email protected].

本文檔是為機器學習受眾編寫的,並假設讀者具有監督學習算法(特別是統計方法)的相關知識儲備。如果想更好地瞭解機器學習,我推薦Mitchell(1997)或Duda等。(2001年)。我努力使這篇文章儘可能通俗易懂,但並沒有完全做到。我自己的研究主要涉及自然語言處理和生物信息學的應用,因此我熟悉的許多經驗主動學習工作都是關於這些領域的。主動學習(就像計算機科學中的許多子領域一樣)正在迅速發展並在無數方向上推進,因此一個人很難提供詳盡的總結。對於任何疏忽或不準確之處,我深表歉意,並鼓勵感興趣的讀者向我提交補充,評論和更正:[email protected]

1.1 What is Active Learning?

Active learning (sometimes called “query learning” or “optimal experimental design” in the statistics literature) is a subfield of machine learning and, more generally, artificial intelligence. The key hypothesis is that if the learning algorithm is allowed to choose the data from which it learns—to be “curious,” if you will—it will perform better with less training. Why is this a desirable property for learning algorithms to have? Consider that, for any supervised learning system to perform well, it must often be trained on hundreds (even thousands) of labeled instances. Sometimes these labels come at little or no cost, such as the the “spam” flag you mark on unwanted email messages, or the five-star rating you might give to films on a social networking website. Learning systems use these flags and ratings to better filter your junk email and suggest movies you might enjoy. In these cases you provide such labels for free, but for many other more sophisticated supervised learning tasks, labeled instances are very difficult, time-consuming, or expensive to obtain. Here are a few examples:

1.1什麼是主動學習?

主動學習(有時在統計學文獻中稱為“查詢學習”或“最佳實驗設計”)是機器學習的子領域,更一般地說是人工智能的子領域。其關鍵思想是,如果允許學習算法選擇它所學習的數據,它會以較少的訓練達到更好的表現。為什麼這是學習算法的理想屬性?考慮到,為了使任何監督學習系統表現良好,通常必須對數百(甚至數千)標記實例進行訓練。有時這些標籤的成本很少或沒有,例如您在不需要的電子郵件中標記的“垃圾郵件”標記,或者您可能在社交網站上給予電影的五星評級。學習系統使用這些標記和評級來更好地過濾您的垃圾郵件並推薦您可能喜歡的電影。在這些情況下,您可以免費提供此類標籤,但對於許多其他更復雜的監督學習任務,標記的實例非常困難,耗時或昂貴。這裡有一些例子:

_ Speech recognition. Accurate labeling ofs peech utterances is extremely time consuming and requires trained linguists. Zhu (2005a) reports that annotation at the word level can take ten times longer than the actual audio (e.g., one minute of speech takes ten minutes to label), and annotating phonemes can take 400 times as long (e.g., nearly seven hours). The problem is compounded for rare languages or dialects.

_ 語音識別。準確標記語音非常耗時,需要訓練有素的語言專家。 Zhu(2005a)認為,對音頻單詞進行標記所花的時間可能比實際音頻長十倍(例如,一分鐘的語音需要十分鐘來標記),註釋音素可能需要400倍(例如,近七個小時)。稀有語言或方言的問題更加複雜。

_ Information extraction. Good information extraction systems must be trained using labeled documents with detailed annotations.Users highlight entities or relations of interest in text, such as person and organization names, or whether a person works for a particular organization.Locating entities and relations can take a half-hour or more for even simple news wire stories (Settleset al., 2008a). Annotations for other knowledge domains may require additional expertise, e.g., annotating gene and disease mentions for biomedical information extraction usually requires PhD-level biologists.

_信息提取。必須使用帶有詳細註釋的帶標籤文檔來訓練良好的信息提取系統。用戶對文本中感興趣的實體或關係進行標記,例如人員和組織名稱,或者某個人是否為特定組織工作。即使是簡單的新聞專題報道,定位實體和關係也可能需要半小時或更長時間(Settles等,2008a)。其他知識領域的註釋可能需要額外的專業知識,例如,註釋基因和疾病涉及的生物醫學信息提取通常需要博士級生物學家來完成。

_ Classification and filtering.Learning to classify documents (e.g., articles or web pages) or any other kind of media (e.g., image, audio, and video files) requires that users label each document or media file with particular labels, like “relevant” or “not relevant.” Having to annotate thousands of these instances can be tedious and even redundant.

_分類和過濾。學習對文檔(例如,文章或網頁)或任何其他類型的媒體(例如,圖像,音頻和視頻文件)進行分類要求用戶使用特定標籤標記每個文檔或媒體文件,例如“相關”或“不相關”。“註釋數以千計的這些實例可能是乏味的,甚至是多餘的。

Active learning systems attempt to overcome the labeling bottleneck by asking queries in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator).In this way, the active learner aims to achieve high accuracy using as few labeled instances as possible, thereby minimizing the cost of obtaining labeled data. Active learning is well-motivated in many modern machine learning problems where data may be abundant but labels are scarce or expensive to obtain. Note that this kind of active learning is related in spirit, though not to be confused, with the family of instructional techniques by the same name in the education literature (Bonwell and Eison, 1991).

主動學習系統試圖通過將未標記的實例交由oracle(例如,人類註釋器)標記來克服標記瓶頸。以這種方式,主動學習器旨在使用盡可能少的標記實例來實現高準確度,從而最小化獲得標記數據的成本。主動學習在很多現代機器學習問題中都很上佳表現,因為數據可能很豐富,但標籤很少或者很難獲得。請注意,這種主動學習與神經學中的主動學習同名(Bonwell和Eison,1991),但不要混淆。



分享到:


相關文章: