「論文」主動學習文獻綜述(中英文對照)-1.2節、1.3節、第二章

1.2 Active Learning Examples

1.2主動學習示例

「論文」主動學習文獻綜述(中英文對照)-1.2節、1.3節、第二章


There are several scenarios in which active learners may pose queries, and there are also several different query strategies that have been used to decide which instances are most informative. In this section, I present two illustrative examples in the pool-based active learning setting (in which queries are selected from a large pool of unlabeled instances u) using an uncertainty sampling query strategy (which selects the instance in the pool about which the model is least certain how to label). Sections 2 and 3 describe all the active learning scenarios and query strategy frameworks in more detail.

有幾種情況可供主動學習者提出查詢,並且還有幾種不同的查詢策略可用於確定哪些實例所含信息更豐富。 在本節中,我將使用不確定性抽樣查詢策略(在池中選擇最不確定如何標記的實例),基於池的主動學習設置(其中從大量未標記實例u中選擇查詢)提供兩個說明性示例。在第2節和第3節中將更詳細地描述所有主動學習場景和查詢策略框架。

「論文」主動學習文獻綜述(中英文對照)-1.2節、1.3節、第二章

Figure 1 illustrates the pool-based active learning cycle. A learner may begin with a small number of instances in the labeled training set L, request labels for one or more carefully selected instances, learn from the query results, and then leverage its new knowledge to choose which instances to query next. Once a query has been made, there are usually no additional assumptions on the part of the learning algorithm. The new labeled instance is simply added to the labeled set L, and the learner proceeds from there in a standard supervised way. There are a few exceptions to this, such as when the learner is allowed to make alternative types of queries (Section 6.4), or when active learning is combined with semisupervised learning (Section 7.1).

圖1說明了基於池的主動學習流程。 學習者可以從有標籤的訓練集L中的少量實例開始,為一個或多個精心選擇的實例請求標籤,從查詢結果中學習,然後利用其新知識來選擇接下來要查詢的實例。 一旦進行了查詢,學習算法通常不存在其他假設。 新標記的實例被簡單地添加到標記集合L中,學習者以標準監督學習方法對標記集合L進行學習。 此外,有一些特例,例如允許學習者進行其他類型的查詢(第6.4節),或主動學習與半監督學習相結合(第7.1節)。

Figure 2 shows the potential of active learning in a way that is easy to visualize. This is a toy data set generated from two Gaussians centered at (-2,0) and (2,0) with standard deviation σ = 1, each representing a different class distribution. Figure 2(a) shows the resulting data set after 400 instances are sampled (200 from each class); instances are represented as points in a 2D feature space. In a real-world setting these instances may be available, but their labels usually are not. Figure 2(b) illustrates the traditional supervised learning approach after randomly selecting 30 instances for labeling, drawn i.i.d. from the unlabeled pool U. The line shows the linear decision boundary of a logistic regression model (i.e., where the posterior equals 0.5) trained using these 30 points. Notice that most of the labeled instances in this training set are far from zero on the horizontal axis, which is where the Bayes optimal decision boundary should probably be. As a result, this classifier only achieves 70% accuracy on the remaining unlabeled points. Figure 2(c), however, tells a different story. The active learner uses uncertainty sampling to focus on instances closest to its decision boundary, assuming it can adequately explain those in other parts of the input space characterized by U. As a result, it avoids requesting labels for redundant or irrelevant instances, and achieves 90% accuracy with a mere 30 labeled instances.

圖2以可視化的方式顯示了主動學習的潛力。這是一個玩具數據集,由兩個分別以(-2,0)和(2,0)為中心,標準差σ= 1的高斯分佈生成,每個代表不同的類別分佈。圖2(a)顯示了400個實例被採樣後的結果數據集(每個類200個);實例表示為2D要素空間中的點。在實際環境中,這些實例可能是可獲得的,但它們的標籤通常不可獲取。圖2(b)說明了隨機選擇30個標記實例後的傳統監督學習方法效果.圖中藍線表示使用這30個點訓練的邏輯迴歸模型(即,後驗等於0.5)的線性決策邊界。請注意,此訓練集中的大多數標記實例在水平軸上遠離零,這是貝葉斯最優決策邊界應該可能的位置。該分類器僅在剩餘的未標記點上實現70%的準確度。然而,圖2(c)講述了一個不同的故事。主動學習者使用不確定性採樣來關注最接近其決策邊界的實例,並假設它可以充分解釋以

u為特徵的輸入空間的其他部分中的那些實例。因此,它避免了為冗餘或不相關實例請求標籤,並實現在僅有30個標記實例的情況下準確率為90%。

「論文」主動學習文獻綜述(中英文對照)-1.2節、1.3節、第二章

Now let us consider active learning for a real-world learning task: text classification. In this example, a learner must distinguish between baseball and hockey documents from the 20 Newsgroups corpus (Lang, 1995), which consists of 2,000 Usenet documents evenly divided between the two classes. Active learning algorithms are generally evaluated by constructing learning curves, which plot the evaluation measure of interest (e.g., accuracy) as a function of the number of new instance queries that are labeled and added to L. Figure 3 presents learning curves for the first 100 instances labeled using uncertainty sampling and random sampling. The reported results are for a logistic regression model averaged over ten folds using cross-validation. After labeling 30 new instances, the accuracy of uncertainty sampling is 81%, while the random baseline is only 73%. As can be seen, the active learning curve dominates the baseline curve for all of the points shown in this figure. We can conclude that an active learning algorithm is superior to some other approach (e.g., a random baseline like traditional passive supervised learning) if it dominates the other for most or all of the points along their learning curves.

現在讓我們考慮用主動學習來解決現實世界中的學習任務:文本分類。在這個例子中,學習者必須從20個新聞組語料庫(Lang,1995)中區分棒球和曲棍球文件,其中包括在兩個類之間平均劃分的2,000個Usenet文檔。主動學習算法通常通過構建學習曲線來評估,該學習曲線將感興趣的評估度量(例如,準確度)繪製為標記並添加到

L中新實例查詢數量的函數。圖3呈現了前100個用不確定性抽樣和隨機抽樣標記實例的學習曲線。模型採用邏輯迴歸模型,將樣本分為10份,結果是採用交叉驗證的方法得到的。標記30個新實例後,不確定性抽樣的準確率為81%,而隨機基線僅為73%。可以看出,主動學習曲線遠高於該圖中所示的基線曲線。我們可以得出結論,如果主動學習算法的學習曲線在大部分時間內是高於其他算法的,那麼,我們可以說主動學習算法優於其他一些方法(例如,像傳統的被動監督學習一樣的隨機基線)。

1.3 Further Reading

This is the first large-scale survey of the active learning literature. One way to view this document is as a heavily annotated bibliography of the field, and the citations within a particular section or subsection of interest serve as good starting points for further investigation. There have also been a few PhD theses over the years dedicated to active learning, with rich related work sections. In fact, this report originated as a chapter in my PhD thesis (Settles, 2008), which focuses on active learning with structured instances and potentially varied annotation costs. Also of interest may be the related work chapters of Tong (2001), which considers active learning for support vector machines and Bayesian networks, Monteleoni (2006), which considers more theoretical aspects of active learning for classification, and Olsson (2008), which focuses on active learning for named entity recognition (a type of information extraction). Fredrick Olsson has also written a survey of active learning specifically within the scope of the natural language processing (NLP) literature (Olsson, 2009).

1.3進一步閱讀

這是對主動學習文獻的第一次大規模綜述。可將該文檔作為該領域的重要註釋書目,並且特定部分或感興趣的子部分的引用可作為進一步研究的良好起點。多年來,還有一些博士論文致力於主動學習,並有豐富的研究成果。事實上,這份報告起源於我的博士論文(Settles,2008)中的一章,該論文側重於結構化實例的主動學習和可能不同的註釋成本。同樣感興趣的可能是Tong(2001)的相關工作,其中考慮了支持向量機和貝葉斯網絡的主動學習,Monteleoni(2006),其考慮了主動學習用於分類的理論研究,以及Olsson(2008),其專注於主動學習識別命名實體(一種信息提取)。 Fredrick Olsson還撰寫了一份關於主動學習的調查,特別是關於自然語言處理(NLP)的相關文獻(Olsson,2009)。

2 Scenarios

There are several different problem scenarios in which the learner may be able to ask queries. The three main settings that have been considered in the literature are (i) membership query synthesis, (ii) stream-based selective sampling, and (iii) pool-based sampling. Figure 4 illustrates the differences among these three scenarios, which are explained in more detail in this section. Note that all these scenarios (and the lion’s share of active learning work to date) assume that queries take the form of unlabeled instances to be labeled by the oracle. Sections 6 and 5 discuss some alternatives to this setting.

2 場景

有幾種關於主動學習的不同場景。 文獻中考慮的三個主要設置是(i)綜合關係詢問,(ii)基於流的選擇性採樣,以及(iii)基於池的採樣。 圖4說明了這三種情況之間的差異,本節將對此進行更詳細的說明。 請注意,所有這些場景(以及迄今為止主動學習工作的大部分)都假定查詢選擇出未標記的實例形式,然後交給oracle標記。 第6節和第5節討論了這種設置的一些替代方案。

「論文」主動學習文獻綜述(中英文對照)-1.2節、1.3節、第二章

2.1 Membership Query Synthesis

One of the first active learning scenarios to be investigated is learning with membership queries (Angluin, 1988). In this setting, the learner may request labels for any unlabeled instance in the input space, including (and typically assuming) queries that the learner generates de novo, rather than those sampled from some underlying natural distribution. Efficient query synthesis is often tractable and efficient for finite problem domains (Angluin, 2001). The idea of synthesizing queries has also been extended to regression learning tasks, such as learning to predict the absolute coordinates of a robot hand given the joint angles of its mechanical arm as inputs (Cohn et al., 1996).

2.1綜合關係詢問

要研究的第一個主動學習場景是關係查詢(Angluin,1988)。 在此設置中,學習者可以為輸入空間中的任何未標記實例請求標籤,包括(並且通常假設)學習者從頭生成的查詢,而不是從某些基礎自然分佈中採樣的查詢。 對於有限問題域,高效的綜合查詢通常易於處理和有效(Angluin,2001)。 綜合查詢的想法也已擴展到迴歸學習任務,例如通過使用機器人手的關節角度作為輸入量來學習預測機器人手的絕對座標(Cohn等,1996)。

Query synthesis is reasonable for many problems, but labeling such arbitrary instances can be awkward if the oracle is a human annotator. For example, Lang and Baum (1992) employed membership query learning with human oracles to train a neural network to classify handwritten characters. They encountered an unexpected problem: many of the query images generated by the learner contained no recognizable symbols, only artificial hybrid characters that had no natural semantic meaning. Similarly, one could imagine that membership queries for natural language processing tasks might create streams of text or speech that amount to gibberish. The stream-based and pool-based scenarios (described in the next sections) have been proposed to address these limitations.

綜合查詢對於許多問題都能合理的應用,但是如果oracle是人類註釋器,則標記任意實例可能是不方便的。 例如,Lang和Baum(1992)利用基於人類註釋器的關係查詢方法訓練神經網絡來對手寫字符進行分類。 他們遇到了一個意想不到的問題:學習者生成的許多查詢圖像不包含可識別的符號,只包含沒有自然語義含義的人工混合字符。 同樣,人們可以想象,對自然語言處理的關係查詢可能會產生相當於胡言亂語的文本或語音流。基於流和基於池的場景(在下面的部分中描述)已經被提出,用來解決這些不足。

However, King et al. (2004, 2009) describe an innovative and promising real world application of the membership query scenario. They employ a “robot scientist” which can execute a series of autonomous biological experiments to discover metabolic pathways in the yeast Saccharomyces cerevisiae. Here, an instance is a mixture of chemical solutions that constitute a growth medium, as well as a particular yeast mutant. A label, then, is whether or not the mutant thrived in the growth medium. All experiments are autonomously synthesized using an active learning approach based on inductive logic programming, and physically performed using a laboratory robot. This active method results in a three-fold decrease in the cost of experimental materials compared to naively running the least expensive experiment, and a 100-fold decrease in cost compared to randomly generated experiments. In domains where labels come not from human annotators, but from experiments such as this, query synthesis may be a promising direction for automated scientific discovery.

但是,King等人(2004年,2009年)描述了關係查詢場景的創新和可能的實際應用。 他們僱用了一名可以執行一系列自主生物實驗的“機器人科學家”,用以觀察酵母釀酒中酵母的代謝途徑。 這裡,實例是構成生長培養基的化學溶液,以及特定的酵母突變體。 因此,標籤是突變體是否在生長培養基中繁殖。 使用基於歸納邏輯編程的主動學習方法自主合成所有實驗,並使用實驗室機器人物理地執行。 與實施最便宜的實驗相比,這種主動方法使實驗材料的成本降低了三倍,與隨機生成的實驗相比,成本降低了100倍。在標籤不是來自人類註釋器的領域,而是來自諸如此類的實驗,綜合查詢可能是自動化科學一個有前景的研究方向。

2.2 Stream-Based Selective Sampling

An alternative to synthesizing queries is selective sampling (Cohn et al., 1990, 1994). The key assumption is that obtaining an unlabeled instance is free (or inexpensive), so it can first be sampled from the actual distribution, and then the learner can decide whether or not to request its label. This approach is sometimes called stream-based or sequential active learning, as each unlabeled instance is typically drawn one at a time from the data source, and the learner must decide whether to query or discard it. If the input distribution is uniform, selective sampling may well behave like membership query learning. However, if the distribution is non-uniform and (more importantly) unknown, we are guaranteed that queries will still be sensible, since they come from a real underlying distribution.

2.2基於流的選擇性採樣

綜合查詢的替代方案是選擇性採樣(Cohn等,1990,1994)。 關鍵假設是獲得未標記的實例是免費的(或廉價的),因此可以首先從實際分佈中對其進行採樣,然後學習者可以決定是否請求其標籤。 這種方法有時被稱為基於流或順序的主動學習,因為每個未標記的實例通常一次從數據源中抽取一個,並且學習者必須決定是查詢還是丟棄它。 如果輸入分佈是均勻的,則選擇性採樣可能表現得像關係查詢。 但是,如果分佈不均勻且(更重要的是)未知,我們保證查詢仍然是合理的,因為它們來自真實的底層分佈。

The decision whether or not to query an instance can be framed several ways. One approach is to evaluate samples using some “informativeness measure” or “query strategy” (see Section 3 for examples) and make a biased random decision, such that more informative instances are more likely to be queried (Dagan and Engelson, 1995). Another approach is to compute an explicit region of uncertainty (Cohn et al., 1994), i.e., the part of the instance space that is still ambiguous to the learner, and only query instances that fall within it. A naive way of doing this is to set a minimum threshold on an informativeness measure which defines the region. Instances whose evaluation is above this threshold are then queried. Another more principled approach is to define the region that is still unknown to the overall model class, i.e., to the set of hypotheses consistent with the current labeled training set called the version space (Mitchell, 1982). In other words, if any two models of the same model class (but different parameter settings) agree on all the labeled data, but disagree on some unlabeled instance, then that instance lies within the region of uncertainty. Calculating this region completely and explicitly is computationally expensive, however, and it must be maintained after each new query. As a result, approximations are used in practice (Seung et al., 1992; Cohn et al., 1994; Dasgupta et al., 2008).

是否查詢實例的決定可以通過多種方式構建。一種方法是使用一些“信息量度量”或“查詢策略”(參見第3節中的示例)來評估樣本,並做出有偏差的隨機決策,以便更有可能查詢更多信息實例(Dagan和Engelson,1995)。另一種方法是計算一個明確的不確定區域(Cohn等,1994),即實例空間中仍然對學習者不明確的部分,並且只查詢屬於它的實例。一種簡單的方法是在定義區域的信息量度量上設置最小閾值。然後查詢評估高於此閾值的實例。另一個更有原則的方法是定義整個模型類仍然未知的區域,即,與當前標記的訓練集(稱為版本空間)一致的假設集(Mitchell,1982)。換句話說,如果相同模型類的任何兩個模型(但不同的參數設置)對所有標記數據達成一致,但在某些未標記的實例上不一致,則該實例位於不確定區域內。然而,完全且明確地計算該區域在計算上是昂貴的,並且必須在每次新查詢之後保持該區域。為解決這個問題,在實踐中一般使用近似值(Seung等人,1992; Cohn等人,1994; Dasgupta等人,2008)。

The stream-based scenario has been studied in several real-world tasks, including part-of-speech tagging (Dagan and Engelson, 1995), sensor scheduling (Krishnamurthy, 2002), and learning ranking functions for information retrieval (Yu, 2005). Fujii et al. (1998) employ selective sampling for active learning in word sense disambiguation, e.g., determining if the word “bank” means land alongside a river or a financial institution in a given context (only they study Japanese words in their work). The approach not only reduces annotation effort, but also limits the size of the database used in nearest-neighbor learning, which in turn expedites the classification algorithm.

基於流的場景已經在幾個實際任務中得到了研究,包括詞性標註(Dagan和Engelson,1995),傳感器調度(Krishnamurthy,2002),以及用於信息檢索的學習排名函數(Yu,2005)。 Fujii等人(1998)在詞義消歧中採用選擇性抽樣進行主動學習,例如,判斷單詞“銀行”在給定環境中是否與河流或金融機構並行(只有他們的工作是研究日語單詞)。 該方法不僅減少了註釋工作量,而且還限制了最近鄰算法學習中使用的數據庫大小,這反過來加速了分類算法。

It is worth noting that some authors (e.g., Thompson et al., 1999; Moskovitch et al., 2007) use “selective sampling” to refer to the pool-based scenario described in the next section. Under this interpretation, the term merely signifies that queries are made with a select set of instances sampled from a real data distribution. However, in most of the literature selective sampling refers to the stream-based scenario described here.

值得注意的是,一些作者(例如,Thompson等,1999; Moskovitch等,2007)使用“選擇性採樣”來指代下一節中描述的基於池的場景。 在此解釋下,該術語僅表示使用從實際數據分佈中採樣的一組選定實例進行查詢。 然而,在大多數文獻中,選擇性採樣是指這裡描述的基於流的場景。

2.3 Pool-Based Sampling

For many real-world learning problems, large collections of unlabeled data can be gathered at once. This motivates pool-based sampling (Lewis and Gale, 1994), which assumes that there is a small set of labeled data L and a large pool of unlabeled data U available. Queries are selectively drawn from the pool, which is usually assumed to be closed (i.e., static or non-changing), although this is not strictly necessary. Typically, instances are queried in a greedy fashion, according to an informativeness measure used to evaluate all instances in the pool (or, perhaps if U is very large, some subsample thereof). The examples from Section 1.2 use this active learning setting.

2.3基於池的抽樣

對於許多現實世界的學習問題,可以一次收集到大量未標記數據。 這推動了基於池的抽樣的產生(Lewis和Gale,1994),其假定存在一小組標記數據L和大量未標記數據u可用。 從池中選擇性地抽取查詢,這通常被認為是閉合的(即,靜態的或不變的),儘管這不是嚴格必需的。 通常,實例以貪婪的方式查詢,並通過信息量度量評估池中所有實例。 第1.2節中的示例使用此主動學習設置。

The pool-based scenario has been studied for many real-world problem domains in machine learning, such as text classification (Lewis and Gale, 1994; McCallum and Nigam, 1998; Tong and Koller, 2000; Hoi et al., 2006a), information extraction (Thompson et al., 1999; Settles and Craven, 2008), image classification and retrieval (Tong and Chang, 2001; Zhang and Chen, 2002), video classification and retrieval (Yan et al., 2003; Hauptmann et al., 2006), speech recognition (Tur et al., 2005), and cancer diagnosis (Liu, 2004) to name a few.

基於池的場景在機器學習中的許多現實問題領域得到了研究,例如文本分類(Lewis和Gale,1994; McCallum和Nigam,1998; Tong和Koller,2000; Hoi等,2006a), 信息提取(Thompson等,1999; Settles和Craven,2008),圖像分類和檢索(Tong和Chang,2001; Zhang和Chen,2002),視頻分類和檢索(Yan et al。,2003; Hauptmann et al 。,2006),語音識別(Tur et al。,2005),癌症診斷(Liu,2004)等等。

The main difference between stream-based and pool-based active learning is that the former scans through the data sequentially and makes query decisions individually, whereas the latter evaluates and ranks the entire collection before selecting the best query. While the pool-based scenario appears to be much more common among application papers, one can imagine settings where the stream-based approach is more appropriate. For example, when memory or processing power may be limited, as with mobile and embedded devices.

基於流和基於池的主動學習之間的主要區別在於,前者按順序掃描數據並單獨進行查詢決策,而後者在選擇最佳查詢之前對整個集合進行評估和排序。 雖然基於池的場景似乎在應用程序文件中更為常見,但可以想象基於流的場景更實用。 例如,當移動和嵌入式設備的存儲器或處理能力可能受限時。


分享到:


相關文章: