Java学习-Day33

主动学习之 ALEC

一、半监督学习

在第 28 天的时候谈到了针对分类问题的监督式学习和无监督式学习. 无监督式学习相对监督式的优点就是不需要标签, 分类结果正确与否有待商榷. 监督式学习需要的数据集含有标签, 就能更加容易确定分类的正确性.

但是在现实世界中的数据集并不是全部都有标签, 可能只有数据集合中一少部分数据具有标签. 针对这个问题就提出了半监督式学习.

例如在第一阶段 A 中, 给出含有 1000 个样本的数据集 S1. S1 中 100 个样本带有标签, 剩余 900 个样本没有标签. 我们通过 S1 中 1000 个样本来生成一个分类器 Classfy. 在第二阶段 B 中, 给出不同于数据集 S1 中所有样本的数据集 S2. S2 中含有 500 个新的样本, 此时再使用分类器 Classfy 对其进行分类.

在这里需要关注的点就是数据集 S1 中那没有标签的 900 个样本会对分类器 Classfy 的精度产生什么样的影响.

二、主动学习

1. 数据集操作

在主动学习之前有两个概念叫做 Close World 和 Open World.

Close World 顾名思义就是在一个封闭的数据集操作. 例如上午给定 1000 个样本, 有权查询其中 100 个样本. 建立分类器后对其它 900 个样本进行分类.

不同于 Close World 的 Close, Open World 需要两部分样本. 一部分用于构建分类器, 另一部分用于测试. 例如上午给定 1000 个样本, 有权查询其中 100 个样本, 并建立分类器. 下午给定 500 个新样本, 要求分类.

为什么只能选样本中的一部分? 这不就和半监督学习中部分有标记样本不谋而合吗.

2. 场景解释

首先机器学习算法从有标签的数据集中学习到一个分类器, 然后再从没有标签的数据集中选择一些样本交由人工标注标签. 然后将这部分有标签的样本加入到含标签数据集中. 简要过程如 图 1 所示.

图 1. 主动学习场景

其中 labled training set 表示有标签的数据集. Machine Learning Model 表示从有标签的数据集中所得到的分类器. unlabeled pool 表示没有标签的数据集. oracle 表示人工, 在流程中起到对无标签样本进行标注标签的作用.

三、三支主动学习

基于聚类的主动学习如下 图 2 所示

图 2. 三支主动学习

整体上看样本 7 和 样本 11 都存在下划线, 表示对这两个样本进行了查询. 两个样本后的 + - 号表示所属的类别, 并以此来对其进行聚类. 两个样本分别做为两棵树的根节点对整棵树进行划分. 形成的两棵新树就是两个簇.

不断重复这个过程, 直到每个聚类形成的簇中全为 + 或者 全为 - 就表示簇为 pure 然后退出运行.

如果出现极端情况, 每一个样本为一簇. 此时我们就规定每一簇中最少要有多少个样本, 在簇不为 pure 时让其 "少数服从多数".

基于这样的思想提出 ALEC 算法.

四、ALEC算法

1. 思路

ALEC 算法的关键就是选择需要查询的样本.

这个选择需要一个叫做 Density 的值, 并用来评估样本的重要性和独立性.

  1. 以样本 X 为中心, dc 为半径, 一维数据可以获得一个区间, 二维数据可以获得一个圆, 三维数据可以获得一个球, 四维数据可以获得一个超球. 在获得的区间或图形中有多少个样本, 则 X 的密度 Density 就取该值. 值越大, 样本重要性越高.

  2. 比样本 X1 的 Density 更高, 并且离它最近的样本 X2 的距离表示该样本 X 的独立性. 值越大, 表示独立性越高.

样本的独立性与重要性的乘积称为代表性, 以此来对样本进行选择.

2. 具体代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
package activelearning;

import java.io.FileReader;
import java.util.*;

import weka.core.Instances;

/**
* Active learning through density clustering.
*
* @author Shi-Huai Wen Email: shihuaiwen@outlook.com.
*/
public class Alec {
/**
* The whole dataset.
*/
Instances dataset;

/**
* The maximal number of queries that can be provided.
*/
int maxNumQuery;

/**
* The actual number of queries.
*/
int numQuery;

/**
* The radius, also dc in the paper. It is employed for density computation.
*/
double radius;

/**
* The densities of instances, also rho in the paper.
*/
double[] densities;

/**
* distanceToMaster
*/
double[] distanceToMaster;

/**
* Sorted indices, where the first element indicates the instance with the
* biggest density.
*/
int[] descendantDensities;

/**
* Priority
*/
double[] priority;

/**
* The maximal distance between any pair of points.
*/
double maximalDistance;

/**
* Who is my master?
*/
int[] masters;

/**
* Predicted labels.
*/
int[] predictedLabels;

/**
* Instance status. 0 for unprocessed, 1 for queried, 2 for classified.
*/
int[] instanceStatusArray;

/**
* The descendant indices to show the representativeness of instances in a
* descendant order.
*/
int[] descendantRepresentatives;

/**
* Indicate the cluster of each instance. It is only used in
* clusterInTwo(int[]);
*/
int[] clusterIndices;

/**
* Blocks with size no more than this threshold should not be split further.
*/
int smallBlockThreshold = 3;

/**
* *********************************
* The constructor.
*
* @param paraFilename The data filename.
* *********************************
*/
public Alec(String paraFilename) {
try {
FileReader tempReader = new FileReader(paraFilename);
dataset = new Instances(tempReader);
dataset.setClassIndex(dataset.numAttributes() - 1);
tempReader.close();
} catch (Exception ee) {
System.out.println("Alec Error" + ee);
System.exit(0);
} // Of fry
computeMaximalDistance();
clusterIndices = new int[dataset.numInstances()];
}// Of the constructor

/**
* *********************************
* Merge sort in descendant order to obtain an index array. The original
* array is unchanged. The method should be tested further. <br>
* Examples: input [1.2, 2.3, 0.4, 0.5], output [1, 0, 3, 2]. <br>
* input [3.1, 5.2, 6.3, 2.1, 4.4], output [2, 1, 4, 0, 3].
*
* @param paraArray the original array
* @return The sorted indices.
* *********************************
*/
public static int[] mergeSortToIndices(double[] paraArray) {
int tempLength = paraArray.length;
int[][] resultMatrix = new int[2][tempLength];// For merge sort.

// Initialize
int tempIndex = 0;
for (int i = 0; i < tempLength; i++) {
resultMatrix[tempIndex][i] = i;
} // Of for i

// Merge
int tempCurrentLength = 1;
// The indices for current merged groups.
int tempFirstStart, tempSecondStart, tempSecondEnd;

while (tempCurrentLength < tempLength) {
// Divide into a number of groups.
// Here the boundary is adaptive to array length not equal to 2^k.
for (int i = 0; i < Math.ceil((tempLength + 0.0) / tempCurrentLength / 2); i++) {
// Boundaries of the group
tempFirstStart = i * tempCurrentLength * 2;

tempSecondStart = tempFirstStart + tempCurrentLength;

tempSecondEnd = tempSecondStart + tempCurrentLength - 1;
if (tempSecondEnd >= tempLength) {
tempSecondEnd = tempLength - 1;
} // Of if

// Merge this group
int tempFirstIndex = tempFirstStart;
int tempSecondIndex = tempSecondStart;
int tempCurrentIndex = tempFirstStart;

if (tempSecondStart >= tempLength) {
for (int j = tempFirstIndex; j < tempLength; j++) {
resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
% 2][j];
tempFirstIndex++;
tempCurrentIndex++;
} // Of for j
break;
} // Of if

while ((tempFirstIndex <= tempSecondStart - 1)
&& (tempSecondIndex <= tempSecondEnd)) {

if (paraArray[resultMatrix[tempIndex
% 2][tempFirstIndex]] >= paraArray[resultMatrix[tempIndex
% 2][tempSecondIndex]]) {
resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
% 2][tempFirstIndex];
tempFirstIndex++;
} else {
resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
% 2][tempSecondIndex];
tempSecondIndex++;
} // Of if
tempCurrentIndex++;
} // Of while

// Remaining part
for (int j = tempFirstIndex; j < tempSecondStart; j++) {
resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
% 2][j];
tempCurrentIndex++;
} // Of for j
for (int j = tempSecondIndex; j <= tempSecondEnd; j++) {
resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
% 2][j];
tempCurrentIndex++;
} // Of for j
} // Of for i

tempCurrentLength *= 2;
tempIndex++;
} // Of while

return resultMatrix[tempIndex % 2];
}// Of mergeSortToIndices

/**
* ********************
* The Euclidean distance between two instances. Other distance measures
* unsupported for simplicity.
*
* @param paraI The index of the first instance.
* @param paraJ The index of the second instance.
* @return The distance.
* ********************
*/
public double distance(int paraI, int paraJ) {
double resultDistance = 0;
double tempDifference;
for (int i = 0; i < dataset.numAttributes() - 1; i++) {
tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
resultDistance += tempDifference * tempDifference;
} // Of for i
resultDistance = Math.sqrt(resultDistance);

return resultDistance;
}// Of distance

/**
* *********************************
* Compute the maximal distance. The result is stored in a member variable.
* *********************************
*/
public void computeMaximalDistance() {
maximalDistance = 0;
double tempDistance;
for (int i = 0; i < dataset.numInstances(); i++) {
for (int j = 0; j < dataset.numInstances(); j++) {
tempDistance = distance(i, j);
if (maximalDistance < tempDistance) {
maximalDistance = tempDistance;
} // Of if
} // Of for j
} // Of for i

System.out.println("maximalDistance = " + maximalDistance);
}// Of computeMaximalDistance

/**
* *****************
* Compute the densities using Gaussian kernel.
* *****************
*/
public void computeDensitiesGaussian() {
System.out.println("radius = " + radius);
densities = new double[dataset.numInstances()];
double tempDistance;

for (int i = 0; i < dataset.numInstances(); i++) {
for (int j = 0; j < dataset.numInstances(); j++) {
tempDistance = distance(i, j);
densities[i] += Math.exp(-tempDistance * tempDistance / radius / radius);
} // Of for j
} // Of for i

System.out.println("The densities are " + Arrays.toString(densities) + "\r\n");
}// Of computeDensitiesGaussian

/**
* *********************************
* Compute distanceToMaster, the distance to its master.
* *********************************
*/
public void computeDistanceToMaster() {
distanceToMaster = new double[dataset.numInstances()];
masters = new int[dataset.numInstances()];
descendantDensities = new int[dataset.numInstances()];
instanceStatusArray = new int[dataset.numInstances()];

descendantDensities = mergeSortToIndices(densities);
distanceToMaster[descendantDensities[0]] = maximalDistance;

double tempDistance;
for (int i = 1; i < dataset.numInstances(); i++) {
// Initialize.
distanceToMaster[descendantDensities[i]] = maximalDistance;
for (int j = 0; j <= i - 1; j++) {
tempDistance = distance(descendantDensities[i], descendantDensities[j]);
if (distanceToMaster[descendantDensities[i]] > tempDistance) {
distanceToMaster[descendantDensities[i]] = tempDistance;
masters[descendantDensities[i]] = descendantDensities[j];
} // Of if
} // Of for j
} // Of for i
System.out.println("First compute, masters = " + Arrays.toString(masters));
System.out.println("descendantDensities = " + Arrays.toString(descendantDensities));
}// Of computeDistanceToMaster

/**
* *********************************
* Compute priority. Element with higher priority is more likely to be
* selected as a cluster center. Now it is rho * distanceToMaster. It can
* also be rho^alpha * distanceToMaster.
* *********************************
*/
public void computePriority() {
priority = new double[dataset.numInstances()];
for (int i = 0; i < dataset.numInstances(); i++) {
priority[i] = densities[i] * distanceToMaster[i];
} // Of for i
}// Of computePriority

/**
* ************************
* The block of a node should be same as its master. This recursive method
* is efficient.
*
* @param paraIndex The index of the given node.
* @return The cluster index of the current node.
* ************************
*/
public int coincideWithMaster(int paraIndex) {
if (clusterIndices[paraIndex] == -1) {
int tempMaster = masters[paraIndex];
clusterIndices[paraIndex] = coincideWithMaster(tempMaster);
} // Of if

return clusterIndices[paraIndex];
}// Of coincideWithMaster

/**
* ************************
* Cluster a block in two. According to the master tree.
*
* @param paraBlock The given block.
* @return The new blocks where the two most represent instances serve as the root.
* ************************
*/
public int[][] clusterInTwo(int[] paraBlock) {
// Reinitialize. In fact, only instances in the given block is
// considered.
Arrays.fill(clusterIndices, -1);

// Initialize the cluster number of the two roots.
for (int i = 0; i < 2; i++) {
clusterIndices[paraBlock[i]] = i;
} // Of for i

for (int j : paraBlock) {
if (clusterIndices[j] != -1) {
// Already have a cluster number.
continue;
} // Of if

clusterIndices[j] = coincideWithMaster(masters[j]);
} // Of for i

// The sub blocks.
int[][] resultBlocks = new int[2][];
int tempFistBlockCount = 0;
for (int clusterIndex : clusterIndices) {
if (clusterIndex == 0) {
tempFistBlockCount++;
} // Of if
} // Of for i
resultBlocks[0] = new int[tempFistBlockCount];
resultBlocks[1] = new int[paraBlock.length - tempFistBlockCount];

// Copy. You can design shorter code when the number of clusters is
// greater than 2.
int tempFirstIndex = 0;
int tempSecondIndex = 0;
for (int j : paraBlock) {
if (clusterIndices[j] == 0) {
resultBlocks[0][tempFirstIndex] = j;
tempFirstIndex++;
} else {
resultBlocks[1][tempSecondIndex] = j;
tempSecondIndex++;
} // Of if
} // Of for i

System.out.println("Split (" + paraBlock.length + ") instances "
+ Arrays.toString(paraBlock) + "\r\nto (" + resultBlocks[0].length + ") instances "
+ Arrays.toString(resultBlocks[0]) + "\r\nand (" + resultBlocks[1].length
+ ") instances " + Arrays.toString(resultBlocks[1]));
return resultBlocks;
}// Of clusterInTwo

/**
* *********************************
* Classify instances in the block by simple voting.
*
* @param paraBlock The given block.
* *********************************
*/
public void vote(int[] paraBlock) {
int[] tempClassCounts = new int[dataset.numClasses()];
for (int j : paraBlock) {
if (instanceStatusArray[j] == 1) {
tempClassCounts[(int) dataset.instance(j).classValue()]++;
} // Of if
} // Of for i

int tempMaxClass = -1;
int tempMaxCount = -1;
for (int i = 0; i < tempClassCounts.length; i++) {
if (tempMaxCount < tempClassCounts[i]) {
tempMaxClass = i;
tempMaxCount = tempClassCounts[i];
} // Of if
} // Of for i

// Classify unprocessed instances.
for (int j : paraBlock) {
if (instanceStatusArray[j] == 0) {
predictedLabels[j] = tempMaxClass;
instanceStatusArray[j] = 2;
} // Of if
} // Of for i
}// Of vote

/**
* *********************************
* Cluster based active learning. Prepare for
*
* @param paraRatio The ratio of the maximal distance as the dc.
* @param paraMaxNumQuery The maximal number of queries for the whole dataset.
* @param paraSmallBlockThreshold The small block threshold.
* *********************************
*/
public void clusterBasedActiveLearning(double paraRatio, int paraMaxNumQuery,
int paraSmallBlockThreshold) {
radius = maximalDistance * paraRatio;
smallBlockThreshold = paraSmallBlockThreshold;

maxNumQuery = paraMaxNumQuery;
predictedLabels = new int[dataset.numInstances()];

for (int i = 0; i < dataset.numInstances(); i++) {
predictedLabels[i] = -1;
} // Of for i

computeDensitiesGaussian();
computeDistanceToMaster();
computePriority();
descendantRepresentatives = mergeSortToIndices(priority);
System.out.println(
"descendantRepresentatives = " + Arrays.toString(descendantRepresentatives));

numQuery = 0;
clusterBasedActiveLearning(descendantRepresentatives);
}// Of clusterBasedActiveLearning

/**
* *********************************
* Cluster based active learning.
*
* @param paraBlock The given block. This block must be sorted according to the priority in descendant order.
* *********************************
*/
public void clusterBasedActiveLearning(int[] paraBlock) {
System.out.println("clusterBasedActiveLearning for block " + Arrays.toString(paraBlock));

// Step 1. How many labels are queried for this block.
int tempExpectedQueries = (int) Math.sqrt(paraBlock.length);
int tempNumQuery = 0;
for (int j : paraBlock) {
if (instanceStatusArray[j] == 1) {
tempNumQuery++;
} // Of if
} // Of for i

// Step 2. Vote for small blocks.
if ((tempNumQuery >= tempExpectedQueries) && (paraBlock.length <= smallBlockThreshold)) {
System.out.println("" + tempNumQuery + " instances are queried, vote for block: \r\n"
+ Arrays.toString(paraBlock));
vote(paraBlock);

return;
} // Of if

// Step 3. Query enough labels.
for (int i = 0; i < tempExpectedQueries; i++) {
if (numQuery >= maxNumQuery) {
System.out.println("No more queries are provided, numQuery = " + numQuery + ".");
vote(paraBlock);
return;
} // Of if

if (instanceStatusArray[paraBlock[i]] == 0) {
instanceStatusArray[paraBlock[i]] = 1;
predictedLabels[paraBlock[i]] = (int) dataset.instance(paraBlock[i]).classValue();
numQuery++;
} // Of if
} // Of for i

// Step 4. Pure?
int tempFirstLabel = predictedLabels[paraBlock[0]];
boolean tempPure = true;
for (int i = 1; i < tempExpectedQueries; i++) {
if (predictedLabels[paraBlock[i]] != tempFirstLabel) {
tempPure = false;
break;
} // Of if
} // Of for i
if (tempPure) {
System.out.println("Classify for pure block: " + Arrays.toString(paraBlock));
for (int i = tempExpectedQueries; i < paraBlock.length; i++) {
if (instanceStatusArray[paraBlock[i]] == 0) {
predictedLabels[paraBlock[i]] = tempFirstLabel;
instanceStatusArray[paraBlock[i]] = 2;
} // Of if
} // Of for i
return;
} // Of if

// Step 5. Split in two and process them independently.
int[][] tempBlocks = clusterInTwo(paraBlock);
for (int i = 0; i < 2; i++) {
// Attention: recursive invoking here.
clusterBasedActiveLearning(tempBlocks[i]);
} // Of for i
}// Of clusterBasedActiveLearning

/**
* ******************
* Show the statistics information.
* ******************
*/
public String toString() {
int[] tempStatusCounts = new int[3];
double tempCorrect = 0;
for (int i = 0; i < dataset.numInstances(); i++) {
tempStatusCounts[instanceStatusArray[i]]++;
if (predictedLabels[i] == (int) dataset.instance(i).classValue()) {
tempCorrect++;
} // Of if
} // Of for i

String resultString = "(unhandled, queried, classified) = "
+ Arrays.toString(tempStatusCounts);
resultString += "\r\nCorrect = " + tempCorrect + ", accuracy = "
+ (tempCorrect / dataset.numInstances());

return resultString;
}// Of toString

/**
* *********************************
* The entrance of the program.
*
* @param args: Not used now.
* *********************************
*/
public static void main(String[] args) {
long tempStart = System.currentTimeMillis();

System.out.println("Starting ALEC.");
String arffFilename = "D:/Work/sampledata/iris.arff";


Alec tempAlec = new Alec(arffFilename);
// The settings for iris
tempAlec.clusterBasedActiveLearning(0.15, 30, 3);

System.out.println(tempAlec);

long tempEnd = System.currentTimeMillis();
System.out.println("Runtime: " + (tempEnd - tempStart) + "ms.");
}// Of main
} // Of class Alec

3. 运行截图

总结

代码太长了, 然后这部分也只梳理了简单的思想, 要想完全理解 ALEC 算法还需要对代码进一步理解, 看来这部分还需要写一篇博客才能完成. 想法和实现之间真的像是一条巨大的鸿沟, 加油吧.