Java学习-Day27

基于 M-distance 的推荐

一、算法概述

所谓 M-distance, 就是根据平均分来计算两个用户 (或项目) 之间的距离.

令项目的 \(j\) 的平均分为 \(x_{.j}\)

采用 item-based recommendation, 则第 \(j\) 个项目关于第 \(i\) 个用户的邻居项目集合为

\[ N_{ij} = \{ 1 \le {j}' \le m | {j}' \neq j, p_{i{j}'} \neq 0, \left | \overline {r_{.j}} - \overline {r_{.{j}'}} \right | < \delta \} \]

\(i\) 个用户对 \(j\) 个项目的评分预测为

\[ p_{ij} = \frac{\textstyle \sum_{j' \in N_{ij} } r_{i{j}'} }{\left | N_{ij} \right |} \]

二、算法特点

邻居不用 \(k\) 控制. 距离小于 radius ( 即 \(\delta\) ) 的都是邻居. 使用 M-distance 时, 这种方式效果更好.

使用 leave-one-out 的测试方式, 很高效的算法才能使用这种方式. 这种算法的特点就是只用一个做为测试, 其他全部用于训练. 所以本文代码就没有对缺失的评分进行预测, 仅仅只是通过已有数据来测试算法的准确性.

最后求平均绝对误差 MAE 和 均方根误差 RMSE 来对算法的优劣性进行判断.

三、算法图解

\(u_i (i=0,1,2,3,4)\) 表示不同用户, \(m_i (i=0,1,2,3,4)\) 表示不同电影.

表格中的数字表示不同用户对不同电影的评分.

\(num\) 表示对于某一个电影已有数据的个数, \(sum\) 表示对某个电影已有数据的总和.

\(\overline{r}\) 表示平均评分, 即 (\(\frac{sum}{num}\))

图中 0 占大多数, 表示还未能看过电影. 我们就是需要对这些数据进行预测.

在预测时只取差值小于 \(\delta\) 的那一列数据. 如果列中对需推荐用户的值为 0 则不带入计算.

图中的问号在实际中是知道并存在的, 只是我们故意装作不知道它.

四、代码分析

1. 数据集分析

评分表 (用户, 项目, 评分) 的压缩方式给出 前几行数据为:

1
2
3
4
5
6
7
8
9
10
11
0,0,5
0,1,3
0,2,4
0,3,3
0,4,3
0,5,5
0,6,4

1,0,4
1,9,2
1,12,4
其中, "0,2,4" 表示用户 0 对项目 2 的评分为 4. 用户 1 对项目 1、2 等的评分没有, 表示没看过该电影. 在用户数、项目数很多时, 必须使用压缩存储.

2. 完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
package knn;

import java.io.*;

/**
* Recommendation with M-distance.
*
* @author Shi-Huai Wen Email: shihuaiwen@outlook.com.
*/
public class MBR {
/**
* Default rating for 1-5 points.
*/
public static final double DEFAULT_RATING = 3.0;

/**
* The total number of users.
* The number of row in matrix.
*/
public int numUsers;

/**
* The total number of items.
* The number of column in matrix.
*/
public int numItems;

/**
* The total number of ratings (non-zero values)
* The number of Rating in matrix.
*/
public int numRatings;

/**
* The predictions.
*/
public double[] predictions;

/**
* Compressed rating matrix. User-item-rating triples.
* Just list 0:0:1 = user0:movie0:rating1.
* In the variable maybe not continuous.
*/
public int[][] compressedRatingMatrix;

/**
* The degree of users (how many item he has rated).
* if userDegrees[0] = 10, so user0 has 10 item what rated.
*/
public int[] userDegrees;

/**
* The average rating of the current user.
*/
public double[] userAverageRatings;

/**
* The degree of items (how many item has rated).
*/
public int[] itemDegrees;

/**
* The average rating of the current item.
*/
public double[] itemAverageRatings;

/**
* The first user start from 0. Let the first user has x ratings, the second
* user will start from x. Also, we can use userDegrees to add.
*/
public int[] userStartingIndices;

/**
* Number of non-neighbor objects.
*/
private int numNonNeighbors;

/**
* The radius (delta) for determining the neighborhood.
*/
private double radius;

/**
* ************************
* Construct the rating matrix.
*
* @param paraFilename the rating filename.
* @param paraNumUsers number of users
* @param paraNumItems number of items
* @param paraNumRatings number of ratings
* ************************
*/
public MBR(String paraFilename, int paraNumUsers, int paraNumItems, int paraNumRatings) throws Exception {
// Step 1. Initialize these arrays
numItems = paraNumItems;
numUsers = paraNumUsers;
numRatings = paraNumRatings;

userDegrees = new int[numUsers];
userAverageRatings = new double[numUsers];
// The last one in userStartingIndices record the border index of the matrix.
userStartingIndices = new int[numUsers + 1];

itemDegrees = new int[numItems];
itemAverageRatings = new double[numItems];
compressedRatingMatrix = new int[numRatings][3];

// use all ratings
predictions = new double[numRatings];

System.out.println("Reading " + paraFilename);

// Step 2. Read the data file.
File tempFile = new File(paraFilename);
if (!tempFile.exists()) {
System.out.println("File " + paraFilename + " does not exists.");
System.exit(0);
} // Of if
BufferedReader tempBufReader = new BufferedReader(new FileReader(tempFile));

String tempString;
String[] tempStrArray;

int tempIndex = 0;
userStartingIndices[0] = 0;
userStartingIndices[numUsers] = numRatings;

while ((tempString = tempBufReader.readLine()) != null) {
// Each line has three values
tempStrArray = tempString.split(",");
compressedRatingMatrix[tempIndex][0] = Integer.parseInt(tempStrArray[0]);
compressedRatingMatrix[tempIndex][1] = Integer.parseInt(tempStrArray[1]);
compressedRatingMatrix[tempIndex][2] = Integer.parseInt(tempStrArray[2]);

userDegrees[compressedRatingMatrix[tempIndex][0]]++;
itemDegrees[compressedRatingMatrix[tempIndex][1]]++;

if (tempIndex > 0) {
// Starting to read the data of a new user.
if (compressedRatingMatrix[tempIndex][0] != compressedRatingMatrix[tempIndex - 1][0]) {
userStartingIndices[compressedRatingMatrix[tempIndex][0]] = tempIndex;
} // Of if
} // Of if
tempIndex++;
} // Of while
tempBufReader.close();

double[] tempUserTotalScore = new double[numUsers];
double[] tempItemTotalScore = new double[numItems];
for (int i = 0; i < numRatings; i++) {
tempUserTotalScore[compressedRatingMatrix[i][0]] += compressedRatingMatrix[i][2];
tempItemTotalScore[compressedRatingMatrix[i][1]] += compressedRatingMatrix[i][2];
} // Of for i

for (int i = 0; i < numUsers; i++) {
userAverageRatings[i] = tempUserTotalScore[i] / userDegrees[i];
} // Of for i
for (int i = 0; i < numItems; i++) {
itemAverageRatings[i] = tempItemTotalScore[i] / itemDegrees[i];
} // Of for i
}// Of the first constructor

/**
* ************************
* Set the radius (delta).
*
* @param paraRadius The given radius.
* ************************
*/
public void setRadius(double paraRadius) {
if (paraRadius > 0) {
radius = paraRadius;
} else {
radius = 0.1;
} // Of if
}// Of setRadius

/**
* ************************
* Leave-one-out prediction. The predicted values are stored in predictions.
*
* @see #predictions
* ************************
*/
public void leaveOneOutPrediction() {
double tempItemAverageRating;
// Make each line of the code shorter.
int tempUser, tempItem, tempRating;
System.out.println("\r\nLeaveOneOutPrediction for radius " + radius);

numNonNeighbors = 0;
for (int i = 0; i < numRatings; i++) {
tempUser = compressedRatingMatrix[i][0];
tempItem = compressedRatingMatrix[i][1];
tempRating = compressedRatingMatrix[i][2];

// Step 1. Recompute average rating of the current item.
// Use one for test, so we should cut the one.
tempItemAverageRating = (itemAverageRatings[tempItem] * itemDegrees[tempItem] - tempRating)
/ (itemDegrees[tempItem] - 1);

// Step 2. Recompute neighbors, at the same time obtain the ratings of neighbors.
// The same user with his different movies.
int tempNeighbors = 0;
double tempTotal = 0;
int tempComparedItem;
for (int j = userStartingIndices[tempUser]; j < userStartingIndices[tempUser + 1]; j++) {
tempComparedItem = compressedRatingMatrix[j][1];
if (tempItem == tempComparedItem) {
continue;// Ignore itself.
} // Of if

if (Math.abs(tempItemAverageRating - itemAverageRatings[tempComparedItem]) < radius) {
tempTotal += compressedRatingMatrix[j][2];
tempNeighbors++;
} // Of if
} // Of for j

// Step 3. Predict as the average value of neighbors.
if (tempNeighbors > 0) {
predictions[i] = tempTotal / tempNeighbors;
} else {
predictions[i] = DEFAULT_RATING;
numNonNeighbors++;
} // Of if
} // Of for i
}// Of leaveOneOutPrediction

/**
* ************************
* Compute the MAE based on the deviation of each leave-one-out.
*
* @author Shi-Huai Wen
* ************************
*/
public double computeMAE() {
double tempTotalError = 0;
for (int i = 0; i < predictions.length; i++) {
tempTotalError += Math.abs(predictions[i] - compressedRatingMatrix[i][2]);
} // Of for i

return tempTotalError / predictions.length;
}// Of computeMAE

/**
* ************************
* Compute the MAE based on the deviation of each leave-one-out.
*
* @author Shi-Huai Wen
* ************************
*/
public double computeRSME() {
double tempTotalError = 0;
for (int i = 0; i < predictions.length; i++) {
tempTotalError += (predictions[i] - compressedRatingMatrix[i][2]) * (predictions[i] - compressedRatingMatrix[i][2]);
} // Of for i

double tempAverage = tempTotalError / predictions.length;

return Math.sqrt(tempAverage);
}// Of computeRSME

/**
* ************************
* The entrance of the program.
*
* @param args Not used now.
* ************************
*/
public static void main(String[] args) {
try {
MBR tempRecommender = new MBR("D:/Work/sampledata/movielens-943u1682m.txt", 943, 1682, 100000);

for (double tempRadius = 0.2; tempRadius < 0.6; tempRadius += 0.1) {
tempRecommender.setRadius(tempRadius);

tempRecommender.leaveOneOutPrediction();
double tempMAE = tempRecommender.computeMAE();
double tempRSME = tempRecommender.computeRSME();

System.out.println("Radius = " + tempRadius + ", MAE = " + tempMAE + ", RSME = " + tempRSME
+ ", numNonNeighbors = " + tempRecommender.numNonNeighbors);
} // Of for tempRadius
} catch (Exception ee) {
System.out.println("Error occurred in main \r\n" + ee);
} // Of try
}// Of main
}// Of class MBR

3. 运行截图

总结

大道至简, 越是简单的算法可能会解决更多的问题.

不过代码中的数组太过于多, 阅读起来存在一定的障碍.

运行结果中出现了 0.1 + 0.2 = 0.30000000000000004 , 浮点数的奥秘.