Data mining 实战第一弹

- October 25, 2018

Audit了HKUST的Data mining.
Dr. Lei Chen 又很nice的把作业和ppt在外网上公开了。http://home.cse.ust.hk/~leichen/courses/mscbd-5002/ 不做作业实在可惜 (链接可能会在学期结束后就失效)
于是我DM的第一次实战就是这门课的Assignment2

题目描述如下:

Task Description
The dataset comes from the 1994 Census database. Prediction task is to determine whether a person makes over 50K a year.
Files Description
1.trainFeatures.csv: 34189 individual’s basic information with 14 attributes for training.
2.testFeatures.csv: 14653 individual’s basic information with 14 attributes for testing.
3.trainLabels.csv: 34189 individual’s incomes, 0: <=50k, 1: >50k. 4.sampleSubmission.csv: The sample submission file you may refer. 5.dataDescription.pdf: 14 attributes information.

Data Description
trainFeatures.csv & testFeatures.csv
age: The age of the individual; this attribute is continuous.
work-class: The type of the employer that the individual has, involving Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked; this attribute is nominal.
fnlwgt: Final weight, this is the number of people the census believes the entry represents; this attribute is continuous.
education: The highest level of education achieved for that individuals involving Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, Some-college, Bachelors, Masters,Doctorate;This attribute is nominal.
education-num: Highest level of education in numerical form; this attribute is continuous.
marital-status: Marital status of the individual, involving Divorced,Married-AF- spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed; This attribute is nominal.
occupation: The occupation of the individual, involving Tech-support, Craft- repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers- cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces; This attribute is nominal. relationship: The family relationship of the individual, involving Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried; This attribute is nominal. race: The race of the individual, involving White, Asian-Pac-Islander, Amer- Indian-Eskimo, Other, Black; This attribute is nominal.
sex: Female, Male; This attribute is nominal.
capital-gain: capital gains recorded; This attribute is continuous.
capital-loss: capital losses recorded; This attribute is continuous. hours-per-week: Hours worked per week; This attribute is continuous. native-country: label person’s country, involving United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
trainLabels.csv
0: means a person makes no more than 50K a year, i.e. <=50k 1: means a person makes over 50K a year, i.e. >50K

参照我的上一篇博客，拿到数据我们先看一下label 的分布。

空值处理

其实这里缺失的只占5.8%，不去做填充，在经过了哑矩阵(dummy matrix)变换以后对最后结果影响不大。
我在这里尝试做空值处理

不做处理，直接丢给dummy matrix -> Lightgbm的结果是 accuracy 0.8736 (0.0053)
直接删除包含空值的行，效果明显下降 Lightgbm的结果是 accuracy 0.76
用众数填充，效果没有明显变化，Lightgbm的结果是 0.8730 (0.0067)

这里仅记录一下怎么用RandomForest做值填充（以occupation为例）：

ref: http://www.cnblogs.com/north-north/p/4353365.html

Label Encoding:

在处理完空值以后，看一下有没有什么属性是可以处理(合并/删除/新增)的。

可以看出fnlwgt 对于预测基本上没什么关系。
我在这里把这列删掉。

然后capital-gain 减去 capital-loss 得到capital-net-gain. 用净收益来代替原来两个。
//之后想想这里可能有点问题，即使是负收益，但是既然有钱投资也很有可能就是高收入人群。在构建完之后我redo了这步操作，对结果倒并没有什么影响

Dummy Matrix Transformation

关于哑矩阵变换同样在上一篇博客里提到了。
这里使用了以后对结果大概有1个百分点准确率的提升。84 -> 85

Standardize & PCA

对数据进行标准化以及主成分分析降维

Base Model

这里我尝试了一些基本的classification算法：

Stochastic Gradient Descent

Decision Tree

Random Forest

Ada Boost

XGBoost (由于对他的实现了解不多，这里参数不太会设置，都用的默认)

LightGBM

基础算法中LightGBM胜出。而且他的运行速度实在太快了。

-End-

Comments

YG OverflowJune 13, 2020 at 6:18 PM
在第一步的时候发现了label的分布不均匀，样本中收入在50k以下的人群和以上的人群比例为：26:8
对于这种情况有如下几种应对方案：
1. 过抽样（合成出更多缺乏的样本，SMOTE算法）
2. 欠抽样（跑模型时候减少多数派中的样本）
3. 修改算法模型的惩罚权重（小样本的惩罚权重更高）
ReplyDelete
Replies
YG OverflowJune 13, 2020 at 6:34 PM
删除fnlwgt这列并不合理，按照这个attribute的解释，这个是普查团统计出的这个类别的人数，虽然根据相关性分析和label没有直接关系（这个很好理解）

或许应该把这个fnlwgt体现在计算loss时候的权重上！
ReplyDelete
Replies
YG OverflowJune 13, 2020 at 7:27 PM
tip:
这里每个模型返回的数值是accuracy的均值和方差(在k-fold上的cross-validation表现)
accuracy = (TP+FN) / (TP + FP + TN + FN)
ReplyDelete
Replies

Add comment

Search This Blog

YG Overflow