Data mining 实战第一弹


Audit了HKUST的Data mining.
Dr. Lei Chen 又很nice的把作业和ppt在外网上公开了。http://home.cse.ust.hk/~leichen/courses/mscbd-5002/  不做作业实在可惜 (链接可能会在学期结束后就失效)
于是我DM的第一次实战就是这门课的Assignment2


题目描述如下:


Task Description
The dataset comes from the 1994 Census database. Prediction task is to determine whether a person makes over 50K a year.
Files Description
1.trainFeatures.csv: 34189 individual’s basic information with 14 attributes for training.
2.testFeatures.csv: 14653 individual’s basic information with 14 attributes for testing.
3.trainLabels.csv: 34189 individual’s incomes, 0: <=50k, 1: >50k. 4.sampleSubmission.csv: The sample submission file you may refer. 5.dataDescription.pdf: 14 attributes information. 


Data Description
trainFeatures.csv & testFeatures.csv
age: The age of the individual; this attribute is continuous.
work-class: The type of the employer that the individual has, involving Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked; this attribute is nominal.
fnlwgt: Final weight, this is the number of people the census believes the entry represents; this attribute is continuous.
education: The highest level of education achieved for that individuals involving Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, Some-college, Bachelors, Masters,Doctorate;This attribute is nominal.
education-num: Highest level of education in numerical form; this attribute is continuous.
marital-status: Marital status of the individual, involving Divorced,Married-AF- spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed; This attribute is nominal.
occupation: The occupation of the individual, involving Tech-support, Craft- repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers- cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces; This attribute is nominal. relationship: The family relationship of the individual, involving Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried; This attribute is nominal. race: The race of the individual, involving White, Asian-Pac-Islander, Amer- Indian-Eskimo, Other, Black; This attribute is nominal.
sex: Female, Male; This attribute is nominal.
capital-gain: capital gains recorded; This attribute is continuous.
capital-loss: capital losses recorded; This attribute is continuous. hours-per-week: Hours worked per week; This attribute is continuous. native-country: label person’s country, involving United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
trainLabels.csv
0: means a person makes no more than 50K a year, i.e. <=50k 1: means a person makes over 50K a year, i.e. >50K

参照我的上一篇博客,拿到数据我们先看一下label 的分布。

空值处理


其实这里缺失的只占5.8%,不去做填充,在经过了哑矩阵(dummy matrix)变换以后对最后结果影响不大。
我在这里尝试做空值处理

  • 不做处理,直接丢给dummy matrix -> Lightgbm的结果是 accuracy 0.8736 (0.0053)
  • 直接删除包含空值的行,效果明显下降 Lightgbm的结果是 accuracy 0.76
  • 用众数填充,效果没有明显变化,Lightgbm的结果是 0.8730 (0.0067)

这里仅记录一下怎么用RandomForest做值填充 (以occupation为例):
ref: http://www.cnblogs.com/north-north/p/4353365.html

Label Encoding: 

在处理完空值以后,看一下有没有什么属性是可以处理(合并/删除/新增)的。
可以看出fnlwgt 对于预测基本上没什么关系。
我在这里把这列删掉。

然后capital-gain 减去 capital-loss 得到capital-net-gain. 用净收益来代替原来两个。
//之后想想这里可能有点问题,即使是负收益,但是既然有钱投资也很有可能就是高收入人群。 在构建完之后我redo了这步操作,对结果倒并没有什么影响

Dummy Matrix Transformation


关于哑矩阵变换同样在上一篇博客里提到了。
这里使用了以后对结果大概有1个百分点准确率的提升。84 -> 85

Standardize & PCA

对数据进行标准化以及主成分分析降维

Base Model

这里我尝试了一些基本的classification算法:

  • Stochastic Gradient Descent
  • Decision Tree

  • Random Forest

  • KNN

  • Ada Boost

  • XGBoost (由于对他的实现了解不多,这里参数不太会设置,都用的默认)

  • LightGBM

基础算法中LightGBM胜出。而且他的运行速度实在太快了。


-End-

Comments

  1. 在第一步的时候发现了label的分布不均匀,样本中收入在50k以下的人群和以上的人群比例为:26:8
    对于这种情况有如下几种应对方案:
    1. 过抽样 (合成出更多缺乏的样本,SMOTE算法)
    2. 欠抽样 (跑模型时候减少多数派中的样本)
    3. 修改算法模型的惩罚权重(小样本的惩罚权重更高)

    ReplyDelete
  2. 删除fnlwgt这列并不合理,按照这个attribute的解释,这个是普查团统计出的这个类别的人数,虽然根据相关性分析和label没有直接关系(这个很好理解)

    或许应该把这个fnlwgt体现在计算loss时候的权重上!

    ReplyDelete
  3. tip:
    这里每个模型返回的数值是accuracy的均值和方差(在k-fold上的cross-validation表现)
    accuracy = (TP+FN) / (TP + FP + TN + FN)

    ReplyDelete

Post a Comment

Popular posts from this blog

Malware Report: iauzzy.exe

Malware Report: withme.exe

根因分析之iDice 文章复现