数据处理的统计学习（scikit-learn教程）

发布时间：2020-12-24 15:05:33 所属栏目：大数据来源：网络整理

导读：副标题#e# 数据挖掘入门与实战 ?公众号： datadw Scikit-learn 是一个紧密结合Python科学计算库(Numpy、Scipy、matplotlib)，集成经典机器学习算法的Python模块。一、统计学习：scikit-learn中的设置与评估函数对象（1）数据集 scikit-learn 从二维数组描

练习：
使用digits数据集，绘制使用线性核的SVC进行交叉验证的分数（使用对数坐标轴，1——10）

import numpy as npfrom sklearn import cross_validation,datasets,svm
digits = datasets.load_digits()
X = digits.data
y = digits.target
svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10,10)

完整代码：

（3）网格搜索和交叉验证模型

网格搜索：
scikit-learn提供一个对象，他得到数据可以在采用一个参数的模型拟合过程中选择使得交叉验证分数最高的参数。该对象的构造函数需要一个模型作为参数：

from sklearn.grid_search import GridSearchCV
Cs = np.logspace(-6,10)
clf = GridSearchCV(estimator=svc,param_grid=dict(C=Cs),? ? ? ? ? ? ? ? ? n_jobs=-1)
clf.fit(X_digits[:1000],y_digits[:1000]) ? ? ? ?
clf.best_score_ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
clf.best_estimator_.C ? ? ? ? ? ? ? ? ? ? ? ? ? ?# Prediction performance on test set is not as good as on train setclf.score(X_digits[1000:],y_digits[1000:]) ? ? ?

默认情况下，GridSearchCV使用3-fold交叉验证。然而，当他探测到是一个分类器而不是回归量，将会采用分层的3-fold。
嵌套交叉验证

cross_validation.cross_val_score(clf,y_digits)

两个交叉验证循环是并行执行的：一个GridSearchCV模型设置gamma,另一个使用cross_val_score 度量模型的预测表现。结果分数是在新数据预测分数的无偏差估测。

【警告】你不能在并行计算时嵌套对象（n_jobs不同于1）

交叉验证估测：
在算法by算法的基础上使用交叉验证去设置参数更高效。这也是为什么对于一个特定的模型/估测器引入Cross-validation:评估估测器表现模型去自动的通过交叉验证设置参数。

from sklearn import linear_model,datasets
lasso = linear_model.LassoCV()
diabetes = datasets.load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target
lasso.fit(X_diabetes,y_diabetes)# The estimator chose automatically its lambda:lasso.alpha_

这些模型的称呼和他们的对应模型很相似，只是在他们模型名字的后面加上了'CV'.

练习：
使用糖尿病数据集，寻找最佳的正则化参数α

附加：你对选择的α值信任度有多高？

from sklearn import cross_validation,linear_model
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
alphas = np.logspace(-4,-.5,30)

完整代码：
```python

```

四、无监督学习：寻找数据的代表

（1）聚类：将观测样例聚集到一起

聚类解决的问题：
比如对于iris数据集，如果我们知道我们知道有三种iris，但是我们没有标签标定他们：我们可以尝试聚类任务：将观测样例分成分离的族群中，这些族群可以被称为簇。

K-mean聚类（K均值聚类）
注意存在很多不同的聚类标准和关联算法。最简的聚类算法是——K均值（K-means）

from sklearn import cluster,datasets
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X_iris) 
print(k_means.labels_[::10])print(y_iris[::10])

注意：没有绝对的保证能够恢复真实的分类。首先，尽管scikit-learn使用很多技巧来缓和问题的难度，但选择簇的个数还是是很困难的，初始状态下算法是很敏感的，可能会陷入局部最小。
不好的初始状态：

8个簇：

真实情况：

不要“过解释”聚类结果

应用实例：矢量化
K-means和一般的聚类，可以看作是选择少量的示例压缩信息的方式。这个问题被称之为矢量化。例如，这可以被用于分离一个图像：

import scipy as sptry:
 ? lena = sp.lena()except AttributeError: ? from scipy import misc
 ? lena = misc.lena()
X = lena.reshape((-1,1)) # We need an (n_sample,n_feature) arrayk_means = cluster.KMeans(n_clusters=5,n_init=1)
k_means.fit(X) 
values = k_means.cluster_centers_.squeeze()
labels = k_means.labels_
lena_compressed = np.choose(labels,values)
lena_compressed.shape = lena.shape

原始图像：

K-means矢量化：

等段：（Equal bins）

图像直方图：

分层凝聚聚类：Ward
分层聚类方法是一种针对构建一个簇的分层的簇分析。通常它的实现方式有以下两种：

凝聚：自下而上的方法：每一个观测样例开始于他自己的簇，以一种最小连接标准迭代合并。这种方法在观测样例较少的情况下非常有效（有趣）。当簇的数量变大时，计算效率比K-means高的多。
分裂：自上而下的方法：所有的观测样例开始于同一个簇。迭代的进行分层。对于预计簇很多的情况，这种方法既慢（由于所有的观测样例作为一个簇开始的，是递归进行分离的）又有统计学行的病态。

连同-驱使聚类（Conectivity-constrained clustering）
使用凝聚聚类，通过一个连通图可以指定某些样例能被聚集在一起。scikit-learn中的图通过邻接矩阵来表示，且通常是一个稀疏矩阵。例如，在聚类一张图片时检索连通区域（有时也被称作连同单元、部件）：

from sklearn.feature_extraction.image import grid_to_graphfrom sklearn.cluster import AgglomerativeClustering################################################################################ Generate datalena = sp.misc.lena()# Downsample the image by a factor of 4lena = lena[::2,::2] + lena[1::2,::2] + lena[::2,1::2] + lena[1::2,1::2]
X = np.reshape(lena,(-1,1))################################################################################ Define the structure A of the data. Pixels connected to their neighbors.connectivity = grid_to_graph(*lena.shape)################################################################################ Compute clusteringprint("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 15 ?# number of regionsward = AgglomerativeClustering(n_clusters=n_clusters,? ?linkage='ward',connectivity=connectivity).fit(X)
label = np.reshape(ward.labels_,lena.shape)print("Elapsed time: ",time.time() - st)print("Number of pixels: ",label.size)print("Number of clusters: ",np.unique(label).size)

特征凝聚：
我们已经知道稀疏性可以缓和高维灾难。i.e相对于特征数量观测样例数量不足的情况。另一种方法是合并相似的特征：特征凝聚。这种方法通过在特征方向上进行聚类实现。在特征方向上聚类也可以理解为聚合转置的数据。

digits = datasets.load_digits()
images = digits.images
X = np.reshape(images,(len(images),-1))
connectivity = grid_to_graph(*images[0].shape)
agglo = cluster.FeatureAgglomeration(connectivity=connectivity,? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? n_clusters=32)
agglo.fit(X) 
X_reduced = agglo.transform(X)
X_approx = agglo.inverse_transform(X_reduced)
images_approx = np.reshape(X_approx,images.shape)

transeform 和invers_transeform方法
有些模型带有转置方法。例如用来降低数据集的维度

（2）分解：从一个信号到成分和加载

成分及其加载：
如果X是我们的多变量数据，那么我们要要尝试解决的问题就是在不同的观测样例上复写写它：我们想要学习加载L和其它一系列的成分C，如X = LC。存在不同的标准和条件去选择成分。

主成分分析：PCA
主成分分析（PCA）选择在信号上解释极大方差的连续成分。

上面观测样例的点分布在一个方向上是非常平坦的：三个特征单变量的一个甚至可以有其他两个准确的计算出来。PCA用来发现数据在哪个方向上是不平坦的。

当被用来转换数据的时候，PCA可以通过投射到一个主子空间来降低数据的维度。：

# Create a signal with only 2 useful dimensionsx1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
x3 = x1 + x2
X = np.c_[x1,x2,x3]from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X)print(pca.explained_variance_) ?# As we can see,only the 2 first components are usefulpca.n_components = 2X_reduced = pca.fit_transform(X)
X_reduced.shape

（编辑：西安站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

5/7

首页

尾页