%matplotlib inline
import mglearn

In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The savefig.jpeg_quality rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The keymap.all_axes rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The animation.avconv_path rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The animation.avconv_args rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.

mglearn.plots.plot_scaling()

min-max scaler
- x도 0~1, y도 0~1로 나눠진다
standard scaler
- 표준값으로 나눠준다
robust scaler
- x 중간값/사분위 -> (x-중간값) / (3사분위 - 1사분위) = robust scaler
- 특정한 입력값이 x보다 큰것과 작은 사분위 값을 통해 계산한다.
- 어찌보면 standard sclaer와 비슷하지만 조금 다르다.
Normalizer

datapoint ? 데이터의 특징이라고 보면 된다.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()

x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)

print(x_train.shape)
print(x_test.shape)

(426, 30)
(143, 30)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(x_train)

MinMaxScaler(copy=True, feature_range=(0, 1))

x_train_scaled = scaler.transform(x_train) # transform이란 데이터 변형과 관련된 말이다.
# fit_transform() 이란 함수를 사용하기도 한다.
print("변환 된 후 크기 : ", x_train_scaled.shape)
print("스케일 조정 전 특성별 최소값 : ", x_train.min(axis =0))
print("스케일 조정 전 특성별 최대값 : ", x_train.max(axis =0))
print("스케일 조정 후 특성별 최소값 : ", x_train_scaled.min(axis =0))
print("스케일 조정 후 특성별 최대값 : ", x_train_scaled.max(axis =0))

변환 된 후 크기 :  (426, 30)
스케일 조정 전 특성별 최소값 :  [6.981e+00 9.710e+00 4.379e+01 1.435e+02 5.263e-02 1.938e-02 0.000e+00
 0.000e+00 1.060e-01 5.024e-02 1.153e-01 3.602e-01 7.570e-01 6.802e+00
 1.713e-03 2.252e-03 0.000e+00 0.000e+00 9.539e-03 8.948e-04 7.930e+00
 1.202e+01 5.041e+01 1.852e+02 7.117e-02 2.729e-02 0.000e+00 0.000e+00
 1.566e-01 5.521e-02]
스케일 조정 전 특성별 최대값 :  [2.811e+01 3.928e+01 1.885e+02 2.501e+03 1.634e-01 2.867e-01 4.268e-01
 2.012e-01 3.040e-01 9.575e-02 2.873e+00 4.885e+00 2.198e+01 5.422e+02
 3.113e-02 1.354e-01 3.960e-01 5.279e-02 6.146e-02 2.984e-02 3.604e+01
 4.954e+01 2.512e+02 4.254e+03 2.226e-01 9.379e-01 1.170e+00 2.910e-01
 5.774e-01 1.486e-01]
스케일 조정 후 특성별 최소값 :  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
스케일 조정 후 특성별 최대값 :  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]

x_test_scaled = scaler.transform(x_test) # fit_transform()

print("스케일 조정 후 특성별 최소값 : ", x_test_scaled.min(axis =0))
print("스케일 조정 후 특성별 최대값 : ", x_test_scaled.max(axis =0))

스케일 조정 후 특성별 최소값 :  [ 0.0336031   0.0226581   0.03144219  0.01141039  0.14128374  0.04406704
  0.          0.          0.1540404  -0.00615249 -0.00137796  0.00594501
  0.00430665  0.00079567  0.03919502  0.0112206   0.          0.
 -0.03191387  0.00664013  0.02660975  0.05810235  0.02031974  0.00943767
  0.1094235   0.02637792  0.          0.         -0.00023764 -0.00182032]
스케일 조정 후 특성별 최대값 :  [0.9578778  0.81501522 0.95577362 0.89353128 0.81132075 1.21958701
 0.87956888 0.9333996  0.93232323 1.0371347  0.42669616 0.49765736
 0.44117231 0.28371044 0.48703131 0.73863671 0.76717172 0.62928585
 1.33685792 0.39057253 0.89612238 0.79317697 0.84859804 0.74488793
 0.9154725  1.13188961 1.07008547 0.92371134 1.20532319 1.63068851]

train : 0, -1, test : -0.xxx , -1.xxxx
이러한 현상이 나오는 이유는 변환되는 공식이 잘못되었기 때문이다.!!
(x - xmin) / (xmax-xmin) = 기본공식
그러나 test 형태의 최솟값을 구한게 아니라 여기서 x는 test인데 max 와 min은 훈련 데이터(train)의 weight값을 가지고 계산하다보니 잘못 된 값이 나온 것이다.
따라서 test는 test에 맞게 최솟값을 적용시켜야 하고, train은 train에 맞게 적용시켜야 한다.
교재 181p

QunatinTransformer¶

scale 조정하는 모듈

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import make_blobs
from sklearn.preprocessing import QuantileTransformer, StandardScaler, PowerTransformer

x, y = make_blobs(n_samples=50, centers=2, random_state=4, cluster_std=1)

print(x.shape)

(50, 2)

plt.scatter(x[:,0],x[:,1],c=y, s=30, edgecolors='black')
plt.xlim(0,16)
plt.ylim(0,10)
plt.xlabel('x0')
plt.ylabel('x1')
plt.title("Original Data")
plt.show()

scaler = QuantileTransformer() # default 1000개
x_trans = scaler.fit_transform(x)

plt.scatter(x_trans[:,0],x_trans[:,1],c=y, s=30, edgecolors='black')
plt.xlim(0,5) # 수치(범위) 고대로 쓰면 완전 몰린다. 조정하기
plt.ylim(0,5)
plt.xlabel('x0')
plt.ylabel('x1')
plt.title("QuantileTransformer") # scaler 조정된 데이터
# plt.title(type(scaler).__name__) # 이렇게 쓸 수도 있다
plt.show()

C:\Users\ka030\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:2239: UserWarning: n_quantiles (1000) is greater than the total number of samples (50). n_quantiles is set to n_samples.
  % (self.n_quantiles, n_samples))

이를 통해 위 데이터와 달리 해당 데이터들은 0~1사이의 값에 분포하게 된다

data가 1000개라면?¶

x, y = make_blobs(n_samples=1000, centers=2, random_state=4, cluster_std=1)

print(x.shape)

plt.scatter(x[:,0],x[:,1],c=y, s=30, edgecolors='black')
plt.xlim(0,16)
plt.ylim(0,10)
plt.xlabel('x0')
plt.ylabel('x1')
plt.title("Original Data")
plt.show()

(1000, 2)

scaler = QuantileTransformer() # default 1000개
x_trans = scaler.fit_transform(x)

plt.scatter(x_trans[:,0],x_trans[:,1],c=y, s=30, edgecolors='black')
plt.xlim(0,5) # 수치(범위) 고대로 쓰면 완전 몰린다. 조정하기
plt.ylim(0,5)
plt.xlabel('x0')
plt.ylabel('x1')
plt.title("QuantileTransformer") # scaler 조정된 데이터
# plt.title(type(scaler).__name__) # 이렇게 쓸 수도 있다
plt.show()

정규분포¶

균등분포 된 값을 정규분포화하는 작업

scaler = QuantileTransformer(output_distribution='normal') # default 1000개
x_trans = scaler.fit_transform(x)

print(x.shape)

plt.scatter(x[:,0],x[:,1],c=y, s=30, edgecolors='black')
plt.xlim(-5,5)
plt.ylim(-5,5)
plt.xlabel('x0')
plt.ylabel('x1')
plt.title(type(scaler).__name__)
plt.show()

(1000, 2)

###### from sklearn.datasets import make_blobs
# 인위적인 데이터셋 생성
x, _ = make_blobs(n_samples=50, centers=2, random_state=4, cluster_std=2)
x_train, x_test= train_test_split(x, random_state=1, test_size=0.1)

fig, axes = plt.subplots(1,3,figsize=(13,4))
axes[0].scatter(x_train[:,0], x_train[:,1], c=mglearn.cm2.colors[0], label="training set", s=60)
axes[0].scatter(x_test[:,0], x_test[:,1], c=mglearn.cm2.colors[1], label="test set", s=60)
axes[0].legend(loc="upper left")
axes[0].set_title("Original data")


scaler = MinMaxScaler()
scaler.fit(x_train)
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test)

axes[1].scatter(x_train_scaled[:,0], x_train_scaled[:,1],
                c=mglearn.cm2.colors[0], label="training set", s=60)
axes[1].scatter(x_test_scaled[:,0], x_test_scaled[:,1],marker = "^",
                c=mglearn.cm2.colors[1], label="training set", s=60)
axes[1].set_title("Scaled data")

# 테스트 세트에 별도의 scaler 사용 (X)
test_scaler = MinMaxScaler()
test_scaler.fit(x_test)
x_test_scaled_badly = test_scaler.transform(x_test)

axes[2].scatter(x_train_scaled[:,0], x_train_scaled[:,1],
                c=mglearn.cm2.colors[0], label="training set", s=60)
axes[2].scatter(x_test_scaled_badly[:,0], x_test_scaled_badly[:,1],
                marker = '^',c=mglearn.cm2.colors[1], label="training set", s=60)
axes[2].set_title("wrong scaled data")

Text(0.5, 1.0, 'wrong scaled data')

cancer data로 해보기¶

print(cancer)

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1]), 'target_names': array(['malignant', 'benign'], dtype='<U9'), 'DESCR': '.. _breast_cancer_dataset:\n\nBreast cancer wisconsin (diagnostic) dataset\n--------------------------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 569\n\n    :Number of Attributes: 30 numeric, predictive attributes and the class\n\n    :Attribute Information:\n        - radius (mean of distances from center to points on the perimeter)\n        - texture (standard deviation of gray-scale values)\n        - perimeter\n        - area\n        - smoothness (local variation in radius lengths)\n        - compactness (perimeter^2 / area - 1.0)\n        - concavity (severity of concave portions of the contour)\n        - concave points (number of concave portions of the contour)\n        - symmetry \n        - fractal dimension ("coastline approximation" - 1)\n\n        The mean, standard error, and "worst" or largest (mean of the three\n        largest values) of these features were computed for each image,\n        resulting in 30 features.  For instance, field 3 is Mean Radius, field\n        13 is Radius SE, field 23 is Worst Radius.\n\n        - class:\n                - WDBC-Malignant\n                - WDBC-Benign\n\n    :Summary Statistics:\n\n    ===================================== ====== ======\n                                           Min    Max\n    ===================================== ====== ======\n    radius (mean):                        6.981  28.11\n    texture (mean):                       9.71   39.28\n    perimeter (mean):                     43.79  188.5\n    area (mean):                          143.5  2501.0\n    smoothness (mean):                    0.053  0.163\n    compactness (mean):                   0.019  0.345\n    concavity (mean):                     0.0    0.427\n    concave points (mean):                0.0    0.201\n    symmetry (mean):                      0.106  0.304\n    fractal dimension (mean):             0.05   0.097\n    radius (standard error):              0.112  2.873\n    texture (standard error):             0.36   4.885\n    perimeter (standard error):           0.757  21.98\n    area (standard error):                6.802  542.2\n    smoothness (standard error):          0.002  0.031\n    compactness (standard error):         0.002  0.135\n    concavity (standard error):           0.0    0.396\n    concave points (standard error):      0.0    0.053\n    symmetry (standard error):            0.008  0.079\n    fractal dimension (standard error):   0.001  0.03\n    radius (worst):                       7.93   36.04\n    texture (worst):                      12.02  49.54\n    perimeter (worst):                    50.41  251.2\n    area (worst):                         185.2  4254.0\n    smoothness (worst):                   0.071  0.223\n    compactness (worst):                  0.027  1.058\n    concavity (worst):                    0.0    1.252\n    concave points (worst):               0.0    0.291\n    symmetry (worst):                     0.156  0.664\n    fractal dimension (worst):            0.055  0.208\n    ===================================== ====== ======\n\n    :Missing Attribute Values: None\n\n    :Class Distribution: 212 - Malignant, 357 - Benign\n\n    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian\n\n    :Donor: Nick Street\n\n    :Date: November, 1995\n\nThis is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.\nhttps://goo.gl/U2Uwz2\n\nFeatures are computed from a digitized image of a fine needle\naspirate (FNA) of a breast mass.  They describe\ncharacteristics of the cell nuclei present in the image.\n\nSeparating plane described above was obtained using\nMultisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree\nConstruction Via Linear Programming." Proceedings of the 4th\nMidwest Artificial Intelligence and Cognitive Science Society,\npp. 97-101, 1992], a classification method which uses linear\nprogramming to construct a decision tree.  Relevant features\nwere selected using an exhaustive search in the space of 1-4\nfeatures and 1-3 separating planes.\n\nThe actual linear program used to obtain the separating plane\nin the 3-dimensional space is that described in:\n[K. P. Bennett and O. L. Mangasarian: "Robust Linear\nProgramming Discrimination of Two Linearly Inseparable Sets",\nOptimization Methods and Software 1, 1992, 23-34].\n\nThis database is also available through the UW CS ftp server:\n\nftp ftp.cs.wisc.edu\ncd math-prog/cpo-dataset/machine-learn/WDBC/\n\n.. topic:: References\n\n   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction \n     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on \n     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,\n     San Jose, CA, 1993.\n   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and \n     prognosis via linear programming. Operations Research, 43(4), pages 570-577, \n     July-August 1995.\n   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques\n     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) \n     163-171.', 'feature_names': array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23'), 'filename': 'C:\\Users\\ka030\\Anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\breast_cancer.csv'}

x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

from sklearn.svm import SVC

svm =SVC(C=100)
svm.fit(x_train, y_train)
print("테스트 세트 정확도 : {:2f}".format(svm.score(x_test, y_test)))

테스트 세트 정확도 : 0.629371

C:\Users\ka030\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

# MINMAXSCALER -> 0,1 스케일로 조정

scaler = MinMaxScaler()
scaler.fit(x_train)
x_train_scaled= scaler.transform(x_train)
x_test_scaled=scaler.transform(x_test)

svm.fit(x_train_scaled,y_train)
print("scaled 테스트 세트 정확도 : {:2f}".format(svm.score(x_test_scaled, y_test)))

scaled 테스트 세트 정확도 : 0.965035

C:\Users\ka030\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

# 평균 0, 분산 1을 갖도록 스케일 조정 -> (x- 평균) / 표준편차
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled= scaler.transform(x_train)
x_test_scaled=scaler.transform(x_test)

svm.fit(x_train_scaled,y_train)
print("SVM(StandardScaler) 테스트 세트 정확도 : {:5f}".format(svm.score(x_test_scaled, y_test)))

SVM(StandardScaler) 테스트 세트 정확도 : 0.958042

여기서 잠깐 주성분 분석(PCA)란?¶

회전을 통해 불필요한 데이터를 삭제하는 차원 축소 방법이다.
비지도 학습을 배우는 지금 사용할 예정이다.

import mglearn
import numpy as np
mglearn.plots.plot_pca_illustration()
plt.show()

# import revel
x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)
fig, axes =plt.subplots(15, 2,figsize=(10,20)) # inch 기억하기!
malignant = cancer.data[cancer.target==0]
benign = cancer.data[cancer.target==1]

ax = axes.ravel() # reval은 데이터를 평평하게 만들어주는 역할을 한다

for i in range(30):
    _, bins = np.histogram(cancer.data[:,i], bins=50) # numpy에서 histogram에 필요한 값을 생성한다
    ax[i].hist(malignant[:,i], bins=bins, color = mglearn.cm3(0), alpha=0.5)
    ax[i].hist(benign[:,i], bins=bins, color = mglearn.cm2(2), alpha=0.5)
    ax[i].set_title(cancer.feature_names[i])
    ax[i].set_yticks(())
ax[0].set_xlabel("Feature size")
ax[0].set_ylabel("Frequency")
ax[0].legend(['Malignacy','Benign'], loc="best")
fig.tight_layout()

print(cancer.data.shape)

(569, 30)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(cancer.data)
x_scaled= scaler.transform(cancer.data)

from sklearn.decomposition import PCA
# 데이터의 처음 두개 주성분만 유지하고 나머지는 버린다,

pca = PCA(n_components=2)
pca.fit(x_scaled)

# 처음 두개의 주성분을 사용해 데이터를 변환한다.
x_pca = pca.transform(x_scaled)
print("원본 데이터 형태 : ", str(x_scaled.shape))
print("축소된 데이터 형태  : ", str(x_pca.shape)) 
# 차원 축소된다고 data이 달라지는 것이 아니다. 다만 속성이 2개만 남기고 지워진 것이다!

원본 데이터 형태 :  (569, 30)
축소된 데이터 형태  :  (569, 2)

plt.figure(figsize=(8,8))
mglearn.discrete_scatter(x_pca[:,0],x_pca[:,1], cancer.target)
# plt.legend(["악성","양성"],loc="best")
plt.gca().set_aspect("equal")
plt.xlabel("first pca")
plt.ylabel("second pca")

plt.legend(['Malignacy','Benign'], loc="best")

<matplotlib.legend.Legend at 0x1b187ad4e88>

# 주성분 형태를 보고 싶다면?
print("PCA 주성분 형태 : ", pca.components_.shape)

PCA 주성분 형태 :  (2, 30)

pca.components_

array([[ 0.21890244,  0.10372458,  0.22753729,  0.22099499,  0.14258969,
         0.23928535,  0.25840048,  0.26085376,  0.13816696,  0.06436335,
         0.20597878,  0.01742803,  0.21132592,  0.20286964,  0.01453145,
         0.17039345,  0.15358979,  0.1834174 ,  0.04249842,  0.10256832,
         0.22799663,  0.10446933,  0.23663968,  0.22487053,  0.12795256,
         0.21009588,  0.22876753,  0.25088597,  0.12290456,  0.13178394],
       [-0.23385713, -0.05970609, -0.21518136, -0.23107671,  0.18611302,
         0.15189161,  0.06016536, -0.0347675 ,  0.19034877,  0.36657547,
        -0.10555215,  0.08997968, -0.08945723, -0.15229263,  0.20443045,
         0.2327159 ,  0.19720728,  0.13032156,  0.183848  ,  0.28009203,
        -0.21986638, -0.0454673 , -0.19987843, -0.21935186,  0.17230435,
         0.14359317,  0.09796411, -0.00825724,  0.14188335,  0.27533947]])

고유 얼굴 특성 추출¶

from sklearn.datasets import fetch_lfw_people

people= fetch_lfw_people(min_faces_per_person=20, resize=0.7, color = False)
image_shape = people.images[0].shape

print(image_shape)

fig, axes = plt.subplots(3, 5, figsize=(15,8),subplot_kw={"xticks" : (), 'yticks':()})
for target, image, ax in zip(people.target, people.images, axes.ravel()):
    ax.imshow(image)
    ax.set_title(people.target_names[target])

(87, 65)

print("people.images.shape", people.images.shape)
print("클래스 개수", len(people.target_names))

people.images.shape (3023, 87, 65)
클래스 개수 62

people.target[0:10], people.target_names[people.target[0:10]]

(array([61, 25,  9,  5,  1, 10, 48, 17, 13, 54], dtype=int64),
 array(['Winona Ryder', 'Jean Chretien', 'Carlos Menem', 'Ariel Sharon',
        'Alvaro Uribe', 'Colin Powell', 'Recep Tayyip Erdogan',
        'Gray Davis', 'George Robertson', 'Silvio Berlusconi'],
       dtype='<U25'))

counts = np.bincount(people.target)
for i, (count, name) in enumerate(zip(counts, people.target_names)):
    print("{0:25}{1:3}".format(name, count), end='   ')
    if (i+1) % 3 ==0:
        print()

Alejandro Toledo          39   Alvaro Uribe              35   Amelie Mauresmo           21   
Andre Agassi              36   Angelina Jolie            20   Ariel Sharon              77   
Arnold Schwarzenegger     42   Atal Bihari Vajpayee      24   Bill Clinton              29   
Carlos Menem              21   Colin Powell             236   David Beckham             31   
Donald Rumsfeld          121   George Robertson          22   George W Bush            530   
Gerhard Schroeder        109   Gloria Macapagal Arroyo   44   Gray Davis                26   
Guillermo Coria           30   Hamid Karzai              22   Hans Blix                 39   
Hugo Chavez               71   Igor Ivanov               20   Jack Straw                28   
Jacques Chirac            52   Jean Chretien             55   Jennifer Aniston          21   
Jennifer Capriati         42   Jennifer Lopez            21   Jeremy Greenstock         24   
Jiang Zemin               20   John Ashcroft             53   John Negroponte           31   
Jose Maria Aznar          23   Juan Carlos Ferrero       28   Junichiro Koizumi         60   
Kofi Annan                32   Laura Bush                41   Lindsay Davenport         22   
Lleyton Hewitt            41   Luiz Inacio Lula da Silva 48   Mahmoud Abbas             29   
Megawati Sukarnoputri     33   Michael Bloomberg         20   Naomi Watts               22   
Nestor Kirchner           37   Paul Bremer               20   Pete Sampras              22   
Recep Tayyip Erdogan      30   Ricardo Lagos             27   Roh Moo-hyun              32   
Rudolph Giuliani          26   Saddam Hussein            23   Serena Williams           52   
Silvio Berlusconi         33   Tiger Woods               23   Tom Daschle               25   
Tom Ridge                 33   Tony Blair               144   Vicente Fox               32   
Vladimir Putin            49   Winona Ryder              24

mask = np.zeros(people.target.shape, dtype=np.bool)
for target in np.unique(people.target):
    mask[np.where(people.target == target)[0][:50]] =1# 50개 까지만 가져옴
    
x_people = people.data[mask]
y_people = people.target[mask]

# 최대값으로 나눠서 scailing 적용하기
x_people= x_people/255.

# MinMaxScaler로 scailing 적용하기

scaler = MinMaxScaler()
scaler.fit(x_people)
x_people_scaled = scaler.transform(x_people)

스케일링을 0~1의 범위값을 255로 나누거나 스케일 적용하는 것은 동일하다
MinMaxScaler도 상관없다.

복습
- MinMaxScaler란?
  - 최대/최소값이 각각 1, 0이 되도록 스케일링하는 것이다

종류	설명
StandardScaler	기본 스케일. 평균과 표준편차 사용
MinMaxScaler	최대/최소값이 각각 1, 0이 되도록 스케일링
MaxAbsScaler	최대절대값과 0이 각각 1, 0이 되도록 스케일링
RobustScaler	중앙값(median)과 IQR(interquartile range) 사용. 아웃라이어의 영향을 최소화

print(x_people)
print(x_people_scaled)

[[0.22352941 0.23660131 0.30588236 ... 0.06797386 0.06535947 0.08888888]
 [0.2614379  0.31633985 0.3477124  ... 0.03398693 0.03267974 0.03660131]
 [0.07320261 0.05620915 0.05882353 ... 0.08888888 0.08888888 0.10065359]
 ...
 [0.14248365 0.0875817  0.10980392 ... 0.05620915 0.02614379 0.02091503]
 [0.21176471 0.25620916 0.22091503 ... 0.82222223 0.8235294  0.8326797 ]
 [0.43398693 0.50326794 0.5699346  ... 0.05490196 0.05490196 0.05359477]]
[[0.22440945 0.23722148 0.30749017 ... 0.06797386 0.06535947 0.08888888]
 [0.26246718 0.31716904 0.34954008 ... 0.03398693 0.03267974 0.03660131]
 [0.07349081 0.05635649 0.05913273 ... 0.08888888 0.08888888 0.10065359]
 ...
 [0.1430446  0.08781127 0.11038108 ... 0.05620915 0.02614379 0.02091503]
 [0.21259843 0.25688073 0.22207622 ... 0.82222223 0.8235294  0.8326797 ]
 [0.43569553 0.5045871  0.57293034 ... 0.05490196 0.05490196 0.05359477]]

# 어제 했던 knn으로 데이터 훈련해보기!

from sklearn.neighbors import KNeighborsClassifier

x_train, x_test, y_train, y_test = train_test_split(
    x_people, y_people, stratify = y_people, random_state=0)
knn = KNeighborsClassifier(n_neighbors = 1)
knn.fit(x_train, y_train)

print("1-최근접 이웃의 테스트 점수 : {:3f}".format(knn.score(x_test, y_test)))

1-최근접 이웃의 테스트 점수 : 0.232558

보면 정답률이 낮다. 그러면 무엇을 해야하는가?

알고리즘을 바꾼다.
parameter을 바꾼다.
차원축소 등을 통해 데이터를 바꾼다.

# 1. 차원축소를 해보자!

print("x_train.shape", x_train.shape)
pca = PCA(n_components = 100, whiten=True, random_state=0)
# whiten이란? 입력값을 비상관관계로 만드는 것


pca.fit(x_train)

x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)

print("x_train.pca.shape", x_train_pca.shape) # 차수가 100개로 줄어든 것을 확인할 수 있다.

x_train.shape (1547, 5655)
x_train.pca.shape (1547, 100)

차수가 5655 -> 100개로 줄어든 것을 확인할 수 있다.

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(x_train_pca, y_train)
print("1-최근접 이웃의 테스트 점수 : {:3f}".format(knn.score(x_test_pca, y_test)))

1-최근접 이웃의 테스트 점수 : 0.312016

처음에는 0.232558이었으나 pca를 적용하였더니 0.312016로 올라갔다.

print("pca.components_.shape", pca.components_.shape)

pca.components_.shape (100, 5655)

pca 적용 전후를 그림으로 출력해보자!

fig , axes = plt.subplots(3,5,figsize=(15,12) , subplot_kw={'xticks':(),'yticks':()})
for i, (component, ax) in enumerate(zip(pca.components_, axes.ravel())):
    ax.imshow(component.reshape(image_shape),cmap='viridis')
    ax.set_title('PCA {}'.format((i+1)))

from matplotlib.offsetbox import OffsetImage, AnnotationBbox

image_shape = people.images[0].shape
plt.figure(figsize=(20, 3))
ax = plt.gca()

imagebox = OffsetImage(people.images[0], zoom=2, cmap="gray")
ab = AnnotationBbox(imagebox, (0.05, 0.4), pad=0.0, xycoords='data')
ax.add_artist(ab)

for i in range(4):
    imagebox = OffsetImage(pca.components_[i].reshape(image_shape), zoom=2, 
                           cmap="viridis")

    ab = AnnotationBbox(imagebox, (0.285 + 0.2 * i, 0.4),
                        pad=0.0,xycoords='data')
    ax.add_artist(ab)
    if i == 0:
        plt.text(0.155, .3, 'x_{} *'.format(i), fontdict={'fontsize': 30})
    else:
        plt.text(0.145 + .2 * i, .3, '+ x_{} *'.format(i), fontdict={'fontsize': 30})

plt.text(.95, .3, '+ ...', fontdict={'fontsize': 30})

plt.rc('text')
plt.text(.12, .3, '=', fontdict={'fontsize': 50})
plt.axis("off")

(0.0, 1.0, 0.0, 1.0)

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()

iris로 pca적용해보기¶

x= iris.data
y = iris.target
target_names = iris.target_names
print(x.shape)
print(y.shape)
print(target_names)

(150, 4)
(150,)
['setosa' 'versicolor' 'virginica']

pca = PCA(n_components =2)
x_pca = pca.fit(x).transform(x)

print("First tow componetns : %s" % str(pca.explained_variance_ratio_))

First tow componetns : [0.92461872 0.05306648]

plt.figure()
colors = ['navy','orange','green']
lw = 2

for color, i , target_name in zip(colors, [0,1,2], target_names):
    plt.scatter(x_pca[y==i, 0], x_pca[y==i,1], alpha=.8, lw=lw, # lw는 line width라고 선의 두께를 말한다.
                label = target_names[i])
plt.legend(loc='best')
plt.title("pca of iris dataset")
plt.show()

실습 문제¶

ml 데이터셋

import pandas as pd

bank_data = pd.read_csv("C:/Users/ka030/Documents/GitHub/ai/bank-additional/bank-additional/bank-additional.csv", delimiter=';')
bank_data.head()

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	30	blue-collar	married	basic.9y	no	yes	no	cellular	may	fri	...	2	999	nonexistent	-1.8	92.893	-46.2	1.313	5099.1	no
1	39	services	single	high.school	no	no	no	telephone	may	fri	...	4	999	nonexistent	1.1	93.994	-36.4	4.855	5191.0	no
2	25	services	married	high.school	no	yes	no	telephone	jun	wed	...	1	999	nonexistent	1.4	94.465	-41.8	4.962	5228.1	no
3	38	services	married	basic.9y	no	unknown	unknown	telephone	jun	fri	...	3	999	nonexistent	1.4	94.465	-41.8	4.959	5228.1	no
4	47	admin.	married	university.degree	no	yes	no	cellular	nov	mon	...	1	999	nonexistent	-0.1	93.200	-42.0	4.191	5195.8	no

[MNF 비지도학습] SK infosec 클라우드 AI전문가 양성과정 실습파일 (0)	2020.09.11
[PYTHON 머신러닝]SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0)	2020.09.11
[CNN&비지도학습]SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0)	2020.09.08
[pandas를 활용한 데이터분석]SK infosec 클라우드 AI 전문가 양성과정 실습과제 (0)	2020.09.08
[pandas를 활용한 데이터분석]SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0)	2020.09.08

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

개발인가 메모장인가

[CNN&비지도학습]SK infosec 클라우드 AI 전문가 양성과정 실습

QunatinTransformer¶

data가 1000개라면?¶

정규분포¶

cancer data로 해보기¶

여기서 잠깐 주성분 분석(PCA)란?¶

고유 얼굴 특성 추출¶

iris로 pca적용해보기¶

실습 문제¶

'개발 > sk infosec cloud ai 전문가 양성과정' 카테고리의 다른 글

'개발/sk infosec cloud ai 전문가 양성과정'의 다른글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

2025. 04
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

[CNN&비지도학습]SK infosec 클라우드 AI 전문가 양성과정 실습

QunatinTransformer¶

data가 1000개라면?¶

정규분포¶

cancer data로 해보기¶

여기서 잠깐 주성분 분석(PCA)란?¶

고유 얼굴 특성 추출¶

iris로 pca적용해보기¶

실습 문제¶

'개발 > sk infosec cloud ai 전문가 양성과정' 카테고리의 다른 글

'개발/sk infosec cloud ai 전문가 양성과정'의 다른글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역