I can do it!!

He can do! She can do! why cannot me? i can do it!

개발/sk infosec cloud ai 전문가 양성과정

[CNN&비지도학습]SK infosec 클라우드 AI 전문가 양성과정 실습

gogoriver 2020. 9. 8. 20:18

 

 

 

In [1]:
%matplotlib inline
import mglearn
 
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The savefig.jpeg_quality rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The keymap.all_axes rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The animation.avconv_path rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\ka030\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The animation.avconv_args rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In [2]:
mglearn.plots.plot_scaling()
 
 
  1. min-max scaler
    • x도 0~1, y도 0~1로 나눠진다
  2. standard scaler
    • 표준값으로 나눠준다
  3. robust scaler
    • x 중간값/사분위 -> (x-중간값) / (3사분위 - 1사분위) = robust scaler
    • 특정한 입력값이 x보다 큰것과 작은 사분위 값을 통해 계산한다.
    • 어찌보면 standard sclaer와 비슷하지만 조금 다르다.
  4. Normalizer

  • datapoint ? 데이터의 특징이라고 보면 된다.
In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()

x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)

print(x_train.shape)
print(x_test.shape)
 
(426, 30)
(143, 30)
In [4]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(x_train)
Out[4]:
MinMaxScaler(copy=True, feature_range=(0, 1))
In [5]:
x_train_scaled = scaler.transform(x_train) # transform이란 데이터 변형과 관련된 말이다.
# fit_transform() 이란 함수를 사용하기도 한다.
print("변환 된 후 크기 : ", x_train_scaled.shape)
print("스케일 조정 전 특성별 최소값 : ", x_train.min(axis =0))
print("스케일 조정 전 특성별 최대값 : ", x_train.max(axis =0))
print("스케일 조정 후 특성별 최소값 : ", x_train_scaled.min(axis =0))
print("스케일 조정 후 특성별 최대값 : ", x_train_scaled.max(axis =0))
 
변환 된 후 크기 :  (426, 30)
스케일 조정 전 특성별 최소값 :  [6.981e+00 9.710e+00 4.379e+01 1.435e+02 5.263e-02 1.938e-02 0.000e+00
 0.000e+00 1.060e-01 5.024e-02 1.153e-01 3.602e-01 7.570e-01 6.802e+00
 1.713e-03 2.252e-03 0.000e+00 0.000e+00 9.539e-03 8.948e-04 7.930e+00
 1.202e+01 5.041e+01 1.852e+02 7.117e-02 2.729e-02 0.000e+00 0.000e+00
 1.566e-01 5.521e-02]
스케일 조정 전 특성별 최대값 :  [2.811e+01 3.928e+01 1.885e+02 2.501e+03 1.634e-01 2.867e-01 4.268e-01
 2.012e-01 3.040e-01 9.575e-02 2.873e+00 4.885e+00 2.198e+01 5.422e+02
 3.113e-02 1.354e-01 3.960e-01 5.279e-02 6.146e-02 2.984e-02 3.604e+01
 4.954e+01 2.512e+02 4.254e+03 2.226e-01 9.379e-01 1.170e+00 2.910e-01
 5.774e-01 1.486e-01]
스케일 조정 후 특성별 최소값 :  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
스케일 조정 후 특성별 최대값 :  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
In [6]:
x_test_scaled = scaler.transform(x_test) # fit_transform()

print("스케일 조정 후 특성별 최소값 : ", x_test_scaled.min(axis =0))
print("스케일 조정 후 특성별 최대값 : ", x_test_scaled.max(axis =0))
 
스케일 조정 후 특성별 최소값 :  [ 0.0336031   0.0226581   0.03144219  0.01141039  0.14128374  0.04406704
  0.          0.          0.1540404  -0.00615249 -0.00137796  0.00594501
  0.00430665  0.00079567  0.03919502  0.0112206   0.          0.
 -0.03191387  0.00664013  0.02660975  0.05810235  0.02031974  0.00943767
  0.1094235   0.02637792  0.          0.         -0.00023764 -0.00182032]
스케일 조정 후 특성별 최대값 :  [0.9578778  0.81501522 0.95577362 0.89353128 0.81132075 1.21958701
 0.87956888 0.9333996  0.93232323 1.0371347  0.42669616 0.49765736
 0.44117231 0.28371044 0.48703131 0.73863671 0.76717172 0.62928585
 1.33685792 0.39057253 0.89612238 0.79317697 0.84859804 0.74488793
 0.9154725  1.13188961 1.07008547 0.92371134 1.20532319 1.63068851]
 
  • train : 0, -1, test : -0.xxx , -1.xxxx
  • 이러한 현상이 나오는 이유는 변환되는 공식이 잘못되었기 때문이다.!!

  • (x - xmin) / (xmax-xmin) = 기본공식

  • 그러나 test 형태의 최솟값을 구한게 아니라 여기서 x는 test인데 max 와 min은 훈련 데이터(train)의 weight값을 가지고 계산하다보니 잘못 된 값이 나온 것이다.
  • 따라서 test는 test에 맞게 최솟값을 적용시켜야 하고, train은 train에 맞게 적용시켜야 한다.

  • 교재 181p

 

QunatinTransformer

  • scale 조정하는 모듈
In [7]:
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import make_blobs
from sklearn.preprocessing import QuantileTransformer, StandardScaler, PowerTransformer
In [8]:
x, y = make_blobs(n_samples=50, centers=2, random_state=4, cluster_std=1)

print(x.shape)
 
(50, 2)
In [9]:
plt.scatter(x[:,0],x[:,1],c=y, s=30, edgecolors='black')
plt.xlim(0,16)
plt.ylim(0,10)
plt.xlabel('x0')
plt.ylabel('x1')
plt.title("Original Data")
plt.show()
 
In [10]:
scaler = QuantileTransformer() # default 1000개
x_trans = scaler.fit_transform(x)

plt.scatter(x_trans[:,0],x_trans[:,1],c=y, s=30, edgecolors='black')
plt.xlim(0,5) # 수치(범위) 고대로 쓰면 완전 몰린다. 조정하기
plt.ylim(0,5)
plt.xlabel('x0')
plt.ylabel('x1')
plt.title("QuantileTransformer") # scaler 조정된 데이터
# plt.title(type(scaler).__name__) # 이렇게 쓸 수도 있다
plt.show()
 
C:\Users\ka030\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:2239: UserWarning: n_quantiles (1000) is greater than the total number of samples (50). n_quantiles is set to n_samples.
  % (self.n_quantiles, n_samples))
 
 
  • 이를 통해 위 데이터와 달리 해당 데이터들은 0~1사이의 값에 분포하게 된다
 

data가 1000개라면?

In [11]:
x, y = make_blobs(n_samples=1000, centers=2, random_state=4, cluster_std=1)

print(x.shape)

plt.scatter(x[:,0],x[:,1],c=y, s=30, edgecolors='black')
plt.xlim(0,16)
plt.ylim(0,10)
plt.xlabel('x0')
plt.ylabel('x1')
plt.title("Original Data")
plt.show()
 
(1000, 2)
 
In [12]:
scaler = QuantileTransformer() # default 1000개
x_trans = scaler.fit_transform(x)

plt.scatter(x_trans[:,0],x_trans[:,1],c=y, s=30, edgecolors='black')
plt.xlim(0,5) # 수치(범위) 고대로 쓰면 완전 몰린다. 조정하기
plt.ylim(0,5)
plt.xlabel('x0')
plt.ylabel('x1')
plt.title("QuantileTransformer") # scaler 조정된 데이터
# plt.title(type(scaler).__name__) # 이렇게 쓸 수도 있다
plt.show()
 
 

정규분포

  • 균등분포 된 값을 정규분포화하는 작업
In [13]:
scaler = QuantileTransformer(output_distribution='normal') # default 1000개
x_trans = scaler.fit_transform(x)

print(x.shape)

plt.scatter(x[:,0],x[:,1],c=y, s=30, edgecolors='black')
plt.xlim(-5,5)
plt.ylim(-5,5)
plt.xlabel('x0')
plt.ylabel('x1')
plt.title(type(scaler).__name__)
plt.show()
 
(1000, 2)
 
In [14]:
###### from sklearn.datasets import make_blobs
# 인위적인 데이터셋 생성
x, _ = make_blobs(n_samples=50, centers=2, random_state=4, cluster_std=2)
x_train, x_test= train_test_split(x, random_state=1, test_size=0.1)

fig, axes = plt.subplots(1,3,figsize=(13,4))
axes[0].scatter(x_train[:,0], x_train[:,1], c=mglearn.cm2.colors[0], label="training set", s=60)
axes[0].scatter(x_test[:,0], x_test[:,1], c=mglearn.cm2.colors[1], label="test set", s=60)
axes[0].legend(loc="upper left")
axes[0].set_title("Original data")


scaler = MinMaxScaler()
scaler.fit(x_train)
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test)

axes[1].scatter(x_train_scaled[:,0], x_train_scaled[:,1],
                c=mglearn.cm2.colors[0], label="training set", s=60)
axes[1].scatter(x_test_scaled[:,0], x_test_scaled[:,1],marker = "^",
                c=mglearn.cm2.colors[1], label="training set", s=60)
axes[1].set_title("Scaled data")

# 테스트 세트에 별도의 scaler 사용 (X)
test_scaler = MinMaxScaler()
test_scaler.fit(x_test)
x_test_scaled_badly = test_scaler.transform(x_test)

axes[2].scatter(x_train_scaled[:,0], x_train_scaled[:,1],
                c=mglearn.cm2.colors[0], label="training set", s=60)
axes[2].scatter(x_test_scaled_badly[:,0], x_test_scaled_badly[:,1],
                marker = '^',c=mglearn.cm2.colors[1], label="training set", s=60)
axes[2].set_title("wrong scaled data")
Out[14]:
Text(0.5, 1.0, 'wrong scaled data')
 
 

cancer data로 해보기

In [15]:
print(cancer)
 
{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1]), 'target_names': array(['malignant', 'benign'], dtype='<U9'), 'DESCR': '.. _breast_cancer_dataset:\n\nBreast cancer wisconsin (diagnostic) dataset\n--------------------------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 569\n\n    :Number of Attributes: 30 numeric, predictive attributes and the class\n\n    :Attribute Information:\n        - radius (mean of distances from center to points on the perimeter)\n        - texture (standard deviation of gray-scale values)\n        - perimeter\n        - area\n        - smoothness (local variation in radius lengths)\n        - compactness (perimeter^2 / area - 1.0)\n        - concavity (severity of concave portions of the contour)\n        - concave points (number of concave portions of the contour)\n        - symmetry \n        - fractal dimension ("coastline approximation" - 1)\n\n        The mean, standard error, and "worst" or largest (mean of the three\n        largest values) of these features were computed for each image,\n        resulting in 30 features.  For instance, field 3 is Mean Radius, field\n        13 is Radius SE, field 23 is Worst Radius.\n\n        - class:\n                - WDBC-Malignant\n                - WDBC-Benign\n\n    :Summary Statistics:\n\n    ===================================== ====== ======\n                                           Min    Max\n    ===================================== ====== ======\n    radius (mean):                        6.981  28.11\n    texture (mean):                       9.71   39.28\n    perimeter (mean):                     43.79  188.5\n    area (mean):                          143.5  2501.0\n    smoothness (mean):                    0.053  0.163\n    compactness (mean):                   0.019  0.345\n    concavity (mean):                     0.0    0.427\n    concave points (mean):                0.0    0.201\n    symmetry (mean):                      0.106  0.304\n    fractal dimension (mean):             0.05   0.097\n    radius (standard error):              0.112  2.873\n    texture (standard error):             0.36   4.885\n    perimeter (standard error):           0.757  21.98\n    area (standard error):                6.802  542.2\n    smoothness (standard error):          0.002  0.031\n    compactness (standard error):         0.002  0.135\n    concavity (standard error):           0.0    0.396\n    concave points (standard error):      0.0    0.053\n    symmetry (standard error):            0.008  0.079\n    fractal dimension (standard error):   0.001  0.03\n    radius (worst):                       7.93   36.04\n    texture (worst):                      12.02  49.54\n    perimeter (worst):                    50.41  251.2\n    area (worst):                         185.2  4254.0\n    smoothness (worst):                   0.071  0.223\n    compactness (worst):                  0.027  1.058\n    concavity (worst):                    0.0    1.252\n    concave points (worst):               0.0    0.291\n    symmetry (worst):                     0.156  0.664\n    fractal dimension (worst):            0.055  0.208\n    ===================================== ====== ======\n\n    :Missing Attribute Values: None\n\n    :Class Distribution: 212 - Malignant, 357 - Benign\n\n    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian\n\n    :Donor: Nick Street\n\n    :Date: November, 1995\n\nThis is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.\nhttps://goo.gl/U2Uwz2\n\nFeatures are computed from a digitized image of a fine needle\naspirate (FNA) of a breast mass.  They describe\ncharacteristics of the cell nuclei present in the image.\n\nSeparating plane described above was obtained using\nMultisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree\nConstruction Via Linear Programming." Proceedings of the 4th\nMidwest Artificial Intelligence and Cognitive Science Society,\npp. 97-101, 1992], a classification method which uses linear\nprogramming to construct a decision tree.  Relevant features\nwere selected using an exhaustive search in the space of 1-4\nfeatures and 1-3 separating planes.\n\nThe actual linear program used to obtain the separating plane\nin the 3-dimensional space is that described in:\n[K. P. Bennett and O. L. Mangasarian: "Robust Linear\nProgramming Discrimination of Two Linearly Inseparable Sets",\nOptimization Methods and Software 1, 1992, 23-34].\n\nThis database is also available through the UW CS ftp server:\n\nftp ftp.cs.wisc.edu\ncd math-prog/cpo-dataset/machine-learn/WDBC/\n\n.. topic:: References\n\n   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction \n     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on \n     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,\n     San Jose, CA, 1993.\n   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and \n     prognosis via linear programming. Operations Research, 43(4), pages 570-577, \n     July-August 1995.\n   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques\n     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) \n     163-171.', 'feature_names': array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23'), 'filename': 'C:\\Users\\ka030\\Anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\breast_cancer.csv'}
In [16]:
x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
In [17]:
from sklearn.svm import SVC

svm =SVC(C=100)
svm.fit(x_train, y_train)
print("테스트 세트 정확도 : {:2f}".format(svm.score(x_test, y_test)))
 
테스트 세트 정확도 : 0.629371
 
C:\Users\ka030\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [18]:
# MINMAXSCALER -> 0,1 스케일로 조정

scaler = MinMaxScaler()
scaler.fit(x_train)
x_train_scaled= scaler.transform(x_train)
x_test_scaled=scaler.transform(x_test)

svm.fit(x_train_scaled,y_train)
print("scaled 테스트 세트 정확도 : {:2f}".format(svm.score(x_test_scaled, y_test)))
 
scaled 테스트 세트 정확도 : 0.965035
 
C:\Users\ka030\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [19]:
# 평균 0, 분산 1을 갖도록 스케일 조정 -> (x- 평균) / 표준편차
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled= scaler.transform(x_train)
x_test_scaled=scaler.transform(x_test)

svm.fit(x_train_scaled,y_train)
print("SVM(StandardScaler) 테스트 세트 정확도 : {:5f}".format(svm.score(x_test_scaled, y_test)))
 
SVM(StandardScaler) 테스트 세트 정확도 : 0.958042
 

여기서 잠깐 주성분 분석(PCA)란?

  • 회전을 통해 불필요한 데이터를 삭제하는 차원 축소 방법이다.
  • 비지도 학습을 배우는 지금 사용할 예정이다.
In [20]:
import mglearn
import numpy as np
mglearn.plots.plot_pca_illustration()
plt.show()
 
In [21]:
# import revel
x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)
fig, axes =plt.subplots(15, 2,figsize=(10,20)) # inch 기억하기!
malignant = cancer.data[cancer.target==0]
benign = cancer.data[cancer.target==1]

ax = axes.ravel() # reval은 데이터를 평평하게 만들어주는 역할을 한다

for i in range(30):
    _, bins = np.histogram(cancer.data[:,i], bins=50) # numpy에서 histogram에 필요한 값을 생성한다
    ax[i].hist(malignant[:,i], bins=bins, color = mglearn.cm3(0), alpha=0.5)
    ax[i].hist(benign[:,i], bins=bins, color = mglearn.cm2(2), alpha=0.5)
    ax[i].set_title(cancer.feature_names[i])
    ax[i].set_yticks(())
ax[0].set_xlabel("Feature size")
ax[0].set_ylabel("Frequency")
ax[0].legend(['Malignacy','Benign'], loc="best")
fig.tight_layout()
 
In [22]:
print(cancer.data.shape)
 
(569, 30)
In [23]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(cancer.data)
x_scaled= scaler.transform(cancer.data)
In [24]:
from sklearn.decomposition import PCA
# 데이터의 처음 두개 주성분만 유지하고 나머지는 버린다,

pca = PCA(n_components=2)
pca.fit(x_scaled)

# 처음 두개의 주성분을 사용해 데이터를 변환한다.
x_pca = pca.transform(x_scaled)
print("원본 데이터 형태 : ", str(x_scaled.shape))
print("축소된 데이터 형태  : ", str(x_pca.shape)) 
# 차원 축소된다고 data이 달라지는 것이 아니다. 다만 속성이 2개만 남기고 지워진 것이다!
 
원본 데이터 형태 :  (569, 30)
축소된 데이터 형태  :  (569, 2)
In [25]:
plt.figure(figsize=(8,8))
mglearn.discrete_scatter(x_pca[:,0],x_pca[:,1], cancer.target)
# plt.legend(["악성","양성"],loc="best")
plt.gca().set_aspect("equal")
plt.xlabel("first pca")
plt.ylabel("second pca")

plt.legend(['Malignacy','Benign'], loc="best")
Out[25]:
<matplotlib.legend.Legend at 0x1b187ad4e88>
 
In [26]:
# 주성분 형태를 보고 싶다면?
print("PCA 주성분 형태 : ", pca.components_.shape)
 
PCA 주성분 형태 :  (2, 30)
In [27]:
pca.components_
Out[27]:
array([[ 0.21890244,  0.10372458,  0.22753729,  0.22099499,  0.14258969,
         0.23928535,  0.25840048,  0.26085376,  0.13816696,  0.06436335,
         0.20597878,  0.01742803,  0.21132592,  0.20286964,  0.01453145,
         0.17039345,  0.15358979,  0.1834174 ,  0.04249842,  0.10256832,
         0.22799663,  0.10446933,  0.23663968,  0.22487053,  0.12795256,
         0.21009588,  0.22876753,  0.25088597,  0.12290456,  0.13178394],
       [-0.23385713, -0.05970609, -0.21518136, -0.23107671,  0.18611302,
         0.15189161,  0.06016536, -0.0347675 ,  0.19034877,  0.36657547,
        -0.10555215,  0.08997968, -0.08945723, -0.15229263,  0.20443045,
         0.2327159 ,  0.19720728,  0.13032156,  0.183848  ,  0.28009203,
        -0.21986638, -0.0454673 , -0.19987843, -0.21935186,  0.17230435,
         0.14359317,  0.09796411, -0.00825724,  0.14188335,  0.27533947]])
 

고유 얼굴 특성 추출

In [28]:
from sklearn.datasets import fetch_lfw_people

people= fetch_lfw_people(min_faces_per_person=20, resize=0.7, color = False)
image_shape = people.images[0].shape
In [29]:
print(image_shape)

fig, axes = plt.subplots(3, 5, figsize=(15,8),subplot_kw={"xticks" : (), 'yticks':()})
for target, image, ax in zip(people.target, people.images, axes.ravel()):
    ax.imshow(image)
    ax.set_title(people.target_names[target])
 
(87, 65)
 
In [30]:
print("people.images.shape", people.images.shape)
print("클래스 개수", len(people.target_names))
 
people.images.shape (3023, 87, 65)
클래스 개수 62
In [31]:
people.target[0:10], people.target_names[people.target[0:10]]
Out[31]:
(array([61, 25,  9,  5,  1, 10, 48, 17, 13, 54], dtype=int64),
 array(['Winona Ryder', 'Jean Chretien', 'Carlos Menem', 'Ariel Sharon',
        'Alvaro Uribe', 'Colin Powell', 'Recep Tayyip Erdogan',
        'Gray Davis', 'George Robertson', 'Silvio Berlusconi'],
       dtype='<U25'))
In [32]:
counts = np.bincount(people.target)
for i, (count, name) in enumerate(zip(counts, people.target_names)):
    print("{0:25}{1:3}".format(name, count), end='   ')
    if (i+1) % 3 ==0:
        print()
 
Alejandro Toledo          39   Alvaro Uribe              35   Amelie Mauresmo           21   
Andre Agassi              36   Angelina Jolie            20   Ariel Sharon              77   
Arnold Schwarzenegger     42   Atal Bihari Vajpayee      24   Bill Clinton              29   
Carlos Menem              21   Colin Powell             236   David Beckham             31   
Donald Rumsfeld          121   George Robertson          22   George W Bush            530   
Gerhard Schroeder        109   Gloria Macapagal Arroyo   44   Gray Davis                26   
Guillermo Coria           30   Hamid Karzai              22   Hans Blix                 39   
Hugo Chavez               71   Igor Ivanov               20   Jack Straw                28   
Jacques Chirac            52   Jean Chretien             55   Jennifer Aniston          21   
Jennifer Capriati         42   Jennifer Lopez            21   Jeremy Greenstock         24   
Jiang Zemin               20   John Ashcroft             53   John Negroponte           31   
Jose Maria Aznar          23   Juan Carlos Ferrero       28   Junichiro Koizumi         60   
Kofi Annan                32   Laura Bush                41   Lindsay Davenport         22   
Lleyton Hewitt            41   Luiz Inacio Lula da Silva 48   Mahmoud Abbas             29   
Megawati Sukarnoputri     33   Michael Bloomberg         20   Naomi Watts               22   
Nestor Kirchner           37   Paul Bremer               20   Pete Sampras              22   
Recep Tayyip Erdogan      30   Ricardo Lagos             27   Roh Moo-hyun              32   
Rudolph Giuliani          26   Saddam Hussein            23   Serena Williams           52   
Silvio Berlusconi         33   Tiger Woods               23   Tom Daschle               25   
Tom Ridge                 33   Tony Blair               144   Vicente Fox               32   
Vladimir Putin            49   Winona Ryder              24   
In [33]:
mask = np.zeros(people.target.shape, dtype=np.bool)
for target in np.unique(people.target):
    mask[np.where(people.target == target)[0][:50]] =1# 50개 까지만 가져옴
    
x_people = people.data[mask]
y_people = people.target[mask]

# 최대값으로 나눠서 scailing 적용하기
x_people= x_people/255.

# MinMaxScaler로 scailing 적용하기

scaler = MinMaxScaler()
scaler.fit(x_people)
x_people_scaled = scaler.transform(x_people)
 
  • 스케일링을 0~1의 범위값을 255로 나누거나 스케일 적용하는 것은 동일하다
  • MinMaxScaler도 상관없다.
  • 복습
    • MinMaxScaler란?
      • 최대/최소값이 각각 1, 0이 되도록 스케일링하는 것이다
종류 설명
StandardScaler 기본 스케일. 평균과 표준편차 사용
MinMaxScaler 최대/최소값이 각각 1, 0이 되도록 스케일링
MaxAbsScaler 최대절대값과 0이 각각 1, 0이 되도록 스케일링
RobustScaler 중앙값(median)과 IQR(interquartile range) 사용. 아웃라이어의 영향을 최소화
In [34]:
print(x_people)
print(x_people_scaled)
 
[[0.22352941 0.23660131 0.30588236 ... 0.06797386 0.06535947 0.08888888]
 [0.2614379  0.31633985 0.3477124  ... 0.03398693 0.03267974 0.03660131]
 [0.07320261 0.05620915 0.05882353 ... 0.08888888 0.08888888 0.10065359]
 ...
 [0.14248365 0.0875817  0.10980392 ... 0.05620915 0.02614379 0.02091503]
 [0.21176471 0.25620916 0.22091503 ... 0.82222223 0.8235294  0.8326797 ]
 [0.43398693 0.50326794 0.5699346  ... 0.05490196 0.05490196 0.05359477]]
[[0.22440945 0.23722148 0.30749017 ... 0.06797386 0.06535947 0.08888888]
 [0.26246718 0.31716904 0.34954008 ... 0.03398693 0.03267974 0.03660131]
 [0.07349081 0.05635649 0.05913273 ... 0.08888888 0.08888888 0.10065359]
 ...
 [0.1430446  0.08781127 0.11038108 ... 0.05620915 0.02614379 0.02091503]
 [0.21259843 0.25688073 0.22207622 ... 0.82222223 0.8235294  0.8326797 ]
 [0.43569553 0.5045871  0.57293034 ... 0.05490196 0.05490196 0.05359477]]
In [35]:
# 어제 했던 knn으로 데이터 훈련해보기!

from sklearn.neighbors import KNeighborsClassifier

x_train, x_test, y_train, y_test = train_test_split(
    x_people, y_people, stratify = y_people, random_state=0)
knn = KNeighborsClassifier(n_neighbors = 1)
knn.fit(x_train, y_train)

print("1-최근접 이웃의 테스트 점수 : {:3f}".format(knn.score(x_test, y_test)))
 
1-최근접 이웃의 테스트 점수 : 0.232558
 
  • 보면 정답률이 낮다. 그러면 무엇을 해야하는가?
  1. 알고리즘을 바꾼다.
  2. parameter을 바꾼다.
  3. 차원축소 등을 통해 데이터를 바꾼다.
In [36]:
# 1. 차원축소를 해보자!

print("x_train.shape", x_train.shape)
pca = PCA(n_components = 100, whiten=True, random_state=0)
# whiten이란? 입력값을 비상관관계로 만드는 것


pca.fit(x_train)

x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)

print("x_train.pca.shape", x_train_pca.shape) # 차수가 100개로 줄어든 것을 확인할 수 있다.
 
x_train.shape (1547, 5655)
x_train.pca.shape (1547, 100)
 
  • 차수가 5655 -> 100개로 줄어든 것을 확인할 수 있다.
In [37]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(x_train_pca, y_train)
print("1-최근접 이웃의 테스트 점수 : {:3f}".format(knn.score(x_test_pca, y_test)))
 
1-최근접 이웃의 테스트 점수 : 0.312016
 

처음에는 0.232558이었으나 pca를 적용하였더니 0.312016로 올라갔다.

In [38]:
print("pca.components_.shape", pca.components_.shape)
 
pca.components_.shape (100, 5655)
 
  • pca 적용 전후를 그림으로 출력해보자!
In [39]:
fig , axes = plt.subplots(3,5,figsize=(15,12) , subplot_kw={'xticks':(),'yticks':()})
for i, (component, ax) in enumerate(zip(pca.components_, axes.ravel())):
    ax.imshow(component.reshape(image_shape),cmap='viridis')
    ax.set_title('PCA {}'.format((i+1)))
 
In [40]:
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

image_shape = people.images[0].shape
plt.figure(figsize=(20, 3))
ax = plt.gca()

imagebox = OffsetImage(people.images[0], zoom=2, cmap="gray")
ab = AnnotationBbox(imagebox, (0.05, 0.4), pad=0.0, xycoords='data')
ax.add_artist(ab)

for i in range(4):
    imagebox = OffsetImage(pca.components_[i].reshape(image_shape), zoom=2, 
                           cmap="viridis")

    ab = AnnotationBbox(imagebox, (0.285 + 0.2 * i, 0.4),
                        pad=0.0,xycoords='data')
    ax.add_artist(ab)
    if i == 0:
        plt.text(0.155, .3, 'x_{} *'.format(i), fontdict={'fontsize': 30})
    else:
        plt.text(0.145 + .2 * i, .3, '+ x_{} *'.format(i), fontdict={'fontsize': 30})

plt.text(.95, .3, '+ ...', fontdict={'fontsize': 30})

plt.rc('text')
plt.text(.12, .3, '=', fontdict={'fontsize': 50})
plt.axis("off")
Out[40]:
(0.0, 1.0, 0.0, 1.0)
 
In [46]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
 

iris로 pca적용해보기

In [47]:
x= iris.data
y = iris.target
target_names = iris.target_names
print(x.shape)
print(y.shape)
print(target_names)
 
(150, 4)
(150,)
['setosa' 'versicolor' 'virginica']
In [50]:
pca = PCA(n_components =2)
x_pca = pca.fit(x).transform(x)

print("First tow componetns : %s" % str(pca.explained_variance_ratio_))
 
First tow componetns : [0.92461872 0.05306648]
In [54]:
plt.figure()
colors = ['navy','orange','green']
lw = 2

for color, i , target_name in zip(colors, [0,1,2], target_names):
    plt.scatter(x_pca[y==i, 0], x_pca[y==i,1], alpha=.8, lw=lw, # lw는 line width라고 선의 두께를 말한다.
                label = target_names[i])
plt.legend(loc='best')
plt.title("pca of iris dataset")
plt.show()
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
 

실습 문제

ml 데이터셋

In [55]:
import pandas as pd

bank_data = pd.read_csv("C:/Users/ka030/Documents/GitHub/ai/bank-additional/bank-additional/bank-additional.csv", delimiter=';')
bank_data.head()
Out[55]:
  age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 30 blue-collar married basic.9y no yes no cellular may fri ... 2 999 0 nonexistent -1.8 92.893 -46.2 1.313 5099.1 no
1 39 services single high.school no no no telephone may fri ... 4 999 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0 no
2 25 services married high.school no yes no telephone jun wed ... 1 999 0 nonexistent 1.4 94.465 -41.8 4.962 5228.1 no
3 38 services married basic.9y no unknown unknown telephone jun fri ... 3 999 0 nonexistent 1.4 94.465 -41.8 4.959 5228.1 no
4 47 admin. married university.degree no yes no cellular nov mon ... 1 999 0 nonexistent -0.1 93.200 -42.0 4.191 5195.8 no

5 rows × 21 columns