mnist fashion¶
test 데이터 중에서 첫번째 데이터를 이미지로 표시 => 이미지 제출
Fashion-mnist_train.csv(60000개), fashion-mnist_test.csv(10000개)
- RandomForestClassifier
- GradientBoostingClassifier
- MLPClassifier
- SVC
- 이 중 최적의 알고리즘과 파라미터 학습 ( GridSearchCV)
- 정답률 확인(WITH CODE)
- 주피터 코드로 제출
1. test 데이터 중 첫번째 데이터를 이미지로 표시하기¶
In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
In [ ]:
data_file = open("/content/drive/My Drive/Colab Notebooks/ml/Day04/fashion-mnist_train.csv", 'r')
data_list = data_file.readlines()
data_file.close()
all_values = data_list[1].split(',')
image_array = np.asfarray(all_values[1:]).reshape((28,28))
plt.imshow(image_array, cmap='Greys', interpolation='None')
plt.show()
In [ ]:
scaled_input = np.asfarray(all_values[1:])/255.0*0.99+0.01
print(scaled_input)
2. Fashion-mnist_train.csv(60000개), fashion-mnist_test.csv(10000개)¶
In [ ]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
train = pd.read_csv('/content/drive/My Drive/Colab Notebooks/ml/Day04/fashion-mnist_train.csv')
test = pd.read_csv('/content/drive/My Drive/Colab Notebooks/ml/Day04/fashion-mnist_test.csv')
In [ ]:
print("train data frame")
print(train)
print("test data frame")
print(test)
In [ ]:
datas = [train, test]
for data in datas:
print(data.isnull().sum())
결측치 확인 결과 결측치 0에 수렴하므로 결측치 처리 과정 생략
각 컬럽에 픽셀값이 있기 때문에 숫자 데이터로 입력되어 있다.(int64). 따라서 값을 바꿔줘야 하는 항목은 따로 없다
MODEL에 학습하기¶
In [ ]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
In [ ]:
X_train = train[train.columns[1:]]
Y_train = train['label']
print(X_train)
print(Y_train)
In [ ]:
X_test = test[test.columns[1:]]
Y_test = test['label']
print(X_test)
print(Y_test)
랜덤포레스트¶
In [ ]:
model1 = RandomForestClassifier(n_estimators=20)
rdc = model1.fit(X_train, Y_train)
output1 = rdc .predict(X_test)
In [ ]:
print("RandomForest with n_estimators = 100")
print(accuracy_score(Y_test, output1))
랜덤포레스트(WITH GridSearchCV)¶
In [ ]:
rf = RandomForestClassifier()
## Grid Search
X_train = X_train.loc[:20000]
Y_train = Y_train.loc[:20000]
rf_param_grid = {
"max_depth": [None],
"max_features": [1, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [False],
"n_estimators": [10, 20]
}
rf_grid = GridSearchCV(rf, param_grid = rf_param_grid, scoring="accuracy", n_jobs=4, verbose=1)
rf_grid.fit(X_train,Y_train)
Out[ ]:
In [ ]:
rf_best = rf_grid.best_estimator_
print(rf_grid.best_score_)
print(rf_grid)
In [ ]:
rf_best = rf_grid.best_estimator_
GradientBoostingClassifier¶
In [ ]:
gb = GradientBoostingClassifier()
## Grid Search
X_train = X_train.loc[:20000]
Y_train = Y_train.loc[:20000]
gb_param_grid = {
"loss":["deviance"],
"n_estimators": [3, 4],
"learning_rate": [0.1, 0.01],
"max_depth": [4, 8],
"max_features": [0.3, 0.1],
"min_samples_leaf": [10, 15]
}
gb_grid = GridSearchCV(gb, param_grid = gb_param_grid, scoring="accuracy", n_jobs=4, verbose=1)
In [ ]:
gb_grid.fit(X_train,Y_train)
gb_best = gb_grid.best_estimator_
print(gb_grid.best_score_)
print(gb_best)
MLPClassifier¶
In [ ]:
# ===================sgd====================================
mlp_sgd = MLPClassifier(solver='sgd', hidden_layer_sizes =(100,), random_state = 1)
X_train = X_train.loc[:20000]
Y_train = Y_train.loc[:20000]
mlp_param_grid= {
"loss":["deviance"],
"n_estimators": [3, 4],
"learning_rate": [0.1, 0.01]
}
mlp_grid = GridSearchCV(gb, param_grid = mlp_param_grid, scoring="accuracy", n_jobs=4, verbose=1)
mlp_grid.fit(X_train,Y_train)
mlp_sgd_best = mlp_grid.best_score_
print(mlp_grid)
print(mlp_grid.best_score_)
In [ ]:
# ===================adam====================================
print("===================adam====================================")
mlp_adam = MLPClassifier(solver='adam', hidden_layer_sizes =(100,), random_state = 1)
X_train = X_train.loc[:20000]
Y_train = Y_train.loc[:20000]
mlp_param_grid_adam= {
"activation":["relu"]
}
mlp_a_grid = GridSearchCV(mlp_adam,param_grid=mlp_param_grid_adam,n_jobs=-1, verbose=1)
mlp_a_grid.fit(X_train,Y_train)
mlp_a_grid_best= mlp_a_grid.best_score_
print(mlp_a_grid)
print(mlp_a_grid.best_score_)
In [ ]:
mlp_a_grid_best= mlp_a_grid.best_score_
SVC¶
In [ ]:
svc = SVC() # 속도가 너무 느림
# 찾아보니 2만개 이상의 데이터셋의 경우 속도가 매우 느려진다고 하였다.
# 이에 svc학습 데이터로 데이터를 줄이기로 하였다.
X_train_m = X_train.loc[:10000]
Y_train_m = Y_train.loc[:10000]
Out[ ]:
scaler 적용하기¶
In [ ]:
# scaler 적용하기
# MinMaxScaler로 이용하기 preprocessing.Min
scaler = MinMaxScaler()
data_train = scaler.fit_transform(train.astype(np.float32))
data_test = scaler.fit_transform(test.astype(np.float32))
x_scaled = train.iloc[:,1:].values
y_scaled = train.iloc[:,0].values
# SCALER을 적용하였음에도 시간이 너무 오래걸려서 10000개로 줄임
x_scaled = pd.DataFrame(x_scaled).loc[:10000]
y_scaled = pd.DataFrame(y_scaled).loc[:10000]
x_scaled
Out[ ]:
In [ ]:
svc = SVC()
svc_param_grid = {'kernel':['linear'],
'gamma':[0.001,0.01,0.1,0.5,1],
'C':[0.01,0.1,1,10,50,100,200,300]}
svc_grid = GridSearchCV(svc, param_grid = svc_param_grid, scoring="accuracy",n_jobs=-1, verbose=1)
# X_train
In [ ]:
svc_grid.fit(x_scaled,y_scaled)
svc_best = svc_grid.best_estimator_
print(svc_grid.best_score_)
print(svc_grid.best_params_)
In [ ]:
svc = SVC()
svc_param_grid = {'kernel':['rbf'],
'gamma':[0.001,0.01,0.1,0.5,1],
'C':[0.01,0.1,1,10,50,100,200,300]}
svc_grid = GridSearchCV(svc, param_grid = svc_param_grid, scoring="accuracy",n_jobs=-1, verbose=1)
# X_train
In [ ]:
svc_grid.fit(x_scaled,y_scaled)
svc_best = svc_grid.best_estimator_
print(svc_grid.best_score_)
print(svc_grid.best_params_)
GridSearchCV적용한 결과물 확인하기¶
'개발 > sk infosec cloud ai 전문가 양성과정' 카테고리의 다른 글
[pandas를 활용한 데이터분석]SK infosec 클라우드 AI 전문가 양성과정 실습과제 (0) | 2020.09.08 |
---|---|
[pandas를 활용한 데이터분석]SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0) | 2020.09.08 |
[PYTHON데이터분석 2020/09/07-2] SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0) | 2020.09.07 |
[PYTHON데이터분석 2020/09/07-1] SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0) | 2020.09.07 |
[PYTHON데이터분석 2020/09/01] SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0) | 2020.09.07 |