네이버 영화 평점 가져오기¶

제목
평점
작성자
작성일 데이터 가져와보기

선형대수의 데이터 유형
- 스칼라
  - 3차 이상의 차수라고 한다.
- 기본적으로 백터, 메트릭스, 행렬을 제일 많이 사용한다.

numpy가 함수들도 많고 리소스도 상당히 많다

다차원 배열의 축
- 정말 중요하다. 꼭 기억하기!

y가 컬럼
3차 데이터 타입이 되면서 뒤쪽의 차수 z가 되고 이게 3차원이다.
numpy에서 행렬 데이터를 다루고자 할때 축의 변경에 의해서 축 중심으로 변경을 시키냐, 데이터 합산을 하느냐, x축 중심으로 연산을 하느냐 등에 따라 결과가 달라진다. 2차 데이터까지만 지원이 된다. 3차원부터는 depth가 생기면서 z, axis2의 값을 사용할 수 있다.
1> 총 6묶음이 있는데 초록색 점선이 1차 배열, 초록색 점선이 3 세트가 됨으로서 3차 배열이 만들어진다. 그러나 numpy를 사용하면 쉽게 그릴 수 있다. 문제는 이 데이터 값들을 변환시키고자 할때 축 방식이다.

numpy실습하기¶

import numpy as np

# 사용목적 : 행렬 연산이 가능하다
my_arr = np.arange(100)
print(my_arr)
my_list = list(range(100))
print(my_list)

data = np.random.randn(2,3)
print(data*10)
print(data.shape)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
[[ 13.86811227   4.72760426  10.31796135]
 [ -4.98163013 -10.21773601  -2.70206206]]
(2, 3)

data2 = [[1,2,3,4,5],[11,12,13,14,15]]
arr2 = np.array(data2)
print(arr2)

print(arr2.shape,',',arr2.ndim)

arr3d = np.array([[1,2,3],[4,5,6],[11,12,13],[14,15,16]])
print(arr3d)

[[ 1  2  3  4  5]
 [11 12 13 14 15]]
(2, 5) , 2
[[ 1  2  3]
 [ 4  5  6]
 [11 12 13]
 [14 15 16]]

NUMPY 배열 연산 함수¶

SUM
MEAN
등등 다 할 수 있다.

point
1. axis=0 : row
2. axis=1 : column

numpy 예제¶

import numpy as np

matrix = np.arange(1,7).reshape(2,3)
print(matrix)

# shape는 모양을 정의 reshape는 모양을 재 정의
# [1,2,3,4,5,6] -> [[1 2 3]
#                  [4 5 6]]

# 합계값
print("sum함수 :",matrix.sum(axis = None))
# 덧샘 - 행
print("row 덧셈 : ",matrix.sum(axis = 0))
# 덧샘 - 열
print("column 덧셈 : ",matrix.sum(axis = 1))

[[1 2 3]
 [4 5 6]]
sum함수 : 21
row 덧셈 :  [5 7 9]
column 덧셈 :  [ 6 15]

# 3차원 데이터 tensor

tensor = np.arange(1,19).reshape(3,2,3)
print(tensor)

print("sum함수 :",tensor.sum(axis = None))
print("row 덧셈 : ",tensor.sum(axis = 0))
print("column 덧셈 : ",tensor.sum(axis = 1))
print("tensor 덧셈 : ",tensor.sum(axis = 2))

[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]

 [[13 14 15]
  [16 17 18]]]
sum함수 : 171
row 덧셈 :  [[21 24 27]
 [30 33 36]]
column 덧셈 :  [[ 5  7  9]
 [17 19 21]
 [29 31 33]]
tensor 덧셈 :  [[ 6 15]
 [24 33]
 [42 51]]

3차원 데이터
- 그래프를 잘 그려봐야한다.
- ppt 잘 보기

matrix = np.arange(1,7).reshape(2,3)
print(matrix.mean()) # 예상 : 21
print(matrix.mean(axis = 0)) # :[5,7,9]
print(matrix.mean(axis =1)) # :[6,15]
matrix

3.5
[2.5 3.5 4.5]
[2. 5.]

array([[1, 2, 3],
       [4, 5, 6]])

수직, 수평으로 합하기¶

vstack : 수직 합
hstack : 수평 합

import numpy as np
# 수직합(vstack)
print("=======vstack=============")
matrix1 = np.array([1,2,3]) # 1 x 3
matrix2 = np.array([4,5,6]) # 1 x 3
print(matrix1)
print(matrix2)

print(np.vstack((matrix1, matrix2)))


# 수평합(hstack)
print("=======hstack=============")
matrix1 = np.array([1,2,3]).reshape(3,1)
matrix2 = np.array([4,5,6]).reshape(3,1)
print(matrix1)
print(matrix2)
print(np.hstack((matrix1, matrix2)))

=======vstack=============
[1 2 3]
[4 5 6]
[[1 2 3]
 [4 5 6]]
=======hstack=============
[[1]
 [2]
 [3]]
[[4]
 [5]
 [6]]
[[1 4]
 [2 5]
 [3 6]]

배열합치기¶

concatenate(수직, 수평)
- axis 0, 1에 따라 수직 수평으로 합칠 수 있다

import numpy as np

print("======= concatenate > axis=0=============")
matrix1 = np.array([1,2,3]) # 1 x 3
matrix2 = np.array([4,5,6]) # 1 x 3
print(np.concatenate((matrix1, matrix2), axis=0))
print("======= concatenate > axis=1=============")
matrix1 = np.array([1,2,3]).reshape(3,1)
matrix2 = np.array([4,5,6]).reshape(3,1)
print(np.concatenate((matrix1, matrix2), axis = 1))

======= concatenate > axis=0=============
[1 2 3 4 5 6]
======= concatenate > axis=1=============
[[1 4]
 [2 5]
 [3 6]]

문제1

operation function

import numpy as np

arr = np.arange(31)
arr_by_teacher = np.arange(1,31, dtype = np.float).reshape(5,6)
matrix = np.array(arr[1:31])
matrix = matrix.reshape(5,6)
print("행렬 만들기 \n",arr_by_teacher)

# 전체의 최댓값
print("최댓값", np.max(matrix))
# 각 행의 합
print("각 행의 합", matrix.sum(axis = 1))
# 각 열의 평균
print("각 열의 합(average)", np.average((matrix),axis = 0))
print("각 열의 합(mean)",matrix.mean(axis = 0))

행렬 만들기 
 [[ 1.  2.  3.  4.  5.  6.]
 [ 7.  8.  9. 10. 11. 12.]
 [13. 14. 15. 16. 17. 18.]
 [19. 20. 21. 22. 23. 24.]
 [25. 26. 27. 28. 29. 30.]]
최댓값 30
각 행의 합 [ 21  57  93 129 165]
각 열의 합(average) [13. 14. 15. 16. 17. 18.]
각 열의 합(mean) [13. 14. 15. 16. 17. 18.]

문제 2-(1)

import numpy as np

a = np.array([[1,2,3],[4,5,6]]).reshape(2,3)
b = np.array([[10],[20]])

print(np.hstack((a,b)))

[[ 1  2  3 10]
 [ 4  5  6 20]]

문제 2-(2)

import numpy as np

a = np.array([[1,2,3,],[4,5,6,]])
b = np.array([10,20,30])

print(np.vstack((a,b)))

[[ 1  2  3]
 [ 4  5  6]
 [10 20 30]]

배열 내부 연산¶

dot
- 행렬의 곱하기

import numpy as np

a = np.arange(1,7).reshape(2,3)
b = np.arange(11,17).reshape(2,3)

# print(a.dot(b))
print(a.T)
print(a.dot(a.T))

[[1 4]
 [2 5]
 [3 6]]
[[14 32]
 [32 77]]

scalar = 10
tensor = np.array(
    [ [
       [1,2,3],[4,5,6]
    ],[
       [7,8,9],[10,11,12]
    ],[
       [13,14,15],[16,17,18]
    ]
     
    ]
)

scalar + tensor

array([[[11, 12, 13],
        [14, 15, 16]],

       [[17, 18, 19],
        [20, 21, 22]],

       [[23, 24, 25],
        [26, 27, 28]]])

문제 3-(1)

import numpy as np

matrix1 = np.arange(1,13).reshape(3,4)
matrix2 = np.arange(12,0,-1).reshape(3,4)

print(matrix1.dot(matrix2.T))

# T는 Transpose이다. 곱할 수 있는 형태로 바꿔준다

[[100  60  20]
 [268 164  60]
 [436 268 100]]

import numpy as np

vector = np.array([10,20,30])
print(vector*scalar*2/10)

[20. 40. 60.]

import numpy as np

matrix = np.array([10])

import numpy as np

vector = np.array([10,20,30])
matrix = np.array([[1],[2],[3]])
print(vector+matrix)

[[11 21 31]
 [12 22 32]
 [13 23 33]]

import numpy as np
matrix = np.array([
                   [10,20,30],
                   [40,50,60],
                   [70,80,90]
])
vector = np.array([[1],[2],[3]])
print(matrix+vector)
print(vector+matrix)

[[11 21 31]
 [42 52 62]
 [73 83 93]]
[[11 21 31]
 [42 52 62]
 [73 83 93]]

pandas¶

import pandas as pd


series1 = pd.Series(data =[25,5,5,15,2])
print(series1)
print("="*15)
series2 = pd.Series(data =[25,5,5,15,2], index = ["서울","대전","광주","부산","제주"])
print(series2)

0    25
1     5
2     5
3    15
4     2
dtype: int64
===============
서울    25
대전     5
광주     5
부산    15
제주     2
dtype: int64

data_frame = pd.DataFrame(
    data = [
            {"구":25,"국번":"02"},
            {"구":5,"국번":"042"},
            {"구":5,"국번":"062"},
            {"구":15,"국번":"051"},
            {"구":2,"국번":"064"}
    ], index= ["서울","대전","광주","부산","제주"]
)
data_frame

.index 라는 것은 property로 index를 지정한다고 이해하면 된다

예제1¶

s1 = pd.Series(['100','90','80'], index = ['kor','math','eng'])
s1

s2 = pd.Series(['80','90','100'], index = ['kor','math','eng'])
s2

s3 = pd.Series(['90','100','80'], index = ['kor','math','eng'])
s3

kor      90
math    100
eng      80
dtype: object

data_frame = pd.DataFrame(data=[
                                {"1번":s1, "2번":s2,"3번":s3 }
])
data_frame

import pandas as pd
import numpy as np
import random

s1 = pd.Series(np.random.randint(10, size = 5))
print(s1)

s2 = pd.Series(data = [random.randint(1,11) for n in range(5)], dtype = float)
print(s2)

# s3 = pd.Series(data = ([random.randint(20,40) for n in range(5)], dtype = float), index = ["A형","B형","O형","AB형"])
# print(s3)

0    4
1    8
2    6
3    8
4    1
dtype: int64
0    2.0
1    5.0
2    1.0
3    3.0
4    7.0
dtype: float64

문제3¶

df = pd.DataFrame(data = [
    {"번호":1,"이름":'a',"점수":100,"나이":15},
    {"번호":2,"이름":'b',"점수":90,"나이":16},
    {"번호":3,"이름":'c',"점수":80,"나이":15},
    {"번호":4,"이름":'d',"점수":70,"나이":14}]
)
df

df[['번호','점수']] # 리스트를 두번 묶어주는 것은 하나의 데이터 컬럼으로 전달해야해서 그런것이다

리스트 형태를 사용한 dataframe¶

data = [
        ['kim',20,'designer'],
        ['lee',21,'programmer'],
        ['park',22,'dba']
]
column_name = ["name","age","job"]
pd.DataFrame(data, columns= column_name)

from collections import OrderedDict

data = OrderedDict(
    [
            ['name',['kim','lee','park']],
            ['age',[20,21,22]],
            ['job',['designer','designer','dba']]
    ])
# column_name = ["name","age","job"]
pd.DataFrame(data)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   번호      4 non-null      int64 
 1   이름      4 non-null      object
 2   점수      4 non-null      int64 
 3   나이      4 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 256.0+ bytes

print(df.head())
print(df.tail(3))
print(df.index)
print(df.columns)
print(df.values)
print(df.describe())

   번호 이름   점수  나이
0   1  a  100  15
1   2  b   90  16
2   3  c   80  15
3   4  d   70  14
   번호 이름  점수  나이
1   2  b  90  16
2   3  c  80  15
3   4  d  70  14
RangeIndex(start=0, stop=4, step=1)
Index(['번호', '이름', '점수', '나이'], dtype='object')
[[1 'a' 100 15]
 [2 'b' 90 16]
 [3 'c' 80 15]
 [4 'd' 70 14]]
             번호          점수         나이
count  4.000000    4.000000   4.000000
mean   2.500000   85.000000  15.000000
std    1.290994   12.909944   0.816497
min    1.000000   70.000000  14.000000
25%    1.750000   77.500000  14.750000
50%    2.500000   85.000000  15.000000
75%    3.250000   92.500000  15.250000
max    4.000000  100.000000  16.000000

행/렬 추가하기¶

product_list = [
                {'name':'MOUSE','price':100,'company':'A'},
                {'name':'SSD','price':200,'company':'B'},
                {'name':'CPU','price':300,'company':'C'}
]
df = pd.DataFrame(product_list , columns=['name','price','company'])
df

df['expires'] = 5
df

sereies = pd.Series([10,2,2], index=[0,1,2])
df['count'] = sereies
df

df['total'] = df['price']*df['count']
df

print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     3 non-null      object
 1   price    3 non-null      int64 
 2   company  3 non-null      object
 3   expires  3 non-null      int64 
 4   count    3 non-null      int64 
 5   total    3 non-null      int64 
dtypes: int64(4), object(2)
memory usage: 272.0+ bytes
None
       price  expires      count        total
count    3.0      3.0   3.000000     3.000000
mean   200.0      5.0   4.666667   666.666667
std    100.0      0.0   4.618802   305.505046
min    100.0      5.0   2.000000   400.000000
25%    150.0      5.0   2.000000   500.000000
50%    200.0      5.0   2.000000   600.000000
75%    250.0      5.0   6.000000   800.000000
max    300.0      5.0  10.000000  1000.000000

discount = []

for row in df['total']:
    if row>=1000:
        discount.append('10%')
    elif row >=500:
        discount.append('5%')
    else:
        discount.append('0%')

df['discount'] = discount
df

# 사용자 함수 (apply)

def is_wireless(row):
    if row == 'MOUSE':
        return True
    else:
        return False

df['wireless'] = df.name.apply(is_wireless) # df.name은 df의 컬럼 name을 의미한다
df

created_date = ['2004-08-09','2009-08-31','2020-09-01']
df['created_date'] = created_date
df

문제 : 년도만 마지막 컬럼에 추가
apply는 df.컬럼명으로 할 수도 있고, 필드명으로 할 수도 있다.
동적이냐 ~이냐의 차이이다.

def extract_year(row):
    year = row.split('-')[0]
    return year

df['year'] = df['created_date'].apply(extract_year)
df

using numpy¶

* 3항연산자란?

a = 10
b = 10
result = a == b and a-b or a+b    # 결과는 a+b = 20

*

import numpy as np

df['price'] = np.where(df['year']<'2010', df['price'] * 0.95, df['price'])
# where은 3항 연산자처럼 사용된다. 첫번째 조건에 부합하면 2번째를 적용하고, 아니라면 3번째를 한다

drop¶

# 컬럼을 삭제할때는 컬럼의 명을 정확하게 지정하면 된다.

# df = 
# df
df.drop([2])

df.drop('year', axis =1)
# # axis 0의 값을 axis =1로 바꿔줘야 한다!
# # 즉 index가 아닌 column으로 변경
# df

df.drop(['year','discount'], axis =1, inplace = True)
df

데이터 조회¶

df[df['wireless'] ==False]

df[df['total']>500]

pandas04¶

ex1¶

import pandas as pd

df = pd.DataFrame([
    {'번호':1, '이름':'a', '국어':100, '나이':15},
    {'번호':2, '이름':'b', '국어':90, '나이':16},
    {'번호':3, '이름':'c', '국어':80, '나이':15},
    {'번호':4, '이름':'d', '국어':70, '나이':14},
], columns=['번호', '이름', '국어', '나이'])

# 1 영어 점수 추가하기

df['영어'] =[100,80,70,60]
df

df1 = df[['번호','이름','국어','영어','나이']]
df1

df1['합격여부'] = df['국어']+df['영어'] >150
df1

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

df1
def chage_value(value):
    return '합격' if value else '불합격'

df1['합격여부'] = df1.합격여부.apply(chage_value)

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """

df

ex2¶

df = pd.DataFrame([
    {'매장명':'대남포차', '주메뉴':'문어', '가격':40000, '휴무일':'매주 일요일'},
    {'매장명':'재민국밥', '주메뉴':'돼지국밥', '가격':6000, '휴무일':'연중무휴'},
    {'매장명':'구서칼국수', '주메뉴':'칼국수', '가격':4000, '휴무일':'연중무휴'},
    {'매장명':'해진아나고', '주메뉴':'아나고', '가격':60000, '휴무일':'2/4주 일요일'},
], columns=['매장명', '주메뉴', '가격', '휴무일'])
df

df1 =df
df1['고가/저가'] = np.where(df1['가격'] >10000, '고가','저가')
df1

df2 =df
# 함수만들어서 하는 법
price_list = []

for row in df2['가격']:
    if row <= 10000:
        price_list.append('저가')
    else:
        price_list.append('고가')

df2['고가/저가'] = price_list
df2

def input_price(value):
    return '고가' if value >10000 else '저가'

df2['고가/저가'] = df2.가격.apply(input_price)
df2

# lambda
f = lambda x: "고가" if x>10000  else '저가'

df2['고가/저가'] = df2.가격.apply(f)
df2

# 4번 문제

def input_kind(value):
    if value in ['문어','아나고']:
        return "해산물"
    else:
        return "기타"
df2['종류'] = df2.주메뉴.apply(input_kind)
df2

합격인지 true false대신 합격 불합격 출력하기

import pandas as pd

df = pd.DataFrame([
    {'번호':1, '이름':'a', '국어':100, '나이':15},
    {'번호':2, '이름':'b', '국어':90, '나이':16},
    {'번호':3, '이름':'c', '국어':80, '나이':15},
    {'번호':4, '이름':'d', '국어':70, '나이':14},
], columns=['번호', '이름', '국어', '나이'])

[PYTHON데이터분석 2020/09/07-2] SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0)	2020.09.07
[PYTHON데이터분석 2020/09/07-1] SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0)	2020.09.07
[PYTHON데이터분석 2020/09/03] SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0)	2020.09.07
[PYTHON데이터분석 2020/09/02] SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0)	2020.09.07
[클라우드 컴퓨팅]SK infosec 클라우드 AI 전문가 양성과정 수업필기본 (0)	2020.09.06

개발인가 메모장인가

[PYTHON데이터분석 2020/09/01] SK infosec 클라우드 AI 전문가 양성과정 수업필기본

네이버 영화 평점 가져오기¶

numpy실습하기¶

NUMPY 배열 연산 함수¶

numpy 예제¶

수직, 수평으로 합하기¶

배열합치기¶

배열 내부 연산¶

pandas¶

예제1¶

문제3¶

리스트 형태를 사용한 dataframe¶

행/렬 추가하기¶

using numpy¶

drop¶

데이터 조회¶

pandas04¶

ex1¶

ex2¶

'개발 > sk infosec cloud ai 전문가 양성과정' 카테고리의 다른 글

'개발/sk infosec cloud ai 전문가 양성과정'의 다른글

티스토리툴바

	구	국번
서울	25	02
대전	5	042
광주	5	062
부산	15	051
제주	2	064

	name	age	job
0	kim	20	designer
1	lee	21	programmer
2	park	22	dba

	name	age	job
0	kim	20	designer
1	lee	21	designer
2	park	22	dba

	name	price	company
0	MOUSE	100	A
1	SSD	200	B
2	CPU	300	C

[PYTHON데이터분석 2020/09/01] SK infosec 클라우드 AI 전문가 양성과정 수업필기본

네이버 영화 평점 가져오기¶

numpy실습하기¶

NUMPY 배열 연산 함수¶

numpy 예제¶

수직, 수평으로 합하기¶

배열합치기¶

배열 내부 연산¶

pandas¶

예제1¶

문제3¶

리스트 형태를 사용한 dataframe¶

행/렬 추가하기¶

using numpy¶

drop¶

데이터 조회¶

pandas04¶

ex1¶

ex2¶

'개발 > sk infosec cloud ai 전문가 양성과정' 카테고리의 다른 글

'개발/sk infosec cloud ai 전문가 양성과정'의 다른글

관련글

티스토리툴바