728x90

차원 축소:

원본 데이터의 특성을 적은 수의 새로운 특성으로 변환하는 비지도 학습. 저장 공간을 줄이고 시각화 하기 좋음, 다른 알고리즘의 성능 향상

주성분 분석(Principal Component Analysis : PCA):

데이터에서 가장 분산이 큰 방향을 찾는 방법.( 이 방향을 주성분이라 한다) 일반적으로 주성분은 원본 데이터에 있는 특성 개수보다 작다.

설명된 분산:

주성분 분석에서 주성분이 얼마나 원본 데이터의 분산을 잘 나타내는지 기록한 것. 사이킷런의 PCA class는 주성분 개수나 설명된 분산의 비율을 지정하여 주성분 분석을 수행할 수 있다.

* 이미지를 예로 들면 모든 픽셀이 특성이 되는데 이를 모두 기록하기는 어렵다. 이들 특성 중 주성분을 분석하여 분산에 맞게 저장하는 방법을 학습한다. 또한 이렇게 저장된 내용으로 다시 원본 데이터에 가깝게 복구가 가능하다.

* 분산이 크다는 것은 data가 가장 멀리 퍼진 방향을 나타낸다. 점들이 직선에 가깝게 퍼져 있다면 이러한 직선이 data의 특성을 잘 나타낸다고 볼 수 있다.

PCA Class

!wget https://bit.ly/fruits_300_data -O fruits_300.npy
import numpy as np

fruits = np.load('fruits_300.npy')
fruits_2d = fruits.reshape(-1, 100*100)

from sklearn.decomposition import PCA

pca = PCA(n_components=50)
pca.fit(fruits_2d)

PCA(n_components=50)

PCA가 찾은 주성분이 componenctes_에 들어가 있다.

print(pca.components_.shape)

(50, 10000)

배열의 차원이 50이다.(주성분 50)

import matplotlib.pyplot as plt

def draw_fruits(arr, ratio=1):
    n = len(arr)    # n은 샘플 개수입니다
    # 한 줄에 10개씩 이미지를 그립니다. 샘플 개수를 10으로 나누어 전체 행 개수를 계산합니다. 
    rows = int(np.ceil(n/10))
    # 행이 1개 이면 열 개수는 샘플 개수입니다. 그렇지 않으면 10개입니다.
    cols = n if rows < 2 else 10
    fig, axs = plt.subplots(rows, cols, 
                            figsize=(cols*ratio, rows*ratio), squeeze=False)
    for i in range(rows):
        for j in range(cols):
            if i*10 + j < n:    # n 개까지만 그립니다.
                axs[i, j].imshow(arr[i*10 + j], cmap='gray_r')
            axs[i, j].axis('off')
    plt.show()
draw_fruits(pca.components_.reshape(-1, 100, 100))

주성분을 50으로 잡았으니 특성의 개수를 10,000에서 50으로 줄일 수 있다.

transform()을 이용하여 원본 데이터의 차원을 50으로 줄일 수 있다.

print(fruits_2d.shape)
fruits_pca = pca.transform(fruits_2d)
print(fruits_pca.shape)

(300, 10000)

(300, 50)

원본 데이터 재구성

fruits_inverse = pca.inverse_transform(fruits_pca)
print(fruits_inverse.shape)
fruits_reconstruct = fruits_inverse.reshape(-1, 100, 100)

for start in [0, 100, 200]:
    draw_fruits(fruits_reconstruct[start:start+100])
    print("\n")

줄어든 특성을 이용하여 다시 원본 데이터를 복구하였는데 거의 비슷한 결과를 볼 수 있다.

미분 후 적분한 것처럼 손실은 있다.

설명된 분산

print(np.sum(pca.explained_variance_ratio_))
plt.plot(pca.explained_variance_ratio_)
plt.show()

0.9214965686302707

50개의 주성분이 있으나 처음 10개의 주성분이 대부분의 분산을 표현하고 있다.

다른 알고리즘과 함께 사용하기

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

target = np.array([0] * 100 + [1] * 100 + [2] * 100)

from sklearn.model_selection import cross_validate

scores = cross_validate(lr, fruits_2d, target)
print(np.mean(scores['test_score']))
print(np.mean(scores['fit_time']))

0.9966666666666667 : test score

1.5750249862670898 : fit time

scores = cross_validate(lr, fruits_pca, target)
print(np.mean(scores['test_score']))
print(np.mean(scores['fit_time']))

1.0 : 50개의 주성분으로 분석했을 때 정확도가 100%로 나왔다

0.025466299057006835 : fit time 역시 엄청 줄었다.

pca = PCA(n_components=0.5)
pca.fit(fruits_2d)

PCA(n_components=0.5)

처음에는 주성분의 개수를 50개로 지정했는데 분산의 50%가 채워질 때까지 주성분의 개수를 늘리도록 변경하였다.

이렇게 변경한 다음 주성분의 개수가 몇개나 필요한 지 확인해보자.

print(pca.n_components_)

2

두개의 주성분이면 충분하다고 한다.

fruits_pca = pca.transform(fruits_2d)
print(fruits_pca.shape)

(300, 2)

scores = cross_validate(lr, fruits_pca, target)
print(np.mean(scores['test_score']))
print(np.mean(scores['fit_time']))

0.9933333333333334 : test score

0.036731719970703125 : fit time

특성을 두개만 사용해도 99.3%의 정확도를 나타낸다.

from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, random_state=42)
km.fit(fruits_pca)

KMeans(n_clusters=3, random_state=42)

print(np.unique(km.labels_, return_counts=True))

(array([0, 1, 2], dtype=int32), array([110, 99, 91]))

for label in range(0, 3):
    draw_fruits(fruits[km.labels_ == label])
    print("\n")

for label in range(0, 3):
    data = fruits_pca[km.labels_ == label]
    plt.scatter(data[:,0], data[:,1])
plt.legend(['apple', 'banana', 'pineapple'])
plt.show()

두개만 가지고 분류했는데 분산이 잘 이루어져 있다.

파인애플과 사과는 경계가 거의 붙어있어 혼동이 있을 수 있다.

(그런데 바나나는 왜 섞여 들어간거지?)

728x90

저작자표시 비영리 동일조건

'Programming > Machine Learning' 카테고리의 다른 글

[혼공머신] 07-2 심층 신경망 (0)	2022.04.24
[혼공머신] 07-1 인공 신경망 Cont. (0)	2022.04.09
[혼공머신] 07-1 인공 신경망 (0)	2022.04.09
[혼공머신] 용어 05장 (0)	2022.04.09
[혼공머신] 용어 04장 (0)	2022.02.28
[혼공머신] 06-2 k-평균 (0)	2022.02.20
[혼공머신] 06-1 군집 알고리즘(비지도학습) (0)	2022.02.19
[혼공머신] 05-3 트리의 앙상블 (0)	2022.02.13
[혼공머신] 05-2 교차 검증과 그리드 서치 (0)	2022.02.12
[혼공머신] 05-1 결정트리 (0)	2022.02.06

728x90

inertia = []
for k in range(2, 7):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(fruits_2d)
    inertia.append(km.inertia_)

plt.plot(range(2, 7), inertia)
plt.xlabel('k')
plt.ylabel('inertia')
plt.show()

k-평균(k-means): 처음에 랜덤하게 클러스터 중심을 정하고 클러스트를 만든다. 클러스터 중심을 이동하고 다시 클러스트 만들기를 반복하여 최적의 클러스터 구성하는 알고리즘

클러스터 중심(cluster center, centroid): k-평균 알고리즘이 만든 클러스터에 속한 샘플의 특성 평균 값. 가장 가까운 클러스터 중심을 샘플의 또 다른 특성으로 사용하거나 새로운 샘플에 대한 예측으로 활용 가능

엘보우 방법(elbow): 최적의 클러스터 개수를 정하는 방법 중 하나. 이너셔(inertia)는클러스터 중심과 샘플 사이 거리의 제곱 합. 클러스터 개수에 따라 이너셔 감소가 꺽이는 지점이 적절한 클러스터 개수 k가 될 수 있음

k-평균 알고리즘 방식

1. 무작위로 k개의 클러스터 중심을 정함

2. 각 샘플에서 가장 가까운 클러스터 중심을 찾아 해당 클러스터의 샘플로 지정

3. 클러스터에 속한 샘플의 평균값으로 클러스터 중심을 변경

4. 클러스터 중심에 변화가 없을 때까지 2~3 반복

k-평균

KMeans 클래스

!wget https://bit.ly/fruits_300_data -O fruits_300.npy

import numpy as np

fruits = np.load('fruits_300.npy')
fruits_2d = fruits.reshape(-1, 100*100)

from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, random_state=42)
km.fit(fruits_2d)

KMeans(n_clusters=3, random_state=42)

print(km.labels_)

[2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 0 0 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

print(np.unique(km.labels_, return_counts=True))

(array([0, 1, 2], dtype=int32), array([111, 98, 91]))

import matplotlib.pyplot as plt

def draw_fruits(arr, ratio=1):
    n = len(arr)    # n은 샘플 개수입니다
    # 한 줄에 10개씩 이미지를 그립니다. 샘플 개수를 10으로 나누어 전체 행 개수를 계산합니다. 
    rows = int(np.ceil(n/10))
    # 행이 1개 이면 열 개수는 샘플 개수입니다. 그렇지 않으면 10개입니다.
    cols = n if rows < 2 else 10
    fig, axs = plt.subplots(rows, cols, 
                            figsize=(cols*ratio, rows*ratio), squeeze=False)
    for i in range(rows):
        for j in range(cols):
            if i*10 + j < n:    # n 개까지만 그립니다.
                axs[i, j].imshow(arr[i*10 + j], cmap='gray_r')
            axs[i, j].axis('off')
    plt.show()

draw_fruits(fruits[km.labels_==0])

draw_fruits(fruits[km.labels_==1])

draw_fruits(fruits[km.labels_==2])

클러스터 중심

draw_fruits(km.cluster_centers_.reshape(-1, 100, 100), ratio=3)

print(km.transform(fruits_2d[100:101]))
print(km.predict(fruits_2d[100:101]))
draw_fruits(fruits[100:101])
print(km.n_iter_)

[[3393.8136117 8837.37750892 5267.70439881]]

[0]

4

최적의 k 찾기

inertia = []
for k in range(2, 7):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(fruits_2d)
    inertia.append(km.inertia_)

plt.plot(range(2, 7), inertia)
plt.xlabel('k')
plt.ylabel('inertia')
plt.show()

728x90

저작자표시 비영리 동일조건

'Programming > Machine Learning' 카테고리의 다른 글

[혼공머신] 07-1 인공 신경망 Cont. (0)	2022.04.09
[혼공머신] 07-1 인공 신경망 (0)	2022.04.09
[혼공머신] 용어 05장 (0)	2022.04.09
[혼공머신] 06-3 주성분 분석 (0)	2022.03.27
[혼공머신] 용어 04장 (0)	2022.02.28
[혼공머신] 06-1 군집 알고리즘(비지도학습) (0)	2022.02.19
[혼공머신] 05-3 트리의 앙상블 (0)	2022.02.13
[혼공머신] 05-2 교차 검증과 그리드 서치 (0)	2022.02.12
[혼공머신] 05-1 결정트리 (0)	2022.02.06
[혼공머신] 04-2 확률적 경사 하강법 (0)	2022.02.05

728x90

비지도 학습 : 훈련 데이터에 타깃이 없다. 때문에 외부의 도움 없이 스스로 유용한 무언가를 학습해야 함. 대표적인 비지도 학습은 군집, 차원축소 등

히스토그램 : 구간별로 값이 발생한 빈도를 그래프로 표시한 것. x축이 구간(계급), y축이 발생 빈도(도수)

군집 : 비슷한 샘플끼리 하나의 그룹으로 모은다. 군집으로 모은 샘플 그룹을 쿨러스터라고 함

Unsupervised Learning : target이 없어 스스로 학습해야 함

군집 알고리즘

과일 사진 데이터 준비하기

!wget https://bit.ly/fruits_300_data -O fruits_300.npy

import numpy as np
import matplotlib.pyplot as plt

fruits = np.load('fruits_300.npy')

print(fruits.shape)
print(fruits[0, 0, :])
plt.imshow(fruits[0], cmap='gray')
plt.show()
plt.imshow(fruits[0], cmap='gray_r')
plt.show()
plt.imshow(fruits[0])
plt.show()

(300, 100, 100)

[ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 3 2 1 2 1 1 1 1 2 1 3 2 1 3 1 4 1 2 5 5 5 19 148 192 117 28 1 1 2 1 4 1 1 3 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

fig, axs = plt.subplots(1, 2)
axs[0].imshow(fruits[100], cmap='gray_r')
axs[1].imshow(fruits[200], cmap='gray_r')
plt.show()

픽셀 값 분석하기

apple = fruits[0:100].reshape(-1, 100*100)
pineapple = fruits[100:200].reshape(-1, 100*100)
banana = fruits[200:300].reshape(-1, 100*100)

print(apple.shape)
print(apple.mean(axis=1))

(100, 10000)

[ 88.3346 97.9249 87.3709 98.3703 92.8705 82.6439 94.4244 95.5999 90.681 81.6226 87.0578 95.0745 93.8416 87.017 97.5078 87.2019 88.9827 100.9158 92.7823 100.9184 104.9854 88.674 99.5643 97.2495 94.1179 92.1935 95.1671 93.3322 102.8967 94.6695 90.5285 89.0744 97.7641 97.2938 100.7564 90.5236 100.2542 85.8452 96.4615 97.1492 90.711 102.3193 87.1629 89.8751 86.7327 86.3991 95.2865 89.1709 96.8163 91.6604 96.1065 99.6829 94.9718 87.4812 89.2596 89.5268 93.799 97.3983 87.151 97.825 103.22 94.4239 83.6657 83.5159 102.8453 87.0379 91.2742 100.4848 93.8388 90.8568 97.4616 97.5022 82.446 87.1789 96.9206 90.3135 90.565 97.6538 98.0919 93.6252 87.3867 84.7073 89.1135 86.7646 88.7301 86.643 96.7323 97.2604 81.9424 87.1687 97.2066 83.4712 95.9781 91.8096 98.4086 100.7823 101.556 100.7027 91.6098 88.8976]

100*100짜리 이미지를 10000의 1차 배열로 생각하여 처리한다.

plt.hist(np.mean(apple, axis=1), alpha=0.8)
plt.hist(np.mean(pineapple, axis=1), alpha=0.8)
plt.hist(np.mean(banana, axis=1), alpha=0.8)
plt.legend(['apple', 'pineapple', 'banana'])
plt.show()

fig, axs = plt.subplots(1, 3, figsize=(20, 5))
axs[0].bar(range(10000), np.mean(apple, axis=0))
axs[1].bar(range(10000), np.mean(pineapple, axis=0))
axs[2].bar(range(10000), np.mean(banana, axis=0))
plt.show()

apple_mean = np.mean(apple, axis=0).reshape(100, 100)
pineapple_mean = np.mean(pineapple, axis=0).reshape(100, 100)
banana_mean = np.mean(banana, axis=0).reshape(100, 100)

fig, axs = plt.subplots(1, 3, figsize=(20, 5))
axs[0].imshow(apple_mean, cmap='gray_r')
axs[1].imshow(pineapple_mean, cmap='gray_r')
axs[2].imshow(banana_mean, cmap='gray_r')
plt.show()

평균값과 가까운 사진 고르기

abs_diff = np.abs(fruits - apple_mean)
abs_mean = np.mean(abs_diff, axis=(1,2))
print(abs_mean.shape)

apple_index = np.argsort(abs_mean)[:100]
fig, axs = plt.subplots(10, 10, figsize=(10,10))
for i in range(10):
    for j in range(10):
        axs[i, j].imshow(fruits[apple_index[i*10 + j]], cmap='gray_r')
        axs[i, j].axis('off')
plt.show()

(300,)

여기서 의문은 비지도 학습인데도 불구하고 사과 이미지들의 평균을 구하고 여기에 비슷한 사과들을 추출하였다.

과연 이것을 비지도 학습이라고 볼 수 있을까?

해당 학습장 마지막에 그에 대해 똑같은 질문이 있다. 바로 다음 장에서 정말 target이 없는 경우 어떻게 하는지 알려 주겠다고 한다.

728x90

저작자표시 비영리 동일조건

'Programming > Machine Learning' 카테고리의 다른 글

[혼공머신] 07-1 인공 신경망 (0)	2022.04.09
[혼공머신] 용어 05장 (0)	2022.04.09
[혼공머신] 06-3 주성분 분석 (0)	2022.03.27
[혼공머신] 용어 04장 (0)	2022.02.28
[혼공머신] 06-2 k-평균 (0)	2022.02.20
[혼공머신] 05-3 트리의 앙상블 (0)	2022.02.13
[혼공머신] 05-2 교차 검증과 그리드 서치 (0)	2022.02.12
[혼공머신] 05-1 결정트리 (0)	2022.02.06
[혼공머신] 04-2 확률적 경사 하강법 (0)	2022.02.05
[혼공머신] 용어 03장 (0)	2022.02.03

728x90

* 앙상블 학습 : 더 좋은 예측 결과를 만들기 위해 여러 개의 모델을 훈련하는 러신머닝 알고리즘

- 정형 데이터를 다루는 데 가장 뛰어난 성과를 내고 있다. 대부분 결정 트리 기반

* 랜덤 포레스트 : 대표적인 결정 트리 기반 앙상블. bootstrap sample을 사용하고 랜덤하게 일부 특성을 선택하여 트리를 만드는 것이 특징

* 엑스트라 트리 : bootstrap sample을 만들지 않고 훈련 데이터를 사용. 대신 랜덤하게 노드를 분할해 과대적합을 감소

* 그레이디언트 부스팅 : 결정트리를 연속적으로 추가하여 손실 함수를 최소화. 훈련 속도가 느리지만 더 나은 성능을 기대할 수 있다. 이의 속도를 개선한 것이 히스토그램 기반 그레이디언트 부스팅이며 안정적인 결과와 높은 성능으로 인기

정형데이터 : 일정한 구조로 되어 있는 데이터 -> 앙상블 학습

비정형데이터 : 사진, 음악, 문학 등 일정하게 처리하기 어려운 데이터 -> 신경망 알고리즘

Random Forest

결정 트리를 랜덤하게 만들어 결정트리의 숲을 만든다.

bootstrap sample : train data에서 random으로 추출하여 sample을 만든다. 이 때 size는 train data와 같고, 중복 추출을 허용한다. -> train data가 1000개면 중복 추출 허용 방식으로 1000개를 뽑아 bootstrap sample을 만든다.

사이킷런의 random forest는 기본적으로 100개의 결정 트리를 훈련한 다음 각 트리의 클래스별 확률을 평균하여 가장 높은 확률을 가진 클래스를 예측으로 삼는다. (회귀일 때는 단순히 각 트리의 예측을 평균한다고 한다)

Extra Trees

random forest와 매우 비슷하게 동작. 기본 100개의 결정 트리를 훈련. parameter도 동일.

bootstrap sample 대신 전체 훈련 세트를 사용. 대신 노드를 분할할 때 랜덤으로 분할.(splitter='random')

성능이 떨어질 수 있으나 과대적합을 막는데 효과가 있다.

Gradient boosting

깊이 얕은 결정 트리를 사용하여 이전 트리의 오차를 보완하는 방식.

사이킷런의 GradientBoostingClassifier는 기본 깊이 3인 결정 트리 100개 사용.

깊이가 얕기 때문에 과대적합에 강함. 높은 일반화 성능 기대.

4장에서 다루었던 하강법을 사용하여 트리를 앙상블에 추가. 분류에서는 로지스틱 손실 함수를 사용하고 회귀에서는 평균 제곱 오차 함수 사용.

Histogram-based Gradient Boosting

입력 특성을 256개의 구간으로 나눈다. -> 노드를 분할할 때 최적의 분할을 빠르게 찾음

256개의 구간 중 하나를 떼어 놓고 누락된 값을 위해 사용한다. -> 누락된 특성이 있어도 이를 전처리 할 필요 없음

HistGradientBoostingClassifier는 트리 개수를 정하는데 n_estimators대신 부스팅 반복 횟수를 지정하는 max_iter를 사용. max_iter를 조정하여 성능을 조율.

사이킷런 외의 library

* XGBoost

* LightGBM

랜덤포레스트

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

wine = pd.read_csv('https://bit.ly/wine_csv_data')

data = wine[['alcohol', 'sugar', 'pH']].to_numpy()
target = wine['class'].to_numpy()

train_input, test_input, train_target, test_target = train_test_split(data, target, test_size=0.2, random_state=42)

from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_jobs=-1, random_state=42)
scores = cross_validate(rf, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9973541965122431 0.8905151032797809

return_train_score의 기본값은 false : train socre를 같이 확인하면 과대적합을 파악하는데 용이.

훈련세트에 과대적합 되어 있다.

rf.fit(train_input, train_target)
print(rf.feature_importances_)

[0.23167441 0.50039841 0.26792718]

feature는 알코올 도수, 당도, pH순

5-1 결정트리에서는 아래와 같았다.

[0.12345626 0.86862934 0.0079144 ]
출처: https://bluelimn.tistory.com/entry/혼공머신-05-1-결정트리 [ANMIAN]

당도의 중요도가 내려가고 pH의 중요도가 높아졌다.

rf = RandomForestClassifier(oob_score=True, n_jobs=-1, random_state=42)

rf.fit(train_input, train_target)
print(rf.oob_score_)

0.8934000384837406

자체 알고리즘을 평가하는 기능이 있다. oob_score=True로 해주어야 한다.

엑스트라트리

from sklearn.ensemble import ExtraTreesClassifier

et = ExtraTreesClassifier(n_jobs=-1, random_state=42)
scores = cross_validate(et, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9974503966084433 0.8887848893166506

et.fit(train_input, train_target)
print(et.feature_importances_)

[0.20183568 0.52242907 0.27573525]

그레이디언트 부스팅

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(random_state=42)
scores = cross_validate(gb, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.8881086892152563 0.8720430147331015

gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.2, random_state=42)
scores = cross_validate(gb, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9464595437171814 0.8780082549788999

결정트리의 개수(default:100)를 500으로 늘려도 과대적합에 비교적 안전함

gb.fit(train_input, train_target)
print(gb.feature_importances_)

[0.15872278 0.68010884 0.16116839]

히스토그램 기반 부스팅

# 사이킷런 1.0 버전 아래에서는 다음 라인의 주석을 해제하고 실행하세요.
# from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

hgb = HistGradientBoostingClassifier(random_state=42)
scores = cross_validate(hgb, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9321723946453317 0.8801241948619236

from sklearn.inspection import permutation_importance

hgb.fit(train_input, train_target)
result = permutation_importance(hgb, train_input, train_target, n_repeats=10,
                                random_state=42, n_jobs=-1)
print(result.importances_mean)

[0.08876275 0.23438522 0.08027708]

result = permutation_importance(hgb, test_input, test_target, n_repeats=10,
                                random_state=42, n_jobs=-1)
print(result.importances_mean)

[0.05969231 0.20238462 0.049 ]

hgb.score(test_input, test_target)

0.8723076923076923

XGBoost

from xgboost import XGBClassifier

xgb = XGBClassifier(tree_method='hist', random_state=42)
scores = cross_validate(xgb, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.8824322471423747 0.8726214185237284

LightGBM

from lightgbm import LGBMClassifier

lgb = LGBMClassifier(random_state=42)
scores = cross_validate(lgb, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9338079582727165 0.8789710890649293

결정트리
[0.12345626 0.86862934 0.0079144 ]
Random Forest
[0.23167441 0.50039841 0.26792718]
Extra Trees
[0.20183568 0.52242907 0.27573525]
Gradient Boosting
[0.15872278 0.68010884 0.16116839]
Histogram-based Gradient Boosting
[0.08876275 0.23438522 0.08027708]

728x90

저작자표시 비영리 동일조건

'Programming > Machine Learning' 카테고리의 다른 글

[혼공머신] 용어 05장 (0)	2022.04.09
[혼공머신] 06-3 주성분 분석 (0)	2022.03.27
[혼공머신] 용어 04장 (0)	2022.02.28
[혼공머신] 06-2 k-평균 (0)	2022.02.20
[혼공머신] 06-1 군집 알고리즘(비지도학습) (0)	2022.02.19
[혼공머신] 05-2 교차 검증과 그리드 서치 (0)	2022.02.12
[혼공머신] 05-1 결정트리 (0)	2022.02.06
[혼공머신] 04-2 확률적 경사 하강법 (0)	2022.02.05
[혼공머신] 용어 03장 (0)	2022.02.03
[혼공머신] 04-1 로지스틱 회귀 (0)	2022.01.23

ANMIAN

machine-learning

[혼공머신] 06-3 주성분 분석

차원 축소:

주성분 분석(Principal Component Analysis : PCA):

설명된 분산:

PCA Class

원본 데이터 재구성

설명된 분산

다른 알고리즘과 함께 사용하기

'Programming > Machine Learning' 카테고리의 다른 글

[혼공머신] 06-2 k-평균

k-평균

KMeans 클래스

클러스터 중심

최적의 k 찾기

'Programming > Machine Learning' 카테고리의 다른 글

[혼공머신] 06-1 군집 알고리즘(비지도학습)

군집 알고리즘

과일 사진 데이터 준비하기

픽셀 값 분석하기

평균값과 가까운 사진 고르기

'Programming > Machine Learning' 카테고리의 다른 글

[혼공머신] 05-3 트리의 앙상블

Random Forest

Extra Trees

Gradient boosting

Histogram-based Gradient Boosting

랜덤포레스트

엑스트라트리

그레이디언트 부스팅

히스토그램 기반 부스팅

XGBoost

LightGBM

'Programming > Machine Learning' 카테고리의 다른 글

+ Recent posts

티스토리툴바