파이썬머신러닝 - 12. 랜덤 포레스트로 사용자 행동 분류하기

2020. 11. 26. 21:09

728x90

랜덤 포레스트

- 배깅 알고리즘 중 하나로 비교적 빠르며, 높은 성능을 보이고 있음.

- 배깅 알고리즘인 만큼 결정 트리 기반으로 하며, 많은 양의 트리를 이용하여 편향-분산을 잘 상쇄시킴.

랜덤 포레스트 모델로

이전에 결정트리로 풀어본 사용자 행동 인식 데이터셋을 다뤄보겠습니다.

ref : throwexception.tistory.com/1039

일단 단순하게 성능을 확인해봅시다.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

path = "./res/UCI HAR Dataset/features.txt"

features_df = pd.read_csv(path,sep="\s+",
                         header=None, names=["column_index", "column_name"])

feature_names = features_df["column_name"].values.tolist()

path = "./res/UCI HAR Dataset/"

X_train = pd.read_csv(path+"train/X_train.txt",sep="\s+", names=feature_names)
X_test = pd.read_csv(path+"test/X_test.txt",sep="\s+", names=feature_names)


y_train = pd.read_csv(path+"train/y_train.txt", sep="\s+", header=None,
                     names=["action"])

y_test = pd.read_csv(path+"test/y_test.txt", sep="\s+", header=None,
                     names=["action"])



rf = RandomForestClassifier(random_state=100)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("rf acc : {}".format(accuracy_score(y_test, y_pred)))

기본 상태로 랜덤 포레스트의 성능이 0.9015948422124194으로 나오고 있습니다.

중요한 변수들을 한번 살펴봅시다.

워낙 변수들이 많아서인지 각 변수들의 중요도가 매우 작은 값들을 가지고 있습니다.

ftr_importance_val = rf.feature_importances_
ftr_importances = pd.Series(ftr_importance_val, index=X_train.columns)
ftr_top30 = ftr_importances.sort_values(ascending=False)[:30]

sns.barplot(x=ftr_top30, y=ftr_top30.index)

성능 개선을 위해

gridsearchcv로 한번 최적 파라미터를 찾아봅시다.

그리드 탐색 결과를 보면

max depth : 10

min samples leaf 18

min samples split 12일때

아까 정확도가 0.9015948422124194 나온것 보다는

아주 약간 개선된 0.910092가 나옵니다.

이 부근으로 탐색을 더해보면 더 나은 하이퍼 파라미터들을 찾을수 있을것같습니다.

* 모델이나 gridsearchcv의 파라미터로 n_jobs=-1로 설정하면 모든 CPU 코어를 사용하게 됨.

from sklearn.model_selection import GridSearchCV

param = {
    "min_samples_leaf": [4, 12, 18],
    "min_samples_split": [6, 12],
    "max_depth" :[10, None]
}

rf = RandomForestClassifier(random_state=100, n_jobs=1)
gs = GridSearchCV(rf, param_grid=param, cv=3, n_jobs=-1)
gs.fit(X_train, y_train)


res_df = pd.DataFrame.from_dict(gs.cv_results_)
res_df[['param_max_depth', 'param_min_samples_leaf', 'param_min_samples_split' ,'mean_test_score']]

위 표를 보면 min samples leaf가 커질수록 성능이 개선되었고,

min samples split은 6, 12이든 별 차이가 나지는 않았습니다.

하지만 max_depth는 None보다 10일때 타 파라미터 동일 조건에서 더 좋은 성능이 나왔으니

min samples split은 빼고

대신 n_estimators를 추가하여

다음과 같이 파라미터를 조정하여 돌려보았습니다.

param = {
    "min_samples_leaf": [18, 25, 30],
    "max_depth" :[10, 15],
    "n_estimators" :[50, 100, 200]
}

rf = RandomForestClassifier(random_state=100, n_jobs=1)
gs = GridSearchCV(rf, param_grid=param, cv=3, n_jobs=-1)
gs.fit(X_train, y_train)


res_df = pd.DataFrame.from_dict(gs.cv_results_)
res_df[['param_min_samples_leaf','param_max_depth', 'param_n_estimators' ,'mean_test_score']]

결과는

min samples leaf : 25

max depth : 10

n estimators : 100

일때

최대 스코어가 0.917982로 나왔습니다.

이전 성능인 0.910092보다 0.007정도 개선되었습니다.

일단 대체로 보면 n estimator 100부근에서 좋은 성능이 나왔고

min samples leaf는 타 파라미터에따라 조금씩 바뀌는것 같습니다.

max depth의 경우 더 깊을수록 성능이 떨어지고 있습니다.

다시 하이퍼 파라미터를 다음과 같이 조정해서 동작시켜보았습니다.

param = {
    "min_samples_leaf": [21, 24, 27],
    "max_depth" :[4, 6, 8],
    "n_estimators" :[80, 100, 120]
}

rf = RandomForestClassifier(random_state=100, n_jobs=1)
gs = GridSearchCV(rf, param_grid=param, cv=3, n_jobs=-1)
gs.fit(X_train, y_train)


res_df = pd.DataFrame.from_dict(gs.cv_results_)
res_df[['param_min_samples_leaf','param_max_depth', 'param_n_estimators' ,'mean_test_score']]

최대 스코어가 0.914로 이전보다 0.003정도 줄어들었습니다.

이번에는 최대 특징 개수를 추가했습니다.

param = {
    "min_samples_leaf": [25, 27],
    "max_depth" :[9, 10, 11],
    "n_estimators" :[100],
    "max_features": [200, "auto"]
}

rf = RandomForestClassifier(random_state=100, n_jobs=1)
gs = GridSearchCV(rf, param_grid=param, cv=3, n_jobs=-1)
gs.fit(X_train, y_train)


res_df = pd.DataFrame.from_dict(gs.cv_results_)
res_df[['param_min_samples_leaf','param_max_depth', 'param_max_feature' ,'mean_test_score']]

아까와 동일하게

min samples leaf 25

max depth 10

n estimator 100

일때와 스코어가 0.917982로 가장 좋은 성능을 보였습니다.

이전에 결정 트리로 0.86425 나왔을때보다는 랜덤 포레스트 모델을 통해 분류 성능을 크개 개선할수 있었습니다.

300x250

저작자표시

'인공지능' 카테고리의 다른 글

파이썬머신러닝 - 14. LightGBM (0)	2020.11.27
파이썬머신러닝 - 13. GBM로 사용자 행동 분류 (0)	2020.11.26
파이썬머신러닝 - 11. 보팅 분류기로 유방암 악성여부 판단하기 (0)	2020.11.26
파이썬머신러닝 - 10. 앙상블 모델 개요 (0)	2020.11.26
파이썬머신러닝 - 9. 결정 트리를 이용한 사용자 행동 인식 분류하기 (0)	2020.11.25

집밖은 위험해

파이썬머신러닝 - 12. 랜덤 포레스트로 사용자 행동 분류하기

'인공지능' 카테고리의 다른 글

+ Recent posts

티스토리툴바