파이토치과정 - 5. 데이터셋 분리/평가 척도/앙상블

2020. 11. 21. 11:43

728x90

할일

- train/test 분리

- voting, bagging, boosting, stacking 차이

- xgboost, lightbgm 이해

과제

- seaborn의 anagrams, attention, car_crashes, diamonds_dots 분석

- train/test 분리, 앙상블

- 하이퍼파라미터 변경하며 train, test loss 비교

실습

- kaggle 자전거 공유 수요 데이터셋, train/validation 분할, 모델로 데이터 예측 후 submission format에 맞춰 csv 저장

- kaggle 샌프란시스코 범죄 데이터셋. train/validation 분할, 모델 예측 후 위와 동일

train_test_split

- 훈련, 테스트 데이터셋 분리

metric, score

- 모델 평가에 대한 측도

scikit-learn.org/stable/modules/model_evaluation.html

- classification : accuracy, f1, aoc, roc 등 사용

- clustering : mutual info score, v measure score 등 사용

- regression : mean sqare error, r2 등 사용

앙상블 개요

- 회귀, 분류, 클러스터링에 다양한 모델들이 존재

- 여러 모델들을 사용하여 더 좋은 모델을 구하는 방법

- voting, bagging, boosting, stacking 등

앙상블 기법들

- voting : 여러 모델 중 최적의 모델을 보팅 해서 선정

- bagging (boostraping and aggregation) : 데이터 분할하여 각각의 모델에 학습하여 수렴.(수평)

- boosting : 약분류기들을 가중치 주면서, 수직으로 배치.

-> adaboost, xgboost, lightdm 등

- stacking : 이전 분류기의 출력을 다음 분류기의 입력으로 사용. boosting과 유사

seaborn car crash (voting, vagging, xgboost)

1. 데이터 로드

- total, speeding, acholog, ...

- speeding, alcohol같이 이해되는 변수도 있지만 의미를 알기 힘든 변수들도 존재

3.3. Metrics and scoring: quantifying the quality of predictions — scikit-learn 0.23.2 documentation

3.3. Metrics and scoring: quantifying the quality of predictions There are 3 different APIs for evaluating the quality of a model’s predictions: Finally, Dummy estimators are useful to get a baseline value of those metrics for random predictions. 3.3.1.

scikit-learn.org

2. 시각화

- 데이터 프레임 플롯

- 지역별 알코올

- 지역별 속도

3. 데이터 분할과 결정트리 회귀 분석

- 라벨 인코딩

4. voting

- 분류기들 설정 및 학습

- train, test 데이터 별 스코어

- train 데이터, 예측 정도 plot

- 하이퍼 파라미터 n_estimator 별 score plot

5. bagging - estimaotrs 갯수별 score

6.xgboost regression

- estimator 갯수별 score

- 변수 중요도 plot : f0가 가장 중요한 변수

- speeding이 x0이므로 total 회귀에 가장 중요한 변수

santander product recommendation

- 이 고객이 향후 어떤 상품을 사용할까?

www.kaggle.com/c/santander-product-recommendation/overview

데이터 설명

www.kaggle.com/c/santander-product-recommendation/data

300x250

저작자표시

'컴퓨터과학 > 기타' 카테고리의 다른 글

snu 샤논의 정보이론 강의 (0)	2020.11.25
파이토치과정 - 6. 구글드라이브,코랩에서 kaggle-api연동 (0)	2020.11.21
파이토치과정 - 4. 깃랩저장소와 코랩 연동, 회귀/분류 학습, 시각화까지 (0)	2020.11.14
파이토치과정 - 3.데이터셋 시각화 (0)	2020.11.07
파이토치과정 - 2. 구글 클라우드 플랫폼 이용 : 인스턴스 생성부터 도커 설치, 간단 사용 (0)	2020.11.07

집밖은 위험해

파이토치과정 - 5. 데이터셋 분리/평가 척도/앙상블

'컴퓨터과학 > 기타' 카테고리의 다른 글

+ Recent posts

티스토리툴바