[코드 분석 스터디] Porto Seguro's Safe Driver Prediction 캐글 커널 필사

심화 스터디/코드 분석 스터디

by 머랭 버핏 2021. 9. 16. 18:30

참고 커널 URL

https://www.kaggle.com/bertcarremans/data-preparation-exploration

Data Preparation & Exploration

Explore and run machine learning code with Kaggle Notebooks | Using data from Porto Seguro’s Safe Driver Prediction

www.kaggle.com

https://www.kaggle.com/aharless/xgboost-cv-lb-284

XGBoost CV (LB .284)

Explore and run machine learning code with Kaggle Notebooks | Using data from Porto Seguro’s Safe Driver Prediction

www.kaggle.com

모델 구조

Meta Data 생성
불균형 데이터 조정
결측치 처리
Feature Engineering
- 범주형 변수 인코딩
  - Mean Encoding
  - One-hot Encoding
- PolynomialFeatures
Feature Selection
- VarianceThreshold
- SelectFromModel
Feature Scaling
Modeling

<대회 설명>

Porto Seguro는 브라질의 자동차 보험 회사

<대회 목적>

어떤 차주가 내년에 보험 청구 할 확률 예측

<데이터 특징>

테스트 데이터 > 훈련 데이터
결측치의 값이 -1로 주어짐
각 Feature의 의미가 주어지지 않음.
target = 0 : 보험 청구 no / target = 1 : 보험 청구 yes

<데이터 평가>

지니 계수를 이용하여 데이터의 성능을 평가함.

지니 계수 : 경제적 불평등을 계수화할 때 주로 사용하는 지표

0~0.5의 값을 가짐
0.5에 가까울수록 좋은 분석임.

위 그림에서 A/(A+B)가 지니 계수

Actual 값이 가지고 있는 불평등 정도 / Actual-Prediction 간의 불평등 정도

<지니 계수 쓰는 이유>

imbalanced data 이기 때문에 이들의 평가를 위한 임계값을 어떻게 정하느냐에 따라 예측값이 바뀌므로 면적으로 스코어를 매겨 이러한 문제를 해결하기 위함

Meta Data

해당 캐글 커널의 특이점 : Meta Data 사용

해당 데이터는 column이 58개로 머신러닝에서 많은 편에 속하기 때문에 데이터의 역할,종류,데이터타입 등을 표로 출력하여 만들면 변수 컨트롤에 용이함.

Imbalanced Data 조정

타겟평균 0.0365 : 굉장히 불균형 데이터

-> 해결방안

oversampling target = 1
undersampling target = 0

undersampling 비율

$$  {(1-\alpha)*\beta_1 \over \beta_2*\alpha}$$

desired_apriori = 0.1

idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index

nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])

undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print('Rate to undersample records with target=0 : {}'.format(undersampling_rate))
print('Number of records with target=0 after undersampling : {}'.format(undersampled_nb_0))

undersampled_idx = shuffle(idx_0, random_state = 37,
                            n_samples =  undersampled_nb_0)

idx_list = list(undersampled_idx) + list(idx_1)

train = train.loc[idx_list].reset_index(drop=True)

Rate to undersample records with target=0 : 0.34043569687437886

Number of records with target=0 after undersampling : 195246

Imputation

해당 커널에서는 Simple Imputation을 활용

연속형

Variable ps_ind_02_cat has 103 records (0.05%) with missing values
Variable ps_ind_04_cat has 51 records (0.02%) with missing values
Variable ps_ind_05_cat has 2256 records (1.04%) with missing values
Variable ps_reg_03 has 38580 records (17.78%) with missing values
Variable ps_car_01_cat has 62 records (0.03%) with missing values
Variable ps_car_02_cat has 2 records (0.00%) with missing values
Variable ps_car_03_cat has 148367 records (68.39%) with missing values
Variable ps_car_05_cat has 96026 records (44.26%) with missing values
Variable ps_car_07_cat has 4431 records (2.04%) with missing values
Variable ps_car_09_cat has 230 records (0.11%) with missing values
Variable ps_car_11 has 1 records (0.00%) with missing values
Variable ps_car_14 has 15726 records (7.25%) with missing values
In total, there are 12 variables with missing values

ps_car_03_cat, ps_car_05_cat : missing value가 변수의 상당수를 차지함 -> 변수 제거
ps_reg_03 : 약 18%가 결측치 -> 평균으로 대체
ps_car_12 & ps_car_14 : 평균으로 대체
ps_car_11 : 범주형 변수이므로 mode로 대체

범주형 변수 인코딩

해당 커널에서는 distinct value가 많은 ps_car_11_cat 변수들에 대해서는 mean encoding을 사용하였고, 나머지 범주형 변수들에 대해서는 더미 변수를 생성하는 one-hot encoding의 방식 사용

Mean encoding이란?

<목표>

카테고리 변수에 대하여 (여기서는 104개의 카테고리를 가진 ps_car_11_cat 변수에 대하여) 단순하게 0,1로 구분된 target값에 대한 의미를 가지도록 만드는 것

카테고리 변수의 Label 값에 따라서 Target 값의 평균을 구해 각 Label이 Target과 가지는 상관성, 영향 도출

<문제점>

1. target값을 이용해 계산하기 때문에 overfitting의 문제가 발생할 수 있음 -> 이 커널에서는 noise를 추가하는 방식으로 이 문제를 해결

2. test 데이터와 train 데이터 간의 분포가 다른 경우 (ex. 한쪽이 불균형 데이터인 경우) 이때도 마찬가지로 overfitting의 문제 발생 가능 -> Smoothing을 통해 문제 해결

Smoothing 공식

Feature selection

분산이 작거나 0인 경우 - 해당 변수를 제거하는 것이 좋음 (값의 변화가 다양하지 않은 경우 예측에 도움이 되지 않음)
VarianceThreshold 함수를 통해 variance가 낮은 값을 제거해 줄 수 있음.
SelectFromModel을 사용하여 사용할 변수를 선택할수도 있음.

이 커널에서는 SelectFromModel을 사용하여 중위수 50% 이상의 분산을 가진 값만 활용

X_train = train.drop(['id','target'],axis=1)
y_train = train['target']

feat_labels = X_train.columns

rf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
rf.fit(X_train,y_train)
importances = rf.feature_importances_

indices = np.argsort(rf.feature_importances_)[::-1] # 변수 중요도 내림차순 정렬

for f in range(X_train.shape[1]) :
  print("%2d) %-*s %f" % (f+1, 30, feat_labels[indices[f]],importances[indices[f]]))

Feature Scaling

StandardScaling을 이용

Modeling

- Module Numba

모델의 성능 향상에 관여하는 모듈

<공식문서>

"Numba makes Python code fast.

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code."

-> 넘파이 코드를 빠르게 실행시켜주는 JIT 컴파일러

평가 지표인 지니 계수는 python에 생성되어 있는 모듈이 아니기 때문에 이를 계산하기 위한 함수 생성

@jit
def eval_gini(y_true, y_prob) :
  y_true = np.asarray(y_true) # numba가 이해할 수 있는 형식으로 변환
  y_true = y_true[np.argsort(y_prob)]
  ntrue= 0
  gini =0 
  delta = 0
  n=len(y_true)
  for i in range(n-1,-1,-1) :
    y_i = y_true[i]
    ntrue += y_i
    gini += y_i * delta
    delta += 1-y_i
  gini = 1-2*gini/(ntrue*(n-ntrue))
  return gini

해당 커널에서 분류에 사용한 XGBClassifier모델은 평가 지표로 rmse와 같은 값을 사용 -> 즉 오류의 최솟값을 찾음

이 대회의 평가 지표인 지니 계수 : 0.5에 가까울수록(값이 클수록) 좋은 값이기 때문에 -를 붙여주는 함수를 생성

def gini_xgb(preds, dtrain) :
  labels = dtrain.get_label()
  gini_score = -eval_gini(labels,preds)
  return [('gini',gini_score)]

<모델링 방식>

XGBClassifer에 대해 KFold 교차 검증 실시

K = 5
kf = KFold(n_splits= K, random_state=1, shuffle = True)
np.random.seed(0)

model = XGBClassifier(
    n_estimators = MAX_ROUNDS, 
    max_depth = 4,
    objective = 'binary:logistic', # target : binary 값이므로 학습 목표를 명확히 지정
    learning_rate = LEARNING_RATE,
    subsample = .8, # 데이터의 0.8을 데이터 생성에 사용
    min_child_weight = 6,
    colsample_bytree = .8, # 각 트리에 얼마의 column을 이용해서 구성할 것인지
    scale_pos_weight = 1.6,
    gamma = 10,
    reg_alpha=10,
    reg_lambda=1.3)

for i, (train_index, test_index) in enumerate(kf.split(train)) :

  y_train,y_valid = y.iloc[train_index].copy(), y.iloc[test_index]
  X_train,X_valid = X.iloc[train_index,:].copy(), X.iloc[test_index, :].copy()
  X_test = test.copy()
  print("\nFold", i )

  for f in f_cats :
    X_train[f + "_avg"], X_valid[f+'_avg'], X_test[f+"_avg"] = target_encode(
    trn_series= X_train[f],
    val_series=X_valid[f],
    tst_series=X_test[f],
    target=y_train,
    min_samples_leaf=200,
    smoothing=10,
    noise_level=0)
  if OPTIMIZE_ROUNDS : 
    eval_set = [(X_valid,y_valid)]
    fit_model = model.fit(X_train,y_train, eval_set=eval_set,
                          eval_metric=gini_xgb,
                          early_stopping_rounds = EARLY_STOPPING_ROUNDS,
                          verbose=False)
    print("Best N trees = ", model.best_ntree_limit)
    print("Best gini = ",model.best_score)
  else :
    fit_model = model.fit(X_train,y_train)

  pred = fit_model.predict_proba(X_valid)[:,1] # 1로 분류될 확률에 대한 열을 가져옴
  print("Gini = ", eval_gini(y_valid,pred))
  y_valid_pred.iloc[test_index] = pred

  y_test_pred += fit_model.predict_proba(X_test)[:,1] 

  del X_test, X_train, X_valid, y_train

print("\nGini for full training set : ")
eval_gini(y,y_valid_pred)

저작자표시 (새창열림)

'심화 스터디 > 코드 분석 스터디' 카테고리의 다른 글

[코드 분석 스터디] Binary Classification: Image Classfication - Statoil/C-CORE Iceberg Classifier Challenge (2)	2021.09.25
[코드 분석 스터디] Multi-class classification : Image classification - TensorFlow Speech Recognition Challenge (2)	2021.09.23
[코드 분석 스터디] Binary Classification - Titanic (EDA to Prediction) (5)	2021.09.16
[코드 분석 스터디] 스터디 진행 계획 및 방법 (0)	2021.09.09
코드 분석 스터디 소개 (0)	2021.08.27

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

Meta Data

Imbalanced Data 조정

Imputation

범주형 변수 인코딩

Feature selection

Feature Scaling

Modeling

'심화 스터디 > 코드 분석 스터디' 카테고리의 다른 글

관련글 더보기

댓글 영역

추가 정보

인기글

최신글

티스토리툴바