[코드 분석 스터디] Regression : House Prices

[코드 분석 스터디] Regression : House Prices - Advanced Regression Techniques

심화 스터디/코드 분석 스터디

by 치즈케익팩토리 2021. 9. 30. 01:35

작성자 : 13기 전보민

참고 커널 : https://www.kaggle.com/mostafaalaa123/simple-house-prediction/notebook#Outliers-!

Simple House Prediction

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression Techniques

www.kaggle.com

대회 소개

집의 특성을 나타내는 79개의 설명변수를 활용해 집값을 예측하는 문제
feature engineering과 regression techniques(e.g. random forest, gradient boosting)을 연습해 보는 것이 목적!
해당 커널은 모델링 과정이 치밀하지는 않은 느낌이었지만, EDA부터 모델링까지 전반적인 과정을 살펴보고 그 속에서 소소한(?) 기법들을 알아볼 수 있었다는 것에 의의가 있었다,,,! :)

코드 리뷰

Importing Libraries & Modules

# Data Analysis
import numpy as np
import pandas as pd
import random

# Statistics
from scipy.stats import norm
from scipy import stats

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler

# ML
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import xgboost as xg

# Another
import warnings
warnings.filterwarnings('ignore')

데이터 불러오기

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

EDA

print(train_data.head())
print('-'*20)
print(train_data.info())

이 많은 변수들을 다음의 두가지 측면에서 처리해보고자 한다.

결측치 개수
categorical & numerical

-> 결측치 개수에 따라 변수를 나눔

full : 결측치 없음
medium : 결측치 비율 50% 미만
Remove_me : 결측치 비율 50% 이상

full = pd.DataFrame()
medium = pd.DataFrame()
remove_me = pd.DataFrame()

features = train_data.columns.values
number_of_houses = 1460 # or train_data.shape[0]

for feature in features:
  if train_data[feature].count() == number_of_houses:
    full[feature] = train_data[feature]
  elif train_data[feature].count() > number_of_houses*0.5: # Actually, that mean it has more than 50% non-null values
    medium[feature] = train_data[feature]
  else:
    remove_me[feature] = train_data[feature]

-> Numerical 변수와 Categorical 변수를 나눔

Tip) select_dtypes로 원하는 데이터 타입을 지정해서 열을 선택할 수 있다

숫자 형식의 데이터 타입만 선택 : select_dtypes(include='number')
object 형식의 데이터 타입만 선택 : select_dtypes(include='object')

Tip) describe(include=['O'])를 통해 object 변수의 요약통계량을 확인할 수 있다

count : 데이터 개수
unique : unique한 값 개수
top : 가장 많은 빈도수를 갖는 값
freq : top에 해당하는 값의 빈도수

Numerical

print('Number of numerical features: ', end='')
print(len(train_data.select_dtypes(include=['number']).columns.values))
train_data.describe(exclude=['O'])

Categorical

print('Number of categorical features: ', end='')
print(len(train_data.select_dtypes(include=['O']).columns.values))
train_data.describe(include=['O'])

이제 다음과 같은 변수들을 제거해준다.

Id 변수 (의미 없는 변수)
Remove_me 데이터프레임에 속한 변수들 (결측치 비율 50% 이상)
결측치 비율이 높진 않지만, 대부분 0의 값을 가져 큰 의미가 없는 변수

Tip) df.loc[조건, 열] 을 통해 조건을 만족하는 행들을 추출할 수 있다

#1
train_data = train_data.drop(['Id'], axis=1)

#2
train_data = train_data.drop(remove_me.columns.values, axis=1)

#3
#First let's create the important data we will use
numerical_data = train_data.select_dtypes(include=['number'])
categorical_data = train_data.select_dtypes(include=['object'])

#we want to know the ratio of (values equals zero) / 1460
#to each feature and if the feature has more than 50% ratio we will remove it
feature_zero_ratio = {feature:numerical_data.loc[numerical_data[feature]==0, feature].count() / 1460 for feature in numerical_data.columns.values}
feature_zero_ratio

0의 비율이 0.3을 넘는 변수들 제거

for feature in numerical_data:
  if feature_zero_ratio[feature] > 0.3:
    numerical_data = numerical_data.drop([feature],axis=1)
    train_data = train_data.drop([feature], axis=1)
    if feature in medium:
      medium = medium.drop([feature],axis=1)

seaborn 패키지의 heatmap 기능을 이용해 numerical 변수들과 target간의 상관관계 확인

corrmat = numerical_data.corr()
fig, ax = plt.subplots(figsize=(12,12))
sns.set(font_scale=1.25)
sns.heatmap(corrmat, vmax=.8, annot=True, square=True, annot_kws={'size':8}, fmt='.2f')
plt.show()

Target 변수인 'SalePrice'와 상관관계가 높은 상위 10개 변수들로만 다시 heatmap을 그려본다.

Tip) pandas.DataFrame.nlargest(n, columns) : Return the first n rows ordered by columns in descending order

df.sort_values(columns, ascending=False).head(n)과 같은 기능!

n = 10
most_largest_features = corrmat.nlargest(n, 'SalePrice')['SalePrice'].index
zoomed_corrmat = np.corrcoef(numerical_data[most_largest_features].values.T)
fig, ax = plt.subplots(figsize=(6,6))
sns.set(font_scale=1)
sns.heatmap(zoomed_corrmat, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=most_largest_features.values, xticklabels=most_largest_features.values)
print(most_largest_features.values)

Target변수와 상관관계가 높은 상위 7개 변수들의 관계를 scatter plot으로 살펴본다.

sns.set()
most_largest_features = corrmat.nlargest(7, 'SalePrice')['SalePrice'].index
sns.pairplot(numerical_data[most_largest_features.values],size=1.5)
plt.show()

선형관계가 존재하는 변수들 중 Target변수와 상관관계가 더 높은 변수를 남기고 낮은 변수는 제거

e.g. GrLivArea and 1stFlrSF --> 1stFlrSF 제거

numerical_data = numerical_data.drop(['1stFlrSF', 'TotalBsmtSF', 'GarageArea', 'GarageYrBlt'],axis=1)
train_data = train_data.drop(['1stFlrSF', 'TotalBsmtSF', 'GarageArea', 'GarageYrBlt'],axis=1)

Target변수와의 상관관계가 [-0.1, 0.2] 사이인 'nutral' 변수 제거

corr_with_price = numerical_data.corr()
corr_with_price = corr_with_price.sort_values(by='SalePrice', ascending=False)
corr_with_price['SalePrice']

numerical_data = numerical_data.drop(['MSSubClass', 'OverallCond', 'YrSold', 'MoSold', 'BedroomAbvGr'],axis=1)
train_data = train_data.drop(['MSSubClass', 'OverallCond', 'YrSold', 'MoSold', 'BedroomAbvGr'],axis=1)

결측치 처리

-> numerical 변수

median으로 대체
mean으로 대체
(mean-std) to (mean+std) 사이의 random value로 대체

-> categorical 변수

최빈값으로 대체
or do some analysis more

numerical 변수 결측치 확인

print(numerical_have_missing.columns.values)
print('-'*30)
print(numerical_have_missing.info())

sns.histplot(numerical_have_missing['LotFrontage'])
plt.title('LotFrontage')
plt.show()

60~80사이의 random한 value로 결측치 대체

Tip) List comprehension을 활용해 반복문을 한 줄 코드로 작성

old_LotFrontage = list(numerical_have_missing['LotFrontage'].values)
missing_indices = list(numerical_have_missing.loc[numerical_have_missing['LotFrontage'].isnull(), 'LotFrontage'].index)
random_values = [random.randint(60,80) for _ in range(1460 - numerical_have_missing['LotFrontage'].count())]
random_values_idx = 0

for missing_idx in missing_indices:
  old_LotFrontage[missing_idx] = random_values[random_values_idx]
  random_values_idx += 1

numerical_have_missing['LotFrontage'] = pd.Series(old_LotFrontage)
train_data['LotFrontage'] = pd.Series(old_LotFrontage)

categorical 변수 결측치 확인

print(len(categorical_have_missing.columns.values))
print('-'*30)
print(categorical_have_missing.columns.values)
print('-'*30)
print(categorical_have_missing.count())

결측치가 매우 많은 FireplaceQu 변수는 삭제

나머지 변수들에 대해서는 SimpleImputer를 활용해 최빈값으로 결측치 대체

Tip) SimpleImputer

strategy 옵션

'mean': 평균값 (디폴트)
'median': 중앙값
'most_frequent': 최빈값
'constant': 특정값 e.g. SimpleImputer(strategy='constant', fill_value=1)

train_data = train_data.drop(['FireplaceQu'], axis=1)
categorical_have_missing = categorical_have_missing.drop(['FireplaceQu'], axis=1)

imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
for feature in categorical_have_missing:
  categorical_have_missing[feature] = imputer.fit_transform(categorical_have_missing[feature].values.reshape((-1,1)))
  train_data[feature] = imputer.fit_transform(train_data[feature].values.reshape((-1,1)))

Outliers 처리

plt.scatter(train_data['GrLivArea'], train_data['SalePrice'])
plt.show()

outlier의 인덱스를 확인한 후 train data에서 제거

train_data[ (train_data['GrLivArea'] > 4000) & (train_data['SalePrice'] < 200000)].index

train_data['Id'] = pd.Series(train_data.index)
train_data = train_data.drop( train_data[ (train_data['Id'] == 1298) | (train_data['Id'] == 523) ].index)
# Delete Id again
train_data = train_data.drop(['Id'], axis=1)

더미변수 생성

get_dummies 활용

train_data = pd.get_dummies(train_data)

정규성 확인

Histogram
Normal probability plot

sns.distplot(train_data['SalePrice'], fit=norm)
fig = plt.figure()
res = stats.probplot(train_data['SalePrice'], plot=plt)

정규분포를 따르지 않는 것으로 보인다. 로그화를 진행해 해결한다.

train_data['SalePrice'] = np.log(train_data['SalePrice'])
sns.distplot(train_data['SalePrice'], fit=norm)
fig = plt.figure()
res = stats.probplot(train_data['SalePrice'], plot=plt)

모델링

target = train_data['SalePrice']
train_data = train_data.drop(['SalePrice'], axis=1)

X, y = train_data, target

LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
pred = lin_reg.predict(X)
print(lin_reg.score(X,y))
np.sqrt(mean_squared_log_error(pred,y))

RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(X, y)
pred = forest_reg.predict(X)
print(forest_reg.score(X, y))
np.sqrt(mean_squared_log_error(pred, y))

XGBRegressor

xg_reg = xg.XGBRegressor(objective ='reg:linear',
                  n_estimators = 300, seed = 123)
xg_reg.fit(X, y)
pred = xg_reg.predict(X)
print(xg_reg.score(X, y))
np.sqrt(mean_squared_log_error(pred, y))

저작자표시 (새창열림)

'심화 스터디 > 코드 분석 스터디' 카테고리의 다른 글

[코드 분석 스터디] Segmentation : Sementic Segmentation - CARLA Image Road segmentation (1)	2021.11.04
[코드 분석 스터디] Time Series Regression - Predict Future Sales 커널 필사 (2)	2021.10.01
[코드 분석 스터디] Multi-class classification : Costa risan household poverty prediction (2)	2021.09.27
[코드 분석 스터디] Regression - New York City Taxi Trip Duration (0)	2021.09.27
[코드 분석 스터디] Binary Classification: Image Classfication - Statoil/C-CORE Iceberg Classifier Challenge (2)	2021.09.25

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

대회 소개

코드 리뷰

Importing Libraries & Modules

데이터 불러오기

EDA

결측치 처리

Outliers 처리

더미변수 생성

정규성 확인

모델링

'심화 스터디 > 코드 분석 스터디' 카테고리의 다른 글

관련글 더보기

댓글 영역

추가 정보

인기글

최신글

티스토리툴바