[Python] ANOVA F-Test, Multinomial Logit Regression

Python

[Python] ANOVA F-Test, Multinomial Logit Regression

이언배 2024. 12. 10. 14:46

데이터 구축이 끝났다.

이제 지난번에 거쳤던 요리의 과정을 고스란히 다시 거쳐보는 작업이 남아있다.

결과가 어떻게 나올지는 나도 잘 모른다.

그냥 해야할 것을 차근차근 해보자.

1. 전처리

# 불러올 column들을 지정해주자.
col_names = ['pmid', 'title', 'cls_main', 'digi_type', 'gro_flo_co', 'und_flo_co', 'total_area', 'bdtyp_cd',
             'roa_cls_se', 'rds_dpn_se', 'road_bt', 'buld_age', 'dist_tord', 'dist_toapt', 'dist_tocbd', 'dist_tostation', 'mlsfc_cl']

# 위 column 들로 query 문을 작성해보자.
col_txt = ''
for col in col_names:
    col_txt += col
    col_txt += ', '
col_txt = col_txt[:-2]
query0 = 'SELECT ' + col_txt  +  ' FROM dtp_data_2024_final;'
df = fetch_table(query0, conn)

# fetch_table 문은 내가 뭘 모를 때 작성했던 것이라서, 이름을 바꿔주는 작업을 추가해봤다.
rename_dict = {}
for i in range(len(col_names)):
    rename_dict[i] = col_names[i]
df.rename(columns =rename_dict, inplace = True)

그리고 전처리를 들어간다.

###################################건축물 용도를 구분하는 코드
# Process Building Type Code
df['bdtyp_cd'].value_counts()
df['bdtyp_cd'] = df['bdtyp_cd'].astype(int)

#df['geunsang'] = np.where((df['bdtyp_cd'] == 4999) | (df['bdtyp_cd'] == 3999), 1, 0)
df['dandok'] = np.where(df['bdtyp_cd'] == 1001, 1, 0)
df['apt'] = np.where(df['bdtyp_cd'] == 2001, 1, 0)
df['residential'] = np.where((df['bdtyp_cd']//1000 == 1) | (df['bdtyp_cd']//1000 == 2), 1, 0)
df['neighborhood'] = np.where((df['bdtyp_cd']//1000 == 3) | (df['bdtyp_cd']//1000 == 4), 1, 0)
df['commercial'] = np.where(df['bdtyp_cd']//1000 == 6, 1, 0)
df['work'] = np.where(df['bdtyp_cd']//1000 == 10, 1, 0)
df['factory'] = np.where((df['bdtyp_cd']//1000 == 13) | (df['bdtyp_cd']//1000 == 14), 1, 0)
df['etc'] = np.where(~(df['bdtyp_cd']//1000).isin([1 ,2,3, 4, 6, 10, 13, 14]), 1, 0)


###################################건축물 연식 (일자)를 구분하는 코드
np_buld_age = np.array(df['buld_age']).reshape(-1, 1)
normed_buld_age = scaler.fit_transform(np_buld_age)
#np_open_age = np.array(df['open_age']).reshape(-1, 1)
#normed_open_age = scaler.fit_transform(np_open_age)

# Replace the original column with the normalized values
#df['open_age'] = normed_open_age
df['buld_age'] = normed_buld_age

###################################토지용도를 구분하는 코드
# Process Land use Code
df['mlsfc_cl'].value_counts()

onlyResi_list = ['UQA110', 'UQA112']
Resi_list = ['UQA121', 'UQA122', 'UQA124']
joonResi_list = ['UQA130', 'UQA190']

df['lu_onlyResi'] = np.where(df['mlsfc_cl'].isin(onlyResi_list), 1, 0)
df['lu_Resi'] = np.where(df['mlsfc_cl'].isin(Resi_list), 1, 0)
df['lu_joonResi'] = np.where(df['mlsfc_cl'].isin(joonResi_list), 1, 0)

comm_list = ['UQA200', 'UQA210', 'UQA220', 'UQA230', 'UQA240']
indu_list = ['UQA330']
gree_list = ['UQA410', 'UQA420', 'UQA430']

df['lu_comm'] = np.where(df['mlsfc_cl'].isin(comm_list), 1, 0)
df['lu_indu'] = np.where(df['mlsfc_cl'].isin(indu_list), 1, 0)
df['lu_gree'] = np.where(df['mlsfc_cl'].isin(gree_list), 1, 0)

2. ANOVA F-test

ANOVA F-Test 는 class 로 나뉘어진 변수들 간의 평균이 유의미한 차이를 가지는지 확인하는 검정이다.

1반애들이랑 2반애들이랑 키 평균 냈을 때, 진짜로 키 차이가 유의미하니?

############################ANOVA-F test for Total
df_anova = df.loc[:, ['roa_cls_se', 'rds_dpn_se', 'road_bt', 'dist_tord', 'dist_toapt','dist_tocbd', 'dist_tostation', 'gro_flo_co', 'und_flo_co', 'total_area', 
       'residential', 'neighborhood', 'commercial', 'work', 'factory', 'buld_age', 'lu_onlyResi', 'lu_Resi', 'lu_joonResi', 'lu_comm', 'lu_indu', 'lu_gree', 'digi_type']]
df_anova['roa_cls_se'] = df_anova['roa_cls_se'].astype(int)
df_anova['rds_dpn_se'] = df_anova['rds_dpn_se'].astype(int)
df_anova['dist_tord'] = df_anova['dist_tord'].astype(float)
df_anova['dist_toapt'] = df_anova['dist_toapt'].astype(float)
df_anova['dist_tocbd'] = df_anova['dist_tocbd'].astype(float)
df_anova['dist_tostation'] = df_anova['dist_tostation'].astype(float)
df_anova['gro_flo_co'] = df_anova['gro_flo_co'].astype(float)
df_anova['und_flo_co'] = df_anova['und_flo_co'].astype(float)
df_anova['total_area'] = df_anova['total_area'].astype(float)

###########################ANOVA-F 테스트에 사용할 새로운 indexing 이다. grp로 목표하는 변수의 class 단위로 구분하여 각 index별로 평균의 유의미를 비교한달까
mult_idx = pd.MultiIndex.from_product([['roa_cls_se', 'rds_dpn_se', 'road_bt', 'dist_tord', 'dist_toapt','dist_tocbd', 'dist_tostation', 'gro_flo_co', 'und_flo_co', 'total_area', 
       'residential', 'neighborhood', 'commercial', 'work', 'factory', 'buld_age', 'lu_onlyResi', 'lu_Resi', 'lu_joonResi', 'lu_comm', 'lu_indu', 'lu_gree'], ['digi_type']],
                                      names = ['x', 'grp'])
                                      
###########################ANOVA-F 테스트에 사용할 module 이다.
import statsmodels.api as smapi
from statsmodels.formula.api import ols

anova_tables = []
for x, grp in mult_idx:
    model = ols('{} ~ {}'.format(x, grp), data = df_anova).fit()
    anova_table = sm.stats.anova_lm(model, typ=1)
    anova_tables.append(anova_table)
df_anova_tables = pd.concat(anova_tables, keys = mult_idx, axis = 0)
df_anova_tables

요 결과물은

이렇게 나온다.

해석할 때에는 "p-value가 0.05 미만일 경우에는, 통계적으로 class (여기에서는 digi_type) 에 따라서 해당 변수마다의 평균 차이가 유의미하게 있다" 로 해석한다.

3. Multinomial Logit Regression

Logit 이란, 이러쿵저러쿵 변수들이 사부작사부작 해서 나온 결과물이 0이냐 1이냐.

continuous 한 값이 나오는 게 아닌, categorical 한 분류를 위한 regression.

Multinomial Logit regression 이란, 이러쿵저러쿵 사부작사부작이 0, 1, 2, ... 어떤 카테고리냐.

키랑 몸무게랑 수학 성적이랑 이렇게 저렇게 보니까 얘는 1반이니 2반이니 3반이니?

우선, 다중공산성을 먼저 검사해준다.

키랑 몸무게는 너무 연관있어서 같이 분석돌리면 영향을 받으니, 둘 중 하나는 빼줄 수 있도록

상관관계가 높은 애들을 찾아내야 한다.

############################전처리
df_logit = df.loc[:, ['roa_cls_se', 'rds_dpn_se', 'road_bt', 'dist_tord', 'dist_toapt','dist_tocbd', 'dist_tostation', 'gro_flo_co', 'und_flo_co', 'total_area', 
       'residential', 'commercial', 'work', 'factory', 'buld_age', 'lu_onlyResi', 'lu_Resi', 'lu_joonResi', 'lu_comm', 'lu_indu', 'lu_gree', 'digi_type']]
# Str to int
df_logit['roa_cls_se'] = df_logit['roa_cls_se'].astype(int)
df_logit['rds_dpn_se'] = df_logit['rds_dpn_se'].astype(int)

# Fit the scaler to the column data and transform the column
np_dist_tord = np.array(df_logit['dist_tord']).reshape(-1, 1)
normed_dist_tord = scaler.fit_transform(np_dist_tord)
np_dist_tocbd = np.array(df_logit['dist_tocbd']).reshape(-1, 1)
normed_dist_tocbd = scaler.fit_transform(np_dist_tocbd)
np_dist_toapt = np.array(df_logit['dist_toapt']).reshape(-1, 1)
normed_dist_toapt = scaler.fit_transform(np_dist_toapt)
np_dist_tostation = np.array(df_logit['dist_tostation']).reshape(-1, 1)
normed_dist_tostation = scaler.fit_transform(np_dist_tostation)
np_tot_flo_are = np.array(df_logit['total_area']).reshape(-1, 1)
normed_tot_flo_are = scaler.fit_transform(np_tot_flo_are)

# Replace the original column with the normalized values
df_logit['dist_tord'] = normed_dist_tord
df_logit['dist_tocbd'] = normed_dist_tocbd
df_logit['dist_toapt'] = normed_dist_toapt
df_logit['dist_tostation'] = normed_dist_tostation
df_logit['total_area'] = normed_tot_flo_are

##############################다중공산성 분석 (VIF)
df_logit = df_logit.fillna(0)
vif = pd.DataFrame()
vif["Feature"] = df_logit.drop(columns = ['digi_type']).columns
vif["VIF"] = [variance_inflation_factor(df_logit.drop(columns = ['digi_type']).values, i) for i in range(df_logit.drop(columns = ['digi_type']).shape[1])]

보통 5가 넘는 factor들을 제외하라고들 한다고 한다.

나는 neighborhood 를 제외했다.

####################데이터 불러오기
df_logit = df.loc[:, ['roa_cls_se', 'rds_dpn_se', 'road_bt', 'dist_tord', 'dist_toapt','dist_tocbd', 'dist_tostation', 'gro_flo_co', 'und_flo_co', 'total_area', 
       'residential', 'commercial', 'work', 'factory', 'buld_age', 'lu_onlyResi', 'lu_Resi', 'lu_joonResi', 'lu_comm', 'lu_indu', 'lu_gree', 'digi_type']]

####################데이터 형태 맞춰주기
df_logit['roa_cls_se'] = df_logit['roa_cls_se'].astype(int)
df_logit['rds_dpn_se'] = df_logit['rds_dpn_se'].astype(int)

####################너무 크다 싶은 애들은 standardize
scaler = StandardScaler()

# Fit the scaler to the column data and transform the column
np_dist_tord = np.array(df_logit['dist_tord']).reshape(-1, 1)
normed_dist_tord = scaler.fit_transform(np_dist_tord)
np_dist_tocbd = np.array(df_logit['dist_tocbd']).reshape(-1, 1)
normed_dist_tocbd = scaler.fit_transform(np_dist_tocbd)
np_dist_toapt = np.array(df_logit['dist_toapt']).reshape(-1, 1)
normed_dist_toapt = scaler.fit_transform(np_dist_toapt)
np_dist_tostation = np.array(df_logit['dist_tostation']).reshape(-1, 1)
normed_dist_tostation = scaler.fit_transform(np_dist_tostation)
np_tot_flo_are = np.array(df_logit['total_area']).reshape(-1, 1)
normed_tot_flo_are = scaler.fit_transform(np_tot_flo_are)

# Replace the original column with the normalized values
df_logit['dist_tord'] = normed_dist_tord
df_logit['dist_tocbd'] = normed_dist_tocbd
df_logit['dist_toapt'] = normed_dist_toapt
df_logit['dist_tostation'] = normed_dist_tostation
df_logit['total_area'] = normed_tot_flo_are

####################암만 봐도 null, nan 값이 없는데 난리피우길래 그냥 fillna(0) 먹이고, X, Y로 구분
X = np.array(df_logit.loc[:, ['roa_cls_se', 'rds_dpn_se', 'road_bt', 'dist_tord', 'dist_tocbd', 'dist_toapt', 'dist_tostation', 'gro_flo_co', 'und_flo_co', 'total_area', 'residential', 'neighborhood', 'commercial', 'work', 'factory', 'buld_age', 'lu_onlyResi', 'lu_Resi', 'lu_joonResi', 'lu_comm', 'lu_indu', 'lu_gree']].fillna(0))
Y = np.array(df_logit.loc[:, 'digi_type'])


####################본격적인 Multinomial regression
X = sm.add_constant(X)  # add constant term for intercept
model = sm.MNLogit(Y, X)
result = model.fit()
#margeff = result.get_margeff()

####################결과물 출력
print(result.summary())

참으로 pseudo 스러운 설명이 아닐 수 없다.

혹시 잘 아시는 분들은 댓글에 설명 부탁드립니다.

Pseudo R-squ 는 multinomial logit 의 성능을 수치로 보여주지만,

classification 모델에서 중요한 건 classification 성능이다. 이것은 F-1 score, confusion matrix, accuracy 등으로 평가 해야 한다.

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Train a multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(cm)

correct_predictions = cm[0,0] + cm[1,1] + cm[2,2]
total_predictions = sum(sum(row) for row in cm)

accuracy = correct_predictions / total_predictions
print("Accuracy:", accuracy)

precision_class_0 = precision_score(y_test, y_pred, average=None)[0]
recall_class_0 = recall_score(y_test, y_pred, average=None)[0]
f1_class_0 = f1_score(y_test, y_pred, average=None)[0]

print("F1-score for class 0:", f1_class_0)

728x90