Label Encoidng 시 ValueError: y contains previously unseen labels:가 발생할 때

Label Encoding 시 "ValueError: y contains previously unseen labels:"가 발생할 때가 있습니다.

학습데이터에 fit을 하고, 테스트데이터에 transform을 했을 때, 테스트데이터에 학습데이터에 없는 범주값이 존재할 때 발생합니다.

초보자 분들의 경우에는 학습데이터와 테스트데이터 모두 fit_transform을 하는 경우가 있기도 하고,

학습데이터와 테스트 데이터를 합쳐서 fit 하고, 학습데이터와 테스트 데이터를 transform 해주기도 하지만, 원칙적으로 학습데이터와 테스트 데이터는 독립적이어야 하므로 실무적으로 권장되는 방법은 아닙니다.(Data Leakage 문제가 생기며, 대회 등에서는 탈락 사유가 됩니다.)

LabelEncoder.classes_를 이용하여 범주를 추가하는 방법을 알아보겠습니다.

1. 학습데이터와 테스트데이터에 모두 fit_transform을 하는 경우 (이렇게 하면 생기는 문제)

from sklearn.preprocessing import LabelEncoder

# Training data
train_data = ["Red", "Green", "Blue"]

# Test data
test_data = ["Red","Green","Yello"]

# Initialize the label encoder
label_encoder = LabelEncoder()

# Fit and transform the training data
train_encoded = label_encoder.fit_transform(train_data)

# Transform the test data
test_encoded = label_encoder.fit_transform(test_data)

print("Training data:")
for color, encoded in zip(train_data, train_encoded):
    print(f"Color: {color}, Encoded value: {encoded}")

print("\nTest data:")
for color, encoded in zip(test_data, test_encoded):
    print(f"Color: {color}, Encoded value: {encoded}")

Output: 같은 값인데, 다른 정수 값으로 인코딩이 될 수 있습니다.

Training data:
Color: Red, Encoded value: 2
Color: Green, Encoded value: 1
Color: Blue, Encoded value: 0

Test data:
Color: Red, Encoded value: 1
Color: Green, Encoded value: 0
Color: Yello, Encoded value: 2

2. 학습데이터만으로 fit을 하고, 테스트 데이터의 범주를 LabelEncoder.classess_에 추가하는 방법

from sklearn.preprocessing import LabelEncoder

# Training data
train_data = ["Red", "Green", "Blue"]

# Test data
test_data = ["Red","Green","Yello"]

# Initialize the label encoder
label_encoder = LabelEncoder()

# Fit and transform the training data
train_encoded = label_encoder.fit_transform(train_data)

# Transform the test data
for label in test_data: #실제 데이터라면 np.unique(test_data[column]) 등으로 수정 필요
    if label not in label_encoder.classes_:
        label_encoder.classes_ = np.append(label_encoder.classes_,label)

        print(label_encoder.classes_)

test_encoded = label_encoder.transform(test_data)

print("Training data:")
for color, encoded in zip(train_data, train_encoded):
    print(f"Color: {color}, Encoded value: {encoded}")

print("\nTest data:")
for color, encoded in zip(test_data, test_encoded):
    print(f"Color: {color}, Encoded value: {encoded}")

Output: 동일한 값은 동일한 정수 값으로 인코딩 된 것을 볼 수 있습니다.

Training data:
Color: Red, Encoded value: 2
Color: Green, Encoded value: 1
Color: Blue, Encoded value: 0

Test data:
Color: Red, Encoded value: 2
Color: Green, Encoded value: 1
Color: Yello, Encoded value: 3

저작자표시 비영리 동일조건 (새창열림)

'데이터분석과 AI > 데이터분석과 AI 문법(Python)' 카테고리의 다른 글

[Python] 그래프에서 한글 깨질 때, 폰트 확인, 한글 폰트 설정, 마이너스 표기 방법 (0)	2023.06.28
Inplace=True 옵션을 썼는데, 데이터 변경이 안되는 경우 (0)	2023.06.11
비지도학습의 앙상블 방법(iris) (0)	2023.05.12
Python에서 DataFrame의 목록을 출력하는 방법 (0)	2022.09.13
Python 함수 tooltip 보는 방법 (0)	2022.09.06

IT에서 일하는 비(非) 개발자 이야기

Label Encoidng 시 ValueError: y contains previously unseen labels:가 발생할 때

1. 학습데이터와 테스트데이터에 모두 fit_transform을 하는 경우 (이렇게 하면 생기는 문제)

2. 학습데이터만으로 fit을 하고, 테스트 데이터의 범주를 LabelEncoder.classess_에 추가하는 방법

'데이터분석과 AI > 데이터분석과 AI 문법(Python)' 카테고리의 다른 글

댓글

티스토리툴바

Label Encoidng 시 ValueError: y contains previously unseen labels:가 발생할 때

1. 학습데이터와 테스트데이터에 모두 fit_transform을 하는 경우 (이렇게 하면 생기는 문제)

2. 학습데이터만으로 fit을 하고, 테스트 데이터의 범주를 LabelEncoder.classess_에 추가하는 방법

'데이터분석과 AI > 데이터분석과 AI 문법(Python)' 카테고리의 다른 글

관련글

댓글

티스토리툴바