📓 Notebook: Iris Classification

Trong notebook này, chúng ta sẽ thực hành xây dựng model classification đầu tiên với Iris dataset.

Mục tiêu

Load và khám phá Iris dataset
Visualize data với Matplotlib và Seaborn
Train model Random Forest
Đánh giá model với các metrics

Bước 1: Import Libraries

# Import các thư viện cần thiết
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Cấu hình hiển thị
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

Bước 2: Load Dataset

# Load Iris dataset
iris = load_iris()

# Tạo DataFrame
df = pd.DataFrame(
    data=iris.data,
    columns=iris.feature_names
)
df['target'] = iris.target
df['species'] = df['target'].map({
    0: 'setosa',
    1: 'versicolor', 
    2: 'virginica'
})

# Xem data
print(f"Shape: {df.shape}")
df.head(10)

Bước 3: Exploratory Data Analysis

3.1 Thống kê mô tả

# Thống kê cơ bản
df.describe()

3.2 Phân bố các classes

# Đếm số lượng mỗi loài
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Pie chart
df['species'].value_counts().plot.pie(
    autopct='%1.1f%%',
    ax=axes[0],
    colors=['#ff9999','#66b3ff','#99ff99']
)
axes[0].set_title('Phân bố các loài hoa')
axes[0].set_ylabel('')

# Bar chart
df['species'].value_counts().plot.bar(
    ax=axes[1],
    color=['#ff9999','#66b3ff','#99ff99']
)
axes[1].set_title('Số lượng mỗi loài')
axes[1].set_xlabel('Species')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

3.3 Phân bố features

# Boxplot cho tất cả features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for idx, column in enumerate(iris.feature_names):
    ax = axes[idx // 2, idx % 2]
    df.boxplot(column=column, by='species', ax=ax)
    ax.set_title(column)
    ax.set_xlabel('')

plt.suptitle('Phân bố Features theo Species', fontsize=14)
plt.tight_layout()
plt.show()

3.4 Pair Plot

# Pair plot để xem mối quan hệ giữa các features
sns.pairplot(df, hue='species', palette='husl', markers=['o', 's', 'D'])
plt.suptitle('Pair Plot - Iris Dataset', y=1.02)
plt.show()

3.5 Correlation Matrix

# Heatmap correlation
plt.figure(figsize=(8, 6))
correlation = df[iris.feature_names].corr()
sns.heatmap(
    correlation, 
    annot=True, 
    cmap='coolwarm',
    center=0,
    square=True,
    linewidths=0.5
)
plt.title('Correlation Matrix')
plt.show()

Bước 4: Data Preprocessing

# Tách features và target
X = df[iris.feature_names].values
y = df['target'].values

# Split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Đảm bảo phân bố đều các classes
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Scaling features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Bước 5: Model Training

# Khởi tạo và train model
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    random_state=42
)

# Fit model
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)

Bước 6: Model Evaluation

6.1 Accuracy

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")

6.2 Classification Report

print("Classification Report:")
print("=" * 60)
print(classification_report(
    y_test, 
    y_pred, 
    target_names=iris.target_names
))

6.3 Confusion Matrix

# Confusion Matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(
    cm, 
    annot=True, 
    fmt='d', 
    cmap='Blues',
    xticklabels=iris.target_names,
    yticklabels=iris.target_names
)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

6.4 Feature Importance

# Feature Importance
importance = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(data=importance, x='importance', y='feature', palette='viridis')
plt.title('Feature Importance')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.show()

Kết luận

Kết quả đạt được

Model Random Forest đạt accuracy cao (~97%)
Petal length và petal width là features quan trọng nhất
Loài setosa dễ phân biệt nhất, versicolor và virginica có sự overlap

Bài tập thực hành

Practice Tasks

Thử thay đổi n_estimators và max_depth, quan sát sự thay đổi accuracy
Sử dụng thuật toán khác (SVM, KNN) và so sánh kết quả
Thêm cross-validation để đánh giá model chính xác hơn

➡️ Tiếp theo: Làm Bài tập thực hành để củng cố kiến thức!