🤖 Giới thiệu Machine Learning

Machine Learning là gì?

Machine Learning (ML) là một nhánh của Trí tuệ nhân tạo (AI), cho phép máy tính học từ dữ liệu mà không cần được lập trình một cách rõ ràng.

Định nghĩa chính thức

"A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E." — Tom Mitchell, 1997

Các loại Machine Learning

graph TB
    ML[Machine Learning]
    ML --> SL[Supervised Learning]
    ML --> UL[Unsupervised Learning]
    ML --> RL[Reinforcement Learning]
    
    SL --> CL[Classification]
    SL --> RG[Regression]
    
    UL --> CLU[Clustering]
    UL --> DR[Dimensionality Reduction]
    
    RL --> MDP[Markov Decision Process]
    
    style ML fill:#e3f2fd
    style SL fill:#bbdefb
    style UL fill:#bbdefb
    style RL fill:#bbdefb

1. Supervised Learning (Học có giám sát)

Học từ dữ liệu đã được gán nhãn (labeled data).

Classification (Phân loại)

Bài toán: Dự đoán nhãn phân loại (discrete)

Ví dụ:

Spam detection (spam / not spam)
Image classification (cat / dog)
Disease diagnosis (positive / negative)

Thuật toán: Logistic Regression, Decision Tree, SVM, Random Forest

Regression (Hồi quy)

Bài toán: Dự đoán giá trị liên tục (continuous)

Ví dụ:

House price prediction
Stock price forecasting
Temperature prediction

Thuật toán: Linear Regression, Polynomial Regression, Ridge, Lasso

2. Unsupervised Learning (Học không giám sát)

Học từ dữ liệu không có nhãn để tìm patterns hoặc cấu trúc ẩn.

Loại	Mô tả	Ví dụ
Clustering	Nhóm các điểm dữ liệu tương tự	Customer segmentation
Dimensionality Reduction	Giảm số chiều dữ liệu	PCA, t-SNE
Association	Tìm rules liên kết	Market basket analysis

3. Reinforcement Learning (Học tăng cường)

Agent học cách hành động trong môi trường để tối đa hóa reward.

Reinforcement Learning được ứng dụng trong: Game AI, Robotics, Self-driving cars, Trading bots.

ML Workflow

flowchart LR
    A[1. Problem Definition] --> B[2. Data Collection]
    B --> C[3. Data Preprocessing]
    C --> D[4. EDA]
    D --> E[5. Feature Engineering]
    E --> F[6. Model Selection]
    F --> G[7. Training]
    G --> H[8. Evaluation]
    H --> I{Good enough?}
    I -->|No| E
    I -->|Yes| J[9. Deployment]
    J --> K[10. Monitoring]

Các khái niệm quan trọng

Overfitting vs Underfitting

                    Model Complexity
        Low ◄─────────────────────────────► High
        
Error   │
  ▲     │    ╭─────╮
  │     │   ╱       ╲
  │     │  ╱         ╲   Test Error
  │     │ ╱           ╲─────────────
  │     │╱             ╲
  │     ├───────────────────────────
  │    ╱│               Training Error
  │   ╱ │
  │  ╱  │
  │ ╱   │
  └─────┴───────────────────────────►
        │       │       │
   Underfitting │   Overfitting
              Optimal

Model quá đơn giản, không capture được patterns trong data. Cả training error và test error đều cao.

Model quá phức tạp, học cả noise trong training data. Training error thấp nhưng test error cao.

Sử dụng regularization (L1, L2), cross-validation, early stopping, data augmentation, ensemble methods.

Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Error

Khái niệm	Mô tả
Bias	Sai số do model assumptions quá đơn giản
Variance	Sai số do model quá nhạy với training data
Irreducible Error	Noise tự nhiên trong data

Code Example

# Ví dụ đơn giản về ML workflow
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Load data
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 4. Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")

Tóm tắt

Key takeaways

ML cho phép máy tính học từ data thay vì được lập trình cứng
Có 3 loại chính: Supervised, Unsupervised, Reinforcement
Workflow chuẩn: Data → Preprocess → Train → Evaluate → Deploy
Cân bằng Bias-Variance là chìa khóa để có model tốt

➡️ Tiếp theo: Thực hành với Jupyter Notebook đầu tiên!