Feature Engineering

June 25, 2024

Pipeline of maching learning

Machine learning algorithm extracts features from input and model classifies or clusters data according to the features

Types of features

Numerical feature
- continuous
- discrete
Categorical feature
- Ordinal
- Nominal
Array feature
- example
  - image
  - sound
  - …

Numerical feature

Numerical feature
- continuous
- discrete
Range can be different between numerical features
$\theta$’s range difference can lead to slow learning rate
$h(\boldsymbol{x}) = \sigma(\theta_1x_1 + \theta_2x_2 + \cdots\theta_nx_n + \theta_0)$

Normalizing numerical feature

Numerical feature scaling
- Min-Max feature scaling
- Mean normalization
- Standardization
Bucketing

Min-Max scaling

#### $x_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}, 0\leq x_{new}\leq 1$

Mean Normalization

#### $x_{new} = \frac{x - x_{mean}}{x_{max} - x_{min}}, -0.5\leq x_{new}\leq 0.5$

Problem of min-max, mean normalization

min, max can be set too large if outlier exists
Resolution of general values can fall down by outlier values

Standardization

Use distribution of variable $x$ to get less affected by outlier values
Normalize by how many times of standard diviation is $x$ seperated from other average $x$s
$x_{new} = \frac{x-\mu}{\sigma}$

Bucketing

Extract generalized and abstracted features by bucketing numeric variables in certain range

Quantile Bucketing

Same range bucketing might not reflect data distribution well depends on data type
Quantile Bucketing bucket variables in a way that same number of variables exist in one bucket

Categorical feature

Ordinal
- Can be replaced with numbers ex) Score: low, middle, high => 1, 2, 3
Nominal
- Should not be replaced with numbers ex) Port of Embarkation: Cherbourg, Queenstown, Southampton =/=> 1, 2, 3

One-hot encoding

Allows nominal features to be presented as linear model
Reform each category with seperate binary variables

Bucketing for categorical feature

Similar categories can be grouped with bucketing

Polynomial feature

What if decision border is non-linear?
Input variables can be powered into Polynominal features and non-linear decision borders can be defined
$h(x) = \sigma(\theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 +\theta_0)$

$h(x) = \sigma(\theta_1x_1 + \theta_2x_2 + \theta_1x_1^2 +\theta_2x_2^2 + \theta_0)$ -> Ellipse

$…$

Leave a comment

You may also enjoy

Eigenvalue and Eigenvector

May 7, 2025

About Eigenvalue and Eigenvector

Ensemble Method

June 30, 2024

About Ensemble Method

Deep Learning Outline

June 30, 2024

About Deep Learning

K-Means Clusturing

June 29, 2024

About K-Means Clusturing