Data Preprocessing
Data preprocessing is a crucial step in any machine learning pipeline. NovaML.jl provides a range of tools for cleaning, transforming, and preparing your data for model training. This page covers the main preprocessing techniques available in NovaML.
Scaling
Scaling features is often necessary to ensure that all features contribute equally to the model training process. NovaML offers several scaling methods:
StandardScaler
Standardizes features by removing the mean and scaling to unit variance.
using NovaML.Datasets
iris = load_iris()
X = iris["data"][:, 3:4]
y = iris["target"]
Split data to train and test sets.
using NovaML.ModelSelection
Xtrn, Xtst, ytrn, ytst = train_test_split(X, y, test_size=0.3, stratify=y)
Fit and transform StandardScaler
.
using NovaML.PreProcessing
stdscaler = StandardScaler()
# fit and transform
Xtrnstd = stdscaler(Xtrn)
# transform
Xtststd = stdscaler(Xtst)
# inverse transform
Xtrn = stdscaler(Xtrnstd, type=:inverse)
MinMaxScaler
Scales features to a number between [0, 1].
minmax = MinMaxScaler()
# fit & transform
Xtrn_mm = minmax(Xtrn)
# transform
Xtst_mm = minmax(Xtst)
# inverse transform
Xtrn = minmax(Xtrn_mm, type=:inverse)
Encoding
Categorical variables often need to be encoded into numerical form for machine learning algorithms.
LabelEncoder
Encodes target labels with value between 0 and n_classes-1.
using NovaML.PreProcessing
lblencode = LabelEncoder()
labels = ["M", "L", "XL", "M", "L", "M"]
# Label encode labels
labels = lblencode(labels)
# Get the labels back
lblencode(labels, :inverse)
OneHotEncoder
Encodes categorical features as a one-hot numeric array.
using NovaML.PreProcessing
labels = ["M", "L", "XL", "M", "L", "M"]
ohe = OneHotEncoder()
onehot = ohe(labels)
ohe(onehot, :inverse)
PolynomialFeatures
Generates polynomial and interaction features.
using NovaML.PreProcessing
X = rand(5, 2)
# 5×2 Matrix{Float64}:
# 0.85245 0.405935
# 0.139957 0.380467
# 0.730332 0.0418465
# 0.051091 0.570372
# 0.730245 0.128763
poly = PolynomialFeatures(
degree=2,
interaction_only=false,
include_bias=true)
Xnew = poly(X)
# 5×6 Matrix{Float64}:
# 1.0 0.85245 0.405935 0.72667 0.346039 0.164783
# 1.0 0.139957 0.380467 0.0195881 0.0532491 0.144755
# 1.0 0.730332 0.0418465 0.533384 0.0305618 0.00175113
# 1.0 0.051091 0.570372 0.00261029 0.0291409 0.325324
# 1.0 0.730245 0.128763 0.533258 0.0940282 0.0165798
Imputation
Missing data is a common issue in real-world datasets. NovaML provides tools for handling missing values. The strategy
argument must be one of :mean
, :median
, :most_frequent
, or :constant
.
SimpleImputer
Imputes missing values using a variety of strategies.
X = [1.0 2.0 3.0 4.0
5.0 6.0 missing 8.0
10.0 11.0 12.0 missing]
using NovaML.Impute
imputer = SimpleImputer(strategy=:mean)
Ximp = imputer(X)
# 3×4 Matrix{Union{Missing, Float64}}:
# 1.0 2.0 3.0 4.0
# 5.0 6.0 7.5 8.0
# 10.0 11.0 12.0 6.0
Pipelines
You can combine multiple preprocessing steps into a single pipeline for easier management and application.
using NovaML.PreProcessing: StandardScaler
using NovaML.Decomposition: PCA
using NovaML.LinearModel: LogisticRegression
sc = StandardScaler()
pca = PCA(n_components=2)
lr = LogisticRegression()
# transform the data and fit the model
Xtrn |> sc |> pca |> X -> lr(X, ytrn)
# make predictions
ŷtst = Xtst |> sc |> pca |> lr
It is also possible to create pipelines using NovaML's Pipe
constructor:
using NovaML.Pipelines: pipe
# create a pipeline
pipe = pipe(
StandardScaler(),
PCA(n_components=2),
LogisticRegression())
# fit the pipe
pipe(Xtrn, ytrn)
# make predictions
ŷ = pipe(Xtst)
# make probability predictions
ŷprobs = pipe(Xtst, type=:probs)
Text Preprocessing
For text data, NovaML offers vectorization techniques:
CountVectorizer
Converts a collection of text documents to a matrix of token counts.
docs = [
"Julia was designed for high performance",
"Julia uses multiple dispatch as a paradigm",
"Julia is dynamically typed, feels like a scripting language",
"But can also optionally be separately compiled",
"Julia is an open source project"];
using NovaML.FeatureExtraction
countvec = CountVectorizer();
bag = countvec(docs);
countvec.vocabulary
# Dict{String, Int64} with 30 entries:
# "scripting" => 25
# "high" => 14
# "feels" => 12
# "is" => 15
# "separately" => 26
# "language" => 17
# "typed" => 28
# "but" => 6
# "a" => 1
# "for" => 13
# "optionally" => 21
# "paradigm" => 22
# "was" => 30
# "dynamically" => 11
# "also" => 2
# "an" => 3
# "multiple" => 19
# "be" => 5
# "julia" => 16
# "project" => 24
# "uses" => 29
# "source" => 27
# "open" => 20
# "performance" => 23
# "compiled" => 8
# "designed" => 9
# "as" => 4
# "can" => 7
# "like" => 18
# "dispatch" => 10
countvec(bag, type=:inverse)
# 5-element Vector{String}:
# "designed for high julia performance was"
# "a as dispatch julia multiple paradigm uses"
# "a dynamically feels is julia language like scripting typed"
# "also be but can compiled optionally separately"
# "an is julia open project source"
TfidfVectorizer
Converts a collection of raw documents to a matrix of TF-IDF features.
tfidf = TfidfVectorizer()
result = tfidf(docs)
tfidf.vocabulary
tfidf(result, type=:inverse)
new_docs = [" The talk on the Unreasonable Effectiveness of Multiple Dispatch explains why it works so well."]
Xnew = tfidf(new_docs)
Most preprocessing transforms in NovaML follow the functor pattern: transform(X) both fits the transformer to the data and applies the transformation. For separate fitting and transforming (e.g., when you want to apply the same transformation to test data), you can use the fitted transformer directly on new data.