Feature Engineering
Feature engineering is the process of using domain knowledge to create new features or transform existing ones to improve machine learning model performance. NovaML provides several tools to help you effectively engineer and select features for your models.
Polynomial Features
One common technique for capturing non-linear relationships in your data is to create polynomial and interaction features. NovaML's PolynomialFeatures
transformer can automatically generate these higher-order features.
using NovaML.PreProcessing
# Create a transformer that will generate polynomial features up to degree 2
poly = PolynomialFeatures(degree=2)
# Example data with two features
X = [1 2; 3 4; 5 6]
# Generate polynomial features
X_poly = poly(X)
# The output will include:
# - Original features (x₁, x₂)
# - Squared terms (x₁², x₂²)
# - Interaction terms (x₁x₂)
You can control the complexity of the generated features with various parameters:
# Generate only interaction terms, without higher-order terms
poly_interact = PolynomialFeatures(
degree=2,
interaction_only=true
)
# Generate features up to degree 3, including bias term
poly_cubic = PolynomialFeatures(
degree=3,
include_bias=true
)
Text Feature Extraction
NovaML provides tools for converting text data into numerical features that can be used by machine learning algorithms.
Count Vectorization
The CountVectorizer
transforms text documents into a matrix of token counts:
using NovaML.FeatureExtraction
# Example text documents
docs = [
"The cat sat on the mat",
"The dog chased the cat",
"The mat was red"
]
countvec = CountVectorizer();
bag = countvec(docs);
countvec.vocabulary
countvec(bag, type=:inverse)
You can customize the vectorization process with various parameters:
vectorizer = CountVectorizer(
min_df=2, # Ignore terms that appear in less than 2 documents
max_df=0.95, # Ignore terms that appear in more than 95% of documents
stop_words="english", # Remove common English stop words
ngram_range=(1, 2) # Include both unigrams and bigrams
)
TF-IDF Vectorization
The TfidfVectorizer
converts a collection of raw documents to a matrix of TF-IDF features:
# Initialize and fit TF-IDF vectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf(documents)
# Transform new documents using the fitted vectorizer
new_docs = ["A cat and dog play", "The red mat"]
X_new = tfidf(new_docs)
Feature Selection
NovaML helps you identify and select the most important features for your models. Feature selection can help improve model performance, reduce overfitting, and speed up training.
Using Model-Based Feature Importance
Many models in NovaML provide feature importance scores that you can use for feature selection:
using NovaML.Tree
# Train a decision tree
dt = DecisionTreeClassifier(max_depth=5)
dt(X_train, y_train)
# Get feature importances
importances = dt.feature_importances_
# Print feature importances with their names
for (name, importance) in zip(feature_names, importances)
println("$name: $importance")
end
Combining Feature Engineering Steps
You can combine multiple feature engineering steps using NovaML's pipeline functionality:
using NovaML.Pipelines: pipe
p = pipe(
StandardScaler(),
PCA(n_components=2),
LogisticRegression())
p(Xtrn, ytrn)
ŷ = p(Xtst)