How does a decision tree split data to make predictions?
A decision tree splits data based on features that produce the highest information gain or lowest impurity. Each internal node represents a condition, dividing the dataset into subsets. Gini impurity or entropy helps determine which split is most informative. The algorithm repeats the splitting process until a stopping condition is reached. Leaves represent final predicted outcomes, making trees easy to interpret.
Explain how k-means clustering groups unlabeled data.
K-means begins by selecting K random centroids and assigning each data point to the nearest one. After assignment, centroids are recalculated as the mean of all assigned points.
This process iterates until centroids stabilize or the algorithm reaches a maximum number of iterations. The final clusters represent groups with similar characteristics. K-means works best with spherical, evenly sized clusters.
How does linear regression fit the best line for prediction?
Linear regression finds the optimal line by minimizing the sum of squared residuals between predictions and true values. The slope and intercept are calculated using analytical formulas or optimization techniques.
The model assumes a linear relationship between input features and output. Residual plots help evaluate goodness of fit. Regularization variants like Ridge and Lasso improve performance on noisy data.
Explain how Support Vector Machines (SVM) find the optimal hyperplane.
SVMs try to find the hyperplane that maximizes the margin between two classes. Support vectors are the closest data points to the boundary, controlling its position. A larger margin usually leads to better generalization. Kernel functions help SVMs classify non-linear data by projecting it into higher-dimensional space. SVMs are powerful but computationally heavy for very large datasets.
Compare supervised and unsupervised learning.
| Aspect | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Data Type | Labeled | Unlabeled |
| Goal | Predict outputs | Find patterns |
| Algorithms | Regression, SVM, Decision Trees | Clustering, PCA |
| Evaluation | Accuracy, RMSE | Silhouette Score |
Supervised learning works with labeled data to predict outcomes, while unsupervised learning focuses on discovering hidden patterns in unlabeled data. Both approaches are essential in machine learning pipelines and are used based on the problem type.
What are the differences between bagging and boosting?
| Feature | Bagging | Boosting |
|---|---|---|
| Approach | Parallel training | Sequential training |
| Error Handling | Reduces variance | Reduces bias |
| Models Used | Independent | Weighted by previous errors |
| Examples | Random Forest | AdaBoost, XGBoost |
Bagging trains multiple models independently and combines their results, while boosting builds models sequentially to correct previous errors. Both ensemble techniques improve prediction accuracy and model performance.
Compare L1 and L2 regularization.
| Feature | L1 Regularization | L2 Regularization |
|---|---|---|
| Penalty | Sum of absolute weights | Sum of squared weights |
| Effect | Sparse weights | Smooth weights |
| Removes Features | Yes | Rarely |
| Algorithms | Lasso | Ridge |
L1 regularization pushes some coefficients to zero, making it useful for feature selection, while L2 regularization reduces the overall magnitude of weights to improve model stability.
Compare Random Forests and Gradient Boosting Machines (GBM).
| Metric | Random Forest | Gradient Boosting |
|---|---|---|
| Training | Parallel trees | Sequential trees |
| Overfitting | Low | Higher (needs tuning) |
| Speed | Fast | Slower |
| Performance | Good | Best with tuning |
Random Forest models are fast and robust, making them a strong default choice, while Gradient Boosting methods can deliver higher accuracy on complex datasets when carefully tuned.
What is the bias-variance tradeoff in machine learning?
The bias-variance tradeoff describes how overly simple models underfit (high bias) and overly complex models overfit (high variance). Achieving optimal generalization requires balancing the two. Techniques like cross-validation and regularization help tune complexity. Ensemble models often reduce variance effectively. Understanding this tradeoff is essential for building stable ML systems.
How does cross-validation improve model evaluation?
Cross-validation splits data into multiple folds to train and test the model multiple times. This reduces bias introduced by any single train-test split. It also provides a more reliable estimate of performance on unseen data. K-fold cross-validation is widely used because it balances accuracy and computation. The method helps detect overfitting early.
What is feature scaling, and why is it important?
Feature scaling ensures that all numeric features contribute proportionally during training. Algorithms like SVM, KNN, and gradient descent work better with standardized inputs. Common techniques include normalization (min-max) and standardization (z-score). Proper scaling improves convergence speed and model stability. It is essential when features differ significantly in range.
What are outliers, and how do they affect ML models?
Outliers are data points that deviate significantly from normal patterns. They can distort model parameters, especially in algorithms like linear regression. Detecting them requires statistical, visual, or clustering techniques. Options include removing, transforming, or capping outliers. Proper handling ensures accurate predictions and more reliable models.
Explain the role of the cost function in machine learning.
The cost function measures how far model predictions deviate from true values. Optimization algorithms aim to minimize this function during training. Different tasks use different cost functions, such as MSE for regression and cross-entropy for classification. A well-chosen cost function improves convergence behavior. It plays a direct role in determining model performance.
What is dimensionality reduction, and when should it be used?
Dimensionality reduction reduces the number of input variables while preserving essential information. Techniques like PCA and t-SNE identify underlying structure in high-dimensional data. It helps simplify models, speed up computation, and reduce overfitting. It is especially useful when datasets contain hundreds of features. Visualization becomes easier after reduction.
What is the purpose of a confusion matrix?
A confusion matrix summarizes classification performance by showing correct and incorrect predictions. It contains true positives, false positives, true negatives, and false negatives. Metrics like precision, recall, and F1-score are derived from it. This matrix helps diagnose model weaknesses, such as class imbalance issues. It provides insight beyond simple accuracy.
How does logistic regression perform classification?
Logistic regression uses the sigmoid function to map predictions to probabilities. Instead of predicting continuous values, it models the likelihood of classes. The decision boundary is determined by thresholding these probabilities. Training uses maximum likelihood estimation to fit parameters. Despite its simplicity, logistic regression is powerful for linearly separable problems.
What is the purpose of evaluation metrics beyond accuracy?
Accuracy alone can be misleading, especially with imbalanced datasets. Metrics like precision, recall, and F1-score provide deeper insights. ROC-AUC evaluates model quality across thresholds. These metrics help understand false positives and false negatives. Proper metric selection ensures fair evaluation aligned with business needs.
What is the ML pipeline, and why is it important?
An ML pipeline automates data loading, preprocessing, model training, validation, and deployment. It ensures reproducibility and consistency across large data workflows. Pipelines help manage transformations in the correct order. Tools like Scikit-Learn, Airflow, and MLflow streamline execution. Pipelines reduce errors and make models easier to maintain.
What is the purpose of hyperparameter tuning?
Hyperparameter tuning searches for the best model configuration that maximizes performance. Methods include grid search, random search, and Bayesian optimization. Tuning controls depth, learning rate, regularization, and more. Well-tuned models generalize better and reduce overfitting. It is an essential step before deploying ML systems.





