Support Vector Machines
Support Vector Machines are powerful algorithms used for both classification and regression tasks. Their core idea is to identify an optimal hyperplane that best separates data points into different categories. By maximizing the distance (margin) between the hyperplane and the nearest data points, SVMs achieve robust and accurate decision boundaries.
Types of SVM
Finds a straight-line (in 2D) or flat hyperplane (in higher dimensions) that divides classes.
Uses kernel functions to transform data into higher dimensions, enabling the model to capture complex boundaries.
Concepts in SVM
1. Hyperplane :
→In 2D, the hyperplane is simply a line.
→In 3D, it becomes a plane.
→In higher dimensions, it generalizes to an n-dimensional flat surface.
→o The general equation is: w⋅x+b=0
2. Margin:
→The margin is the distance between the separating hyperplane and the nearest data points from each class.
→Maximizing the margin ensures better generalization.
3. Hard Margin vs. Soft Margin:
→Hard margin: Strictly separates data with no misclassification; works only when data are linearly separable.
→Soft margin: Allows some misclassification for noisy or overlapping datasets, improving flexibility.
4. Support Vectors:
→The training data points that lie closest to the hyperplane.
→They are critical in defining the position and orientation of the decision boundary.
How SVM Works
1. The algorithm searches for the hyperplane w⋅x+b=0 that separates classes.
2. Support vectors (boundary data points) are identified..
3. The hyperplane is selected such that the margin is maximized, ensuring a clear separation between classes.
4. For non-linear problems, kernel tricks (polynomial, RBF, etc.) are applied to project data into higher dimensions for separation.
Assumptions of SVM
1. Data is (approximately) separable: SVM assumes that a decision boundary exists—linear or transformed via kernels—that can separate the classes.
2. Independent and Identically Distributed (i.i.d.) Samples: Training data points are assumed to be drawn independently from the same underlying distribution.
3. Relevant Features Provided: SVM assumes that the input features contain enough information to distinguish between classes.
4. Margin Maximization Improves Generalization: The algorithm assumes that maximizing the margin leads to better prediction accuracy on unseen data.
5. Kernel Appropriateness: When using kernel SVM, it is assumed that the chosen kernel function corresponds well to the structure of the data.
Advantages of SVM
1.High Accuracy: SVMs often achieve strong predictive performance, especially in binary classification tasks.
2. Effective in High Dimensions: They work well when the number of features is very large compared to the number of samples (e.g., text classification, gene expression data).
3. Robust to Overfitting (with proper regularization): By maximizing the margin between classes, SVMs provide good generalization capability.
4. Flexibility with Kernels: Kernel trick allows SVMs to handle non-linear decision boundaries effectively (e.g., polynomial, radial basis function kernels).
5. Sparse Solution: Only a subset of training points (support vectors) are used in decision making, which makes the model relatively compact.
6. Clear Geometric Interpretation:The decision boundary and margin provide intuitive visualization in lower dimensions.
Disadvantages of SVM
1. Computationally Intensive: Training becomes slow and memory-heavy for very large datasets (millions of samples).
2. Choice of Kernel and Parameters: Performance heavily depends on selecting the right kernel function and tuning parameters (C, gamma).
3. Not Naturally Probabilistic: Unlike logistic regression or Bayesian models, SVMs don’t directly give probability estimates (though extensions exist).
4. Difficult to Interpret in Non-linear Cases: With kernel functions, the resulting model is harder to interpret compared to linear models.
5. Sensitivity to Noise and Overlapping Classes:If classes overlap significantly or data contains outliers, SVM performance may degrade.
6. Limited Scalability for Real-time Predictions: Prediction time increases with the number of support vectors, which may be problematic in large datasets.