Conference Presentation · NIDS Survey

From Data
to Defense

An analytical overview of machine and deep learning models for Network Intrusion Detection Systems, evaluated across two benchmark datasets under a unified experimental framework.

7
Models Evaluated
2
Benchmark Datasets
8
Evaluation Metrics
4M+
Records Processed

Benchmark Datasets

Two widely-used intrusion detection datasets selected to represent both classical and modern network traffic environments.

Dataset 01

KDD Cup 99

One of the earliest and most widely-used IDS benchmarks. Contains diverse attack types across 41 features, enabling systematic ML evaluation despite its age.

Total Records~4,000,000
Features41
Attack Types10 (after filtering)
Largest Classsmurf · 2,807,886
Normal Traffic972,781
Dataset 02

UNSW-NB15

A modern dataset with realistic traffic patterns and diverse attack categories. Reflects contemporary network environments with significant class imbalance.

Total Records~2,367,624
Attack Categories9
Benign Traffic2,237,731
Rarest ClassWorms · 158
Imbalance RatioHigh

Model Performance Results

Comprehensive metrics including balanced accuracy, F1-macro, G-Mean, precision, and recall evaluated across all models on both datasets — with and without SMOTE oversampling.

// Balanced Accuracy — UNSW-NB15 (Raw)
ModelTypeAccuracyBalanced Acc.PrecisionRecallF1-MacroG-Mean
Random ForestML0.97570.56570.61970.56570.58100.4822
XGBoostML0.97380.42470.64350.42470.45100.0180
Decision TreeML0.97530.47030.57600.47030.48500.0238
LSTM-CNNDL0.97290.38230.46730.38230.39500.0001
ANNDL0.95890.37440.35680.37440.2990≈0
LSTMDL0.96730.32870.35460.32870.3250≈0
CNNDL0.95240.27810.33860.27810.2730≈0
// Balanced Accuracy — UNSW-NB15 (After SMOTE) — Degradation observed
ModelAccuracyBalanced Acc.PrecisionRecallF1-MacroG-Mean
LSTM-CNN SMOTE0.01120.12410.09180.12410.00770.0017
Random Forest SMOTE0.91940.10010.19190.10010.0961≈0
ANN SMOTE0.91940.10000.09190.10000.0958≈0
CNN SMOTE0.91940.10000.09190.10000.0958≈0
XGBoost SMOTE0.11040.11080.10440.11080.0248≈0
Decision Tree SMOTE0.86330.09600.09300.09600.0937≈0
LSTM SMOTE0.03520.06830.08140.06830.0097≈0
// ⚠ SMOTE degraded ALL metrics on UNSW-NB15 — Not recommended for high-dimensional imbalanced data

Applying SMOTE to the UNSW-NB15 dataset caused substantial deterioration across all models. Balanced accuracy, F1-macro, and G-Mean dropped significantly compared to raw data settings, indicating that sample-level oversampling is ineffective for high-dimensional complex datasets. Model robustness and algorithm-level imbalance handling are more critical.

// Balanced Accuracy — KDD Cup 99 (Raw)
ModelTypeAccuracyBalanced Acc.PrecisionRecallF1-MacroG-Mean
Random ForestML0.99990.99400.99690.99400.99550.9939
XGBoostML0.99970.99090.99410.99090.99250.9908
LSTM-CNNDL0.99920.98480.96870.98480.97580.9846
Decision TreeML0.99880.96140.96590.96140.96220.9586
LSTMDL0.99850.96730.94010.96730.95170.9663
ANNDL0.99870.94270.96220.94270.95190.9362
CNNDL0.98990.62560.70800.62560.64460.0051
// Balanced Accuracy — KDD Cup 99 (After SMOTE) — Selective improvements
ModelAccuracyBalanced Acc.PrecisionRecallF1-MacroG-Mean
Random Forest SMOTE0.99990.99220.99780.99220.99490.9920
XGBoost SMOTE0.99970.99710.98640.99710.99160.9971
ANN SMOTE0.99260.99530.79930.99540.86550.9953
LSTM-CNN SMOTE0.99920.99070.95890.99080.97370.9906
Decision Tree SMOTE0.99700.99310.92520.99310.93860.9930
LSTM SMOTE0.99810.99800.91110.99120.94310.9910
CNN SMOTE0.94940.97210.52310.97210.61990.9717
// ✓ SMOTE on KDD Cup 99 — Improvements for DL models, marginal effect on tree-based

SMOTE improved minority class recall for deep learning models notably (ANN: 0.94 → 0.99, CNN: 0.63 → 0.97). However, precision dropped, indicating more false positives. Tree-based models (RF, XGBoost) were largely unaffected — they inherently handle class imbalance well and do not benefit significantly from data-level oversampling.


Computational Cost Analysis

Inference time and memory footprint are critical factors for real-world NIDS deployment. Deep learning architectures show significantly higher latency and resource consumption.

// Inference Time (seconds) — UNSW-NB15
// Inference Time (seconds) — KDD Cup 99
// Memory Usage (MB) — UNSW-NB15 · Note: RF 22,424 MB
// Memory Usage (MB) — KDD Cup 99
ModelTypeUNSW Infer. Time (s)UNSW Memory (MB)KDD Infer. Time (s)KDD Memory (MB)
Decision TreeML0.0251,5790.0264,959
XGBoostML0.3331,5710.2264,959
Random ForestML0.57122,4240.3734,989
ANNDL1.6225,2671.4104,503
CNNDL1.6494,3751.3145,096
LSTMDL1.9024,7101.4755,104
LSTM-CNNDL3.7214,7202.9955,112

Key Findings

Six critical insights derived from unified experimental evaluation across models and datasets.

🌲

Tree Models Dominate on Structured Data

Random Forest and XGBoost achieved near-perfect F1-macro (>0.99) on KDD Cup 99. Ensemble methods consistently outperformed all other approaches on structured, tabular datasets.

⚖️

Accuracy is a Misleading Metric

Multiple DL models reported >96% accuracy on UNSW-NB15, yet their balanced accuracy and G-Mean revealed near-zero minority class detection, exposing the danger of relying on a single metric.

🧠

Deep Learning: High Cost, Mixed Returns

CNN-LSTM hybrid achieved strong KDD performance (F1: 0.976) but requires 3.7s inference and >5GB memory. The computational cost limits real-time applicability significantly.

🚫

SMOTE Harmful on High-Dimensional Data

Applying SMOTE to UNSW-NB15 degraded all metrics across all models substantially. Simple sample-level balancing cannot compensate for feature complexity and distribution shifts.

SMOTE Helps DL on KDD Cup 99

ANN balanced accuracy improved from 0.943 to 0.995, and CNN from 0.626 to 0.972 after SMOTE on KDD Cup 99. However, this came at the cost of precision degradation.

Decision Tree Best for Real-Time

With ~0.025s inference time and the smallest memory footprint, Decision Trees are optimal for latency-sensitive deployments where sub-second detection is a hard requirement.


Dataset–Model Selection Guidelines

Practical decision framework for selecting the right model and strategy based on deployment requirements and dataset characteristics.

Deployment Scenario
Recommended Model
Dataset
Key Evidence
Structured, low-complexity traffic
RF / XGBoost
KDD Cup 99
F1-macro >0.99; near-perfect balanced accuracy
High-dimensional, imbalanced traffic
RF
UNSW-NB15
Best balanced accuracy (0.57) and G-Mean (0.48) among all tested models
Real-time / latency-sensitive deployment
DT
Both
Fastest inference (~0.02s); lowest memory footprint across all datasets
Sequential / temporal attack patterns
LSTM / CNN-LSTM
KDD Cup 99
Strong recall on ordered flow attacks; temporal dependency modeling
Imbalanced data + deep learning
SMOTE + ANN/LSTM
KDD Cup 99
Balanced accuracy improved from 0.94 to 0.99; recall gains outweigh precision drop
High-dimensional data + imbalance
RF (no SMOTE)
UNSW-NB15
SMOTE degraded all metrics on high-dimensional data; avoid sample-level balancing
// Conclusion & Future Directions

Ensemble Methods Offer the Best Production Trade-off

Decision Trees achieve inference times as low as 0.02s for real-time monitoring, while hybrid CNN-LSTM models exceed 3s latency with >5GB memory — suitable only for offline analysis. RF and XGBoost provide the optimal balance of strong detection performance with manageable inference costs, making them the most practical choice for production NIDS environments.

Future research should explore transformer-based architectures for NIDS — leveraging self-attention for parallel processing and better capture of global traffic patterns, potentially overcoming the computational bottlenecks observed in LSTM and CNN-LSTM models.