Network Intrusion Detection Systems (NIDS) play a critical role in identifying and mitigating malicious activities within computer networks. With the rapid evolution of natural language processing (NLP), Large Language Models (LLMs) have emerged as transformative tools across various domains. LLMs, such as OpenAI's GPT series and Meta's LLaMA models, have demonstrated remarkable performance in tasks like language generation, reasoning, and classification...
Network Intrusion Detection Systems (NIDS) play a critical role in identifying and mitigating malicious activities within computer networks. With the rapid evolution of natural language processing (NLP), Large Language Models (LLMs) have emerged as transformative tools across various domains. LLMs, such as OpenAI's GPT series and Meta's LLaMA models, have demonstrated remarkable performance in tasks like language generation, reasoning, and classification. Their ability to understand and process vast amounts of data has enabled groundbreaking advancements in areas like healthcare, finance, and cybersecurity. Recent trends highlight their potential to handle unstructured data, perform complex reasoning, and adapt to a wide range of applications, making them a promising technology for enhancing NIDS.
This thesis explores the application of advanced NLP techniques, particularly LLMs, to enhance NIDS performance. We investigate multiple approaches, including the use of Masked Language Models (MLMs) such as BERT, RoBERTa, and DistilBERT, as well as large-scale generative models like Gemma (2B, 9B, and 27B parameter versions) for intrusion detection tasks.
Our study begins by implementing standard machine learning models - k-Nearest Neighbors (kNN), Random Forest, XGBoost, Support Vector Machines (SVM), and Logistic Regression -on two benchmark datasets: NSL-KDD and CICIoT2023. These models establish a performance baseline for NIDS tasks. Subsequently, we apply MLMs to classify network traffic, both as standalone classifiers and as feature extractors, by converting the datasets into dense embeddings and using them with standard models. To further analyze the efficacy of LLMs, we conduct experiments by prompting Gemma models with sampled datasets of varying sizes (1000, 5000, 10,000 rows). The experiments encompass different prompting strategies, including Zero-Shot, One-Shot, In-Context Learning, In-Context Learning with Coverage-Based selection algorithm, and Chain-of-Thought reasoning. Each experiment is conducted with two variants: one using selected features and another using the entire feature set. Our analysis focuses on how model size, data representation, and prompting methods impact classification performance.