Automated-Smart-Log-Classifier/README.md at main · Harsh-7612/Automated-Smart-Log-Classifier

Automated-Log-Classifier

Summary:
Developed a hybrid intelligent log-classification pipeline that combines regex-based pattern matching, transformer embeddings (MiniLM/Sentence-BERT), traditional ML classification, and LLM-based fallback routing to automate categorization of heterogeneous system logs. The system uses clustering-assisted pattern discovery for regex generation, semantic embedding models for unstructured logs, and source-aware routing for legacy-system specific classification.

Detailed:
I started with a log dataset in Dataset/synthetic_logs.csv and loaded it into a notebook using pandas. The dataset contained 2410 log entries with fields including timestamp, source, log_message, target_label, complexity, clusters, and regex_label. I analyzed the dataset distribution across 6 log sources (ModernCRM, AnalyticsEngine, ModernHR, BillingSystem, ThirdPartyAPI, LegacyCRM) and 9 target classes (HTTP Status, Critical Error, Security Alert, Error, System Notification, Resource Usage, User Action, Workflow Error, Deprecation Warning).

To identify recurring structural patterns in the logs, I performed exploratory clustering analysis and grouped semantically similar messages into clusters. This allowed me to discover repetitive and highly structured log formats—such as login/logout events and email-related failures—that were suitable for deterministic regex-based classification.

Using these cluster insights, I manually engineered regex rules for structured and repetitive log patterns. I labeled logs matching these deterministic patterns via a regex_label field and separated the remaining unmatched logs into a non-regex subset for semantic classification.

I implemented the rule-based classification module in processor_regex.py, where structured log patterns such as user login/logout events, backup notifications, reboot messages, upload confirmations, and account creation logs are directly mapped to their respective categories. This serves as the system’s fast-path deterministic classifier.

For logs not handled by regex, I built a semantic classification pipeline in processor_bert.py using Sentence Transformers (all-MiniLM-L6-v2) to generate dense semantic embeddings. These embeddings are passed into a trained supervised classifier stored as log_classifier.joblib to predict the final label. I also implemented confidence thresholding, returning Unclassified when prediction confidence falls below 0.5.

To handle legacy-system logs originating from LegacyCRM, I designed a dedicated LLM-based classifier in processor_llm.py. Using the Groq API with Llama-3.3-70B-Versatile, I prompt the model to classify legacy logs into Workflow Error, Deprecation Warning, or Unclassified, addressing cases where traditional regex/ML approaches underperform.

Finally, I integrated all components into a unified routing pipeline in app.py. The routing logic first checks the log source:

If the source is LegacyCRM, the log is routed to the LLM classifier Otherwise, regex-based classification is attempted first If no regex rule matches, the semantic embedding + ML classifier is used as fallback

I also implemented batch CSV processing support through classify_csv, which processes an input CSV of logs, classifies each entry, and outputs predictions to output.csv.

The final system is a hybrid multi-stage intelligent log classification architecture that combines:

Clustering-assisted regex rule discovery Deterministic rule-based classification Transformer embedding-based semantic classification LLM-based source-aware fallback routing

This architecture enables scalable, robust classification of heterogeneous structured and unstructured logs across modern and legacy enterprise systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated-Log-Classifier

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Automated-Log-Classifier