Phishing, Pharming are very common types of fraud to deceive people on the internet using malicious URLs, links. In the current Covid-19 perspective, IT organizations are also struggling to secure the corporate network from all sorts of malware viz. Ransomware, Virus, Worm, etc. Correspondingly enterprises see AI/ML-based solutions has the potential to address phishing related threats in much more efficient ways. Machine Learning, Deep Learning based solutions expect labeled datasets, extensive datasets to flag suspicious URLs efficiently. Although today advanced Deep learning solutions are used more often than traditional rule-based or Machine leaning approaches, we start with a machine learning approach first to flag mal URL samples. We’ll try the same problem again with Deep learning later.
To predict malicious URLs from a dataset containing legitimate and malware samples.
The dataset contains both good & bad URLs
Once you read the data set, it comprises of 420K rows and 2 features (URL & label).
Target variable is “label”, let’s look at the distribution
Feature Extraction: It is about extracting the domain information from the URL.
Python TLD-extract package has been used to fetch the domain, subdomain, TLD information:
Data preparation is performed using Scikit learn, label encoded
Feature selection not being done as we’ve considered all features here, but we could use SelectKbest
The classifier used: Decision Tree, Random forest
Random Forest Classification::
Decision Tree Classification::
Both Random Forest and Decision Tree works fine here, although we can tune these models further. Will see how Deep learning works in our next assignment.
For detail code lets visit: