Phishing Classification using Ensemble model
Last Updated :
11 Mar, 2024
With the rise of digital technology usage, it is becoming easier for attackers to steal personal information from users by committing phishing, one of the most common and dangerous cybercrimes. In this context, our exploration is related to phishing classification using an ensemble model. In this article, by leveraging a curated dataset, we will train and evaluate a robust model capable of distinguishing between legitimate and phishing URLs.
What is Phishing?
Phishing is a type of cyberattack that tricks people into revealing sensitive information, such as passwords or financial details. Attackers often use emails, messages, or websites that look legitimate to deceive victims. Phishing campaigns aim to create a false sense of security and can involve techniques like using fake links, infected attachments, or fake login pages to trick people into giving away their information.
How to be safe from Phishing?
Some of the countermeasures of Phishing are discussed below:
- Security training: Stay aware of the dangers of phishing through regular training. Learn to spot fishy emails or messages so you can avoid falling prey to phishing. Discovery allows you to make smart decisions online.
- Web filtering and URL analysis: Apply filters that scan websites in real time to block access to known phishing sites. It’s like having a security alarm that warns you before you access a potentially dangerous Internet site.
- Advanced Threat Protection (ATP): Use advanced security tools that learn from patterns and behaviors to detect hijacking attempts. These tools can catch suspicious emails or files before they reach you, acting like a vigilant watchdog against cyber threats.
- Endpoint security solutions: Equip your devices with security tools that specifically protect against phishing. It’s like your personal computer security guard, ready to block any phishing attempts that target your device.
Implementation: Phishing Classification using Ensemble Model
Importing required modules
At first, we will import all required Python modules like Pandas, Matplotlib and SKlearn etc.
Python3
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_auc_score
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns
|
Dataset loading and Preview
Now we will load Phishing dataset and map its target column. After that we will visualize some rows of the raw dataset for better understanding.
Python3
df = pd.read_csv( 'phishing_data' )
df[ 'status' ] = df[ 'status' ]. map ({ 'phishing' : 1 , 'legitimate' : 0 })
print ( "Dataset Preview:" )
df.head()
|
Output:
Note: The screenshot is just for demo purpose, as 89 rows cannot be vizualised at once.
Exploratory Data Analysis
- In Exploratory Data Analysis(EDA) we will generate a pie chart to depict the distribution of classes in the ‘status’ column of the dataset. The ‘status’ column represents whether a URL is classified as ‘Legitimate’ or ‘Phishing.’
- The pie chart is created using the counts of each class, showing the proportion of ‘Legitimate’ and ‘Phishing’ instances. The chart is styled with custom colors and labels, making it visually informative.
- This visualization helps us in quickly understanding the balance or imbalance between the two classes in the dataset.
Python3
plt.figure(figsize = ( 6 , 6 ))
df[ 'status' ].value_counts().plot(kind = 'pie' , autopct = '%1.1f%%' , colors = [ 'lightcoral' , 'lightblue' ], labels = [ 'Legitimate' , 'Phishing' ])
plt.title( 'Distribution of Target Classes' )
plt.show()
|
Output:
distribution of target feature
So, from the above output we can see that our dataset is balanced and not required any extra resampling method.
Now we will create one more type of exploratory data analysis by pair plot. We will select the most relevant columns like length_url’, ‘nb_dots’, ‘nb_hyphens’, ‘nb_subdomains’, ‘web_traffic’ with target column ‘status’ and plot them in pairwise comparison. It will help us to understand their relationship with each selected feature. It is highly recommended to select more feature for better understanding.
Python3
selected_features = [ 'length_url' , 'nb_dots' , 'nb_hyphens' , 'nb_subdomains' , 'web_traffic' , 'status' ]
sns.pairplot(df[selected_features], hue = 'status' , palette = 'husl' )
plt.tight_layout()
plt.show()
|
Output:
Data preprocessing and splitting
- Now,the dataset is prepared for training a phishing classification model.
- The features and the target variable are separated into X (features) and y (target). Next, the features are categorized into numerical and categorical types.
- One-hot encoding is applied to the categorical features using the pd.get_dummies function. The numerical features and the one-hot encoded categorical features are then concatenated into a processed feature set, denoted as X_processed.
- To ensure compatibility with the model, invalid characters like ‘[‘ and ‘]’ in feature names are removed. Additionally, a specific column with a problematic name is dropped to prevent issues during model training.
- Finally, the dataset is split into training and testing sets using the train_test_split function, with 80% for training and 20% for testing, and a random seed for reproducibility.
Python3
X = df.drop( 'status' , axis = 1 )
y = df[ 'status' ]
numerical_features = X.select_dtypes(include = [ 'float64' , 'int64' ]).columns
categorical_features = X.select_dtypes(include = [ 'object' ]).columns
X_categorical = pd.get_dummies(X[categorical_features])
X_numerical = X[numerical_features]
X_processed = pd.concat([X_numerical, X_categorical], axis = 1 )
cleaned_columns = [col.replace( '[' ,' ').replace(' ] ',' ').replace(' < ',' ') for col in X_processed.columns]
X_processed.columns = cleaned_columns
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size = 0.2 , random_state = 42 )
|
Defining an Ensemble model with instance of Hyperparameter tuning
- In this stage, we’re setting up an ensemble model known as XGBoost, which is a popular algorithm for classification tasks. However you can use any other Ensemble model using same data.
- Here we will perform hyperparameter tuning using Randomized Search CV on some hyperparameters like how many trees to include (`n_estimators`), how quickly the model should learn from the data (`learning_rate`), and the maximum depth of each tree (`max_depth`).
- Additionally, we set a `random_state` to ensure that the process is reproducible.
- The model is then trained on best hyperparameters set, where it learns to make predictions by analyzing patterns in the features (X_train) and corresponding target labels (y_train). The goal is to minimize errors in predicting whether a URL is legitimate or phishing. The tuning is performed based on best F1-Score metric.
Python3
param_dist = {
'n_estimators' : [ 100 , 150 , 200 ],
'learning_rate' : [ 0.05 , 0.1 , 0.2 ],
'max_depth' : [ 3 , 4 , 5 ]
}
xgb_model = xgb.XGBClassifier(random_state = 42 )
random_search = RandomizedSearchCV(estimator = xgb_model, param_distributions = param_dist, n_iter = 10 , cv = 3 , scoring = 'f1' , random_state = 42 , verbose = 1 , n_jobs = - 1 )
random_search.fit(X_train, y_train)
best_params = random_search.best_params_
xgb_model_tuned = xgb.XGBClassifier( * * best_params, random_state = 42 )
xgb_model_tuned.fit(X_train, y_train)
|
Model evaluation
Now,we will evaluate our model on various model performance metrics like Accuracy, Precision, Recall, F1-score, Classification Report and AUC-ROC score.
Python3
y_pred_xgb_tuned = xgb_model_tuned.predict(X_test)
accuracy_xgb_tuned = accuracy_score(y_test, y_pred_xgb_tuned)
precision_tuned = precision_score(y_test, y_pred_xgb_tuned)
recall_tuned = recall_score(y_test, y_pred_xgb_tuned)
f1_tuned = f1_score(y_test, y_pred_xgb_tuned)
auc_roc_value_tuned = roc_auc_score(y_test, xgb_model_tuned.predict_proba(X_test)[:, 1 ])
print ( "Classification Report:" )
print (classification_report(y_test, y_pred_xgb_tuned))
print (f "AUC-ROC Value: {auc_roc_value_tuned}" )
print (f "Tuned XGBoost Accuracy: {accuracy_xgb_tuned}" )
print (f "Tuned XGBoost Precision: {precision_tuned}" )
print (f "Tuned XGBoost Recall: {recall_tuned}" )
print (f "Tuned XGBoost F1-score: {f1_tuned}" )
|
Output:
Classification Report:
precision recall f1-score support
0 0.92 0.94 0.93 101
1 0.94 0.92 0.93 99
accuracy 0.93 200
macro avg 0.93 0.93 0.93 200
weighted avg 0.93 0.93 0.93 200
AUC-ROC Value: 0.9820982098209821
Tuned XGBoost Accuracy: 0.93
Tuned XGBoost Precision: 0.9381443298969072
Tuned XGBoost Recall: 0.9191919191919192
Tuned XGBoost F1-score: 0.9285714285714285
So, our model is performing very well with 99.71% of AUC-ROC value and 98.50% above performance metrics.
Share your thoughts in the comments
Please Login to comment...