The Sum of Courage

The French doors are opened and the white sheers are flowing with the breeze. I watch the gentle movement of the curtains as I sit here drinking my hot tea taking pleasure in these quiet moments. In…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




MALICIOUS URL DETECTION WITH MACHINE LEARNING

Written by Jad Hamdouch & Ismael Bouarfa.

URLs allow Internet users to navigate from one website to another. They comprehensively represent access to content that it is stored in servers, somewhere in the world. The URLs are accessible by a simple click on a link, image … or simply by writing it via our browsers. Here is the anatomy of an URL:

We decided to give it a try so we followed some tutorials and read some blogs and had tried a few machine learning algorithms to predict whether the URLs where malicious or legitimate. The outcome was very interesting and provided some pretty good results. Here is how we did it:

Our objective is to classify URLs given as inputs to predict if they are dangerous or inoffensive. We selected good as a label for the legitimate ones and bad for the malicious. Using a dataset with many URLs (as text) already labeled, located in a CSV file, we’ll train our model.

As we can see, we have a dataset composed of good and bad urls. Now, how can we detect if we have more good urls in the dataset? This code allows you to create a simple bar chart of class proportion.

We have more good URLs. That’s what we thought. We can learn more from the good URLs to avoid bias or generalize words present in bad URL and false positives then.

Now that we have our dataset we’ll choose an approach:

In Bag of Words a good classifier detects patterns in words distribution and which words occur and how many times for each kind of text. However, words count is not always the best idea. As demonstrated previously, a URL can be very large and add bias for our predictions.

The following code allows you to represent for each different length the number of URLs.

We choose to avoid Bags of Words and use another technique. So improving our model can be done using Term Frequency (TF) and Inverse Document Frequency (IDF):

This technique downscales the influence of some respecting words:

As shown, 1652 bad URL have been correctly predicted as malicious URLs. However 261 legitimate URLs have been identified as malicious. These “false alarms” are known as Type I error. This kind of errors are one example of reason why monitoring and supervision are necessary when applying machine learning to cybersecurity.

And then we made our predictions:

As we can see in our tutorial, we still have 261 false alarms that should be checked by security professionals. The idea is not to replace the security professionals with AI but to improve almost all the performances in many aspects :

And to make all of that work in harmony, it is essential to have a good monitoring approach.

AI and machine learning techniques are now used in finance, psychology, and economics… and will soon be present in many Infosec processes. Cyber security and all the subsets of AI and Computer Science can work together to create intelligent and effective solutions to the new threats and issues that are breaking innovation. Automatized detection of bad URLs based on machine learning and not human instructions can be a little piece of this puzzle. However, machine learning is not a magic solution and is not without its threats.

Add a comment

Related posts:

Empty Space

Sometimes you seek for loneliness just to get clear for what you want.. “Empty Space” is published by Prakash Singh.

BOM DIA

Todo dia eu cruzo por milhares de pessoas e ignoro elas completamente. Coloco meus fones de ouvido, faço air guitar acompanhando minhas playlists ou fico fritando ideias degustando um podcast…