Skip to content

🎣 Using deep learning (word level and character level embeddings combined with GRU layers) to detect Phishing using the URL

License

Notifications You must be signed in to change notification settings

MJafarMashhadi/Haplophysh

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Haplophysh

What is it?

Experimenting with some convolutional and recurrent neural net architectures with word and character embeddings to detect phishing URLs. This is the project I did for Network Security course in winter 2020. You can check out a one-page extended abstract which summarises the key points of this project here.

What did you learn?

First thing I learned is that it works. Phishing pages DO give themselves away in their URLs alone.

Also I learned how important it is that the model size (number of trainable parameters / its learning capacity) and the data set size to be proportionate. One of the models that combines character level and word level embeddings is almost the same architecture that URLNet proposed but it is way smaller because my data volume was orders of magnitude smaller. It still manages to perform pretty well sometimes surpassing URLNet.

These two were the highlights for me but there are many things to learn doing such a project, listed below. I had already learned many of them while doing my Masters' thesis; yet it is worth mentioning that these can be other take aways of this project. If anyone wants to see some examples of how to do these things in Keras and TensorFlow 2 they can take a look at this repository!

  • How embeddings word
  • How to deal with 1-D convolutions for sequence processing
  • How to use LSTM/GRU recurrent layers for sequence processing
  • How to tune the hyper parameters such as the layer sizes or batch size
  • How to get the most out of the GPU in training
  • How to use dataset generators to deal with larger datasets more easily
  • How to deal with model architectures with multiple inputs

Where to get the data from?

If you want to do a big research on it you'd be better off to collaborate with a large corporation or security group. But if it is not possible for your situation I'd recommend these data sources which I used myself, in no particular order:

I cannot share them on github. First because they get out of date quickly (in a matter of hours!). Second, there won't be any point in mirroring them in a repository too, most of them are publically available. Also, I might not have the permission to share some of them (I'm sure about the first one; redistribution is a no no). And at last, for now I have no plans for keeping this repository up to date in long term; I have completed a project and had my take-aways, I'm done with it for now.

Related Work

Misc.

What does this name mean anyways? It's a dull play on words. Haplophryne is an [ugly] fish living in deep ocean. physh, phish, deep, deeplearning, get it? Okay I'll stop. Sorry.

Why is everything on master branch? Do YoU EvEn GIT BrUh? Sorry, I do. I was just lazy here, don't judge.

Are you publishing it? I don't know. I wrote a 10 page conference paper as a deliverable for the course, but it needs more work before being ready for publication. A summary of that paper, as an extended abstract, is now available here.

I read the abstract, who is "We" exactly? I did the project by myself. I used "we" throught the paper and the EA out of habit.

About

🎣 Using deep learning (word level and character level embeddings combined with GRU layers) to detect Phishing using the URL

Topics

Resources

License

Stars

Watchers

Forks