Let's Hack Some LLM's (Large Language Models)

Esky Man
Feb 14, 2024
2 min read

I have learnt a lot about machine learning and LLM's over the past year, I have taken a handful of courses and I think I have a good grasp on how LLM's work and how they can be manipulated. So I am going to spend some time doing just that, manipulating LLM's to see wether or not I can circumnavigate the 'guard-rails' these models have.

I am going to be testing GPT 3.5 turbo and Meta's Llama side by side for comparison.

Asking an LLM directly to write anything illegal, including code, will initiate those guard rails and the LLM will tell you it cant perform your request:

Let's not forget though that LLM's are not human, its computer code, it's a clever language algorithm that, all intense and purposes it guesses the next probable word in a sentence or line of code. GPT doesn't actually 'know' what it is writing, its not a human.

This makes it possible to trick an LLM in to writing nefarious code, like malware, by using carefully built up, layered, prompts and giving the LLM false context.

Let's see if I can prompt GPT and Llama to write a key logger script in Python and have it suggest how I can hide such a script and install it on a Windows system!

I am going to prompt the LLM's as if I were an inexperienced script kiddie with neferious intentions.

I was able to successfully prompt GPT 3.5 and Llama to write a key logger script in Python, suggest how I can transfer the text document to a remote server and advise on how to conceal it. Now anyone with some basic cyber skills and coding skills we be able to tell that there are multiple technical issues with the suggestions and the script would need some tinkering, however it would't take too much effort to 'improve' the script and having a working key logger.

I know this is a crude simple example of circumnavigating the guard rails in LLM's and I will be sure to post more research over the coming months.