Hacking the Gandalf Large Language Model

Esky Man
Mar 10, 2024
2 min read

Updated: Mar 10, 2024

My interest in hacking LLM's is literally taking over most of my spare time right now, conducting research, watching youtube videos and taking part in LLM hacking challenges.

In today's blog post I am going to tackle the Gandalf LLM hacking challenge which is a free CTF whereby you have to progress through 7 levels of increasing difficulty at each stage prompting the LLM to reveal a secret password.

I will post screenshots of each successful prompt rather than every prompt I used (otherwise this will be a massive post!)

Level 1

he first level here is very easy, you just need a little creativity.

Level 2

This next level does not have any output or input guard rails, however there is a system prompt in place to prevent the LLM from revealing the secret password.

Level 3

Level 3 sees the introduction of an output guardrail, the guardrail checks the password is not included in the output from the LLM. This can be circumnavigated by telling the LLM to encrypt the password.

Level 4

Now both the input and output guardrails are securing the LM from revealing the secret password. The prompt is checked by a second LLM for leaked passwords. By manipulating its output we can have the LLM reveal the password, without actually revealing the password verbatim. This took me quite a while to work out!

Level 5

In this level there is an input guardrail but no output guardrail. The input protection is looking for the word 'password'. I can assume that LLM only knows one password, so in having it suggest one to me, without using the word 'password' it forces the LLM to reveal the only 1 password it knows.

Level 6

A seperate LLM checks the input on this level but there is no output guardrail. After a number of failed attempts I thought about returning to the idea from the last level about how the password could be 'hidden' - ENCRYPTION !

Level 7

In the final level the LLM introduces both input and output guardrails and a deny-list.

FINISHED !

You can check out the LLM TF game at https://gandalf.lakera.ai/