Red Teaming LLM's - Automated Pentesting Large Language Models

Esky Man
Mar 10, 2024
1 min read

As more and more tools appear on Github relating to pen testing large language models I thought it would be good to test some of these tools to see how they work and to discover what the outputs are like.

So to kick things off let us take a look at Giskard. Giskard can be found here on Github:

https://github.com/Giskard-AI/giskard

This Python tool can be run in Jupyter Notebook instance and can test a handful of LLM vulnerabilities such as biased, hallucinations, prompt injection attacks, harmful outputs and more.

We begin by running Giskard in a Google colab environment and install dependencies required:

Setup your OpenAI API key (as we are using GPT3.5)

Now lets run the model and test it

We are using the IPCC's climate change report as the 'knowledge base'.

Now we can scan the LLM

The initial scan results show 1 issue of hallucination, and what's really cool is that Giskard shows you the prompt it used and explains the hallucination - information ideal for a Pentest report I think!

Now we run the full report

The full scan results report a number of errors, including prompt injections, stereotype issues, harmful information and more

Giskard can also be run from a Huggingface Spaces page where it has GUI. I have not investigated this yet as I prefer interacting with the code directly. Either way the outputs from the test results are really interesting from a penetration test perspective.