Prompt Injection

Prompt Injection is the attempt by a malicious user to manipulate the behavior of an LLM or an LLM application using strategic prompting techniques to produce undesirable responses.

Types of Prompt Injections

We can further classify prompt injections into sub-categories:

  1. Jailbreaking
    i. Attempting to override the LLM’s system prompts (i.e., underlying instructions) to illicit inaccurate, biased, or forbidden responses. ii. This can be further categorized according to the different mechanisms of attack: Role Play (also referred to as Double Character or Virtualization), Obfuscation, Payload Splitting, and Adversarial Suffix.
  2. Instruction Manipulation
    i. Attempting to leak or ignore the LLM’s system prompt or the application’s prompt template, which can reveal sensitive information or inform future attacks.

There is a growing range of types of prompt injections as bad actors continue to explore and expand on their current attack techniques. As a result, these definitions are continuously evolving in the space.

The Arthur Approach**

Arthur's prompt injection detection model is a binary classification model fine-tuned on a prompt injection dataset. We currently focus primarily on Role Play and Instruction Manipulation attacks.

Our prompt injection approach was developed by scouring the internet for prompt injection examples from discussion threads on Reddit to more formal top academic research in prompt injections. For this reason, this model is planned to be consistently updated as bad actors begin to explore and expand on their current attack techniques. Our method truncates texts from the middle after 512 WordPiece tokens. This roughly corresponds to ~2000 characters or ~400 words.

Requirements

Arthur Shield validates prompt injections with the Validate Prompt endpoint. You only need to pass in the user prompt to that endpoint to run prompt injections.

Benchmarks

Benchmark

Accuracy

F1-Score

Precision

Recall

Confusion Matrix

Prompt Injection Benchmark Dataset

86.84%

85.71%

100%

75%

[[18 0] [5 15]]

Benchmarks leverage these resources: HackaPrompt, DeepSet