When User Input Lines Are Blurred: Indirect Prompt Injection Attack Vulnerabilities in AI LLMs

Written by Tom Neaves | Dec 10, 2024 2:00:00 PM

It was a cold and wet Thursday morning, sometime in early 2006. There I was sitting at the very top back row of an awe-inspiring lecture theatre inside Royal Holloway's Founder’s Building in Egham, Surrey (UK) while studying for my MSc in Information Security. Back then, the lecture in progress was from the software security module.

The first rule of software security back then was never to trust user inputs. In software, there are sources — where we take the data from (usually the user), and there are sinks — where that data ends up, with “stuff” going on in between that data. Now, at that time, things were a little simpler, I say with my slippers on, pitched back in my rocking armchair by the fire.

We didn't have the complexities we have today in modern software. We still had one or more multiple sources of data coming in typically from a user, or perhaps another system, for which we had to deal with from a security perspective. Namely input validation and then output encoding. Making sure any payloads (or special characters) that could impact the middle “stuff” (and, subsequently, the output) were filtered out. I said simpler, but not easier. It still took many years to deal with buffer overflows, which exploit input validation to get data to be placed into memory to then be treated as instructions instead, to ultimately change the flow of execution. SQL injection is another case of untrusted user input. While it has improved over the years, it isn't going away anytime soon and is a similar injection — an attacker gets to break out of the data context and modify the database query (the command) to do things that were never intended.

These two examples are direct injections, if you will. We're knocking on the input validation gate and circumventing it because it has dodgy hinges if you like. There is another way we can inject things, typically to the middle “stuff” (in the logic of an application,) and that's indirectly. Enter the attack class which I absolutely love, and that is server-side request forgery, or SSRF for short.

An SSRF vulnerability is when software (an application or an API of some sort) takes malicious user input and utilizes it as a value in a server-side function call to actively do something. Because the function directly uses a user-defined input value to do server-side things, you can see where this could go wrong. Typically, input validation is forgotten about here because the danger is not always noticed, or if input validation is remembered, it is placed in the form of a regular expression, which in my experience, can be bypassed nine times out of 10. We can then get this function call to do something the application didn't intend to do.

For example, take a web application dangling a "url" parameter client side, which is then used in a further (importantly, server-side) query to retrieve some content, which is only accessible from that server-side origin, which in turn is delivered back to the user. We can put anything we want into this parameter, including internal targets — we are essentially “moving the goalposts,” to coin an old British saying. My absolute favorite thing to do here as a proof of concept for clients (if cloud-hosted) is to use such a vulnerability to access internal resources, such as the Instance Metadata Service (IMDS) to grab security credentials to demonstrate impact.

Where am I going with this? I know you came here for the LLMs. I was setting the scene. Fast forward to 2024 (or 2025 if you like, which is nearly upon us). We have artificial intelligence large language models, or AI LLMs as we have come to know and love them as. If you’ve gotten this far, I'm guessing you know what an LLM is but to be inclusive of all audiences here, LLMs are machine learning models that can understand and generate human language. The input to these is typically from a user making input, referred to as “user” prompts, for which the LLM goes away and does its thing and comes back with output. Regarding input validation, this has now shifted to the “system” prompt which primes the LLM, defining the rules of the LLM prompt land, with prompts coming in from the user.

Enter direct prompt and indirect prompt injection attacks.

If you recall from the two earlier examples of buffer overflows and SQL injections, the direct injection attacks are trying to (at the front gate) do things directly with their target. Regarding LLMs, these direct prompt injections are trying to trick the model into doing things it wasn't intended to do, circumventing the guardrails implemented in the system prompt — sometimes known as “jailbreaking.” This usually results in information leaks (including of the system prompt itself) but can also have active consequences, too.

Now, there is a whole blog post I could write on direct prompt injection attacks (and I probably will), but what I want to focus on here are the indirect prompt injection attacks however, because to me these are creative and align with the SSRF primer from earlier, and their very nature is active when we throw AI agents and tools into the mix.

Indirect prompt attacks are when an LLM takes input from external sources but where an attacker gets to smuggle payloads (additional prompts!) into these external/side sources. These malicious additional prompts modify the overall prompt, breaking out of the data context as they are treated as instructions (they are additional prompts, commands, if you will) and, in turn, influence the initial user prompt provided together with the system prompt and with that, the subsequent actions and output. This could involve circumventing those system prompt guardrails again, or in the case of LLM AI agents and retrieval-augmented generation (RAG) agents, function calls, tools, etc., getting them to actively go and do/fetch things they weren’t intended for.

So, to summarise indirect prompt attacks:

An attacker implants a malicious prompt into something they know the LLM will utilize. (If you’re wondering at this stage, Tom, how exactly is an attacker getting to implant the malicious prompt at the right place? Stay with me on this I will give an example later — no spoilers!)
A normal user interacts with the LLM like any other day.
The LLM goes off to the normal resource, but plot twist: this time, it picks up the implanted malicious prompt, parses it as an instruction, and does whatever is asked, e.g. exfiltrate internal corporate data, make an API call, etc.
The attacker waits at their endpoint to receive the treasure if the implanted prompt action was to receive something rather than blindly do an active thing, e.g., RAG tool/function call/API call.
The normal user may or may not notice anything is up, depending on how cleanly the malicious prompt was exited by the attacker and returned to the user. For example, a malicious implanted prompt ending such as, “…and after this, continue with the original prompt provided by the user”, may make it seem all normal to the user – nothing to see here!

If you take one thing away from indirect prompt attacks, it’s that the user prompt, system prompt, and implanted prompt all make it to the LLM, like one big prompt party, and therein lies the problem!

The OWASP Top 10 for LLM Applications & Generative AI website has some really good prompt injection examples, so I'm going to include all of them below. Scenario #2 shows a good example of an indirect injection that I wanted to convey.

Scenario #1: Direct Injection
An attacker injects a prompt into a customer support chatbot, instructing it to ignore previous guidelines, query private data stores, and send emails, leading to unauthorized access and privilege escalation.
Scenario #2: Indirect Injection
A user employs an LLM to summarize a webpage containing hidden instructions that cause the LLM to insert an image linking to a URL, leading to exfiltration of the private conversation.
Scenario #3: Unintentional Injection
A company includes instructions in a job description to identify AI-generated applications. An applicant, unaware of this instruction, uses an LLM to optimize their resume, inadvertently triggering the AI detection.
Scenario #4: Intentional Model Influence
An attacker modifies a document in a repository used by a RAG application. When a user’s query returns the modified content, the malicious instructions alter the LLM’s output, generating misleading results.
Scenario #5: Code Injection
An attacker exploits a vulnerability (CVE-2024-5184) in an LLM-powered email assistant to inject malicious prompts, allowing access to sensitive information and manipulation of email content.
Scenario #6: Payload Splitting
An attacker uploads a resume with split malicious prompts. When an LLM is used to evaluate the candidate, the combined prompts manipulate the model’s response, resulting in a positive recommendation despite the actual resume contents.
Scenario #7: Multimodal Injection
An attacker embeds a malicious prompt within an image that accompanies benign text. When a multimodal AI processes the image and text concurrently, the hidden prompt alters the model’s behavior, potentially leading to unauthorized actions or disclosure of sensitive information.
Scenario #8: Adversarial Suffix
An attacker appends a seemingly meaningless string of characters to a prompt, which influences the LLM’s output maliciously, bypassing safety measures.
Scenario #9: Multilingual/Obfuscated Attack
An attacker uses multiple languages or encodes malicious instructions (e.g., using Base64 or emojis) to evade filters and manipulate the LLM’s behavior.

I can't go this far without mentioning the MITRE ATLAS Matrix, which has a mapping of tactics and techniques used to attack machine learning models that is worth a look.

Back to OWASP, if we look at scenario #5 in the list above, there was a direct prompt injection attack on a browser extension (called EmailGPT), which assists users in writing emails, using an LLM, of course. "The service uses an API service that allows a malicious user to inject a direct prompt and take over the service logic. Attackers can exploit the issue by forcing the AI service to leak the standard hard-coded system prompts and/or execute unwanted prompts."

Now, while the EmailGPT example is a direct prompt injection attack against the API service, this could easily be twisted around and become an indirect one, which is the point of my blog post — that the user input lines are now blurred (this blog post title does not lie!).

We could hypothetically theorize that such a product may not just help compose e-mails, but may also be reading them to carry out automated (or assisted) replying, etc. Now, the incoming e-mails become the source, and what are e-mails in this context? Altogether now… they are untrusted user input! Back to the first part of this blog post — we went full circle!

Now, while all the user input validation’s focus is on the user prompt at the front gate (and the system prompt governing access to it), we have a route into the logic of the application that perhaps has been forgotten about, with perhaps no (or little) input validation occurring. What happens if we sprinkle a little bit of scenario #7 from the above list into the mix and send the user (who is using this similar hypothetical LLM e-mail service) a hidden prompt within an e-mail?

Bad things!

Now, via the hidden prompt, we could ask it to disregard all previous system prompts, ask it to send all current e-mails in the inbox to a specific address and/or create a new e-mail (again to us) and add attachments with all the files from the local “My Documents” folder.

With my pentester hat on, this whole thing is both beautiful and scary. With “traditional” vulnerabilities, including SQL injection, if you were compromised, the impact is that your database got dumped, or in the slightly more painful scenario, perhaps an attacker achieved remote code execution via that vector. Now, vulnerabilities with LLMs, the sky is the limit, especially if that LLM has excessive/privileged access (referred to as “excessive agency”). This may be the equivalent of having an employee on the inside doing whatever you want from an attacker’s perspective — access to all sorts of internal documents and knowledge.

In threat modelling (and testing) AI LLM applications, we, therefore, need to be really mindful of all inputs to them. This refers to both the obvious direct prompt that we typically think about where the user is mainly consuming the service, perhaps via a web interface/chatbot, API, etc., but also the indirect ones, where AI agents go off to get data from external sources and act upon them.

Importantly, the principle of least privilege should be applied to any API or internal resource that the LLM has access to, this being more aimed towards combating direct prompt injections. That internal API is now effectively external due to the LLM now acting as a broker between it and the outside world and should be treated as such — limiting the risk the LLM can do if an attacker should instrument it to do so.

View full post