It was a cold and wet Thursday morning, sometime in early 2006. There I was sitting at the very top back row of an awe-inspiring lecture theatre inside Royal Holloway's Founder’s Building in Egham, Surrey (UK) while studying for my MSc in Information Security. Back then, the lecture in progress was from the software security module.
The first rule of software security back then was never to trust user inputs. In software, there are sources — where we take the data from (usually the user), and there are sinks — where that data ends up, with “stuff” going on in between that data. Now, at that time, things were a little simpler, I say with my slippers on, pitched back in my rocking armchair by the fire.
We didn't have the complexities we have today in modern software. We still had one or more multiple sources of data coming in typically from a user, or perhaps another system, for which we had to deal with from a security perspective. Namely input validation and then output encoding. Making sure any payloads (or special characters) that could impact the middle “stuff” (and, subsequently, the output) were filtered out. I said simpler, but not easier. It still took many years to deal with buffer overflows, which exploit input validation to get data to be placed into memory to then be treated as instructions instead, to ultimately change the flow of execution. SQL injection is another case of untrusted user input. While it has improved over the years, it isn't going away anytime soon and is a similar injection — an attacker gets to break out of the data context and modify the database query (the command) to do things that were never intended.
These two examples are direct injections, if you will. We're knocking on the input validation gate and circumventing it because it has dodgy hinges if you like. There is another way we can inject things, typically to the middle “stuff” (in the logic of an application,) and that's indirectly. Enter the attack class which I absolutely love, and that is server-side request forgery, or SSRF for short.
An SSRF vulnerability is when software (an application or an API of some sort) takes malicious user input and utilizes it as a value in a server-side function call to actively do something. Because the function directly uses a user-defined input value to do server-side things, you can see where this could go wrong. Typically, input validation is forgotten about here because the danger is not always noticed, or if input validation is remembered, it is placed in the form of a regular expression, which in my experience, can be bypassed nine times out of 10. We can then get this function call to do something the application didn't intend to do.
For example, take a web application dangling a "url" parameter client side, which is then used in a further (importantly, server-side) query to retrieve some content, which is only accessible from that server-side origin, which in turn is delivered back to the user. We can put anything we want into this parameter, including internal targets — we are essentially “moving the goalposts,” to coin an old British saying. My absolute favorite thing to do here as a proof of concept for clients (if cloud-hosted) is to use such a vulnerability to access internal resources, such as the Instance Metadata Service (IMDS) to grab security credentials to demonstrate impact.
Where am I going with this? I know you came here for the LLMs. I was setting the scene. Fast forward to 2024 (or 2025 if you like, which is nearly upon us). We have artificial intelligence large language models, or AI LLMs as we have come to know and love them as. If you’ve gotten this far, I'm guessing you know what an LLM is but to be inclusive of all audiences here, LLMs are machine learning models that can understand and generate human language. The input to these is typically from a user making input, referred to as “user” prompts, for which the LLM goes away and does its thing and comes back with output. Regarding input validation, this has now shifted to the “system” prompt which primes the LLM, defining the rules of the LLM prompt land, with prompts coming in from the user.
Enter direct prompt and indirect prompt injection attacks.
If you recall from the two earlier examples of buffer overflows and SQL injections, the direct injection attacks are trying to (at the front gate) do things directly with their target. Regarding LLMs, these direct prompt injections are trying to trick the model into doing things it wasn't intended to do, circumventing the guardrails implemented in the system prompt — sometimes known as “jailbreaking.” This usually results in information leaks (including of the system prompt itself) but can also have active consequences, too.
Now, there is a whole blog post I could write on direct prompt injection attacks (and I probably will), but what I want to focus on here are the indirect prompt injection attacks however, because to me these are creative and align with the SSRF primer from earlier, and their very nature is active when we throw AI agents and tools into the mix.
Indirect prompt attacks are when an LLM takes input from external sources but where an attacker gets to smuggle payloads (additional prompts!) into these external/side sources. These malicious additional prompts modify the overall prompt, breaking out of the data context as they are treated as instructions (they are additional prompts, commands, if you will) and, in turn, influence the initial user prompt provided together with the system prompt and with that, the subsequent actions and output. This could involve circumventing those system prompt guardrails again, or in the case of LLM AI agents and retrieval-augmented generation (RAG) agents, function calls, tools, etc., getting them to actively go and do/fetch things they weren’t intended for.
So, to summarise indirect prompt attacks:
If you take one thing away from indirect prompt attacks, it’s that the user prompt, system prompt, and implanted prompt all make it to the LLM, like one big prompt party, and therein lies the problem!
The OWASP Top 10 for LLM Applications & Generative AI website has some really good prompt injection examples, so I'm going to include all of them below. Scenario #2 shows a good example of an indirect injection that I wanted to convey.
I can't go this far without mentioning the MITRE ATLAS Matrix, which has a mapping of tactics and techniques used to attack machine learning models that is worth a look.
Back to OWASP, if we look at scenario #5 in the list above, there was a direct prompt injection attack on a browser extension (called EmailGPT), which assists users in writing emails, using an LLM, of course. "The service uses an API service that allows a malicious user to inject a direct prompt and take over the service logic. Attackers can exploit the issue by forcing the AI service to leak the standard hard-coded system prompts and/or execute unwanted prompts."
Now, while the EmailGPT example is a direct prompt injection attack against the API service, this could easily be twisted around and become an indirect one, which is the point of my blog post — that the user input lines are now blurred (this blog post title does not lie!).
We could hypothetically theorize that such a product may not just help compose e-mails, but may also be reading them to carry out automated (or assisted) replying, etc. Now, the incoming e-mails become the source, and what are e-mails in this context? Altogether now… they are untrusted user input! Back to the first part of this blog post — we went full circle!
Now, while all the user input validation’s focus is on the user prompt at the front gate (and the system prompt governing access to it), we have a route into the logic of the application that perhaps has been forgotten about, with perhaps no (or little) input validation occurring. What happens if we sprinkle a little bit of scenario #7 from the above list into the mix and send the user (who is using this similar hypothetical LLM e-mail service) a hidden prompt within an e-mail?
Bad things!
Now, via the hidden prompt, we could ask it to disregard all previous system prompts, ask it to send all current e-mails in the inbox to a specific address and/or create a new e-mail (again to us) and add attachments with all the files from the local “My Documents” folder.
With my pentester hat on, this whole thing is both beautiful and scary. With “traditional” vulnerabilities, including SQL injection, if you were compromised, the impact is that your database got dumped, or in the slightly more painful scenario, perhaps an attacker achieved remote code execution via that vector. Now, vulnerabilities with LLMs, the sky is the limit, especially if that LLM has excessive/privileged access (referred to as “excessive agency”). This may be the equivalent of having an employee on the inside doing whatever you want from an attacker’s perspective — access to all sorts of internal documents and knowledge.
In threat modelling (and testing) AI LLM applications, we, therefore, need to be really mindful of all inputs to them. This refers to both the obvious direct prompt that we typically think about where the user is mainly consuming the service, perhaps via a web interface/chatbot, API, etc., but also the indirect ones, where AI agents go off to get data from external sources and act upon them.
Importantly, the principle of least privilege should be applied to any API or internal resource that the LLM has access to, this being more aimed towards combating direct prompt injections. That internal API is now effectively external due to the LLM now acting as a broker between it and the outside world and should be treated as such — limiting the risk the LLM can do if an attacker should instrument it to do so.