Building AI agents safely: PII, jailbreaks, and real guardrails”

Table of Contents

Guardrails to safe the day

We’re building more and more AI agents that process real company data. That data often includes PII — email addresses, phone numbers, dates of birth, and customer case descriptions. Even with an enterprise LLM license, one uncomfortable question remains:

Are we legally and technically allowed to send that data to an external LLM provider?

Many teams assume that “enterprise license = safe.” But compliance, logging, trace storage, and data residency tell a more complicated story.

A week ago, I watched a launch presentation from Aimable — a product that promises a safe AI workspace. It triggered a question I had been postponing:

How should we implement guardrails properly in the applications we build ourselves?

Not just moderation. Not just jailbreak protection. But real guardrails include PII masking, hallucination detection, tool safety, and content filtering.

In this post, I explore what guardrails should actually do — and how different frameworks approach the problem.

What do guardrails do?
#

Not every framework is on the same level. I discuss what I think a guardrail should do. The initial responsibility is to work safely with an LLM. That means users cannot try to hack the prompt and make the system do something the agent’s developer does not want it to do. Guardrails ensure we use polite language, avoid asking the wrong things of the LLM, and that the LLM’s responses are polite.

Guardrails come in different flavours. Some act as a proxy between your application and the LLM. Instead of pointing to OpenAI, you point to the proxy. During development, this can be a Docker image; for production, you can create the service in Kubernetes or use the SAAS version. Companies that back open-source products often provide a paid SAAS version with more capabilities. Especially for PII masking, I prefer a local solution; it feels weird to send your data to a remote service for masking, only to have an LLM search for that PII. Using the proxy is easy; however, it gives you less control. Examples for these offerings are LLM Guard and Guardrails AI.

Another approach is to use an API to verify or mask content. Before giving content to your agent or LLM, you run it through the moderation and masking service. And after receiving the agent’s response, you do the same. This gives you more control, and with that control, you get more responsibility. You have to arrange this guarding behaviour yourself for each interaction with the agent.

The third approach is to use the available frameworks. Most frameworks, such as SpringAI, Langchain4j, and Embabel, include implementations of guardrails. They aim to give you the same level of control, without the burden of having to do everything yourself. They often wrap specific implementations using a generic interface. Naming the different functionalities is not consistent. Some use guardrails, others moderation, and others have a callback mechanism for functionality like PII masking.

Blocking or not
#

A guardrail can block progress to the next step, thereby preventing the LLM from being called. A typical use case is the Jailbreak guardrail. The purpose of this guardrail is to prevent users from circumventing the prompt’s rules to make the agent do something it is not allowed to do. Another example is the moderation guardrail that detects content that may be harassment, hate, self-harm, or violence. This type of guardrail is in use for user requests as well as assistant responses. An example is a hallucination guardrail. This guardrail checks the content returned by an LLM to ensure it comes from our provided context and is not something the LLM just made up.

A guardrail can also have non-blocking features. PII detection is the best example here. If the content contains information such as an email address or date of birth, we want to mask it before sending it to the LLM. The best way to do this is to replace the content with a placeholder. You can add functionality to retain the placeholder value and replace it with the value when the response is returned.

Tool arguments and responses
#

The user and the assistant are the main contributors to the conversations; however, tools and components, such as skills and MCPs, also add content to the prompt. Applying guardrails to information entering tools, as well as to the response values of tools, is important. One problem is a malicious MCP server that tries to trick your agent into disclosing information it should not.

One or two endpoints?
#

My personal preference is to use two mechanisms: one to moderate the content and return a true or false value. No changes to the content. The other is to do things like PII masking, translation, or trimming down large prompts.

Time to have a look at some examples.

Guardrails AI
#

I focus on the open-source version of Guardrails AI. Before jumping into the code, visit the Guardrails AI Hub to see which guardrails are available. In the first part, I build the Docker image; in the second, I use it to experiment with guardrails.

Interact with the API
#

Most of the steps I took were from the GuardrailsAI deployment page. It all starts with creating an account and obtaining a key.

https://guardrailsai.com/hub/keys

Next, we create a project with uv, add some dependencies, and activate the Python virtual environment to use the guardrails CLI.

uv init guardrails-ai-tryout
cd guardrails-ai-tryout
uv add "guardrails-ai[api]"
source .venv/bin/activate

With the guardrails cli available, we configure the project, add a guardrail, and configure config.py. The guardrail is taken from the hub I mentioned before.

guardrails configure
guardrails hub install hub://guardrails/guardrails_pii

# config.py
from guardrails import Guard
from guardrails.hub import GuardrailsPII

pii_check = Guard(
    name="pii",
    description="Checks if the string contains specific PII data."
).use(
    GuardrailsPII(entities=["EMAIL_ADDRESS","PERSON","DATE_TIME"], on_fail="fix")
)

That is all you need to run the server locally and check if it works.

guardrails start --config=./config.py

# Use the swagger-ui to send a message
http://localhost:8000/docs

If the startup of the guardrails application is successful, you should see something similar to this.

🚀 Guardrails API is available at http://localhost:8000
📖 Visit http://localhost:8000/docs to see available API endpoints.

🟢 Active guards and OpenAI compatible endpoints:
- Guard: pii http://localhost:8000/guards/pii/openai/v1

========================================================================================= Server Logs ==========================================================================================
INFO:     Started server process [31365]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Notice the proxy URL for OpenAI. For now, I demo the validation endpoint with curl. You have to pass the openai api key. The next code block shows the curlrequest with the specific header.

curl -X 'POST' \
  'http://localhost:8000/guards/pii/validate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H 'x-openai-api-key: $GUARDRAILS_API_KEY' \
  -d '{
  "llmOutput": "Ik ben op 1 November 2014 gestart bij Luminis. Mijn eerste taak was het ontwikkelen een PII systeem om email adressen als jettro@donotuse.eu te identificeren.",
  "numReasks": 0,
  "llmApi": "openai.Completion.create"
}'

The response is the following JSON message. Notice the placeholders. It would be nice if we got an array with the replaced values as well.

{
  "callId":"281469803946992",
  "rawLlmOutput":"Ik ben op 1 November 2014 gestart bij Luminis. Mijn eerste
     taak was het ontwikkelen een PII systeem om email adressen als
     jettro@donotuse.eu te identificeren.",
  "validatedOutput":"Ik ben op <DATE_TIME> gestart bij Luminis. Mijn eerste
    taak was het ontwikkelen een PII systeem om email adressen als
    <EMAIL_ADDRESS> te identificeren.",
  "validationPassed":true
}

You can also use the proxy for OpenAI.

curl -X 'POST' \
  'http://localhost:8000/guards/pii/openai/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "gpt-4o-mini",
  "messages": [
    {
      "role": "user",
      "content": "This is a test message for Jettro Coenradie, how are you doing?"
    }
  ],
  "max_tokens": 1000,
  "temperature": 1
}'

There is a problem with the response: it does not return an ID. This field is mandatory, as the spec shows. The actual response does not show it. Therefore, working with Spring AI and the proxy does not work at the moment.

{
  "choices": [
    {
      "message": {
        "content": "<PERSON>'m just a computer program, so <PERSON> don't have feelings, but <PERSON>'m here to help you! If you have any questions or need assistance, feel free to ask. How can <PERSON> help you today?"
      }
    }
  ],
  "guardrails": {
    "reask": null,
    "validation_passed": true,
    "error": null
  }
}

I created an issue and will update here if something changes.

Notice that the result contains a field validation_passed. They use one endpoint for both types of guardrails.

Build the Docker image
#

I want to have a Docker image to run the Guardrails service. A Docker config file is available for download.

Docker requires a requirements-lock.txt file. This file is generated using uv.

uv export --format requirements-txt --hashes --output-file requirements-lock.txt

And building the image with this command, before running it.

docker build -f Dockerfile -t "guardrails-server:latest"
    --build-arg GUARDRAILS_TOKEN=<your-token> .

# Run the Docker container
docker run -p 8000:8000 -d guardrails-server:latest

OpenAI
#

OpenAI offers multiple flavours that serve as guardrails. One of the API’s is moderation. This is a server-side way to moderate content. It is useful to moderate both input and output to remove bad content. The service classifies content against harmful categories like violence, hate, and self-harm. The API is well-integrated with the SDKs. Therefore, the Java SDK can also access the API.

Another flavour is the Python library OpenAI Guardrails. This library uses the open-source Microsoft framework, Presidio. The library provides a drop-in replacement for the chat client that enforces all the guardrails defined in the configuration file. Examples are: PII, Jailbreak, NSFW.

A disadvantage is that it is only available for Python. If you are in the JVM domain as I am most of the time, that is a pity.

Let’s give the framework a go. Create a new project, add the required dependencies, generate the config and try it out.

uv init guardrails-openai-tryout
cd guardrails-openai-tryout
uv add openai-guardrails
uv add python_dotenv

Add a .env file with a property for OPENAI_API_KEY

Next, go to https://guardrails.openai.com/ and select the guardrails you want to apply. I choose:

for Input: Mask PII, Moderation, Jailbreak
for Output: NSFW

Next, you can download the config, and you get a default implementation block of code. Below are the config, the code and the output. Notice how the guardrails are configured and check the masking in the end result.

{
  "version": 1,
  "pre_flight": {
    "version": 1,
    "guardrails": [
      {
        "name": "Contains PII",
        "config": {
          "entities": [
            "CREDIT_CARD",
            "CVV",
            "CRYPTO",
            "DATE_TIME",
            "EMAIL_ADDRESS",
            "IBAN_CODE",
            "BIC_SWIFT",
            "IP_ADDRESS",
            "LOCATION",
            "MEDICAL_LICENSE",
            "NRP",
            "PERSON",
            "PHONE_NUMBER",
            "URL"
          ]
        }
      },
      {
        "name": "Moderation",
        "config": {
          "categories": [
            "sexual",
            "sexual/minors",
            "hate",
            "hate/threatening",
            "harassment",
            "harassment/threatening",
            "self-harm",
            "self-harm/intent",
            "self-harm/instructions",
            "violence",
            "violence/graphic",
            "illicit",
            "illicit/violent"
          ]
        }
      }
    ]
  },
  "input": {
    "version": 1,
    "guardrails": [
      {
        "name": "Jailbreak",
        "config": {
          "confidence_threshold": 0.7,
          "model": "gpt-4.1-mini",
          "include_reasoning": false
        }
      }
    ]
  },
  "output": {
    "version": 1,
    "guardrails": [
      {
        "name": "NSFW Text",
        "config": {
          "confidence_threshold": 0.7,
          "model": "gpt-4.1-mini",
          "include_reasoning": false
        }
      }
    ]
  }
}

import asyncio
from pathlib import Path

from dotenv import load_dotenv
from guardrails import GuardrailsAsyncOpenAI, GuardrailTripwireTriggered
from openai.types.chat import ChatCompletionUserMessageParam

async def call_llm(user_input):
    try:
        # GuardrailsAsyncOpenAI is a drop in replacement for the AsyncOpenAI client
        response = await guardrails_client.chat.completions.create(
            messages=[ChatCompletionUserMessageParam(content=user_input, role="user")],
            model="gpt-4.1-mini",
        )

        print(f"Assistant: {response.choices[0].message.content}")

    # If a guardrail is triggered, an exception will be raised
    except GuardrailTripwireTriggered as exc:
        raise

if __name__ == "__main__":
    load_dotenv()

    # Initialize GuardrailsAsyncOpenAI with the config file
    guardrails_client = GuardrailsAsyncOpenAI(config=Path("guardrails_config.json"))

    asyncio.run(call_llm("I need a bit formal message to Jettro Coenradie, I want to congratulate him for his anniversary since 1 november 2014."))

Assistant: Certainly! Here's a formal congratulatory message you can use:

---

Dear <PERSON>,

I hope this message finds you well. I would like to extend my sincere
congratulations on your anniversary since <DATE_TIME>. Your dedication and
commitment are truly commendable, and I wish you continued success in your
endeavors.

Warm regards,
[Your Name]

---

Let me know if you'd like it tailored further!

I like that the response contains a lot of information about the applied guardrails. With a few simple print statements, the *GuardrailsResponse *is read to obtain information about the PII found and the result of the Jailbreak guardrail.

def print_guardrail_results(results):
    for guardrail_result in results:
        print(f"Guardrail: {guardrail_result.info["guardrail_name"]}")
        if guardrail_result.info["guardrail_name"] == 'Contains PII':
            for key, value in guardrail_result.info["detected_entities"].items():
                print(f"  {key}: {value}")
        if guardrail_result.info["guardrail_name"] == 'Jailbreak':
            print(f"  Flagged: {guardrail_result.info['flagged']}")
            print(f"  Reason: {guardrail_result.info['reason']}")

print("--- Preflight Guardrails ---")
print_guardrail_results(response.guardrail_results.preflight)

print("--- Input flight Guardrails ---")
print_guardrail_results(response.guardrail_results.input)

--- Preflight Guardrails ---
Guardrail: Contains PII
  PERSON: ['Jettro Coenradie']
  DATE_TIME: ['1 november 2014']
Guardrail: Moderation
--- Input flight Guardrails ---
Guardrail: Jailbreak
  Flagged: False
  Reason: The user is requesting assistance in crafting a formal congratulatory message for an anniversary, which is a legitimate and benign request. There are no signs of deception or manipulation in the input.

Presidio
#

There is something with Microsoft’s Data Protection and De-identification SDK called Presidio. This project is referenced by most other projects. The two I discussed in this blog, OpenAI and Guardrails AI, are no exception. Presidio is a framework for detecting and masking PII data.

Presidio comes with two parts. In the first part, you detect PII data using the analyser. Its goal is to detect PII and mark where the entities are located. In the second part, you anonymise the entities you found. You can replace them with *** or use a named mask. The named mask is perfect for the LLM; it still understands the context of the masked text.

Presidio is a Python library that uses Spacy for some of the entity detection. If you are interested in the Python approach, have a look at Presidio’s tutorial.

Home - Microsoft Presidio

I try the Docker approach. First, I download the containers. Next, I start the containers and expose the service through a local port.

docker pull mcr.microsoft.com/presidio-analyzer
docker pull mcr.microsoft.com/presidio-anonymizer

docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest

With the containers running, I can send the data from the previous section.

curl -X POST http://localhost:5002/analyze \
-H "Content-Type: application/json" \
-d '{
  "text": "I need a bit formal message to Jettro Coenradie, I want to
     congratulate him for his anniversary since 1 november 2014.",
  "language": "en"
}' | jq .

The response from the Presidio analyser is:

[
  {
    "analysis_explanation": null,
    "end": 47,
    "entity_type": "PERSON",
    "score": 0.85,
    "start": 31
  },
  {
    "analysis_explanation": null,
    "end": 117,
    "entity_type": "DATE_TIME",
    "score": 0.85,
    "start": 102
  }
]

I use the response with the original text to mask PII data with the Presidio anonymiser.

curl -X POST http://localhost:5001/anonymize \
-H "Content-Type: application/json"  \
-d '{
        "text": "I need a bit formal message to Jettro Coenradie, I want to congratulate him for his anniversary since 1 november 2014.",
        "analyzer_results": [
  {
    "analysis_explanation": null,
    "end": 47,
    "entity_type": "PERSON",
    "score": 0.85,
    "start": 31
  },
  {
    "analysis_explanation": null,
    "end": 117,
    "entity_type": "DATE_TIME",
    "score": 0.85,
    "start": 102
  }
]}'

And the response, which now contains the masked values.

{
  "text": "I need a bit formal message to <PERSON>, I want to congratulate him for his anniversary since <DATE_TIME>.",
  "items": [
    {
      "start": 94,
      "end": 105,
      "entity_type": "DATE_TIME",
      "text": "<DATE_TIME>",
      "operator": "replace"
    },
    {
      "start": 31,
      "end": 39,
      "entity_type": "PERSON",
      "text": "<PERSON>",
      "operator": "replace"
    }
  ]
}

The second response is not too hard to create by yourself. Presidio does offer other masking options. To me, especially the first service to detect PII data is interesting.

Concluding
#

I like the OpenAI Guardrails approach, which uses a pipeline with three elements: pre-flight, input, and output. The drop-in replacement for the LLM class and agents is, of course, very handy when you remain in the OpenAI space. The wizard to create the guardrails config is great, and new features are coming. There was a tool used guardrail in beta, for me, definitely a framework to watch. Do you hear the but? But it is not available for the JVM.

Presidio is only for PII detection. Which is not necessarily a problem. Guardrails AI provides PII through Presidio and extends it with many other guardrails. Their hub for downloading existing guardrails is a great addition. For me, the proxy for connecting to the OpenAI API did not work because the id field was missing. And the response does not include the values of the extracted entities. Having the input and the output, that information can be extracted, of course.

Presidio and Guardrails AI are both Python environments that expose a service variant through a Docker container. This is fine for now as an integration with a JVM language. In a follow-up blog, I look at the integration of these guardrails in Spring AI and Embabel. Both frameworks are in active development; I expect an easy integration to be available in the coming weeks.

Originally published on Medium

What do guardrails do? #

Blocking or not #

Tool arguments and responses #

One or two endpoints? #

Guardrails AI #

Interact with the API #

Build the Docker image #

OpenAI #

Presidio #

Concluding #