The RAG pipeline most teams build first
It looks like this. A drag and drop uploader sends PDFs to an S3 bucket. A Lambda triggers on upload and calls a hosted embeddings API. The embeddings land in a hosted vector database. A chat interface sends user queries to the same embeddings API, runs a vector search, and feeds the top K chunks plus the query into a hosted LLM.
This works for a demo. For regulated documents, it is a mess.
- Document content leaves your environment at embed time and again at inference time.
- A third party vector service holds vector representations of your regulated data.
- There is no audit trail that survives a real compliance review. HIPAA applies to RAG retrievals the same way it applies to a human accessing a record, and regulators have been explicit about this.
- The model or embedding provider could change retention policies next quarter.
If you are building on regulated data (health records, privileged filings, government information, financial filings), you need a different architecture from day one. Not because the demo architecture is dangerous in some abstract way, but because retrofitting security into a RAG pipeline is harder than building it in.
This is the architecture that actually holds up.
The boundary principle
The single most important design decision: data does not leave your environment. Not the raw documents. Not the chunks. Not the embeddings. Not the prompts that reference them.
Every other decision follows from this.
In practice, this means:
- Embedding model runs inside your cloud account. Bedrock, Vertex AI, or a self-hosted embedding model.
- Vector storage is inside your cloud account. Postgres with pgvector, OpenSearch with knn, or Aurora.
- Inference runs through Bedrock, Vertex, or a private model deployment.
- All traffic between layers stays on private networking. VPC endpoints, not public internet.
When a compliance reviewer asks "where does this data go," the answer is a diagram with one boundary and nothing leaving it.
Ingestion without surprises
The document ingestion layer is where most teams accidentally leak data before the LLM ever sees it.
Upload path. Direct to a storage bucket with presigned URLs. Object-level encryption at rest with your own KMS keys. Bucket policies that block public access and require TLS. Logging on every upload and download.
Parsing. For PDFs and Office documents, use a parser that runs inside your environment. Do not send documents to a third party parsing API. Libraries like pdfplumber, unstructured, or a Textract endpoint inside your AWS account all work.
Chunking. Standard recursive character splitting works for most content. For long-form documents with clear section structure, semantic chunking at section boundaries gives better retrieval. This is a tuning decision, not a security decision, but it affects output quality significantly.
Metadata. Every chunk gets metadata: source document, section, page, access control group, classification level. This metadata drives retrieval filtering and audit later.
Embeddings inside the boundary
The embedding model choice is about three things: quality, cost, and where it runs.
AWS Bedrock. Cohere Embed and Titan Embed both run as managed Bedrock endpoints. Data never leaves AWS. Supports VPC endpoints. The default choice for teams on AWS.
GCP Vertex AI. text-embedding-005 and similar. Same properties on GCP.
Self-hosted. BGE, E5, or Nomic Embed on a GPU instance. Highest control. Most operational overhead. Worth it for FedRAMP High or similar environments where managed services are out of scope.
Whichever you pick, document it and pin the model version. Embedding quality changes between versions and a silent upgrade can tank your retrieval quality.
Vector storage that audits cleanly
Skip the dedicated vector databases for regulated workloads. Use primitives you already have.
Postgres with pgvector. The most common choice. Already in your compliance scope if you use RDS. Supports row-level access control, which matters when different users should retrieve different subsets. Handles tens of millions of vectors well.
OpenSearch with kNN. Better for very high scale. Native to AWS. Good if you are also doing keyword search alongside vector search.
Aurora. If you are already on Aurora, pgvector works natively in newer versions.
The dedicated vector databases are fast and feature rich, but they add a new third party to your compliance scope, a new auth surface, a new audit trail to maintain. For most regulated workloads the tradeoff is not worth it.
Retrieval with access control
This is where teams get burned.
A user queries the system. The vector search returns the top 10 most semantically similar chunks. If those chunks include documents the user is not authorized to see, you have a data leak through retrieval.
Current guidance for regulated workloads is per-request authorization at the retrieval layer, not just at connection time. An authenticated connection to the database is not the same thing as an authenticated retrieval decision. Use RBAC or ABAC controls that evaluate on every query against the requesting user's context.
The fix:
- Attach access control metadata to every chunk at ingestion. Source document, access group, classification level, owning tenant.
- At query time, pass the authenticated user's access groups and tenant to the vector search.
- Postgres row-level security or a strict WHERE clause on the access group column handles this cleanly.
- Log the authorization decision for every retrieval, not just the retrieval itself.
Never retrieve everything and filter in the application layer. That pattern has sent cross-tenant data to users more times than anyone wants to admit.
Inference inside the boundary
The retrieved chunks plus the user query go to the model. For regulated data, the model call runs through Bedrock, Vertex, or a self-hosted deployment.
Bedrock specifically is worth calling out. Anthropic's Claude models are available through Bedrock, as are Llama, Mistral, and Cohere models. The inference call stays inside AWS, can be routed through VPC endpoints, and is covered under AWS's compliance certifications including HIPAA BAA and FedRAMP High.
From the user's perspective the experience is identical to calling Anthropic's API directly. From a compliance perspective the data boundary is dramatically different.
Evals that keep it honest
A RAG pipeline in production is a moving target. Documents get added. Models get updated. Chunking strategies get tuned. Without evals, quality silently degrades and nobody notices until a user complains.
A minimum viable eval suite:
- Retrieval evals. A fixed set of queries with known-correct chunks. Measure recall at K. Run on every ingestion pipeline change and every embedding model change.
- Generation evals. A fixed set of queries with known-correct answers. Measure answer quality with both string-match heuristics and LLM-as-judge grading.
- Refusal evals. Queries that should be refused. Questions outside the document scope. Prompt injection attempts. Measure that the system refuses appropriately.
- Access control evals. Queries from users with limited access. Verify no unauthorized chunks appear in retrieval.
Run these on every significant change. Store the results. When quality regresses, you will know within minutes rather than weeks.
Observability and audit logging in production
For regulated workloads, the audit log is not the same as the application log. Current guidance is to maintain immutable, time-stamped records that link the user identity, the specific query, the retrieved sources, the model version served, and the returned output. Store these in WORM or append-only systems with verified integrity controls and defined retention. On AWS, S3 with object lock and KMS CMKs is the common pattern. On GCP, Cloud Storage with object versioning and bucket lock. Under HIPAA, retention is typically six to seven years.
The minimum operations dashboard, on top of the audit log:
- Query volume, by user group and tenant.
- Retrieval latency and hit rate.
- Inference latency and token cost.
- Refusal rate. Unexpected spikes mean something changed.
- User feedback, thumbs up and down on responses.
Flagged responses feed back into the eval suite. Over time your evals become a regression test against real user-observed failure modes. This is how a RAG pipeline gets better in production instead of drifting.
A reference stack
For a team on AWS building RAG on HIPAA-scope documents, a reasonable starting stack:
- S3 with KMS customer-managed keys and VPC endpoints for document storage.
- A document parser running inside the VPC (self-hosted or a Textract endpoint).
- A Bedrock-hosted embedding model, called through a Bedrock VPC endpoint.
- Aurora Postgres with pgvector for vector storage, with row-level security on tenancy and access group columns.
- Per-request authorization evaluated on every query.
- A foundation model on Bedrock for inference, called through a Bedrock VPC endpoint.
- AES-256 at rest, TLS 1.2 or higher in transit, S3 object lock for audit logs.
- CloudWatch, CloudTrail, Bedrock invocation logging, and a custom eval harness for observability.
Every layer stays inside the AWS boundary. Every component is in scope of your existing HIPAA controls. Every data flow logs to CloudTrail and your SIEM.
This is not the fastest RAG pipeline to ship. It is the one that does not blow up in a compliance review and does not require a rewrite six months in when the scope expands.
Kai Token leads AI engineering at Fraktional. Works on secure RAG architectures for teams with data that cannot leave the environment. Believes the boundary is the architecture.