12 min read Flagship
Pfula — an isiZulu AI assistant for South African government services on Azure AI Foundry
A bilingual (isiZulu / English) government-services assistant built on Azure OpenAI and Foundry Agent Service — where the knowledge base is the tool-set and citizen-facing streaming is first-class.
- Azure AI Foundry
- Foundry Agent Service
- Azure OpenAI
- isiZulu
- Social good
- Government services
Pfula is a bilingual (isiZulu / English) assistant that walks South Africans through SASSA, Home Affairs, SARS, UIF, municipal services, the Deeds Office, and CIPC processes — in their own language, on a phone, with escalation letters generated on request. The name is Xitsonga for “to open”, because government services should be open to everyone.
Who Pfula is for
Every South African has a government-services story. You wait four hours at Home Affairs for an ID that never arrives. Your SASSA grant is rejected with no reason given. Your eThekwini municipal bill triples overnight. You phone the call centre and the call centre phones back a different number three days later. The people most likely to need government services — the unemployed, the elderly, the first-time taxpayer, the new small-business owner — are the least likely to have the time, the airtime, or the literacy in bureaucracy-English to navigate it unaided.
Pfula’s target user is that person. The shape is deliberately familiar: a WhatsApp-style chat, answered in isiZulu when the user writes in isiZulu and in English when they don’t. The assistant knows the real office addresses, the real form numbers, the 0800 numbers that actually pick up, the appeal deadlines that actually apply, and the specific South African legislation that a well-written complaint letter should cite. When the conversation escalates from “help me understand this” to “help me do something about it,” Pfula can generate a formal complaint letter — correctly addressed, with the right Act cited, with a 14-business-day deadline, ready to print or email.
Pfula was publicly demoed at the Data & AI Community Day Durban: AI Unplugged event on 14 March 2026. This post is about the architecture underneath the demo — how Pfula is built on Microsoft Azure AI Foundry, what that shape buys, and where it is going next.
Why Foundry is the right shape
Four things about Pfula’s problem shape make Azure AI Foundry the right fit rather than an alternative inference stack.
First, data residency. Pfula handles real South African citizen queries — even without persisting sensitive data, the inference path should run in a region the South African government recognises as local. Azure OpenAI through Foundry runs in South Africa North. That is a narrative consideration more than a strict legal one, but narrative matters when the target user is government-adjacent and the funders most likely to back the work are state or state-aligned.
Second, identity posture. Managed identity is Azure’s killer feature for public-sector workloads: long-lived API keys never have to appear in the running surface, and the hosting environment’s system-assigned identity authenticates to the model endpoint directly. For a project that has to pass due-diligence conversations about how services authenticate to each other, “the application carries no long-lived credentials in production” is a materially shorter answer than any alternative.
Third, the knowledge base fits Foundry’s mental model. Pfula’s knowledge base is seven JSON files, one per government service. Foundry Agent Service has a first-class notion of function tools: the agent decides which function to call, the function returns the data, and the agent reasons over the result. That shape — “which of seven service playbooks is this complaint about?” — is exactly what agent-service function tools are for.
Fourth, the MVP angle. Building on Microsoft AI rather than around it is directly aligned with the AI Services (Foundry) contribution area and with the Microsoft-aligned funder pipeline that a social-good pilot like Pfula lives inside.
How the conversational path is built
The conversational path — /api/chat for the request/response UI and
/ws/chat/{conv_id} for the WebSocket-streaming UI — is the hot path.
It needs token-level streaming and low latency. It does not need tool
calls; the system prompt carries the persona and the detected
service’s knowledge section, and the model emits plain text.
The Azure OpenAI client is constructed once and shared across requests:
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from openai import AsyncAzureOpenAI
token_provider = get_bearer_token_provider(
DefaultAzureCredential(),
"https://cognitiveservices.azure.com/.default",
)
client = AsyncAzureOpenAI(
azure_ad_token_provider=token_provider,
api_version="2024-10-21",
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
)
The non-obvious piece here is DefaultAzureCredential. It is a
credential chain — in production it resolves to the App Service or
Container App’s system-assigned identity; on a developer laptop it
resolves to az login; inside a CI pipeline it resolves to workload
identity federation. One credential construction, three environments,
no if-branches in application code. That is the shape that makes
managed identity practical.
Request assembly folds the system prompt into the messages array as
the first message with role="system", followed by the rolling
conversation history:
openai_messages = [
{"role": "system", "content": request_system_prompt},
*messages,
]
response = await client.chat.completions.create(
model=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"],
max_tokens=1024,
messages=openai_messages,
)
assistant_message = response.choices[0].message.content or ""
Streaming is where the citizen-facing latency story lives. On a 3G connection with a 400 ms round-trip, a user who sees the first words of a response within 600 ms experiences a conversation; a user who waits three seconds for a complete response experiences a form submission. The streaming handler unpacks server-sent-event chunks and forwards each delta to the WebSocket:
stream = await client.chat.completions.create(
model=AZURE_OPENAI_CHAT_DEPLOYMENT,
max_tokens=1024,
messages=openai_messages,
stream=True,
)
async for chunk in stream:
if not chunk.choices:
continue
delta = chunk.choices[0].delta.content
if not delta:
continue
full_response += delta
await websocket.send_json({"type": "stream", "content": delta})
Two small shapes are worth calling out. The if not chunk.choices
guard catches role-only chunks that legitimately have no choices
array. The if not delta guard catches the role-assignment chunk
that has an empty content delta. Both are normal parts of the
SSE protocol; missing either of them crashes the stream at random.
Managed-identity-first auth
The committed default for authentication is managed identity.
DefaultAzureCredential resolves to the hosting environment’s
system-assigned identity in production and to az login credentials
on a developer laptop. The App Service or Container App’s identity
gets two role assignments — Cognitive Services OpenAI User on the
Azure OpenAI resource and Azure AI Developer on the Foundry project
— and that is the entire auth story. No key rotations to schedule,
no secret-scanning false positives, no “oops, this leaked in a
screen-share.”
An AZURE_OPENAI_API_KEY environment variable is supported as a
documented fallback for local development, because az login inside
Docker Desktop can be finicky. The .env.example is explicit: do
not set it in production. The production posture is that there is
no long-lived credential between the application and the model
endpoint — the bearer token is minted on demand from the
DefaultAzureCredential token provider, and rotates automatically.
This is the single most load-bearing operational decision in the project. Every due-diligence conversation with a potential funder starts with some version of “how do your services authenticate to each other?” A one-sentence answer — managed identity, no long-lived credentials — closes that part of the conversation and opens the next one.
Escalation letters on Foundry Agent Service
The /api/escalation-letter endpoint is the piece that justifies
the rest of the architecture.
Letter generation is inherently agentic: decide which service the complaint belongs to, fetch that service’s canonical knowledge, emit a structured letter. Each step is a tool call. The model is the orchestrator.
The agent wiring lives in backend/agent.py. The tool surface is
seven function tools, one per government service:
def lookup_sassa() -> str:
"""Return the SASSA knowledge section — grants, appeals, escalation bodies."""
return json.dumps(_load_kb("sassa"), ensure_ascii=False)
def lookup_home_affairs() -> str:
"""Return the Home Affairs knowledge section — IDs, passports, civil records."""
return json.dumps(_load_kb("home_affairs"), ensure_ascii=False)
# ...and five more: UIF, SARS, eThekwini Municipality, Deeds Office, CIPC.
Each function is a plain synchronous callable with a descriptive docstring. The docstring is not documentation for humans — it is the tool description the model reads when choosing which tool to call. That is why the docstrings are written in the tool-selection register (“Return the SASSA knowledge section — grants, appeals, escalation bodies”) rather than the function-documentation register (“Loads the SASSA JSON from disk and returns it.”). The model’s decision quality is bounded by the docstring quality.
The agent is created once per process and cached:
toolset = ToolSet()
toolset.add(FunctionTool(functions=_SERVICE_TOOLS))
agent = project.agents.create_agent(
model=os.getenv("FOUNDRY_AGENT_MODEL", "gpt-4o"),
name="pfula-escalation-agent",
instructions=_AGENT_INSTRUCTIONS,
toolset=toolset,
)
…and the endpoint collapses to a single delegation:
letter = await generate_escalation_letter_via_agent(
service_type=service_type,
problem_description=problem_description,
citizen_name=citizen_name,
citizen_id=citizen_id,
reference_number=reference_number,
)
The agent reads the complaint, decides whether this is a SASSA
problem or a Home Affairs problem or a municipal problem, calls the
correct lookup_* tool, reads the KB section it just fetched, and
returns the structured letter. The four-key contract —
recipient, subject, body, legislation_cited — is enforced
by the agent’s instructions, not by a fragile single-shot prompt.
Knowledge-base-as-tools: the central design choice
The most important design decision in Pfula’s architecture is where the knowledge base lives in the control flow.
The conversational path detects the service in Python (a keyword-matching function), fetches that service’s JSON, and injects it into the system prompt. That is a deliberate cost choice — seven services at ~7–10k tokens each would blow out the per-request input budget, so injecting only one section keeps the rate-limit footprint reasonable. Keyword detection on a chat turn is acceptable because the downstream generation is additional context for a persona, not a decision the model needs to explain.
Escalation-letter generation is different. There, the agent chooses which tool to call, and if it isn’t sure it can call two. The agent also commits to the choice in a way that prompt injection never does — because a tool call is an explicit artefact visible in the run log, not an opaque attention pattern. When a complaint touches Home Affairs, the Deeds Office, and potentially SARS — for example, “can I get a certificate of my late father’s estate?” — the agent run log shows which tools got called. If the produced letter cites the wrong legislation, the run log reveals which KB section the model grounded against.
Put differently: prompt injection is how you give a model context; tool calls are how you give it affordances. Context is fine when the work is generative. Affordances are better when the work is decision-making.
Operating notes from building on GPT-4o
Three things are worth flagging for anyone building a similar bilingual agent on Foundry.
GPT-4o’s isiZulu has a specific register. The model leans formal and slightly literal in isiZulu, more than the warmth a citizen assistant needs. The fix is tuning the persona examples in the system prompt — small, colloquial turns of phrase that steer the model toward the warmth the user experience needs. The lesson is that persona tuning is a first-class step for any non-English agent, not a cosmetic one.
Strict schemas at the boundary beat strict schemas in the middle.
The agent instructions specify legislation_cited as an array of
strings. Most runs comply; occasionally the model returns a single
string. The backend normalises at the boundary: if the field is a
string, wrap it in a list. Two lines of code, and a reminder that
boundary-layer resilience is cheaper than prompt discipline in the
long run.
The agent shape keeps the application small. The
escalation-letter endpoint is around twenty lines of delegation code.
The system prompt no longer needs to say “return ONLY JSON, no
markdown fences, no preamble” — the agent handles structured output.
The caller-side parsing is a single json.loads with a defensive
markdown-fence strip, and even that only runs occasionally. Less
code because the SDK does more of the work. That is usually a good
trade, and it is an especially good trade when the extra SDK work
is grounded, auditable, and agentic rather than magical.
What comes next
Pfula is a pilot, not a product. The next milestones, roughly in order:
- Voice input. Azure Speech recognition in both English and
isiZulu, for users who can speak more comfortably than they can
type. This is a natural pairing with the Azure Neural TTS
(
en-ZA-LeahNeural) that already ships. - A WhatsApp Business entry point. The UI simulates WhatsApp; the obvious next step is to hang the agent off an actual WhatsApp Business number so users don’t have to visit a website at all.
- An evaluation harness. There is qualitative confidence that the current build does what it should. There is no quantitative data yet. Before any claim about impact can be made honestly, there has to be a tracked set of held-out citizen queries and human-rated responses, across both languages and all seven services.
- Deployment into a real citizen context. A stage demo is not an app. The harder work — the partnership with an NGO, the consented telemetry, the moderation, the fallback to a human — is the work that turns Pfula into something that actually opens doors.
- MCP as the knowledge-base update seam. The KB currently lives as JSON files in the repo. Exposing those files as a Model Context Protocol server lets the policy team (paralegals, social workers — the people who actually understand whether the SRD threshold has changed) edit the knowledge base without a git commit and redeploy.
Closing
The architectural posture of Pfula — Foundry-native, managed-identity-first, agent-orchestrated, knowledge-base-as-tools — is the posture every small public-interest AI project should start from. Managed identity instead of long-lived keys. Data residency in a region the target user recognises as local. Tools as affordances for decision-making, prompt injection only for generative context. A consolidated Microsoft estate that the MVP and funder surfaces already speak to.
Pfula is small, it is a pilot, and it may or may not become the thing that helps the next generation of South Africans navigate the system that wasn’t designed for them. But the shape is the shape that gives it the best chance of surviving contact with the real world — and, if this post gives another builder a shortcut to that shape, it has done its job.
Pfula was demoed at Data & AI Community Day Durban: AI Unplugged on 14 March 2026. It is built on Azure AI Foundry as part of the MVP 2026 submission under AI Services (Microsoft Foundry).