The Challenge: The “Impossible” Triangle
The client approached us with a paralyzed initiative. They sought to deploy a natural language interface allowing Field Engineers and NOC analysts to query massive datasets (700M+ records) using plain English (e.g., “Show sites with high congestion”).
However, the project was trapped in an “impossible triangle” of constraints:
- Capability: They demanded “Agentic” reasoning to handle complex, multi-step troubleshooting.
- Infrastructure: Strict data sovereignty required air-gapped, on-premise deployment on standard CPU servers.
- Performance: Users expected near-instant “Google-like” responsiveness.
Their initial prototype attempted to solve this by throwing a monolithic LLM agent at every query. The result was a system that took up to 10 minutes to process a single request on a CPU, with no quantitative way to measure if the answer was even correct.
Our Approach: Engineering over “Magic”
We diagnosed the failure as a classic case of “Premature Complexity.” The team was attempting to use a Level 3 Agent to solve Level 1 Retrieval problems. We dismantled the monolith and implemented a Dual-Path Architecture.
1. Evaluation-Driven Development (EDD)
Before writing code, we established a Golden Dataset. We discovered that the previous system’s accuracy was anecdotal rather than based on benchmarks.
- We curated a benchmark of Question-SQL pairs.
- We established a strict metric: Execution Match Accuracy. It does not matter if the SQL looks right; it matters if it returns the correct data.
2. The Dual-Path Architecture
We recognized that not all queries require “reasoning.” A query like “What is the RRC success rate for Site X?” is a retrieval task, not a planning task.
- Path 1 (Fast Path): For 80% of queries (Simple Analytics), we bypassed the agent entirely. We used a streamlined Retrieval-Augmented Generation (RAG) pipeline to map questions directly to validated SQL patterns. This reduced latency to <2 seconds.
- Path 2 (Complex Path): Only when the Triage Agent detected multi-step logic (e.g., “Find sites with SINR < 5dB for more than 6 hours”) did we escalate to the “Decompose-Plan-Execute” framework. This path allows for higher latency (Tier 3) but ensures correctness for complex temporal reasoning.
3. Intermediate Representation
The client’s previous “black box” model failed, and engineers had no idea where the problem was, if the error was in table selection, column matching, or filtering. We implemented an Intermediate Representation (IR) pipeline. Instead of generating SQL directly, the system first generates a JSON “Plan”.
- Schema Pruning: A dedicated ML model reduces the search space from hundreds of tables to the specific few relevant to the query.
- Deterministic Validation: By validating the JSON plan before SQL generation, we eliminated hallucinations like
SELECT *or invented columns.
Business Impact
By treating Intelligence as a Cost rather than a Virtue, we delivered a system that was viable for production deployment.
- Operational Viability: Latency dropped from ~10 minutes to an average of 6-8 seconds, falling within the acceptable “Tier 2/3” window for complex analytics.
- Trust: The shift to an evaluation-based workflow raised accuracy on the Golden Set to over 80%, recovering the previously failing project.
- Cost Efficiency: By optimizing for CPU inference (using quantized models), we introduced massive cost and time savings.
Technology Stack
- Orchestration: Custom
- Text-to-SQL: Custom Implementation
- Vector Store: ChromaDB (Local/On-Prem)
- Inference: vLLM
- Caching: Redis (L1 Query Cache & L2 KV Cache)