Build reliable Agentic AI solution with Amazon Bedrock: Learn from Pushpay’s journey on GenAI evaluation
- •Pushpay achieves 95% accuracy for its agentic AI search using a custom Amazon Bedrock evaluation framework.
- •New AI search feature reduces time-to-insight from 120 seconds to under 4 seconds for ministry leaders.
- •Strategic domain-level metrics and dynamic prompt construction enable targeted performance optimization and feature rollouts.
Pushpay, a digital engagement platform for faith-based organizations, has successfully transitioned its generative AI search tool from a prototype to a production-ready solution by leveraging Amazon Bedrock. Initially, the development team hit a performance plateau where their AI agent—a system capable of reasoning through steps and executing tasks (Agentic AI)—only achieved 60-70% accuracy. To overcome this, they developed a sophisticated evaluation framework using a "golden dataset" of over 300 validated query-response pairs. By comparing the AI's output against this benchmark using another language model as a reviewer (LLM), Pushpay gained the granular insights needed to refine specific problem areas. A standout technical improvement involved moving away from static instructions. Instead, they implemented a system that builds customized instructions on the fly based on user context. This approach, combined with prompt caching to reduce costs and speed up response times, allowed the tool to deliver actionable insights 15 times faster than manual navigation. The strategy of "strategic suppression" also played a vital role; by identifying which categories the AI struggled with, the team could temporarily disable those specific functions until they met a 95% accuracy threshold. This data-driven approach ensures that users only interact with high-performing features while maintaining long-term trust in the system's reliability.