PFN Builds Secure Sandbox for AI Code Evaluation
- •Preferred Networks developed a specialized sandbox environment to safely execute and test AI-generated code.
- •The system leverages AWS Lambda to provide a low-cost, isolated environment with completely blocked external communication.
- •Data compression techniques were implemented to bypass Lambda's payload limits, enabling large-scale, complex evaluations.
Preferred Networks (PFN), a leader in AI development, has created a dedicated sandbox environment to safely test programs generated by its large language model, PLaMo. Standard benchmarks such as HumanEval and LiveCodeBench require executing AI-generated code alongside test scripts to verify accuracy. However, running this code unprotected is inherently risky, as AI can occasionally output commands that damage systems or attempt unauthorized network access. To prevent potential infrastructure failures, executing these programs in an isolated sandbox is essential for maintaining security.
PFN built this isolated environment using AWS Lambda, a serverless computing service. While the company previously relied on Kubernetes for container management, they required a more versatile solution that could be used consistently across various cloud services and supercomputers. The new system only accepts execution requests authenticated via IAM, and outbound communication from Lambda is completely blocked through specific VPC configurations. This setup ensures that even if dangerous code is generated, there is zero impact on internal systems or the public internet.
The system is also highly cost-effective, utilizing the AWS free tier to process tens of thousands of benchmark tests monthly at a very low price. To overcome technical hurdles like Lambda’s 6MB request size limit, PFN implemented an approach where data is compressed before being sent and then expanded within the sandbox. This allows for the continuous and accurate evaluation of data-heavy tasks, such as complex competitive programming problems. Building such a robust evaluation framework is a critical step toward the safe social implementation of AI, proving to be just as important as training the model itself.