Sandboxed Code Execution for Kids: How Judge0 and Python sys.settrace Power FireFly

Thu, 15 Jan 2026 09:00:00 +0000

When you build a platform for kids to learn to code, like FireFly, the hard problem is safety.

Letting a developer run arbitrary code in a container is one thing. Letting a 7-year-old, who might accidentally or intentionally write an infinite loop or a memory-hogging script, run code on your servers is another.

For FireFly, I needed a solution that was:

Secure: No “breakouts” to the host machine.
Fast: Near-instant execution so the learning flow isn’t broken.
Traceable: I needed to know exactly which line was running at any moment so the AI tutor could give grounded feedback.

The solution ended up being a combination of Judge0 and Python’s sys.settrace().

Layer 1: The Hard Sandbox (Judge0)

The first line of defense is Judge0, an open-source online code execution system. I run Judge0 in a set of Docker containers. When a student in FireFly clicks “Run,” their code is sent to the Judge0 API, which:

Creates a temporary, isolated worker.
Enforces strict CPU and memory limits.
Limits the execution time to a few seconds.
Returns the output (or the error).

This handles the outer safety boundary. Even if a student tries to import os; os.system('rm -rf /'), Judge0 catches it or confines the damage to a disposable container.

Layer 2: The Soft Sandbox (Python `sys.settrace`)

Judge0 keeps the system safe, but it does not tell me why a student got stuck. To power a Socratic AI Tutor, the system needs to see the internal state of execution: which variables change, and which lines get hit.

To do that, I wrap the student’s Python code in a tracer script that uses sys.settrace(). It is a built-in Python hook that lets you run a function for each executed line.

How the Tracer Works:

Line-by-line tracking: As the code runs, the tracer records the current line number and the values of local variables.
Instruction limit: If the code takes too many steps, as in an infinite loop, the tracer raises a custom exception and stops execution before Judge0 has to step in.
State snapshot: At the end of the run, the tracer returns a breadcrumb trail of the execution.

Layer 3: The AI Tutor Feedback Loop

The “breadcrumb” from the tracer is what makes the FireFly AI Tutor so effective. Instead of just seeing “Error: NameError: name ‘x’ is not defined,” the AI can see: “The student defined x on line 2, but they are trying to use it on line 5 inside a function where it’s not in scope.”

This level of detail allows the AI to ask much better Socratic questions.

Why This Matters for EdTech

We often think of “sandboxing” as a security feature for protecting servers. In EdTech, it is also a pedagogical feature. A safe, observable environment gives kids room to experiment, break things, and learn from their mistakes without real-world consequences.

Building this part of FireFly has been one of the most satisfying engineering problems in the project. It sits right where security requirements and teaching goals meet.

Related reading:

How I Wired Up an AI Tutor to Teach Like a Socratic Mentor — Not a Cheater

Tue, 05 Aug 2025 09:00:00 +0000

The problem with most “AI tutors” today is simple: they are too helpful. If a student asks “How do I solve for x?”, a generic LLM will often just show the steps and give the answer. In an educational context, that isn’t teaching; it is a shortcut to a finished worksheet with zero retention.

When I started building FireFly, my goal was the opposite. I wanted an AI that would act like a Socratic Mentor. It should never give the answer directly. It should only ask the next right question to help the student find the answer themselves.

That turned out to be a surprisingly hard technical challenge.

The “Helpful Assistant” Bias

Large Language Models (LLMs) are trained to be helpful assistants. Their default behavior is to minimize the “effort” for the user. In education, you actually want to maximize the student’s cognitive effort within a safe range (the Zone of Proximal Development).

To break the LLM’s habit of just giving the answer, I had to move beyond simple system prompts and build a more structured interaction loop.

Three Layers of a Socratic AI

In FireFly, the “AI Tutor” isn’t just one prompt. It is three distinct layers working together.

1. The Knowledge Layer (BKT)

Before the AI says a word, the system checks the student’s current mastery using Bayesian Knowledge Tracing (BKT). If the system knows the student is 90% likely to understand loops but only 10% likely to understand nested loops, it passes that context to the LLM.

The prompt becomes: “The student understands X, but is struggling with Y. Ask a question that bridges the gap.”

2. The Socratic Constraint

The core prompt for the FireFly tutor is built around strict negative constraints:

NEVER provide the full solution.
NEVER point out the exact line of the error.
ALWAYS ask a leading question.
ALWAYS validate the student’s process, not just their output.

If the student is stuck on a syntax error, the AI might say: “I see you’re trying to repeat a block of code. Have you looked at where your curly braces are starting and ending?”

3. The Age-Adapted Tone

Teaching a 7-year-old is different from teaching a 15-year-old. FireFly uses different “personas” depending on the user’s profile. For younger kids, the tone is more encouraging and uses metaphors (like “the computer is a very literal robot”). For older students, the tone is more technical and precise.

Dealing with the “Just Tell Me” Frustration

One of the biggest challenges in Socratic teaching is student frustration. When an AI keeps asking questions instead of giving answers, some students will just keep asking “Just tell me the answer.”

To handle this, FireFly has a “Frustration Fuse.” If the student asks for the answer three times in a row, the AI is allowed to provide a hint that is slightly more direct, or it can offer to “reset” the problem to an easier version. This keeps the student engaged without breaking the pedagogical goal.

Why This Matters

We are entering an era where AI-powered personalized learning will be the norm. But if we just build “answer machines,” we are doing a disservice to the next generation of learners.

Building FireFly taught me that the most powerful use of AI in education isn’t knowing everything. It’s being a patient, persistent, and occasionally annoying mentor who refuses to let the student take the easy way out.