My Avatar

Shoto

Full Stack Engineer

What makes the 'Behind the Scenes of the Gemini App' compelling today?

Posted at 2026/01/166 min to read

Generative AI is no longer judged only by how smart the model is.
What increasingly matters is everything around the model: product design, UX decisions, evaluation methods, guardrails, latency constraints, and the everyday trade-offs made by the people building real products.

That is why the “behind the scenes” of AI products have become so compelling.

In this session, I went to Google event in Tokyo office, explored the internal thinking behind the Gemini App. The session was lead by Adachi (GDE, AI/ML), Ota (GDE, AI), and Keith Stevens (Gemini App team, Google).


Slide Generation: the feature that exposes product taste

One of the most striking features in the Gemini App is slide generation.
You type a prompt, select the canvas mode, and within seconds a structured slide deck appears on the right side of the screen. For many users, the experience feels almost magical.

And yet, behind the scenes, the team does not experience it as “magic” at all.

Keith described a constant underlying anxiety during development: even if people say the feature is impactful, is the quality truly high enough? Will users actually rely on this for real work, or is it just an impressive demo? That tension—between perceived impact and internal quality standards—shaped many of the design decisions.

At the heart of this feature lies a simple but demanding idea, slide generation is not a demo of LLM capability, but a reflection of product taste.


Plenty of tools can already “generate slides.” The problem is that many of them look like they were generated by AI. The structure feels generic, the visuals feel mismatched, and the overall result lacks conviction. For the Gemini App team, that bar was unacceptable. Internally, Google already has a strong culture around slides—the expectation for clarity, design, and narrative is high. “It works” is not enough if it does not look credible. Because of that, slide quality is judged on more than correctness. A good deck needs:

  • clear structure and flow
  • thoughtful use of whitespace
  • consistent tone
  • visual hierarchy that guides the reader’s attention

Interestingly, the biggest challenges were not only conceptual but also technical.

Early on, the team explored generating code that calls external APIs to construct slides programmatically. In practice, this approach struggled to achieve the level of visual and structural control the team wanted. Therefore, they move to HTML-based slide generation instead.

Evaluation: why it became a “vibe check” (and who decides)

In many areas of machine learning, evaluation is already well-established.
We have benchmarks, metrics, automated graders, and even techniques like “LLM-as-a-judge” to score model outputs at scale. On paper, it sounds like slide generation should be evaluable in the same way.

In practice, it isn’t.

When the output is a slide deck, the dominant question is not “Is this correct?” but simply Does this feel right?

That subtle shift changes everything. A slide can be factually accurate and still be unusable—confusing layout, awkward tone, inconsistent visuals, or a narrative that just does not land. No single metric captures that.

Because of this, the real judges for slide quality ended up being people, not models.

Keith explained that the most trusted evaluators were product managers and UX designers—people with strong opinions about what good slides look like and how users perceive them. These are individuals who care deeply about spacing, hierarchy, tone, and narrative flow. Their feedback often outweighed any automated signal.

One particularly revealing comment from Keith was his admission that his own opinion on slide quality was not especially reliable. That humility is not a weakness—it highlights something essential about product development, which is quality is not defined by the builder, but by the people closest to the user experience.

This leads to a deeper insight: evaluation itself must be designed.
It is not enough to build a model and then “measure quality.” You must decide who gets to define quality.

What Keith calls an “agent”: LLM + tools + runtime + orchestrator

Keith began this part of the conversation by stepping back and answering a deceptively simple question: what is an agent, really? That definition alone reframes how you think about systems like the Gemini data analysis agent. An agent is not just a “smart model.” It is a system made of multiple moving parts working together:

  • a large language model (such as Gemini)
  • a set of tools the model can call
  • an execution environment where those tools actually run (for example, a Docker runtime)
  • and an orchestrator that manages the loop: calling the model, selecting tools, running them, and feeding the results back to the model

An agent is not intelligence in isolation. It is intelligence plus execution. Without the runtime and orchestration layer, the model may generate good ideas—but nothing actually happens in the world.

One of the most persistent problems in agent development is the infinite loop: the model writes code, the code fails, the model tries again, the code fails again… and the system can spiral without ever converging. From the outside, this looks like a technical reliability issue. From the inside, it feels more like a product design problem.

What Keith emphasized instead was a more nuanced approach. Rather than relying only on external rules, the goal is to help the model itself learn how to recover. That recovery behavior comes from a combination of prompt design and improvements to the underlying model: teaching it to read errors, adjust its strategy, and make more informed second attempts.

This is why guardrails are not just a safety mechanism.
They are a form of UX design.

Where do you want recovery to happen?
At the prompt layer, by shaping behavior?
At the model layer, by improving generalization?
Or at the runtime layer, by controlling execution?

The answers to those questions define the experience users have with agents—whether they feel brittle and frustrating, or resilient and trustworthy.