Why We Switched to Gemma 4 — And What Our Tests Actually Show

Why We Switched to Gemma 4 — And What Our Tests Actually Show
CommunityOne Spark AI

Our Spark AI chatbot answers questions across thousands of Discord servers and website widgets, around the clock. For many communities, it is the only "staff member" available at 3am. That context shapes every infrastructure decision we make — reliability is not optional.

Over the past year, we have run Spark on three different model families — OpenAI, then Gemini, and now Gemma 4 — each time with a concrete reason backed by data. This post covers the full arc: what drove each switch, how we evaluate model candidates before touching production, and what our internal test results actually show.

TL;DR

  • Gemma 4 31B scores 100% pass rate on our internal test suite versus ~96% for Gemini 2.5 Flash.
  • Gemma 4 26B A4B hits 99% pass rate at nearly the same speed as Gemini — the best accuracy-per-compute tradeoff.
  • Running through OpenRouter means no single API is a point of failure. If one provider goes down, requests automatically route to the next.
  • Nothing has changed for your community's data. We do not train on it, and you can delete it anytime from your dashboard.
CommunityOne

Want to 4x your Discord engagement and get free nitros & steams every week✌️?

✨ Invite our bot

Where We Started

Spark launched on OpenAI. At the time, it was the obvious choice — GPT-3.5 had the broadest capability, the best tooling ecosystem, and strong enough instruction-following to make a Discord bot feel coherent.

What pushed us off OpenAI and onto Gemini was a pair of concrete problems we kept hitting in production. First, response styling. Discord conversations have a distinct rhythm — shorter messages, more casual tone, heavy use of formatting like bullet points and bold. Getting GPT-4 to reliably match the styling expected by community members required increasingly complex prompt engineering that didn't hold across all server configurations. Gemini handled this more naturally out of the box. Second, cost at context length. Discord bots carry chat history across a conversation window. As that window grew, OpenAI's per-token pricing made deep conversations expensive to run at scale. Gemini gave us a better cost profile for the same context depth.

So we moved to Gemini 2.5 Flash — and it was a genuine improvement. Better styling, better context management, lower cost per conversation.

The problem was not quality. It was single-provider dependency.

When Google's Gemini API has an outage — even a partial one — every bot we power goes dark at the same moment. No fallback, no graceful degradation. For a community that relies on Spark to handle onboarding questions, moderation context, or customer support, a 15-minute outage at peak hours is a real failure.

That is what sent us looking for Gemma 4.

Why Gemma 4

Gemma 4 is Google DeepMind's open-weight model family, built directly from the same research and technology behind Gemini 3. Critically for us: it is open-weight, which means any inference provider can serve it.

A few things made it stand out for our use case:

Same Google AI research lineage. Gemma 4 is not a separate research direction — it is built from Gemini 3 technology. That meant we were not switching families, we were moving to an open distribution of equivalent-generation research.

Genuine multilingual capability. Our servers span gaming, crypto, and creator communities across many regions. Gemma 4's 140-language support with cultural context understanding (not just translation) matters in practice.

Native function calling. Spark's agent uses tool calls for knowledge retrieval. Gemma 4 supports agentic workflows with function calling natively — we did not need to change our tooling architecture.

Multi-provider via OpenRouter. By routing through OpenRouter and selecting multiple backend providers, our availability is now the probability of all selected providers going down simultaneously — a dramatically smaller risk than any single API dependency.

Our Evaluation Setup

The evaluation framework is built on deepeval, an open-source LLM evaluation library. We run five test suites, each targeting a specific capability:

1. rag_retrieval_full 100 synthetic Q&A pairs generated from real knowledge base chunks using a Gemini 3 Flash synthesizer. Distributed across four difficulty tiers: 30 single-turn questions, 30 conversational multi-turn, 20 two-node (answer spans two documents), and 20 three-node. The metric is whether the correct source document appears in the top search results. This tests the retrieval pipeline end-to-end, independent of how good the LLM's prose is.

2. knowledge_gap_detection Given a user query and available context, does the model correctly identify when it does not have enough information to answer confidently? False negatives — the model answers confidently when it should not — are what we care most about eliminating. We measure both precision and recall separately.

3. safe_for_kids Structured output task checking that the model correctly flags content that is inappropriate for family-friendly servers. This is particularly important for the gaming communities we serve.

4. rule_evaluation Each server can configure custom AI rules — behavioral guardrails the bot is expected to follow (for example, "never mention competitor products" or "always respond in Spanish"). This test suite verifies the model correctly identifies when a rule applies and what action to take.

5. bot_info_tool Tests whether the model correctly calls the right tool to retrieve server-specific information when asked. Tool use accuracy is a separate concern from general reasoning quality.

The synthetic golden sets are cached with a fingerprint of the knowledge base. If the knowledge base changes, the cache is automatically invalidated and new goldens are generated — meaning we are always testing against the current knowledge state, not a stale snapshot.

The Results

Pass Rate by Model

Bar chart comparing LLM pass rates: Gemma 4 31B 100%, Gemma 4 26B 98%, Gemini 2.5 Flash 96.2% — CommunityOne internal eval
Average pass rate for different model

Gemma 4 31B achieves a 100% average pass rate across all test suites. Gemma 4 26B A4B (the efficient architecture variant) comes in at ~99%. Our previous production model, Gemini 2.5 Flash, averaged ~96%.

That 4-point gap is not noise. A 4% failure rate across thousands of daily interactions is a real volume of wrong answers, missed tool calls, and guardrail bypasses.

Test Suite Duration by Model

Bar chart comparing LLM test suite duration: Gemma 4 31B 94.8s, Gemma 4 26B 57.4s, Gemini 2.5 Flash 53.2s — CommunityOne benchmark
Average test duration for models

Gemma 4 31B is the slowest to run through our full test suite (~93 seconds), which reflects its larger parameter count. Gemma 4 26B A4B runs in ~57 seconds — only 4 seconds slower than Gemini 2.5 Flash (~53 seconds).

These are test-suite durations, not per-response latencies. In production, Gemma 4 26B A4B's inference time through OpenRouter providers is competitive with the direct Gemini API call we ran before.

The practical takeaway: Gemma 4 26B A4B gives you 99% accuracy at essentially the same operating cost as Gemini, while Gemma 4 31B trades speed for an extra percentage point of accuracy at the top end. Both are now live — we route premium tiers to 31B and use 26B A4B for higher-volume workloads where the speed difference matters.

Per-Test Breakdown — Where Each Model Stands

The charts above show aggregate pass rates, but the per-case detail tells a more specific story.

Bar chart: Gemini 2.5 Flash pass rate by test — rule_evaluation 90%, bot_info_tool 90.9%, knowledge_gap_detection 100%, rag_retrieval 100%, safe_for_kids 100%
Gemini 2.5-flash pass rate
Bar chart: Gemma 4 26B A4B pass rate by test — rule_evaluation 90%, all other tests 100% — CommunityOne LLM eval
Gemma 26b pass rate
Bar chart: Gemma 4 31B pass rate by test — perfect 100% across all 5 test categories — CommunityOne LLM eval
Gemma 31b pass rate

Rule Evaluation (10 cases)

Each case tests whether the model correctly identifies which custom rules apply to a given conversation turn.

Test case Gemini 2.5 Flash Gemma 4 26B Gemma 4 31B
Single rule fires on greeting
Single rule does not fire
Multiple rules, only one fires
Multiple rules, both fire
Rule uses context from history
No rules match
Rule fires on profanity
Rule fires on question about fees
Only the last message matters
Semantic match (not keyword)

The one failure shared by Gemini and Gemma 26B is a nuanced case: a rule that should only trigger based on the most recent message, not prior conversation history. Gemma 31B handles this correctly.

Bot Info Tool (11 cases)

Tests whether the model calls the right tool when a user asks about the bot's capabilities or the platform.

Test case Gemini 2.5 Flash Gemma 4 26B Gemma 4 31B
What can this bot do?
How to configure the bot
What is CommunityOne?
Quests feature
Hype Engine
Spark AI
Community question with context
General knowledge
Greeting
Community event
Server-specific question

Gemini failed to correctly invoke the tool for the Quests feature, returning a generic response instead of fetching bot-specific information. Both Gemma models handle it correctly.

Knowledge Gap Detection (17 cases)

Tests whether the model correctly flags when it lacks sufficient context to answer a question confidently.

Test case Gemini 2.5 Flash Gemma 4 26B Gemma 4 31B
Community question, no context
Community question, irrelevant context
Community question, outdated context
CommunityOne server help question
Hype Engine question
Spark AI question
Event details missing
General knowledge — answerable
General knowledge — unanswerable
Greeting / small talk, no results
Small talk, irrelevant results
Bot answered from knowledge base
Bot help on third-party server
Hype Engine on third-party server
Spark AI on third-party server
Server listing on third-party server
Social / rhetorical message

All three models handle every knowledge gap scenario correctly. This is the most critical category from a trust perspective — a model that claims to know something it does not is far more damaging than one that admits uncertainty.

Safe for Kids (14 cases)

Tests whether the model correctly identifies and modifies content that is inappropriate for family-friendly servers, without over-flagging harmless content.

Test case Gemini 2.5 Flash Gemma 4 26B Gemma 4 31B
Mild profanity
Strong profanity
Sexual content
Graphic violence
Drug references
Hate speech
Profanity with useful info
Clean helpful answer (true negative)
Factual history mention (true negative)
Neutral greeting (true negative)
Technical explanation (true negative)
Polite decline (true negative)
Educational biology (true negative)
Medical info, neutral (true negative)

Perfect across the board. Notably, 6 of the 14 cases are true negatives — content that looks borderline but should not be modified. All models correctly leave these untouched.

RAG Retrieval (100 cases)

All three models returned the correct source document in the top results for all 100 synthetic questions — 70 single-turn and 30 multi-turn conversational queries generated from real knowledge base content.

Question type Cases Gemini 2.5 Flash Gemma 4 26B Gemma 4 31B
Single-turn 70 70/70 70/70 70/70
Multi-turn (conversational) 30 30/30 30/30 30/30
Total 100 100/100 100/100 100/100

A Note on Data Privacy

Switching AI models is a good moment to be explicit: nothing about your data has changed.

We do not train on your users' conversations. Messages handled by Spark are never used to train our models or improve anyone else's models.

Google does not train on your data either. Gemma 4 is an open-weight model deployed through inference providers. No data is sent back to Google for training at any point in this stack. OpenRouter's providers operate on standard inference contracts — your data is not a training resource.

Your data stays yours. We store conversation logs only to power your server's analytics and to help Spark answer questions better within your own knowledge base. We do not use it to profile users, build advertising segments, or share it with third parties.

You can delete your data anytime. Every server admin can request a full data deletion directly from the CommunityOne dashboard — no email required, no waiting period, no friction.

What's Next

Premium model selection — with tested guarantees

We are building the ability for premium users to select higher-tier models, with a credit-based system to match. But this is not a simple plug-and-play toggle.

Every model we make available to users goes through the same internal test suite you saw in this post. The reason is that a chat model for Discord and one for a website widget genuinely have different requirements. On Discord, style matters enormously — your community has an established tone, and a bot that responds in flat corporate prose breaks the vibe. On a website widget, retrieval precision is the priority — users are typically asking specific product questions and expect an accurate, direct answer, not a conversational riff.

We have known for a long time that different models excel at different things. Our role is not just to expose that choice to you — it is to do the evaluation work first, build platform-specific system prompts that make each model behave correctly on each surface, and only then give you access to it. That evaluation work takes time to do properly. We would rather be slow and right than fast and ship you a model that sounds good in isolation but fails in your actual use case.

Evaluations as a premium feature

The internal tooling we built to run and track these tests is coming to premium users. You will be able to run the same evaluation suite against your own knowledge base and see exactly where your bot fell short — which questions it answered wrong, which gaps in your documentation caused retrieval to fail, and what you can do to fix it. Instead of guessing why users are not getting good answers, you will have data.

Reliability infrastructure

The move to OpenRouter is one layer of our reliability work, but it is not the whole picture. Internally, we have been building a broader set of fallback systems. If a provider goes down or a model endpoint degrades, our routing logic detects the failure automatically and shifts traffic — without you noticing. The goal is that even a multi-hour outage at a single provider is invisible to your community. We are still building this out, but the foundation is in place and improving with every deployment.

If you want to see Spark in action in your own community, the setup takes about five minutes.