Can Gemma 4 Beat Gemini 3.1 Pro at Coding?

Is a $20/month Google AI Pro account worth it versus running Gemma 4 31B on OpenRouter pay-as-you-go? This Ship-Bench run was designed to answer that question across a realistic coding workflow rather than a single coding prompt.
Hypothesis: Gemini's larger model size would show clear advantages over Gemma's smaller 31B parameters especially when it comes to working through problems.
Key Insights
Gemini finished with an 86.6 average across the five roles and passed 4 of 5 gates, while Gemma finished at 72.4 and only passed 2 of 5.
Gemma actually led the raw Architect and UX scores, but still failed the Architect gate because exact versions were not pinned to the latest frameworks.
The biggest separation showed up in execution and verification: Gemini scored 93.3 in Developer versus Gemma's 58, and 72 versus 37 in Reviewer.
Gemini is currently an unusually strong value on AI Pro, but the more durable market-rate comparison is roughly $5.05 for Gemini versus $0.85 for Gemma on OpenRouter-equivalent pricing.
Setup
Both runs used the same machine, the same runtime family, the same benchmark task, and the same Ship-Bench version (v1). The main difference was the harness and provider setup, which matters because operator experience and tool behavior can shape outcomes even when the benchmark target stays constant.
Environment
| Item | Value |
|---|---|
| Machine | Windows 11 |
| Runtime | Node v24 |
| Ship-Bench repo | ship-bench v1 |
| Benchmark task | Simplified knowledge base app |
Run configuration
| Item | Gemini run | Gemma run |
|---|---|---|
| Harness | Gemini CLI 0.38.2 | GitHub Copilot CLI 1.0.34 |
| Model | Gemini 3.1 Pro | Gemma 4 31B |
| Backend | Google AI Pro account | OpenRouter |
| Run repo | Gemini branch | Gemma branch |
Judge configuration
| Item | Value |
|---|---|
| Judge harness | Claude Code |
| Judge model | Opus 4.7 Medium |
| Evaluation mode | LLM judge plus human review |
Ship-Bench Context
Ship-Bench evaluates models across five SDLC roles: Architect, UX Designer, Planner, Developer, and Reviewer. Each role produces artifacts that feed the next stage, making the benchmark useful for measuring both isolated output quality and handoff quality across a realistic workflow.
This run used the standard simplified knowledge base app task. That task is large enough to expose differences in architecture, planning, implementation, and QA without becoming too open-ended to compare cleanly across runs.
Overall Results
| Metric | Gemini 3.1 Pro | Gemma 4 31B |
|---|---|---|
| Architect | 87.2 | 92.2 (FAIL gate) |
| UX Designer | 89.5 | 94.6 |
| Planner | 91.1 | 80.0 (FAIL gate) |
| Developer | 93.3 | 58.0 (FAIL) |
| Reviewer | 72.0 (FAIL gate) | 37.0 (FAIL) |
| Average score | 86.6 | 72.4 |
| Passes | 3/5 | 2/5 |
Gemini was more dependable across the full workflow. Gemma looked competitive early, but the later-stage failures were severe enough to erase that advantage in practical terms.
Architect
The architect stage tests whether the model can turn the product brief into a concrete technical plan with clear decisions and minimal unresolved ambiguity.
| Metric | Gemini 3.1 Pro | Gemma 4 31B |
|---|---|---|
| Score | 87.2 | 92.2 |
| Pass | Yes | No |
| Output | Gemini architecture | Gemma architecture |
| Eval | Gemini eval | Gemma eval |
LLM judge summary: Gemma scored higher on design quality and ergonomics, but failed the mandatory Frameworks gate because it used generic “Latest” placeholders instead of exact version pins. Gemini passed with slightly lower raw score because of some nitpicking of the LLM judge.
Human notes: Both chose SQLite plus Prisma for a good local-first developer experience, but neither specified what a deployed database path should look like, so both would have needed follow-up prompting there. Testing strategies were broadly similar, backend and data choices were nearly identical, but the front-end architecture showed a real difference: Gemma defaulted to a standard Next.js plus Tailwind stack, while Gemini simplified to vanilla CSS in a way that felt more thought-through for the actual backlog. Gemma's outdated framework assumptions are also a meaningful practical issue, especially if version drift is already a known complaint with LLMs.
UX Designer
The UX stage evaluates whether the design direction is specific enough to guide implementation, including flows, states, layout decisions, and interaction details.
| Metric | Gemini 3.1 Pro | Gemma 4 31B |
|---|---|---|
| Score | 89.5 | 94.6 |
| Pass | Yes | Yes |
| Output | Gemini UX spec | Gemma UX spec |
| Eval | Gemini eval | Gemma eval |
LLM judge summary: Both passed. Gemma scored slightly higher because it was a bit more complete on states and accessibility detail, while Gemini was still fully usable and implementable.
Human notes: Gemma did a bit better describing screen routes by user flow, but Gemini's version was still perfectly functional. Gemini also put more thought into the interactions themselves, even if both specs largely covered the same interaction set.
Planner
The planner stage tests whether the model can convert prior artifacts into an executable delivery sequence with sensible task sizing and dependency order.
| Metric | Gemini 3.1 Pro | Gemma 4 31B |
|---|---|---|
| Score | 91.1 | 80.0 |
| Pass | Yes | No |
| Output | Gemini backlog | Gemma backlog |
| Eval | Gemini eval | Gemma eval |
LLM judge summary: Gemini produced better-scoped vertical slices and passed the planner gates. Gemma failed because its task structure relied too much on horizontal slicing and deferred testing until the end and some imbalance in the iterations.
Human notes: This is where Gemini's stronger reasoning started to matter more. Both understood scope and dependencies well, but Gemma's sequence of Foundation → Browsing → Editing → Testing left both unit and end-to-end testing to the final iteration, which created imbalanced iterations and caused rework in iteration 4. Gemini's sequence of Base/Foundation → Browsing/Viewing → Editing → Searching felt more realistic and better balanced.
Developer
The developer stage measures whether the model can implement the assigned backlog into a working MVP while staying aligned to the earlier artifacts.
| Metric | Gemini 3.1 Pro | Gemma 4 31B |
|---|---|---|
| Score | 93.3 | 58.0 |
| Pass | Yes | No |
| Output | Gemini source | Gemma source |
| Eval | Gemini eval | Gemma eval |
LLM judge summary: Gemini delivered a working MVP with verified browse, search, and edit flows. Gemma's implementation failed on a broken Prisma import that caused 500 errors and prevented the write path from functioning correctly.
Human notes: Both models needed some operator intervention around interactive commands like create-react-app and Playwright setup. The practical difference is that Gemini mostly sailed through implementation after that, while Gemma could not get the newer Prisma version working, downgraded it, never got Playwright green, and left a critical bug on the edit article page that required manual fixing.
Reviewer
The reviewer stage closes the loop by checking whether the built MVP actually satisfies the brief, specs, and implementation plan.
| Metric | Gemini 3.1 Pro | Gemma 4 31B |
|---|---|---|
| Score | 72.0 | 37.0 |
| Pass | No | No |
| Output | Gemini QA report | Gemma QA report |
| Eval | Gemini eval | Gemma eval |
LLM judge summary: Both reviewer runs failed gates, but in very different ways. Gemini's failure was relatively minor and came from missing screenshots, attached evidence, and other verification artifacts despite catching real defects. Gemma's reviewer missed the app-crashing Prisma import entirely, marked broken flows as PASS without browser verification, and made a ship recommendation on a non-functional app.
Human notes: Gemini's stronger reasoning showed up again here: it found one major issue and several minor ones, but none blocked primary functionality. Gemma never got the Playwright tests running, did not work around that limitation, and missed the critical showstopping bugs altogether.
Gate Failures
| Model | Role | Gate failure |
|---|---|---|
| Gemini 3.1 Pro | Reviewer | Evidence gate — no screenshots, coverage report, or attached logs despite otherwise sound defect detection. |
| Gemma 4 31B | Architect | Frameworks gate — no exact versions, “Latest” placeholders, and outdated assumptions on version currency. |
| Gemma 4 31B | Planner | 70% good chunks gate — horizontal slicing and late testing caused poor iteration quality. |
| Gemma 4 31B | Developer | MVP flows and critical bugs gates — broken Prisma import caused 500s and blocked key flows. |
| Gemma 4 31B | Reviewer | Flows, Defects, and Evidence gates — the reviewer missed critical failures and did not verify runtime behavior. |
Token and Cost Analysis
The quality difference matters, but cost is the practical question behind this comparison.
| Metric | Gemini AI Pro (effective) | Gemini OpenRouter equivalent | Gemma OpenRouter |
|---|---|---|---|
| Total tokens | 2.35M | 2.35M | 6.43M |
| Estimated cost | ~$0.13 | $5.05 | $0.85 |
| Cost per average point | $0.0015 | $0.058 | $0.012 |
Gemini is currently a great value on AI Pro at roughly \(0.13 effective for this run based on the observed request budget, but that pricing environment should not be assumed to last as providers reduce quotas and raise prices. The more durable comparison is the retail-style one: about \)5.05 for Gemini versus $0.85 for Gemma, which makes Gemma far cheaper but also much weaker once the workflow reaches implementation and QA.
App Comparison
The benchmark scores matter most, but screenshots still help reveal polish and coherence that score tables do not fully capture.
Screenshots
Gemini 3.1 Pro
Gemma 4 31B
| View | Gemini app | Gemma app |
|---|---|---|
| Home page | article_list.png | articles.png |
| Search results | search.png | search.png |
| Article detail | article.png | article.png |
| Article editor | article_edit.png | article_edit.png |
Subjective UX review
Both models produced broadly similar flows, which is expected given the task and specs. The main visual difference is that Gemini went very lean and content-forward, while Gemma inherited baseline Tailwind styling that felt slightly less aesthetic in practice.
Both apps would have benefited from wireframes earlier in the process. There were also some obvious missed touches on both sides, such as stronger search calls to action, although Gemma at least added a “Clear search” option that Gemini lacked.
Interpretation
This run suggests that Gemini's deeper reasoning matters most once the workflow stops being about drafting and starts being about sequencing, implementation, recovery, and verification. Gemma stayed competitive in the earlier specification-heavy stages, but the later breakdowns show that a cheaper model can still become expensive if it burns cycles on rework or misses critical issues.
That does not mean Gemma has no place. With tighter task definitions and more explicit setup constraints, it could still make sense as a lower-cost option for spec-heavy work or coding loops where the operator is willing to be more hands-on.
Verdict: Gemini 3.1 Pro
Gemini showed that deeper thinking is vital for coding workflows in this benchmark. It produced the more reliable end-to-end result and delivered a working MVP across the SDLC handoffs that matter most.
Gemma was much cheaper on a market-rate basis and looked competitive in the early roles, but it broke down where the benchmark became most operationally demanding. With more upfront work to make task definitions crisper, Gemma may still be a sensible way to save money on coding loops, but this run did not show it as the better full-workflow option.


