Do Open Frontier Models Have A Chance Against Closed Models?

Which of the new open-ish frontier models has the best chance to stand up against closed-source models on both cost and quality?

I ran Ship-Bench against Kimi K2.6, Qwen 3.6 Plus, and DeepSeek v4 Pro to find out.

Hypothesis: All three models will stand up to the hype and provide good enough output quality but destroy closed frontier's on price. Kimi is rumored to have "Opus-like" quality with Qwen and DeepSeek standing a long-time competitors.

Key Insights (tldr;)

DeepSeek v4 Pro finished first with a 95.0 average and 5/5 gate passes, ahead of Kimi K2.6 at 93.9 and 5/5 passes, and Qwen 3.6 Plus at 91.1 with 4/5 passes.
All three produced strong-looking apps and much better visual results than the earlier Gemini and Gemma runs.
Token usage is the clearest economic indicator: Kimi used an astounding 64.1 million tokens, Similarly, Qwen used 63.3 million, and DeepSeek used "just" 26.3 million.
Qwen's planning left much to be desired, while Kimi and DeepSeek both cleared all five SDLC roles.
DeepSeek made the best overall case because it combined top-end quality with much better token efficiency. Kimi and Qwen were less compelling on cost once their heavy reasoning usage was included.
Cost. I will need a sponsor if this trend continues. Read on to find out.

Setup

All three runs used the same benchmark task and the same general operator setup. The important differences were the target model and, in DeepSeek's case, a slightly newer Copilot CLI build.

Environment

Item	Value
Machine	Windows 11
Runtime	Node v24
Ship-Bench repo	ship-bench v1
Benchmark task	Simplified knowledge base app

Run configuration

Item	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Harness	Copilot CLI 1.0.37	Copilot CLI 1.0.37	Copilot CLI 1.0.43
Model	`kimi-k2.6`	`qwen-3.6-plus`	`deepseek-v4-pro`
Run branch	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro

Judge configuration

Item	Value
Judge harness	Claude Code
Judge model	Opus 4.7 Medium
Evaluation mode	LLM judge plus human review

Ship-Bench Context

Ship-Bench evaluates models across five SDLC roles: Architect, UX Designer, Planner, Developer, and Reviewer. Each phase produces artifacts that feed the next one, which makes the benchmark useful for testing not just isolated quality but handoff quality across a realistic software workflow.

This run used the Simplified Knowledge Base App task. It is large enough to expose differences in architecture, planning, implementation, and QA, while still being constrained enough to compare across runs.

Overall Results

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Architect	93.89	92.78	95.56
UX Designer	98.57	98.60	98.60
Planner	98.33	87.30	93.00
Developer	97.00	92.00	98.75
Reviewer	82.00	83.00	85.00
Average score	93.96	90.74	94.18
Passes	5/5	4/5	5/5

The top-level story is straightforward. All three models were strong enough to look credible on quality, but DeepSeek delivered the cleanest balance of score, pass rate, and efficiency. Kimi stayed close on quality, while Qwen was still good overall but took the biggest hit from planning and execution friction.

Gate Failures

Model	Role	Gate failure
Qwen 3.6 Plus	Planner	Failed the ≥70% good-chunk gate; the plan landed around 20% good chunks and mixed oversized iterations with undersized sub-tasks.

That matters in practice because planning quality affects the entire downstream workflow. Qwen's raw planner score was still respectable, but the gate failure matched the real-world churn that showed up later in development.

Architect

The architect stage tests whether the model can turn the product brief into a concrete technical plan with clear decisions and minimal unresolved ambiguity.

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Score	93.89	92.78	95.56
Pass	Yes	Yes	Yes
Output	architecture.md	architecture.md	architecture.md
Eval	eval	eval	eval

LLM judge summary: All three architecture specs were concrete and implementation-ready. DeepSeek scored highest on completeness and organization, Kimi was close behind, and Qwen remained solid but had more version drift and a slightly weaker maintainability story.

Human notes: Kimi got a bonus for asking clarifying questions even after being told to choose based on requirements, and it was the only one of the three to propose a totally separate API server rather than keeping everything inside Next.js. Qwen's assumptions section was a nice touch and helped readability, but Kimi still had the edge. DeepSeek landed between them on raw architecture quality, though its organization was especially strong from a human-review perspective.

Practical takeaway: All three were viable architects, but Kimi and DeepSeek felt stronger in practice than Qwen.

UX Designer

The UX stage evaluates whether the design direction is specific enough to guide implementation, including flows, states, layout decisions, and interaction details.

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Score	98.57	98.60	98.60
Pass	Yes	Yes	Yes
Output	design-spec.md	design-spec.md	design-spec.md
Eval	eval	eval	eval

LLM judge summary: This role was extremely close. All three specs were dev-ready, state-rich, and unusually detailed, with the only consistent deduction being the lack of actual rendered mockups.

Human notes: Kimi was the first model in these runs to include text wireframes, which was a meaningful improvement over prior benchmark posts. Qwen also included text wires and felt roughly on par with Kimi from the spec alone. DeepSeek got the edge here because it was the most detailed of the three while still staying coherent.

Practical takeaway: UX was a strength for all three, with a slight edge to DeepSeek on spec quality and a slight edge to Qwen on final app aesthetics.

Planner

The planner stage tests whether the model can convert the prior artifacts into an executable delivery sequence with sensible task sizing and dependency order.

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Score	98.33	87.30	93.00
Pass	Yes	No	Yes
Output	backlog.md	backlog.md	backlog.md
Eval	eval	eval	eval

LLM judge summary: Kimi scored best numerically, DeepSeek was still strong, and Qwen failed on granularity. DeepSeek's plan balanced actionability with broader lifecycle thinking, while Qwen's chunking missed the rubric's target window.

Human notes: All three leaned too horizontal and all three deferred meaningful E2E testing until later, which caused churn in the final implementation stretch. That was the biggest shared planning weakness in the whole comparison. DeepSeek still came out best overall here because it combined strong planning with explicit stretch goals and documentation work, while Qwen's organization felt cleaner than Kimi's even though it failed the gate.

Gate failure note: Qwen's failure was not just a paperwork problem. The chunking issue lined up with the practical development friction later in the run.

Practical takeaway: DeepSeek had the best planning story overall, even though all three would have benefited from earlier vertical slices and earlier E2E verification.

Developer

The developer stage measures whether the model can implement the assigned backlog into a working MVP while staying aligned to the prior artifacts.

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Score	97.00	92.00	98.75
Pass	Yes	Yes	Yes
Output	source code	source code	source code
Eval	eval	eval	eval

LLM judge summary: DeepSeek was the strongest implementer, Kimi was close behind, and Qwen was clearly the most troublesome of the three despite still shipping a passable result.

Human notes: Kimi's developer was thorough and produced a nicer-looking app than the earlier Gemini and Gemma runs, but OpenRouter performance for Kimi was slow and it burned a huge number of thinking tokens. Qwen was much harder to operate: it hit CLI compatibility problems, copied into the wrong folder after a create-react-app naming issue, removed the .git folder in that location, and killed all Node processes on the machine when trying to stop a dev server, including the CLI itself. DeepSeek was faster, cleaner, and more token-efficient, though it still hit some churn in the testing iteration like the others.

Practical takeaway: DeepSeek was the best developer in both output quality and operator experience.

Reviewer

The reviewer stage closes the loop by checking whether the built MVP actually satisfies the brief, the specs, and the implementation plan.

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Score	82.00	83.00	85.00
Pass	Yes	Yes	Yes
Output	qa-report.md	qa-report.md	qa-report.md
Eval	eval	eval	eval

LLM judge summary: Reviewer was the weakest role for all three, mostly due to evidence and performance-measurement gaps rather than completely bad QA logic. DeepSeek scored highest, but the margin was small.

Human notes: Kimi's QA agent got bonus points for identifying security issues. Qwen's reviewer was thorough. DeepSeek's reviewer was solid, but Kimi and Qwen may have had a slight edge in practical QA sharpness despite the final numeric ordering.

Practical takeaway: All three reviewers were useful, but none of them fully closed the loop as cleanly as the design and development stages did.

Token and Cost Analysis

The quality differences were not huge, so token usage matters a lot here.

Primary cost view

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Total requests	761	1060	426
Total tokens	64.1M	63.3M	26.3M
Estimated total cost	$25.84	$21.16	$13.07
Cost per average point	$0.27	$0.23	$0.14

This is where the story changes. Kimi and Qwen did not really behave like bargain options in this setup because both burned so many reasoning tokens that they gave back much of their nominal pricing advantage. DeepSeek still used substantial tokens, but it was dramatically more efficient and that made its quality result much easier to justify economically. Compare this to the Gemma and Gemini using a fraction of the tokens.

If this trend keeps up, I'll need a benefactor to keep my OpenRouter account stocked up. For now, go have some fun on dumbquestion.ai, maybe buy some merch (I hear the mugs are pretty neat).

App Comparison

Screenshots matter here because all three models produced apps that are close enough in score that visual polish and interaction quality become part of the practical comparison.

Screenshots

Kimi

Qwen

DeepSeek

View	Kimi K2.6 app	Qwen 3.6 Plus app	DeepSeek v4 Pro app
Articles list	articles.png	articles.png	aricles.png
Article detail	details.png	details.png	details.png
Article editor	edit.png	edit.png	edit.png

Subjective UX review

All three apps were aesthetically pleasing, and all three looked better than the earlier Gemini and Gemma runs. Qwen gets a slight edge on overall visual feel, but it was a close call.

DeepSeek stood out most clearly in search. Its search behavior felt meaningfully better than the others, with proper debounced and deferred search, accurate FTS behavior, and cleaner result presentation. Qwen's search was a little pickier, and Kimi's was competent but less polished visually.

Interpretation

From a quality perspective, all three models made a legitimate case for themselves. None of these runs looked like a cheap imitation of a frontier workflow. DeepSeek, Kimi, and Qwen all produced strong architecture, detailed UX specs, and working MVPs that would have been hard to dismiss outright if they had been evaluated without model names attached.

But the economics split them apart. DeepSeek had the best chance to stand up against closed-source models because it combined top-tier quality with much better token efficiency. Kimi and Qwen still looked competitive on quality, but their reasoning-heavy behavior made them less compelling as cost challengers in this specific setup.

Verdict - DeepSeek v4 Pro

This run showed that all three open-ish frontier models have a real chance to compete with closed-source models on quality.

But if the question is which one currently has the best chance to compete on both cost and quality, the answer here is DeepSeek v4 Pro. Kimi K2.6 and Qwen 3.6 Plus stayed in the quality conversation, but their token inefficiency made them more expensive in practice than their model positioning might suggest.

What's Next?

What should the next Ship-Bench matchup test?

Are there two models or tools you want compared head to head?
Are you more interested in raw quality, cost efficiency, or open-vs-closed performance?
Do you want to know which setup is best for end-to-end autonomous runs, or which one is good enough for specific roles?
Are you more interested in planning quality, implementation reliability, or QA accuracy?

Do Open Frontier Models Have A Chance Against Closed Models ?

Key Insights (tldr;)