• unwind ai
  • Posts
  • Claude Opus 4.5 Scores 80.9% on SWE-bench

Claude Opus 4.5 Scores 80.9% on SWE-bench

+ Extend MCP servers with interactive UIs, Deep Agents with Google ADK

In partnership with

Today’s top AI Highlights:

& so much more!

Read time: 3 mins

AI Tutorial

Imagine uploading a photo of your outdated kitchen and instantly getting a photorealistic rendering of what it could look like after renovation, complete with budget breakdowns, timelines, and contractor recommendations. That's exactly what we're building today.

In this tutorial, you'll create a sophisticated multi-agent home renovation planner using Google's Agent Development Kit (ADK) and Gemini 2.5 Flash Image (aka Nano Banana).

It analyzes photos of your current space, understands your style preferences from inspiration images, and generates stunning visualizations of your renovated room while keeping your budget in mind.

We share hands-on tutorials like this every week, designed to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Don’t forget to share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads) to support us!

Latest Developments

MCP servers just learned how to speak in pictures and buttons, not just text.

The Anthropic team is adding support for interactive user interfaces in MCP through a new standardized extension called MCP Apps, meaning AI agents can now show you rich visualizations, complex forms, and interactive tools instead of dumping raw JSON data in your chat.

The extension introduces a structured way for servers to declare UI resources (using the ui:// URI scheme), link them to specific tools, and enable two-way communication between embedded interfaces and host applications. Everything runs in sandboxed iframes with auditable messaging to keep things secure.

Key Highlights:

  1. Pre-declared UI resources - UI templates are registered as resources with the ui:// scheme and referenced in tool metadata, allowing hosts to prefetch and review templates before execution for better performance and security.

  2. Native MCP communication - UI components communicate with hosts using the existing MCP JSON-RPC protocol over postMessage, meaning developers can use the standard MCP SDK to build applications with structured, auditable messages that automatically benefit from future protocol features.

  3. Layered security model - Protection comes from multiple angles, including mandatory iframe sandboxing with restricted permissions, predeclared templates that hosts can review before rendering, loggable JSON-RPC for all UI-to-host communication, and optional user consent requirements for UI-initiated tool calls.

  4. HTML-first - The initial specification supports only text/html content rendered in sandboxed iframes, providing universal browser support, well-understood security patterns, and screenshot generation capabilities while deferring external URLs and native widgets to future iterations.

You can review the full specification here: SEP-1865

The AI Agent Shopify Brands Trust for Q4

Generic chatbots don’t work in ecommerce. They frustrate shoppers, waste traffic, and fail to drive real revenue.

Zipchat.ai is the AI Sales Agent built for Shopify brands like Police, TropicFeel, and Jackery — designed to sell, Zipchat can also.

  • Answers product questions instantly and recommends upsells

  • Converts hesitant shoppers into buyers before they bounce

  • Recovers abandoned carts automatically across web and WhatsApp

  • Automates support 24/7 at scale, cutting tickets and saving money

From 10,000 visitors/month to millions, Zipchat scales with your store — boosting sales and margins while reducing costs. That’s why fast-growing DTC brands and established enterprises alike trust it to handle their busiest season and fully embrace Agentic Commerce.

Setup takes less than 20 minutes with our success manager. And you’re fully covered with 37 days risk-free (7-day free trial + 30-day money-back guarantee).

On top, use the NEWSLETTER10 coupon for 10% off forever.

Claude Opus 4.5 scores 80.9% on SWE-bench Verified, outperforming every other frontier model in real-world software engineering tasks.

It completed the 2-hour test with higher scores than any human has achieved, demonstrating capabilities that push beyond current benchmarks in ways that weren't anticipated.

There’s a lot to be excited about!

Claude Opus 4.5 is now available via API, Claude app, Claude Code, and across all major platforms like Cursor and Lovable. The API is priced at $5/$25 per million tokens. It brings top-tier AI capabilities within reach for more developers and teams. The model delivers meaningful improvements in coding, agentic behavior, computer use, and everyday tasks like deep research and working with spreadsheets.

Opus 4.5 is also very effective at managing a team of subagents, enabling the construction of complex, well-coordinated multi-agent systems. Claude’s Context Editing capabilities and Memory Tool can take this performance a notch up and also help keep conversations going longer.

Key Highlights:

  1. Best-in-class coding performance - Leads across 7 out of 8 programming languages on SWE-bench Multilingual and scores 89.4% on Aider Polyglot coding problems, beating Sonnet 4.5's 78.8%. Handles ambiguity and reasons about tradeoffs without hand-holding.

  2. Effort control - At medium effort, matches Sonnet 4.5's best performance while using 76% fewer output tokens. At high effort, it exceeds Sonnet 4.5 by 4.3 percentage points while still using 48% fewer tokens.

  3. Industry-leading security against prompt injection - Most robust defense against prompt injection attacks among all frontier models, with significantly lower susceptibility rates. Also achieves the lowest "concerning behavior" score, measuring resistance to both human misuse and undesirable autonomous actions.

  4. Plan Mode in Claude Code - Claude Code now includes Plan Mode that asks clarifying questions upfront and builds editable plan.md files before execution.

Quick Bites

Prompting Playbook for Gemini 3
Your Gemini 2.5 prompts probably won't work well on Gemini 3 Pro. A Google team member posted their Gemini 3 Pro prompting guide based on real usage, and the main takeaway is that the model actually prefers less verbose prompts than 2.5. It defaults to concise answers unless you explicitly request otherwise, responds better to structured XML formatting, and performs best when you place specific instructions after large context blocks rather than before them. The guide is a great starting point to help you refine your own strategies. Take what works, tweak what doesn't, and keep iterating.

AI2’s suite of reasoning models with “open model flow”
AI2 released OLMo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an “open model flow” - the complete lifecycle, including every checkpoint, dataset, and training decision is open-sourced. The 7B and 32B models come in multiple variants (Base, Think, Instruct, RL Zero), with OLMo 3-Think (32B) leading as the strongest fully open reasoning model at this scale. The release includes their new 5.9-trillion-token Dolma 3 pretraining mix, Dolci post-training suite, and integration with OlmoTrace for tracing model outputs back to training data in real time.

Build "Deep Agents" with Google ADK and Gemini 3 Pro
Google's Agent Development Kit now ships with "Deep Search," a full-stack reference implementation showing how to build agents that actually think through problems recursively. The workflow splits into human-collaborative planning (where you refine research goals together), followed by autonomous execution - the agent loops through searching, self-critiquing for gaps, and refining until it has enough data, then composes a comprehensive report with linked citations to all source material. Use this agent as a starting point for your own full-stack agent.

Parallel launches Extract API for LLM-ready web scraping
Parallel just launched their Extract API in beta, which pulls clean markdown from any URL, including JavaScript-heavy sites and multi-page PDFs that typically break scrapers. It works in two modes: compressed excerpts based on semantic objectives, or full content extraction. Pair it with their Search API, and you've got agents that can discover relevant pages, then dive into complete documentation, research papers, or financial filings without wrestling with paywalls or rendering issues.

Exa 2.1: Fast Search < 500ms, Deep Search beats the market
Exa just shipped version 2.1 with a 10x scale-up in pre-training and test-time compute, bringing real improvements to their search API. Their Fast endpoint now returns results in under 500ms while beating Google-wrapped alternatives, and their new Deep search mode uses agentic multi-query strategies to become the highest-accuracy search API available. They're one of the few companies actually building search infrastructure from scratch rather than wrapping existing engines, and it shows in the benchmarks.

Tools of the Trade

  1. OCR Arena- A free playground for testing and evaluating leading foundation VLMs and open source OCR models on document parsing tasks. Upload a document, measure accuracy, and vote for the best models on a public leaderboard.

  2. Claude-agent-server - Wraps the Claude Agent harness (the framework powering Claude Code) in a WebSocket server and deploys it on E2B cloud sandboxes, letting you run Claude's agentic capabilities remotely instead of just locally.

  3. Sourcewizard - A CLI tool that uses AI agents to install and configure SDKs in your codebase, handling everything from middleware to environment variables with package-specific prompts. Connects via MCP server.

  4. Erdos - An AI IDE built specifically for data scientists to create and edit Jupyter notebooks with AI assistance. It focuses on speed and accuracy for notebook-based workflows rather than traditional code files.

  5. Awesome LLM Apps - A curated collection of LLM apps with RAG, AI Agents, multi-agent teams, MCP, voice agents, and more. The apps use models from OpenAI, Anthropic, Google, and open-source models like DeepSeek, Qwen, and Llama that you can run locally on your computer.
    (Now accepting GitHub sponsorships)

Hot Takes

  1. I believe this new model in Claude Code is a glimpse of the future we're hurtling towards, maybe as soon as the first half of next year: software engineering is done.

    Soon, we won't bother to check generated code, for the same reasons we don't check compiler output.

    ~ Adam Wolff


  2. In school my kids are told “AI is cheating, don’t use it!”

    In real life I tell my employees “If you’re not using AI, you’re fired!”

    ~ Ian Andrews

That’s all for today! See you tomorrow with more such AI-filled content.

Don’t forget to share this newsletter on your social channels and tag Unwind AI to support us!

Unwind AI - X | LinkedIn | Threads

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉 

Reply

or to participate.