• unwind ai
  • Posts
  • Apple Says Reasoning LLMs Are Just Bluffing

Apple Says Reasoning LLMs Are Just Bluffing

PLUS: GitHub makes Copilot project-specific, Run Codex and Claude Code in secure sandboxes

Today’s top AI Highlights:

  1. Are reasoning models like Claude Sonnet and DeepSeek R1 faking it?

  2. Chain any Agent Framework into a multi-agent team

  3. Build AI agents in just 2 minutes with GitHub Copilot Spaces

  4. India’s Sarvam AI goes full-stack with voice AI agent platform

  5. Run OpenAI Codex or Claude Code in secure sandboxes

& so much more!

Read time: 3 mins

AI Tutorial

Traditional RAG has served us well, but it's becoming outdated for complex use cases. While vanilla RAG can retrieve and generate responses, agentic RAG adds a layer of intelligence and adaptability that transforms how we build AI applications. Also, most RAG implementations are still black boxes - you ask a question, get an answer, but have no idea how the system arrived at that conclusion.

In this tutorial, we'll build a multi-agent RAG system with transparent reasoning using Claude 4 Sonnet and OpenAI. You'll create a system where you can literally watch the AI agent think through problems, search for information, analyze results, and formulate answers - all in real-time.

We share hands-on tutorials like this every week, designed to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Don’t forget to share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads) to support us!

Latest Developments

Picture this: You want three AI agents to have a simple conversation. Agent A analyzes data, Agent B makes a decision, Agent C executes it. Should be easy, right?

Fast forward 2 hours later...

You're drowning in 70+ lines of messy code, debugging mysterious handoff failures, and questioning your life choices. Your "simple" three-agent workflow now looks like you're trying to choreograph a dance between drunk robots.

Water solves that for you…

Water is like having a universal translator for AI agents – except instead of languages, it speaks fluent LangChain, CrewAI, and whatever other framework you threw into your project last week.

Key Highlights:

  1. Framework agnostic - Drop in any agent framework or custom agents without getting locked into specific APIs, so you can mix and match different agents in the same workflow.

  2. Clean Workflow Design - Chain tasks with simple .then(), run parallel execution with .parallel(), and handle conditional logic with .branch() using intuitive Python syntax.

  3. Context Management - Each task receives execution context with access to previous outputs, flow metadata, and step tracking, eliminating the manual state management that usually complicates multi-agent systems.

  4. Built-in Playground - The built-in FastAPI playground automatically generates endpoints for your flows with interactive documentation, making testing and iteration much faster.

You know the drill. Open ChatGPT. Copy-paste your API structure. Explain your database schema. Describe your authentication flow. Ask your question. Get a response that's almost helpful but misses three crucial details about how your specific project works.

Tomorrow? Rinse and repeat. Because AI has the memory of a goldfish with commitment issues.

Imagine if ChatGPT or Claude actually remembered that you use FastAPI with SQLAlchemy, that your user model has that weird composite key, and that yes, you did decide to store timestamps in UTC because of that incident we don't talk about.

That's Copilot Spaces by GitHub. It's like giving your AI a permanent desk in your office where it can keep all your project's sticky notes, coffee-stained documentation, and "why did we build it this way?" explanations.

Key Highlights:

  1. Expert-level understanding - Ground Copilot's responses in your specific code, docs, and notes to get answers that actually make sense for your project instead of generic suggestions.

  2. Effortless team scaling - Share spaces across your organization so junior developers can instantly access senior-level expertise without interrupting busy teammates.

  3. Zero maintenance context - Attach repositories directly from GitHub—no copy-pasting required—and watch as your space automatically updates when code changes.

  4. Quality over quantity - Intentionally limited context size ensures higher response quality compared to unlimited knowledge bases that dilute accuracy.

Here's how to build AI agents in under 2 minutes: 

1️⃣ Start by cloning the Awesome LLM Apps repository, which contains 75+ ready-to-use AI agents and RAG tutorials.

2️⃣ Head to GitHub Copilot Spaces and create a new workspace, then add the repository along with custom instructions telling Copilot to act like a senior AI engineer.

3️⃣ Now you can prompt it with any AI agent idea and watch it generate complete Python code, requirements files, and documentation instantly.

Pro Tip: Use this with Google Gemini 2.5 Pro or Claude Sonnet 3.7 thinking models for best results. Check out the demo here.

Claude Sonnet Thinking and DeepSeek R1 might not be reasoning at all.
They're just really good at faking it.

Apple's new research tested the world's best "reasoning" models (Claude 3.7 Sonnet Thinking, DeepSeek R1, and OpenAI's o3) and found something nobody expected.

All these top models follow the exact same broken logic: As problems get harder, they initially think more (use more tokens). But right before they completely fail, they suddenly START THINKING LESS. Even though they have unlimited compute available.

Somewhere, Yann LeCun is probably doing a victory lap. We can hear him saying "I told you so."

Key Highlights:

  1. Complete accuracy collapse - All frontier reasoning models hit specific complexity thresholds where they drop to 0% accuracy. They're actually memorizing and recombining training data patterns rather than developing genuine problem-solving abilities.

  2. Three-regime pattern - Standard models actually outperform "reasoning" models on simple tasks, thinking models show advantages on medium complexity problems, but both completely fail on high-complexity tasks, where most real-world problems fall.

  3. Counterintuitive scaling failure - As problems approach critical difficulty, reasoning models reduce their thinking effort and use fewer tokens despite having unlimited budgets, revealing fundamental compute scaling limitations.

  4. Algorithm-following incompetence - Even when given step-by-step solution algorithms (like cake-baking instructions), models still failed at exactly the same complexity points, proving they can't follow explicit directions consistently.

  5. Pattern matching - Models could handle 100+ moves in Tower of Hanoi puzzles but failed after just 4 moves in River Crossing puzzles, suggesting they memorized solutions during training rather than developing actual reasoning capabilities.

Quick Bites

Build voice AI agents that know the difference between Mumbai slang and Chennai’s lilt. India’s Sarvam AI has released the Samvaad platform for Indian businesses to build, test, and deploy full-stack voice AI agents that can speak 11 Indian languages, sound natural, and follow the local accent. You can deploy these across channels like phone, WhatsApp, and web.

  • Agents are fully customizable - voice, instructions, tools, and knowledge base

  • Go live in under a week by simply writing instructions and connecting your tools.

  • Built for India, with pricing that fits local needs.

Google has released an opensource DeepSearch template using Gemini 2.5 with LangGraph to build full-stack AI research agents. The system iteratively searches the web, reflects on gathered information to identify gaps, then refines its queries until it can deliver comprehensive answers with proper citations through a React frontend. It is modular and flexible, and serves as a solid starting point to build full-stack AI agents with Gemini 2.5 and the LangGraph orchestration framework.

Cloudflare's engineers just did something wild: they opensourced a production-ready OAuth 2.1 library where Claude wrote 95% of the code, and they documented every single prompt in their git commits. The lead engineer went from AI skeptic to believer after discovering Claude's sweet spot - feed it concrete examples instead of abstract requirements, then give it conversational feedback. Claude shines at churning out comprehensive code and documentation, and face-plants in moving a simple class declaration that required manual inputs.

Tools of the Trade

  1. VibeKit: Opensource SDK for running coding agents like OpenAI Codex or Claude Code in secure, customizable sandboxes. You can generate and execute real code safely, stream output to your UI, and run everything in the cloud with full isolation and flexibility. Local execution coming soon.

  2. MCP-Use: Opensource Python package for connecting any LLM to MCP tools in just 6 lines of code, without requiring desktop applications. It provides a straightforward client-agent structure for accessing MCP server capabilities from Python environments.

  3. mcp-hacker-news: An MCP server for Hacker News. It acts as a bridge between the Hacker News API and MCP clients like Claude and Cursor. With this, you can fetch and interact with live Hacker News data - posts, comments, or users.

  4. Awesome LLM Apps: Build awesome LLM apps with RAG, AI agents, MCP, and more to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos, and automate complex work.

Hot Takes

  1. Everyone: AI!

    Literally everyone: AI!

    Apple: look at this new font that looks like glass for your Home Screen how magnificent and incredible only Apple can deliver this reimagined experience for you. ~
    Santiago

  2. Hot take 🌶️: using Cursor to vibe code a Python script hosted on Cloudflare Workers is 100x easier than trying to build an n8n workflow ~
    Ian Nuttall

That’s all for today! See you tomorrow with more such AI-filled content.

Don’t forget to share this newsletter on your social channels and tag Unwind AI to support us!

Unwind AI - X | LinkedIn | Threads

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉 

Reply

or to participate.