unwind ai
Posts
Stop giving agents the whole computer

Stop giving agents the whole computer

+ GitHub Spec Kit, Qwen 3.7 Max

Shubham Saboo & Gargi Gupta
May 22, 2026

I’ve been thinking a lot about how much room we should actually give coding agents to work.

Qwen3.7-Max running for 35 hours with 1,000+ tool calls makes long-horizon agents feel a lot more real. But today’s npm compromise is the less fun side of the same story: attackers are now targeting Claude Code and Codex hooks directly.

So the takeaway is pretty simple. Agents are getting better at doing real work, but the workflows around them need stricter specs, better memory, and tighter boundaries before we hand them bigger jobs.

Today’s top AI Highlights:

Qwen3.7-Max: 35 hours, 1,000+ tool calls, zero human intervention
GitHub Spec Kit forces AI to spec before it codes
314 npm packages compromised to hijack your coding agent
Google’s own version of hosted Hermes/OpenClaw
A design skill that refuses to look AI-generated

& so much more!

Read time: 3 mins

AI Tutorial

The Ultimate Guide to /goal

HTTP is a primitive. JSON is a primitive. /goal is becoming one for coding agents.

A few weeks ago, OpenAI's Codex CLI added /goal as a way to give the coding worker a job with a defined done state. Claude Code added it this week.

Hermes Agent, the orchestrator I run on a Mac Mini to coordinate work between coding workers, has had /goal built in for a while.

This guide walks through what /goal actually is, the three roles in a multi-agent setup, a real end-to-end run, the verification rule, and how to run goals in parallel without workers stepping on each other.

Read The Ultimate Guide to /goal

Don’t forget to share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads) to support us!

Latest Developments

Qwen3.7-Max: 35 Hours, 1,000+ Tool Calls, Zero Human Intervention

What’s the maximum number of steps and tool calls you’ve seen an LLM doing without you babysitting? 20? 50? Max 100?

Qwen3.7-Max just ran a fully autonomous kernel optimization session for 35 hours straight, making over 1,000 tool calls.

Alibaba's Qwen Team released their latest model Qwen 3.7-Max, specifically for the agent era. It tops SWE-Pro at 60.6% (vs Opus 4.6's 48.2%), leads TerminalBench, and takes the crown on MCP-Mark. It also tops all the benchmarks on pure reasoning; best-in-class!

What makes it genuinely different is scaffold generalisation. You can plug it into Claude Code, OpenClaw, Hermes Agent, or Qwen Code and get consistent results without prompt gymnastics.

Key Highlights:

Reward hacking defense built in: During 80+ hours of RL training on SWE tasks, the model's monitoring system autonomously caught 1,618 reward hacking attempts and generated 13 new heuristic rules to block them. The model is training itself to be honest.
1M token context, 65K output: Scores 90.4% on MRCR-v2 128K, far ahead of every competitor on long-context retrieval.
48 languages natively: Leads multilingual benchmarks across the board, including WMT24++ translation and MMLU-ProX.
Pricing: Qwen 3.7 Max is roughly half the price of GPT-5.4 and less than a third of Claude Opus 4.6, while matching both on SWE-Pro and TerminalBench.
Closed source: Qwen3.7-Max is proprietary and will be available via Alibaba Cloud Model Studio API. Open-weight variants at smaller sizes are expected to follow.

GitHub Spec Kit forces AI to spec before it codes

Still throwing vague prompts at your coding agent and hoping it doesn't torch your project?

GitHub just open-sourced Spec Kit, a toolkit that makes the AI create a structured specification before it writes a single line of code. The agent figures out what you want, asks clarifying questions, plans the architecture, generates a task list, then implements. All structured, all inspectable, all before any code exists.

103K stars already. Works with 30+ coding agents out of the box: Claude Code, Cursor, Codex, Gemini CLI, Junie, and more. And it's completely stack-agnostic, so it doesn't care if you're writing Rust or Rails.

Key Highlights:

Structured before creative: The agent can't start coding until it's written a spec, asked clarifying questions, and planned the architecture. The sequence is enforced, not optional.
30+ agent integrations: Works with Claude Code, Copilot, Gemini, Codex, Cursor, and pretty much every coding agent you're already using. Same spec, any agent.
Extensible via presets and extensions: Customize the workflow with your own templates. Runtime resolution follows a priority chain: project-local overrides beat presets beat extensions beat core.
MIT-licensed: Install via uv tool install from the GitHub repo. Full greenfield, exploration, and brownfield workflows supported out of the box.

314 npm Packages Compromised to Hijack Your Coding Agent

22 minutes. That's how long it took an attacker to publish 637 malicious versions across 317 npm packages with a combined 11+ million monthly downloads.

The compromised account "atool" pushed a payload from the "Mini Shai-Hulud" toolkit, the same one behind the SAP compromise three weeks ago. But the interesting bit is that the malware specifically targets AI coding agents. It injects Claude Code SessionStart hooks, Codex hooks, and VS Code "runOn: folderOpen" tasks. It harvests AWS credentials, Kubernetes tokens, SSH keys, GitHub PATs, and even 1Password and Bitwarden vaults. Exfiltration is disguised as OpenTelemetry traces to blend in with your existing observability stack.

Key Highlights:

Your coding agent is a vector: The payload hooks into Claude Code and Codex session startup, silently piping every credential it finds to a C2 server. If you're running these agents in environments with cloud access, this is as bad as it sounds.
Packages you probably use: size-sensor (4.2M downloads/month), echarts-for-react (3.8M), @antv/scale (2.2M), timeago.js (1.15M). Check your lockfile.
Persistent and stealthy: A LaunchAgent/systemd service called "kitty-monitor" survives reboots and uses GitHub commit search as a dead-drop C2 channel, polling for RSA-PSS signed commands.
Full advisory and IoCs available: SafeDep published the complete list of all 317 compromised packages with deobfuscated payloads and remediation steps.

Quick Bites

Andrej Karpathy joins Anthropic
Yesterday, he announced he's joined Anthropic. "The next few years at the frontier of LLMs will be especially formative," he wrote. He shaped the early GPT era at OpenAI, then left to build Eureka Labs for AI education. Now he's back in the lab at what might be the most interesting research org in the field right now. One to watch.

Google releases a managed personal 24/7 agent
Gemini App now comes with Spark, an always-on personal AI agent, running on Gemini 3.5 and built on Antigravity. It navigates your digital life and takes actions on your behalf, even when you close your laptop. You can set up cron jobs (schedule tasks), teach it new Skills, and create end-to-end workflows. Rolling out to trusted testers now, with beta access for Google AI Ultra subscribers next week.

Turn any chat agent into a voice agent with one prompt
ElevenLabs just shipped Speech Engine, and it’s pretty straightforward: keep your existing chat agent exactly as it is, add Speech Engine on top, and now it talks. You don’t need to rearchitect your LLM stack or swap out your RAG pipeline. It bundles speech-to-text, turn detection, interrupt handling, TTS, and audio orchestration into a single pipeline with ultra-low latency. Works with any LLM that produces text, has built-in stream extraction, and covers 70+ languages.

Google releases the Nano Banana of video-gen
Gemini Omni is Google's new any-input-to-any-output model. Feed it images, text, video, audio, or any combination, and it generates or edits video through conversation. Multi-turn editing keeps scenes consistent across back-and-forth iterations, and it applies real-world physics to generated content. Available in the Gemini app, Google Flow, and YouTube Shorts.

Multi-agent orchestration comes to Warp Oz
Warp just shipped multi-agent orchestration in Oz with support for Claude Code, Codex, and the Warp Agent. Use /orchestrate to delegate complex tasks across a team of agents running locally or in the cloud. If you're already in the Warp terminal, this makes it your control plane for parallel agent work.

Google Antigravity with Gemini 3.5 Flash now in your Terminal
Google Antigravity just shipped a CLI written in Go, powered by Gemini 3.5 Flash, and built for async workflows where agents run tasks in the background and report back when done. It shares the same tool and app server as Antigravity 2.0, so anything you build on the platform also works inside Google Search, where Antigravity powers the new agentic coding features: custom generative UIs, dashboards, and "mini apps" spun up from natural language.

Google Search gets its biggest overhaul in 25 years
The search box itself is being rebuilt: AI-powered, dynamically expanding, with multimodal inputs (text, images, files, videos, even Chrome tabs). New "search agents" will monitor the web 24/7 for specific criteria like apartment listings or sneaker drops, and agentic booking lets you complete local service bookings right from Search. Launching for Google AI Pro and Ultra subscribers this summer.

Tools of the Trade

Hallmark: Design skill by Hassan El Mghari that encodes anti-slop rules into Claude Code, Cursor, and Codex. Has four modes: build (generates pages that refuse to repeat the same structure twice), study (extracts a design's DNA from a URL or screenshot without copying pixels), audit (scores existing pages against its anti-pattern catalogue), and redesign (same content, deliberately different bones).
CodeGraph: Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, and OpenCode that cuts tool calls by 92% and speeds up tasks by 71%. 100% local, MIT-licensed.
AgentMemory: Persistent memory MCP server for coding agents with 4-tier consolidation inspired by how the brain organizes memory during sleep. Works with every major coding agent.
Awesome LLM Apps (111k+ 🌟 ) - A curated collection of LLM apps with RAG, AI Agents, multi-agent teams, MCP, voice agents, and more. The apps use models from OpenAI, Anthropic, Google, and open-source models like DeepSeek, Qwen, and Llama that you can run locally on your computer.
(Now accepting GitHub sponsorships)

That’s all for today! See you tomorrow with more such AI-filled content.

Don’t forget to share this newsletter on your social channels and tag Unwind AI to support us!

Unwind AI - X | LinkedIn | Threads

Awesome LLM Apps | Sponsor Us

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉

Reply

or to participate.