Copilot meters agents, AGENTS.md enters review, and MCP gets ready for July

The useful agent news from the last two days is not one more demo. It is the stack around the demo getting more explicit: usage is now metered per user, coding agents inherit repository instructions, MCP builders are being pushed toward stateless servers and tool-level authorization, and the community is asking how to test agents that can actually take irreversible actions.

Fast scan

Signal	What changed	Why it matters for builders
Copilot cost telemetry	GitHub added `ai_credits_used` to the Copilot usage metrics API. Enterprise and organization reports can now show per-user daily AI-credit consumption in both one-day and 28-day views, although GitHub says the field is not broken down by feature, model, or surface. 1	Agentic coding cost is moving from anecdote to budget line. Track credits against merged PRs, review cycles, and escaped defects rather than treating spend as a shared pool.
Repository instructions enter review	Copilot code review now uses root-level `AGENTS.md` instructions when producing feedback. 2	The instruction file is no longer just a local coding-agent convention. Stale or overly broad `AGENTS.md` files are now review-governance debt.
Agent authorship gets clearer	Generated GitHub release notes can credit the human who asked Copilot cloud agent to open a merged pull request, alongside `@copilot`. 3	Agent work needs a human accountability trail. Attribution is starting to appear in ordinary developer artifacts, not only admin dashboards.
MCP's July migration gets concrete	Gravitee's June 19 write-up says the July 28 MCP spec moves to a stateless core, adds MCP Apps, promotes Tasks for long-running work, tightens authorization, and introduces a formal deprecation policy. 4	Check for hidden session dependencies now. If your MCP server assumes sticky sessions or implicit state, the migration is architectural, not cosmetic.
Agent memory paper	AtomMem, submitted to arXiv on June 18, proposes extracting high-value atomic facts, organizing them into hierarchical events and temporal profiles, then retrieving through an associative memory graph. 5	The memory race is moving from bigger context windows to more disciplined memory representations. Personalized agents need update stability as much as recall.
Agentic code security	Aikido's Code Audit post, updated June 19, claims its agentic audit covers roughly 70-80% of what a full pentest engagement surfaces at about 10x lower cost, with early users finding a median of about 25 issues per codebase. 6	Treat those as vendor-reported figures, but the direction is clear: defenders are trying to use agent reasoning against multi-step logic flaws before attackers do.
Agent payments surface again	Agentcard hit Hacker News as virtual cards for AI agents with DoorDash checkout, while its site pitches disposable one-time cards, user-authorized transactions, and integrations with ChatGPT, Claude Desktop, and OpenClaw. 7 8	Agentic commerce is still waiting on mundane controls: spend limits, charge approval, merchant scope, and receipts. Payment is the last mile where a demo becomes liability.

The enterprise layer is getting less optional

GitHub's Copilot changes point in the same direction from two sides. The usage metrics API gives admins a per-user consumption signal, while AGENTS.md support lets repository-specific instructions shape Copilot review feedback. That is a quiet but important pairing: spend and behavior are both becoming observable through normal enterprise surfaces. A team can no longer argue that coding agents are just a personal productivity experiment if the bill, the review policy, and the generated release notes all carry agent traces.

The practical move is to treat agent configuration as production configuration. An AGENTS.md file should have an owner, review cadence, and scope boundaries. It should say what the agent can assume about the repo, what it must never do, and which tests or commands are authoritative. If it reads like a collection of one-off prompting tips, it will produce one-off review behavior.

MCP has the same problem at the protocol layer. Gravitee's summary of the July 28 spec says the old session model is going away: no initialize handshake, no Mcp-Session-Id, and no server-instance affinity. Client capabilities move into _meta on each request, and a new server/discover method lets clients fetch server capabilities when needed. The immediate upside is simpler load balancing. The harder work is making state explicit with handles, not hidden in a session object. 4

That matters because MCP is now where many agent risks concentrate. BankInfoSecurity's June 18 risk note frames MCP as the connective tissue between AI agents and enterprise systems, then calls out tool poisoning, delegated-authority abuse, SSRF, developer-environment compromise, least-privilege scoping, time-bound consent, sandboxing, credential vaulting, and non-human-identity auditing as core control areas. 9 The vendor and practitioner posts differ in tone, but they converge on one point: once agents can call tools, the security boundary is no longer the chat box. It is the tool call.

Research signal: memory is becoming a data-model problem

AtomMem is the cleanest research item in the verified window. The paper starts from a familiar pain point: long-term agents need to reuse information across sessions, but coarse memories drift, update unstably, or become too expensive to retrieve. Its answer is to turn conversations into high-value atomic facts, arrange those facts into event structures and temporal profiles, and activate related memory fragments through an associative graph during retrieval. The authors report state-of-the-art performance on LoCoMo across multiple reasoning tasks. 5

The important distinction is not "memory versus no memory." Most production agents already have some form of notes, vector search, or session replay. AtomMem is part of a more specific shift: memory as a data model with update semantics. Atomic facts make memory easier to inspect. Event hierarchies keep context from flattening into isolated snippets. Temporal profiles let user attributes change over time rather than calcify after one conversation.

One older context item from earlier this week helps explain why this matters for evaluation. OpenAI's June 16 Deployment Simulation post says it tested agentic coding settings by using 120,000 internal employee agentic trajectories from GPT-5.4 to simulate an internal GPT-5.5 deployment, with tool calls simulated instead of applied to live systems. The post reports that giving the tool-simulator model extra affordances improved a discriminator test from an 11.6% win rate to 49.5%, close to chance. 10 That item is outside the main 24-48 hour window, but it is directly relevant to today's builder discussion: realistic agent evaluation depends on the surrounding state, tools, and traces, not only on the model's next answer.

Builder pulse: testing actions, not answers

The r/AI_Agents discussion on June 19 was blunt. One post asks how to test agents that send emails, update CRM records, issue refunds, schedule meetings, or modify infrastructure. The author points out that a bad chatbot answer is annoying, while a wrong action can delete a record or email the wrong customer. 11

Cargando tarjeta de contenido…

That question lines up with Hex's May write-up on data-agent evaluation, which resurfaced on Hacker News today. Hex describes Shoebox as an internal lab bench for ad-hoc and scheduled evals, pairwise candidate-versus-baseline comparisons, daily production baselines, and custom rubrics layered over small, carefully written eval sets. It also built Shorelane Commerce, a fake B2B2C office-supplies business with about $129M in yearly revenue, three revenue streams, messy migrations, multiple customer IDs, and millions of rows across dozens of tables. 12 That is the shape action-taking evals are moving toward: less leaderboard, more controlled world with receipts.

A second Reddit thread asks how to deploy Python-script agents to nontechnical teams after a prototype works locally. 13

Cargando tarjeta de contenido…

This is where Agentcard is interesting despite being early. Its pitch is not better reasoning; it is one disposable card, one approved charge, one real-world checkout flow. The product claims one-click integration with ChatGPT, Claude Desktop, and OpenClaw, and the Show HN title specifically calls out DoorDash checkout. 7 8 The pattern is familiar: once an agent crosses from text into purchasing, the hard part is scoping the action so a user can approve it without supervising every token.

Kernhelm, another Hacker News item from the same window, attacks the problem lower in the stack. The project describes a kernel-level enforcement layer for untrusted agents, scripts, and compromised processes, where privileged effects require a cryptographically signed permit tied to a plan hash, effect type, target, deadline, and revocation state. The author claims enforcement decisions land in single-digit microseconds, but the repo is clearly experimental and should be read that way. 14 15

github.com · Repositorio de GitHub

Deso-PK/make-trust-irrelevant

https://github.com/Deso-PK/make-trust-irrelevant

Cargando tarjeta de contenido…

The through-line across these community items is simple: builders are not asking only "which agent is smarter?" They are asking how to package, meter, authorize, test, and stop agents when the agent is about to touch the world.

What to do next

For teams already using coding agents, the first action is administrative: audit every repository-level instruction file this week. If Copilot code review now reads AGENTS.md, then that file needs the same seriousness as CI configuration. Remove contradictory instructions, name the trusted test commands, and document which directories or operations need human review.

For MCP builders, map every server that depends on session stickiness. The July spec's stateless core is a forcing function. Prefer explicit handles, scoped credentials, and tool-call audit logs over hidden server memory. If you cannot explain how a call is authorized per user, per tool, and per target, you do not yet have a production MCP story.

For product teams moving from prototype agents to nontechnical users, stop evaluating only answer quality. Create a small fake world with realistic side effects: a fake CRM, fake billing records, fake support inboxes, fake repo state, and a replayable audit trail. The evaluation question is no longer "did the answer look good?" It is "did the agent take the right action, for the right reason, within the authority it was actually given?"

Copilot meters agents, AGENTS.md enters review, and MCP gets ready for July

Fast scan

The enterprise layer is getting less optional

Research signal: memory is becoming a data-model problem

Builder pulse: testing actions, not answers

What to do next

Fuentes de referencia