Step 12 · The toolbelt

Computer Use CLI: non-blocking macOS automation

Some things the loop needs to verify do not live on the web or in a file — they live in a native macOS app: a toggle in System Settings, a value in a Numbers cell, a window that should have moved. The computer-use CLI (alias cu) drives those apps from the shell through the Accessibility API — so an agent can operate an app that has no MCP, or prove a GUI change really landed at the desktop boundary. Its defining promise: it is non-blocking — it never grabs your mouse, never types on your keyboard, never raises the app, and when it cannot do something honestly it says so instead of faking success.

Read the plain layer top-to-bottom; open a panel when you want the exact command, flag, or AX method.

The big idea: drive an app without stealing the desk

Most of the toolbelt reaches the world through text — the shell, files, the web. But a lot of real software is a window with buttons: System Settings, Calculator, Notes, Finder, Mail. To check that a change you made actually shows up there, or to operate an app that exposes no API at all, you need a hand that can reach into the GUI.

computer-use is that hand — but a careful one. It talks to apps through macOS's built-in Accessibility API: the very same channel a screen reader uses to know "there is a button here labelled Save". Through that channel it can read what's on screen (every element, its role, its label, its position), press a button, and set the text of a field.

What it pointedly does not do is move your physical mouse or tap your keyboard. A normal automation tool fakes input: it warps the cursor to a spot and synthesises a click, so your pointer jumps and the app you were typing in loses focus. computer-use refuses that. It asks the app directly — "press your Save button" — and the app does it, while your cursor sits exactly where you left it. You can keep working in another window the entire time.

The catch is the honest part. The Accessibility channel can't express everything a real mouse can — there is no way to drag-select a region of empty canvas, for instance. When you ask for something AX can't do, the CLI returns a plain error rather than pretending it worked. For the loop, that honesty is the whole value: a tool that never lies about what happened is a tool you can build a Proof Gate on.

Think of it like… a building with an intercom at every desk. A pushy courier would walk in, grab your hand, and make you sign for a package — that's synthetic input. computer-use instead presses the intercom and asks the person at the desk to sign. You never feel a thing, your own pen stays in your hand — and if there's no intercom at a particular desk, the courier tells you "can't reach that one" rather than forging a signature. Where the analogy breaks: there's no human in the loop here; the "person at the desk" is the app itself answering an accessibility request.

What it is, concretely

The package is @appfyai/computer-use-cli (repo github.com/appfyai/computer-use-mcp). It drives apps through the macOS Accessibility API via a small Swift native bridge. It is a one-shot CLI any agent can call from the shell: computer-use <cmd> or the alias cu <cmd>. As of 1.0.7 the native bridge ships bundled in the package, so a global install runs from any directory with no checkout.

Where it sits in the toolbelt

Three tiers, by target. A web page → the browser tools (claude-in-chrome / agent-browser), which are DOM-aware. An app with its own dedicated MCP or API (Slack, Linear, Calendar) → use that. A native macOS app with no MCP, or a GUI change you must verify by actually driving it → that is exactly this CLI's lane. It is macOS-only and requires the system Accessibility permission plus a per-app grant.

Why "no synthetic input" is a design choice, not a limitation to route around

There are zero synthetic CGEvent mouse/keyboard injections anywhere in the tool. That is deliberate and permanent: the operator can keep using their machine while an agent drives other apps in the background. The cost — AX can only do what it exposes — is paid back as trust: a non-ok result genuinely means the action did not happen, which is precisely what a verification gate needs.

The non-blocking guarantee, in one picture

Here is the single idea the rest of the lesson hangs on. Both a normal automation tool and computer-use can end up pressing the same Save button. The difference is the path they take to get there — and what that path costs you.

Same destination, two routes. Synthetic input goes through your cursor and hijacks it; the Accessibility API goes around it, straight to the button.

The three promises, stated precisely

(1) No cursor movement. An AX press targets an element, not a screen point — your physical pointer never moves. (2) No keystrokes. type sets an element's AXValue; it does not synthesise key events, so it never lands in whatever field you happen to have focused. (3) No raise. The target app is not activated or brought to the front; it can act while minimized or behind other windows. Together these are why the operator can keep working uninterrupted.

The honest-failure corollary

Because there is no synthetic fallback, anything AX cannot express simply fails. A drag across empty space, a key-combo, sending Enter — none have an AX expression, so the CLI returns an explicit error string. There is no silent best-effort. This is the property the Proof Gate relies on.

AX press or honest error? Walk the decision

When you ask computer-use to do something, it runs a short series of yes/no questions before it touches the app. If every gate is a "yes", it performs the action through Accessibility and reports ok. The first "no" stops it — and instead of faking a result, it returns an honest error.

Pick an action below, then press Next to follow it through the gates. Watch a clean scroll request refuse at the last gate rather than move your real cursor.

Trace action:

Read top → bottom. Each diamond is a gate; the first "no" branches right to an honest error, a clean run reaches an AX action that returns ok.

Step 0 of 4

Start here

A command lands at the CLI

Press Next to follow click a button through each gate. Switch the action above to see which ones pass and which one refuses.

The gates, mapped to real behavior

Granted? — until you run grant --app <Name>, commands return {"permission_required": true}. Element in tree? — click/type resolve a target from the accessibility tree (by --element-index, or by --x/--y mapped to the AX element at that point); a non-writable or missing target fails. AX can express it? — this is where the honest error lives. click = ax_press; type = ax_value; window-title-bar drag = ax_window_position; slider drag = ax_slider_value. A free-form scroll or marquee drag has no AX expression, so it returns Could not perform drag via accessibility API rather than moving your real cursor to fake it.

Cursor freedom, side by side

The flowchart told you AX goes around your cursor. This lets you feel it. Below is a tiny stand-in for your screen: your own pointer (the arrow) rests in a text note while an agent presses Save in another window.

Try both tools. With synthetic input, every action yanks the pointer across the screen and your note loses focus. With computer-use (AX), the same actions land — and the pointer never twitches. Watch the readout and the arrow.

The black arrow is your physical pointer. Under AX it stays put; under synthetic input it jumps to the button.

Active tool

computer-use (AX)

Actions go through the Accessibility API. Your pointer and keyboard are yours; the app responds without being raised.

What this tool touches

What happened to your desk

Why focus survives

A synthetic click is a real CGEvent delivered to the window under the cursor — which means the OS first moves the cursor there and changes the key window, stealing focus from wherever you were typing. An AX press is a message to a specific AXUIElement ("perform your default action"); the windowing system's pointer and key-focus are never consulted, so your caret keeps blinking in your note. The same holds for type: setting AXValue writes straight into the target element without a focus change or a single key event.

The trade you are accepting

You give up the ability to do things only a real pointer can (drag-select empty canvas, hover tooltips, OS-level shortcuts). In exchange the tool is safe to run unattended on a machine someone else is using — the defining requirement for an AFK loop that verifies GUIs in the background.

The per-app grant model

Reaching into apps is powerful, so authorization is deliberately narrow. There are two locks, not one. First the system-wide macOS Accessibility permission (granted once, in System Settings). Then — separately — a per-app grant: you explicitly authorize each app you want to drive. Until an app is granted, every command against it returns permission_required and does nothing.

So an agent that has been allowed to drive Calculator cannot suddenly start poking at Mail. Each app is its own door with its own key.

Think of it like… a building keycard. Getting hired (the system permission) lets you into the lobby. But your card is then programmed for specific rooms — the second-floor lab, not the server cage. A new room means a new authorization, granted on purpose. Where it breaks: there's no central security desk watching; the grant list is just local config you control.

One system permission unlocks the CLI; each app is then granted individually. An ungranted app refuses every command.

The grant commands

computer-use grant --app <Name|bundleId> authorizes one app (returns {"status":"granted"}). computer-use permissions lists what's granted. computer-use revoke --app <Name> takes it back. A permission_required response is the cue to grant and retry.

Two independent authorizations

The system Accessibility grant (System Settings → Privacy & Security → Accessibility) is what macOS itself enforces for any AX client. The CLI's per-app grant is a second, finer layer the tool maintains on top: it scopes which apps this CLI will act on, so a broad system permission doesn't translate into "drive everything". The two are separate — having the system permission is necessary but not sufficient; you still grant each app.

Naming gotcha

Use the EXACT app name macOS reports — it may be localized (Calculadora vs Calculator). computer-use list-apps shows the controllable running apps under the names you must pass. You can also grant by bundle identifier when the display name is ambiguous.

Why this matters for the loop's authorization tiers

Because each app is opt-in, the grant list is a natural safety boundary: an AFK run can be allowed to verify a sandbox app while every sensitive app stays ungranted and therefore untouchable. Destructive in-app actions (delete, send, purchase, sign-out) remain user-gated regardless of grant.

Reading the accessibility tree to find an element

Before computer-use can press a button, it has to find it. It does that by reading the app's accessibility tree — a structured list of every element on screen, each with a role (AXButton, AXTextField), a label, a position, and a numeric index. You ask for that snapshot with state, locate your target, and use its index (or its x/y) to act.

Here is that whole pipeline — from "which apps can I even touch" to "the click landed" — as five hand-offs. The sticky list on the left jumps you to any stage.

Read left → right: list-apps → grant → state → find index → click, then a second state read confirms it landed.

1List the apps you can touch list-apps

First, see what's controllable. computer-use list-apps returns the running apps you can drive, under the exact names macOS reports. You'll pass one of those names to every later command — copy it verbatim, because it can be localized.

2Grant the one you want grant --app

Authorize that single app: computer-use grant --app Calculator. Now — and only now — commands against it will run. Skip this and state comes back as permission_required and nothing happens.

3Read the accessibility tree state --app

computer-use state --app Calculator returns a JSON snapshot: a nested tree of nodes, each with a role, label, frame (its on-screen rectangle), and an index. This is your map of the window — the same map a screen reader sees.

4Find your element's index role · label

Scan the tree for the node you want — say the AXButton labelled "=" — and note its index. That number is the handle you'll act on. Remember: indices drift as the UI changes, so read state fresh right before you click.

5Click it, then read back click → state

computer-use click --app Calculator --element-index <n> performs an AX press. Then read state a second time and check the display element changed. That second read is the proof — you never assume; you confirm.

⌗The shape of a state snapshot

computer-use state --app Calculator (abridged JSON)

# the tree is a JSON string of nested AX nodes
{
  "tree": {
    "role": "AXWindow", "label": "Calculator",
    "children": [
      { "role": "AXStaticText", "index": 3,  "value": "0" },        # the display
      { "role": "AXButton",     "index": 17, "label": "7" },
      { "role": "AXButton",     "index": 24, "label": "=" },     # ← the target
      ...
    ]
  },
  "screenshot": "<base64 …>"
}

Access it

Run computer-use state --app Calculator and pipe it through a JSON tool to scan roles: computer-use state --app Calculator | jq '.tree'. Each node carries role, index, frame, and a label/value; you act on the index (or --x/--y within its frame).

The full command surface is documented in the skill: open it with grep -n "Commands" ~/.claude/skills/computer-use-cli/SKILL.md.

Verify a GUI change at the real boundary

This is where computer-use earns its place in the loop. The loop's rule is: never claim something works — prove it at the real boundary. For a code change, the boundary is a test run. For a GUI change, the boundary is the app itself. computer-use lets the VERIFY step actually drive the app and read the result back, instead of trusting a screenshot or a "should be fine".

The pattern is a tiny loop of its own: read the tree to capture the before-state, act through AX, read again, and compare. If the second read matches what you intended, the gate passes. If the action returned an error, the gate fails — honestly.

The same read → act → read → compare cycle the loop uses everywhere — here the boundary is the live app, read through Accessibility.

Why a screenshot alone isn't a gate

A screenshot proves pixels rendered, not that state is correct — and a model reading its own screenshot can hallucinate the value it hoped to see. Reading the accessibility tree returns the element's actual AXValue as a string you can assert on programmatically. That is a real-boundary check: the claim "the toggle is on" is backed by the OS's own accessibility data, not by a confident description.

Trust the honest error

If the act step returns a non-ok result, the gate fails immediately — no "compare" needed. Because the tool never fakes success, a failed action can never masquerade as a passed gate. This is exactly the property a verifier needs: silence is never mistaken for success.

State readback: proving the click landed

This is the VERIFY step made concrete. Below is a live readout of an app's accessibility state — four key elements and a per-element table. It starts in the before state. Press Run AX click to perform an ax_press on the "=" button, then watch the readout re-read the tree: the values that changed light up, and each row flips to match once its readback confirms the intended value.

That second read — not the click — is the proof. Hit Reset to return to the before-state and run it again.

Accessibility state — Calculator

app granted · read via computer-use state --app Calculator

before · not yet verified

state read 0 times

AXStaticText · display

before

last AXButton pressed

—

none yet

tree nodes

steady

last result

—

idle

Per-element readback
Element	AXValue / label	Readback

One model, two reads

A single object holds the app's element values. The first render is the before snapshot. Pressing the button mutates the model the way an ax_press on "=" would (the display recomputes, the "last pressed" element updates), then the readout simulates a second state read and diffs it against the intended result. Rows turn to match only when the read-back value equals what the action was supposed to produce — that is the gate, not the click.

Mapping to the real CLI

In a real run: state (capture before) → click --element-index <n> (the press, returns {"method":"ax_press","result":"ok"}) → state (capture after) → assert the display element's value. The "last result" tile is the CLI's own ok/error; the table is your own comparison. Both must agree for the gate to pass.

The honest-error report: when it refuses rather than lie

Here is a run where an agent asked computer-use to do four things — and one of them was a scroll across an empty canvas, which Accessibility cannot express. A lesser tool would have moved your real cursor to fake a scroll. computer-use refused and returned an error. Read this like an incident report: a timeline of what it did, the blast radius (zero — your desk was never touched), and the findings to acknowledge.

The point isn't that something "failed". The point is the failure was honest: the tool told the truth so the loop could route around it (use the browser tools for that scroll, or pick an AX-expressible action) instead of building on a lie.

NON-BLOCKING

Run report: a scroll that refused to move your cursor

AppPreview (granted)

Actions requested4

Cursor movednever

Keys synthesized0

Honest errors1

What the tool did, in order

t+0.0s

state --app Preview

Read the accessibility tree. Found the toolbar, the page view, and a zoom slider. No cursor moved; Preview was not raised.
ax · ok
t+0.3s

click --element-index 8 (Zoom In button)

ax_press on the toolbar button. Returned {"result":"ok"}. A follow-up state confirmed the zoom value rose.
ax · ok
t+0.7s

drag --element-index 12 (zoom slider) → 150%

Slider drag resolves to ax_slider_value — an expressible AX action. Set the value; read-back confirmed 150%.
ax · ok
t+1.1s

scroll the page canvas down 400px

A free scroll over empty canvas has no Accessibility expression. To do it, the tool would have had to move your real cursor onto the canvas and synthesize a wheel event — exactly what it must never do. It refused.
honest error
t+1.1s

returned: "Could not perform scroll via accessibility API"

A plain, non-ok result. The agent now knows to route this step elsewhere — not to assume it happened.
routed around

Three AX-expressible actions succeeded; the one that would have required moving your cursor refused — honestly.

Blast radius (the non-blocking proof)

Cursor moves

Keystrokes sent

Apps raised

Honest error

Findings to acknowledge

0 of 3 acknowledged

The refused scroll is a true AX limit, not a bug to file finding · expected behavior honest
Route the scroll elsewhere — browser tools for a web view, or act on an AX-expressible element action · re-plan the step re-plan
Treat each ok as provisional until a state read confirms it — never assume discipline · Proof Gate honest

The exact failure surfaces

Two documented honest errors: a free-form / marquee drag returns Could not perform drag via accessibility API (AX can't express a region drag), and a type against a non-writable target — e.g. Calculator's display, which is AXStaticText, not a text field — fails rather than pretending to set it. type also replaces a field's whole value (it's ax_value, not keystroke entry): it can't append, and there's no key-combo / Enter / Tab command at all.

A known caveat to verify against

Honesty has one rough edge worth knowing: some sliders can report ok for a drag without the underlying state actually changing (observed on a System Settings volume slider, non-deterministic). The tool does check the AX value changed, but a slider whose AXValue isn't bound to observable state can read as "changed" anyway. The lesson: when a result matters, confirm it against an independent signal — which is exactly the read-back gate from section 8.

In the code: one verified GUI check, end to end

Putting it together: here's a complete VERIFY step that uses computer-use to prove a Calculator computation lands — grant, read, act, read back, assert. Read it top to bottom and you'll recognize every command from the lesson. Notice the last line: it doesn't trust the ok; it re-reads state and asserts the display.

verify/calc-gui-check.sh (a Proof-Gate step for the loop)

# PROVE: pressing "=" after "7 × 6" shows 42 in Calculator's display.
# Non-blocking: the operator can keep working the whole time.

computer-use grant --app Calculator                 # 1 · per-app authorization

before=$(computer-use state --app Calculator)         # 2 · read the tree (before)

# 3 · drive it — each click is an ax_press, no cursor moves
computer-use click --app Calculator --element-index 17   # 7
computer-use click --app Calculator --element-index 22   # ×
computer-use click --app Calculator --element-index 14   # 6
res=$(computer-use click --app Calculator --element-index 24) # =  → {"method":"ax_press","result":"ok"}

after=$(computer-use state --app Calculator)          # 4 · read the tree (after)

# 5 · the GATE: don't trust "ok" — assert the real AXValue
display=$(echo "$after" | jq -r '.tree.children[] | select(.role=="AXStaticText") | .value')
if [ "$display" = "42" ]; then
  echo "PASS · display reads 42 at the real boundary"
else
  echo "FAIL · display reads '$display' (expected 42)"    # honest, not assumed
fi

Run it yourself

Element indices are illustrative — read your own with computer-use state --app Calculator | jq '.tree' and substitute, because indices drift between UI states. Re-read state right before each click if the layout might change.

The full command reference

Every command, flag, and AX method lives in the skill file. Open it with grep -n "Commands\|Known issues\|Rules" ~/.claude/skills/computer-use-cli/SKILL.md. Local-checkout invocation, the bundled-bridge path, and the registry install are all documented there.

Quick check: did the model land?

Three quick questions. Pick one answer in each — it grades on click, and tells you why.

What makes computer-use "non-blocking"?

You ask it to scroll an empty canvas. What happens?

How do you prove an AX click actually landed?

You are not done learning here — I am your teacher for this. Ask me to trace a different action through the gate flow, to map state → click → state onto an app you actually use, or to wire this CLI into a Proof-Gate step in your own loop. Next up: visual-teach — the engine that built this very course.