AI Scientists Are Here

§ 01 Existing

The Existing Workflow

I was investigating how neural networks represent features—whether they exist as directions in latent space or live on higher-dimensional manifolds. Twelve experiments, serious mechanistic interpretability work, all requiring GPU compute.

Claude Opus 4.6 and I spent time co-designing the experimental spec together. We iterated on the methodology, defined the experiment parameters, worked through the analysis pipeline. Once the plan was solid, I handed implementation off to a multi-agent orchestration rig called Gas Town. Agents run as crew members in tmux panes, coordinated by a central dispatcher called the mayor, with many other roles maintaining order and efficiency in the town of agents.

The code got written. It also had bugs—and these bugs were invisible locally because the experiments required an A100 to run at any reasonable speed. So the debug cycle looked like this: run on Modal, see the traceback, copy the error back to Claude, wait for a fix, commit, push, pull on the remote, rerun. And many of the runs took minutes before they would throw an error. This would have been a full day of me acting as a human clipboard between the agent and the compute.

§ 02 Change

The Change

I decided to automate the bottleneck. Instead of mediating between Claude and the GPU, I gave Claude direct access.

The setup was four tmux panes managed by a single Claude Code session running Opus 4.6. One pane launched a Modal instance—an A100 with 40GB VRAM. The rest SSH'd into that instance. Claude could now write code locally, commit and push fixes, pull them on the Modal box, and rerun experiments—all without waiting for me. The feedback cycle collapsed from minutes to seconds.

This is not a novel architecture. It is Claude Code, tmux, Modal, and SSH. The key move was removing myself from sitting between the agent and its compute.

§ 03 Nine

Nine Hours

Claude ran for nine hours nearly continuously, and it managed three parallel experiment streams simultaneously across separate tmux panes on the Modal instance.

Some processes took seconds. Others took minutes. Others took hours. Claude set its own timeouts and queues, managing the parallelism without being told how. When an experiment threw an error, Claude diagnosed the bug, implemented a fix with access to a full Gas Town rig behind it, committed and pushed the changes, pulled them on the Modal instance, and reran the experiment. Then it moved on to the next one. It did not need external prompting to keep going—it just ran.

The only time I intervened was when one experiment's estimated runtime came in at around eleven hours, which would have run all night. Claude flagged it before I did; this was the only time I stepped in from my observational role. We talked through it and decided on subsampling to fit within my ad hoc compute budget. I had not specified runtime constraints upfront—that was on me. One scope decision in nine hours.

§ 04 What

What It Found

I am not going to detail the specific findings here—that is a separate writeup. What I will say is that the results are genuinely novel and nontrivial. They hint at something unexpected about how neural networks represent abstract concepts, and they represent meaningful progress: a stepping stone toward a meaningful thesis. In other words, it was progress. And I didn't need to be there.

The point of this essay is not the content of the discoveries. It is that an AI agent ran a real research program autonomously and produced results worth building on.

§ 05 Intervention

The Intervention as Signal

I want to investigate what that single scope decision means, because it cuts both ways.

On one hand, the collaboration model works well. I manage constraints—session duration, compute budget, scope—while the agent does the science. I did not need to understand every tangent-space nuance to know that subsampling was the right tradeoff. Constraint management and scientific reasoning are genuinely separable.

On the other hand, it is a current limit. Claude identified the runtime and was happy to run it for eleven hours. I had underspecified my constraints—something I've since adapted. Whether that is a fundamental limitation (in the same way we never really know just how much we have to enumerate to fully specify a halting condition) or just a trust calibration issue, I am not sure yet. It is a solvable problem either way.

One way to frame the paradigm shift is that I have promoted myself to PI. I design the research program, set the constraints, make scope calls when they come up. The agents are my grad students—tireless, fast, and surprisingly capable, if occasionally in need of a nudge on resource management. The difference is that these particular cough indentured servants cough research assistants do not need sleep, do not context-switch, and can run three experiment streams—since then, more—in parallel without complaining. It is a better lab than most PIs get.

(If you're a PI and want to chat, find me at @foundinrome.)

Nine hours of autonomous research with one scope decision is not "AI-assisted research." It is AI research with human oversight. The emphasis belongs on the first term.

§ 06 How

How to Set This Up

A few assumptions before the prompts. This worked because the codebase was well-spec'd—Opus and I had written a thorough experimental spec with justifications for each experiment, and Gas Town one-shotted the implementation from that spec. I also had a tested Modal launch script (modal run main.py) from prior work, so the remote compute side was already proven. Claude can probably build that script for you too, but it helps to have the plumbing sorted before you ask it to do science. If you are running vanilla Claude Code rather than a multi-agent rig like Gas Town, the same approach applies—you just need Claude Code, tmux, and outbound egress.

Here are the prompts I used, roughly in order. Your mileage may vary. These certainly aren't optimal, but they just might work.

Getting started (first time only—SSH key setup and Modal auth):

This runs once to configure SSH keys and authenticate with Modal. Claude handles the setup end-to-end; you just log in when prompted.

Starting each session (launch instance and SSH in):

This is the session-start prompt. Claude spins up the Modal instance, then SSH's into it from the remaining panes—ready to run code.

Key reminders (scope and resource constraints):

This is the operating brief. It tells Claude how to prioritize, when to keep moving, and when to escalate. The runtime budget and check-in condition are the two most important constraints to set explicitly.

The architecture boils down to: Claude Code in one pane, Modal instance launched from another, SSH session in a third (or more). Claude manages all of them. The critical insight is that the agent needs direct access to the compute environment where its code actually runs—and even if it's on a remote instance, that's totally possible. Everything else is infra.

§ 07 What

What This Means

You can do this now. Claude Code, tmux, Modal—that is the stack. No proprietary infrastructure, no custom frameworks, no research lab required. The key architectural decision is giving the agent direct access to compute, not mediating between the agent and the hardware. If your agent can write code but cannot run it, you are the bottleneck.

The research paradigm is shifting. A single researcher with the right rig can run systematic experimental campaigns that previously required a team and weeks of manual iteration. Twelve experiments across an entire network's geometry, ablation studies, follow-up probes—this is a research agenda, not a weekend project. It ran in one session.

A word on cost. Both Opus API credits and Modal A100 time burn through money fast. I do not have exact numbers for this session, but nine hours of Opus plus GPU compute is not trivial. This is a tool for researchers who are willing to trade dollars for time compression. If your time is the scarce resource, the economics probably work. If you are exploring casually, they probably do not.

The capability exists right now, with tools you can set up this afternoon. The question is what research program you hand off next. Karpathy's autoresearch—a similar approach for overnight training experiment loops—just dropped, and I'm looking forward to experimenting with it.