March 19, 2025

How We 10x’d Our Agent in a Week

Recently, we were approached by a customer with a bold request: generate a massive dataset. At first glance, it seemed almost comically out of reach. Based on our infrastructure at the time, some quick back-of-the-napkin math estimated the inference cost alone would land somewhere around $500,000. Not exactly budget-friendly.

But instead of turning the project down, we took it as a challenge. And it paid off—literally. After a few intense weeks of engineering, we managed to speed up our agent by 10x while cutting our costs by roughly 16x. Here’s how we did it.

1. Rethinking the Problem Space

We started by revisiting the nature of the task. This wasn’t a general-purpose language or vision task that required the full capability of a massive LLM or VLM. In fact, the problem was relatively constrained at inference time. That gave us the confidence to aggressively distill and quantize our model without losing meaningful performance. With a smaller, faster model tailored to the specific task, we unlocked significant efficiency gains right out of the gate.

2. Bringing the Infrastructure Closer to the Metal

The milliseconds of time in sending signals across the country actually add up in a significant way. Our agent uses a browser to navigate and extract data. But the distance between our browsers hosted in California and our inference in Virginia caused latency issues. To counter that, we brought our browser infrastructure closer to our inference servers.

3. Caching Where It Counts

Another major unlock: caching our embedding lookups. When you're running 100s of H100s across millions of websites, small delays become massive at scale. Precomputing and caching repeated operations allowed us to reclaim a ton of compute and reduce redundant work across the system.

The Result

Between model compression, smarter infrastructure, and targeted caching, we drove our estimated dataset cost down from $500K to ~$30K. That’s not just a cost reduction—it’s a fundamental shift in how we think about large-scale agent deployment.

Big shout out to Taichi Kato and Alex Goldstein for absolutely crushing the optimization work. The team really went deep here, and it shows.

What’s Next?

Well... we’re now running into new bottlenecks—namely, Google Search. We’ve hit rate limits so hard that searching itself is now our slowest step. So we’re implementing spillover to alternate providers to keep the pipeline humming.

Want to dive deeper into the technical details? We’re happy to chat more—reach out to me at alex@structify.ai.

-Alex Reichenbach, Founder