Are AI web agents a gimmick?

Using large language models (LLMs) to automate web applications by brute force is not right. At this stage in the AI game, it feels like trying to cram a square peg into a round hole. Yet it also happens to be exactly what web AI agents these days seem absolutely determined to do 🤷🏻.

I know, because I tried it too 🫣.

Don't get me wrong, today's web-based AI agents are cool for putting together quick demos that look extremely impressive on Twitter. Agents are also awesome for getting VC juices flowing 🤑.

But, as of today, available agents (that is, the ones you can sign up for and use) are more of a gimmick than anything else. They barely get the job done.

Sure, if you happen to be on the exact page you are supposed to be on…

And you craft your prompt precisely how the developer intended you to…

And if all the stars are otherwise in perfect alignment…

Then, maybe they’ll do what you asked them to. Maybe.

The current state of AI web agents isn’t the way it is because it's a bad idea or because the people who are trying to make it work are not smart enough. Quite the contrary, they are some of the best and brightest!

No, the problems lie elsewhere.

The problem(s) with AI web agents

The overall approach is flawed on a lot of levels:

User experience ≠ agent experience

Web sites are designed for humans (not 🤖). And because we’re easily overwhelmed, functionality is often buried under layers of user interface (UI) hierarchy. This makes intuitive sense to people—not so much to robots. This makes it borderline impossible for an agent to navigate any given UI consistently and correctly.

You could always hack the agent by prompting your way out of any particular UI idiosyncrasy, adding the precise skills for a specific app, etc., but then what’s the point? You completely lose the generality you were looking for and start facing an entirely different problem: building and maintaining an extremely long tail of prompt-hacking skills ☭.

Too fast, too slow

Web UIs are, in general, SLOW (did I mention that they are designed for humans?). How long does it take to click through and populate 15 columns in a spreadsheet row? What if you have to switch tabs in the process? What if you have to repeat for 700 rows?

Also, the dynamic object manipulation (DOM) approach to web design is very brittle 🔨. Markup changes all the time, and web-based AI agents end up triggering bot protections and captchas. Many web applications are specifically designed to prevent this kind of approach 🛑.

So you're very much going against the grain

Like watching paint dry

Who wants to sit in front of a computer and watch 🧐 it slowly do its work? But but but, you can run the agent in the background, RIGHT?

You kinda can't, because web applications behave differently when running or loading in the background, some may straight refuse to load. Others may trigger bot detection measures.

So, unless now is a good time to go get a soda, you’re stuck watching the agent do its thing while you wait to get your computer back.

Reading is not seeing

Last but not least, LLMs are great with the llllllanguage, not so much with cccccclicking. By design, LLMs are great at language-y things (translation, summarization, classification, meaning extraction). In other words, they’re some of the best readers you can find.

“Seeing” and finding the right element in the DOM to click? Not so much.

Sure, there are vision models now. But, again, web sites are designed for humans—not robots. These models aren't all that great at navigating UIs that were built specifically for people.

It would be unfair to say the models are blind, but their vision is a lot more blurry than the industry is willing to admit.

Now what?

Does this mean that LLMs are useless for automation? Does it mean that natural language to action is a stupid idea that’ll never work?

In other words: IS THERE HOPE or are we doomed?

LLMs are amazingly great and incredibly useful for automation when it comes to classification (insert shameless plug of my own work on few-shot text classification), summarization, meaning extraction, and (in general) making automated workflows more intelligent.

Go check out our new Research page while you’re at it.

For example, an LLM works best for telling you whether a person would be a good fit for a job or whether a prospect is right for outreach. It’s not-so-best for moving that person’s data from System A to System B.

Natural language to action is a beautiful idea. It promises to increase productivity and make applications more accessible to a plethora of new users. Don’t get me wrong: I'd love to FINALLY be amazing at Photoshop, AutoCad or, dare I say, Salesforce.

BUT, there has to be a way to verify the correctness of actions before execution, and it has to be consistent. In other words, you don't want any magic (colloquially referred to as AI) when it comes to doing the actual thing.

The Lollapalooza we all deserve

Instead of magic, what you DO want, is to tell the computer what to do, verify that it understood you exactly right 🫡, correct it if you have to, and then have a way to RELIABLY run that instruction again and again ⚙️.

Web agents are a great idea, but you have to use them for the right task. Are agents great for populating a value of a 23 x 157 Excel Sheet? Umm, no! A single API call can do it in under 200 milliseconds with 100% accuracy and 0% drama. That’s the right tool for the job. Not an agent.

Does this mean we have given up on the idea and quit working on our own AI web agents? ? Of course not! We even shipped a new version of our browser agent recently. Do we think it's a one-size-fits-all-panacea that's going to solve automation? Not quite 😅.

We believe the winning approach to web-based automation is going to combine a 'traditional' RPA/API/automation approach with LLM-powered agents within the same workflow (but that deserves a thread of its own).

Either that, or GPT-N just comes out and solves software and automation all at once 🦄.