Using LLMs as a strap-on to brute force automation upon applications is not right. It feels like forcing a square peg through a round hole. It also happens to be exactly what web AI agents these days seem to be determined to do 🤷🏻♂️. I know, because I tried it too .
Don't get me wrong, today's web-based AI agents are cool for putting together a quick demo that looks extremely impressive on Twitter. Agents are also awesome for getting VC juices flowing 🤑. But, as of today, available agents (that is, the ones that one can sign up for and use) are more of a gimmick than anything else.
Sure, if you happen to be exactly on the page you are supposed to, and craft your prompt exactly how the developer intended to, and if all the starts otherwise align, they sometimes work.
It is the way it is not because it's a bad idea, or because people who try to make it work are not smart enough. Quite the contrary, they are some of the best and brightest!
The problem is that the overall approach is flawed on a lot of levels:
- UIs are designed for humans (not 🤖). And because we the humans are very easily overwhelmed, the functionality is often buried under layers of UI hierarchy. This makes it borderline impossible for an Agent to navigate UIs in general consistently and correctly.
Now, you can always hack it by prompting your way out of any particular UI idiosyncrasy, add the skills for a specific app etc, but at that point you completely loose the generality and start facing a different problem: building and maintaining an extremely long tail of those skills ☭.
- Web UIs are in general SLOW (did I mention that they are designed for humans?). How long does it take to click through and populate 15 columns in a spreadsheet row? What if you have to switch tabs in the process? What if you have to repeat for 700 rows?
Also, the DOM manipulation approach is very brittle 🔨. Markup changes all the time, you end up triggering bot protections and captchas, a lot of web applications are specifically designed to prevent this kind of approach 🛑. So you're very much going against the grain
- Another thing is that everyone will get very tired very quickly by sitting in front of a computer and watching 🧐 it slowly do work (while it can't be used for anything else). What's the point if things aren't happening any quicker, and you can't do anything else in the meantime?
But but but, you can run it in the background, RIGHT? Wrong!
You kinda can't, because web applications behave differently when running or loading in the background, some may straight refuse to load and trigger bot detection measures.
- Last but not least, LLMs are great with the LLLLanguage, not so much when it with CCCCClicking: by design, LLMs are great and language-y things (translation, summarization, classification, meaning extraction).
Finding the right element in the DOM to click? Not so much. Sure, there is vision models now, but then again, those aren't necessarily great at navigating UIs designed specifically for humans.
Combine vision models with the poorly-dom-navigating-clicking models and errors start to compound quicker than anyone in the industry would be willing to admit.
Does this mean that LLMs are useless for automation? Does it mean that natural language to action is a stupid idea? In other words, IS THERE HOPE or are we doomed?
LLMs are amazingly great and incredibly useful for automation when it comes to classification (of course I have to plug https://arxiv.org/abs/2310.06111), summarization, meaning extraction, and in general making automated workflows more intelligent (go check out our new Research page while you’re at it!).
For example, being able to tell whether a person would be a good fit for a job, or whether a prospect would be a good fit for outreach, as opposed to (or rather, in addition to) moving the data about the person along from the system A to system B.
Natural language to action is a beautiful idea, that increases productivity, and makes applications more accessible for a plethora of new users (I'd FINALLY love to be able to be amazing at Photoshop, AutoCad or, dare I say, Salesforce).
BUT, there has to be a way to verify correctness of actions before execution and it has to be consistent. In other words, you don't want any magic (colloquially referred to as AI) when it comes to doing the actual thing.
The Lollapalooza we all deserve
What you DO want, is to tell the computer what to do, verify that it understood you exactly right , correct it if you have to, and have a way to RELIABLY run that instruction over and over again ⚙️.
Web agents are a great idea, but you have to use them for the right task. Are agents great for populating a value of a 23 x 157 Excel Sheet? Umm, no! A single API call can do it in under 200 milliseconds with 100% accuracy and 0% drama.
Does this mean we at have given up on the idea and are not working on the web agents? Of course we haven't! We even shipped a new version of our browser agent recently. Do we think it's a one-size-fits-all-panacea that's going to solve automation? Not quite 😅.
We believe the winning approach is going to combine 'traditional' RPA/API/automation approach with LLM-powered agents within the same workflow (but that deserves a thread of its own).
Either that, or GPT-N just comes out and solves software and automation all at once 🦄.