Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals Hackernews Viewer

Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals

(blog.skyvern.com)

49 points by suchintan 17 January 2025 | 31 comments

Comments

happyopossum 17 January 2025

Many of the examples given for agents such as this are things I just flat wouldn’t trust an LLM to do - buying something on Amazon for example: Will it pick new or ‘renewed’? Will it select an item that is from a janky looking vendor and may be counterfeit? Will it pick the cheapest option for me? What if multiple colors are offered?

This one example alone has so many branches that would require knowing what’s in my head.

On the flip side, it’s a ridiculously simple task for a human to do for themselves, so what am I truly saving?

Call me when I can ask it to check the professional reviews of X category on N websites (plus YouTube), summarize them for me, and find the cheapest source for the top 2 options in the category that will arrive in Y days or sooner.

That would be useful.

mkagenius 18 January 2025

Pre-planned steps by Planner will go wrong more often than not, as it will try to guess the UI layers from its memory/training data. Its better to just ask the "next step" by giving it current state of the UI.

I have built a similar project for mobile automation [1] and the validator phase is not separate rather it's inherently baked in each step since we only ask next step based on current UI and previous actions.

My Planner sometimes goes "Oh, we are still on home screen, let's find the Uber app icon". This sort of self-correcting behaviour was not programmed but the LLM does it on its own.

1. https://github.com/BandarLabs/ClickClickClick - A framework to automate mobile use via any LLM (local/remote)

lyime 17 January 2025

This is an impressive tool. I especially like the observability around the workflow and the steps it takes to achieve the outcome. We are potentially interested in exploring this if we can get the cost down at scale.

wejick 18 January 2025

UI is most common interface but not particularly AI friendly, i'll wait for more standardized interface that's both human and AI friendly. Hoping it will still br a browser based.

skull8888888 17 January 2025

isn't browser use sota on web voyager? At this point web voyager seems to be outdated, there's def a need for a new harder benchmark.

govindsb 17 January 2025

congrats Suchintan! huge achievement!