Using the UIA tree as the currency for LLMs to reason over always made more sense to me than computer vision, screenshot based approaches. It’s true that not all software exposes itself correctly via UIA, but almost all the important stuff does. VS code is one notable exception (but you can turn on accessibility support in the settings)
Looks awesome. I've attempted my own implementation, but I never got it to work particularly well. "Open Notepad and type Hello World" was a triumph for me. I landed on the UIA tree + annotated screenshot combination, too, but mine was too primitive, and I tried to use GPT which isn't as good at image tasks as Gemini as used here. Great job!
Working on something very similar in Rust. It's quite magical when it works (that's a big caveat, as I'm trying to make it work with local LLMs). Very cool implementation, and imo, this is the future of computing.
Windows-Use: an AI agent that interacts with Windows at GUI layer
(github.com)106 points by djhu9 9 September 2025 | 20 comments
Comments
https://learn.microsoft.com/en-us/dotnet/api/microsoft.visua...
Preferably one that is similarly able to understand and interact with web page elements, in addition to app elements and system elements.
I guess I can answer, "yes I think so."