Autoresearch on an old research idea Hackernews Viewer

Autoresearch on an old research idea

409 points by ykumards 23 March 2026 | 90 comments

Comments

the_arun 23 March 2026

Try this if the main link is not responsive - https://archive.is/6xLiU

datsci_est_2015 23 March 2026

I often use LLMs to explore prior art and maybe find some alternative ways of thinking of problems. About 90% of what it tells me is useless or inapplicable to my domain due to a technicality it could not have known, but the other 10% is nice and has helped me learn some great new things.

I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.

On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.

Xx_crazy420_xX 14 hours ago

Autoresearch is nothing new, big players are already in the game with more sophisticated solutions:

  - https://arxiv.org/abs/2602.02660 (MARS)
  - https://arxiv.org/abs/2601.14525 (Execution-grounded automated AI research)
  - https://arxiv.org/abs/2601.10402 (ML-Master 2.0)

The mostly used benchmark for automated AI engineering/ research is: https://github.com/openai/mle-bench

carlsborg 23 March 2026

> “ The agent acted like a hyperparameter optimization algorithm with some basic reasoning baked in.”

Good lens.

The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.

_pdp_ 23 hours ago

Take some working code. Ask an LLM to fix bugs. Measure performance and test coverage. Feed the results back into the LLM. Repeat.

This has been the standard approach for more complex LLM deployments for a while now in our shop.

Using different models across iterations is also something I've found useful in my own experiments. It's like getting a fresh pair of eyes.

dvt 23 March 2026

Ok, so looking at the commit log[1], I was mostly interested in seeing what the "moonshot ideas" implementations looked like, but basically everything is just hyperparameter tuning. Which is nice, but likely not worth the $$$ spent on the tokens. Am I missing something here?

[1] https://github.com/ykumards/eCLIP/commits/main/autoresearch

jpcompartir 23 hours ago

There are better techniques for hyper-parameter optimisation, right? I fear I have missed something important, why has Autoresearch blown up so much?

The bottleneck in AI/ML/DL is always data (volume & quality) or compute.

Does/can Autoresearch help improve large-scale datasets? Is it more compute efficien than humans?

1970-01-01 23 hours ago

> The original paper used several medical X-ray datasets which I don’t have access to anymore, so I needed a new dataset with spatial annotations to test the expert attention mechanism. I picked the Ukiyo-eVG dataset: ~11K Japanese woodblock prints

That's such a weird switch. There's lots of free medical imaging online. Example: https://www.cancerimagingarchive.net/

love2read 23 March 2026

So... It did work. It found bugs (that he didn't know about) and it did optimization (that he hadn't done).

pu_pe 9 hours ago

> Like with any LLM project, the first 90% of the work was super smooth and barely needed my intervention. The last 10% was a slog.

The author doesn't really describe which part was a slog, I thought autoresearch was supposed to be pretty much set and forget.

lucasay 23 hours ago

This feels less like automated research and more like structured trial and error with a decent feedback loop. Still useful, but I think the real bottleneck is how good your eval metric is. If that’s weak, the whole loop just optimizes for the wrong thing faster.

ide0666 9 hours ago

The scratchpad.md for agent working memory is a nice touch. Having a persistent record of what was tried and why matters more than most people realize when debugging automated experiment loops.

BrokenCogs 23 March 2026

Does autoresearch work for projects that are not llm based? Eg in karpathy's example he is optimizing the nanogpt. What if I wanted to improve a Unet for image segmentation?

ricksunny 14 hours ago

With all the posts lately about Karpathy's autoresearch, it remains unclear to me whether this name is intended to convey that this LLM-codebase should be useful for research across all domains - like molecular biology, aircraft control, sociological, ww2 history, etc. or is it intended only to discover new LLM capabilities.

mlmonkey 22 hours ago

> Then I lock down Claude Code’s permissions to only edit these two files and run run.sh. No direct Python execution, no pip installs, no network access, no git push, etc.

How does one run Claude Code without network access?

n_bhavikatti 22 hours ago

The temperature clamp fix and "Optuna++" actions by the agents (the cause of basically all improvement to eCLIP) indicate they are good at finding bugs and hyper-parameter tuning. But when it comes to anything beyond that, such as novel architectural shifts, agents aren't good enough. With no clear path forward they tend to randomly change things, which is a poor approach. Agents: Optimization >> innovation

saidnooneever 21 hours ago

pretty cool experiment, i thought about someone maybe doing this and am happy you did it in this way. nice writeup too. this made me giggle a bit: "At one point it got tired of waiting for training to finish and just ended the conversation. I wouldn’t give it full autonomy just yet :)"

thanks for sharing your results and the road to them!

endymion-light 9 hours ago

This is really cool - i'm going to try it on my old disseration.

lamroger 23 March 2026

Awesome breakdown! It really feels like a hyper-hyper parameter search + bug fixer.

I started looking at Kaggle again and autoresearch seems to converge to many of the solution vibes there.

Wild ensembles, squeezing a bit of loss out. More engineering than research IMO

motbus3 23 hours ago

I've done something with a small project I have and I had very similar results overall.

pikachu0625 22 hours ago

It's better to outsource optimization phases. Our idea should be for constraint, assumptions etc. for breakthrough. Boyd often argues that once you can express a problem in a standard mathematical form, the implementation becomes a commodity that software can handle automatically.

SebastianSosa 15 hours ago

autoresearch is a trivial research idea "ablate through experiments with knowledge over previous experiments"