I get so confused on this. I play around, test, and mess with LLMs all the time and they are miraculous. Just amazing, doing things we dreamed about for decades. I mean, I can ask for obscure things with subtle nuance where I misspell words and mess up my question and it figures it out. It talks to me like a person. It generates really cool images. It helps me write code. And just tons of other stuff that astounds me.
And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly? This is the most amazing technology I've experienced as a 50+ year old nerd that has been sitting deep in tech for basically my whole life. This is the stuff of science fiction, and while there totally are limitations, the speed at which it is progressing is insane. And people are like, "Wah, it can't write code like a Senior engineer with 20 years of experience!"
My experience (almost exclusively Claude), has just been so different that I don't know what to say. Some of the examples are the kinds of things I explicitly wouldn't expect LLMs to be particularly good at so I wouldn't use them for, and others, she says that it just doesn't work for her, and that experience is just so different than mine that I don't know how to respond.
I think that there are two kinds of people who use AI: people who are looking for the ways in which AIs fail (of which there are still many) and people who are looking for the ways in which AIs succeed (of which there are also many).
A lot of what I do is relatively simple one off scripting. Code that doesn't need to deal with edge cases, won't be widely deployed, and whose outputs are very quickly and easily verifiable.
LLMs are almost perfect for this. It's generally faster than me looking up syntax/documentation, when it's wrong it's easy to tell and correct.
Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air.
Not every use case is like this, but there are many.
-edit- Also, when she says "none of my students has ever invented references that just don't exist"...all I can say is "press X to doubt"
The most interesting thing about this post is how it reinforces how terrible the usability of LLMs still is today:
"I ask them to give me a source for an alleged quote, I click on the link, it returns a 404 error. I Google for the alleged quote, it doesn't exist. They reference a scientific publication, I look it up, it doesn't exist."
To experienced LLM users that's not surprising at all - providing citations, sources for quotes, useful URLs are all things that they are demonstrably terrible at.
But it's a computer! Telling people "this advanced computer system cannot reliably look up facts" goes against everything computers have been good at for the last 40+ years.
I become more and more convinced with each of these tweets/blogs/threads that using LLMs well is a skill set akin to using Search well.
It’s been a common mantra - at least in my bubble of technologists - that a good majority of the software engineering skill set is knowing how to search well. Knowing when search is the right tool, how to format a query, how to peruse the results and find the useful ones, what results indicate a bad query you should adjust… these all sort of become second nature the longer you’ve been using Search, but I also have noticed them as an obvious difference between people that are tech-adept vs not.
LLMs seems to have a very similar usability pattern. They’re not always the right tool, and are crippled by bad prompting. Even with good prompting, you need to know how to notice good results vs bad, how to cherry-pick and refine the useful bits, and have a sense for when to start over with a fresh prompt. And none of this is really _hard_ - just like Search, none of us need to go take a course on prompting - IMO folks jusr need to engage with LLMs as a non-perfect tool they are learning how to wield.
The fact that we have to learn a tool doesn’t make it a bad one. The fact that a tool doesn’t always get it 100% on the first try doesn’t make it useless. I strip a lot of screws with my screwdriver, but I don’t blame the screwdriver.
You're using them wrong. Everyone is though I can't fault you specifically. Chatbot is like the worst possible application of these technologies.
Of late, deaf tech forums are taken over by language model debates over which works best for speech transcription. (Multimodal language models are the the state of the art in machine transcription. Everyone seems to forget that when complaining they can't cite sources for scientific papers yet.) The debates are sort of to the point that it's become annoying how it has taken over so much space just like it has here on HN.
But then I remember, oh yeah, there was no such thing as live machine transcription ten years ago. And now there is. And it's going to continue to get better. It's already good enough to be very useful in many situations. I have elsewhere complained about the faults of AI models for machine transcription - in particular when they make mistakes they tend to hallucinate something that is superficially grammatical and coherent instead - but for a single phrase in an audio transcription sporadically that's sometimes tolerable. In many cases you still want a human transcriber but the cost of that means that the amount of transcription needed can never be satisfied.
It's a revolutionary technology. I think in a few years I'm going have glasses that continuously narrate the sounds around me and transcribe speech and it's going to be so good I can probably "pass" as a hearing person in some contexts. It's hard not to get a bit giddy and carried away sometimes.
If there's one common thread across LLM criticisms, it's that they're not perfect.
These critics don't seem to have learned the lesson that the perfect is the enemy of the good.
I use ChatGPT all the time for academic research. Does it fabricate references? Absolutely, maybe about a third of the time. But has it pointed me to important research papers I might never have found otherwise? Absolutely.
The rate of inaccuracies and falsehoods doesn't matter. What matters is, is it saving you time and increasing your productivity. Verifying the accuracy of its statements is easy. While finding the knowledge it spits out in the first place is hard. The net balance is a huge positive.
People are bullish on LLM's because they can save you days' worth of work, like every day. My research productivity has gone way up with ChatGPT -- asking it to explain ideas, related concepts, relevant papers, and so forth. It's amazing.
I think many people are just not really good at dealing with "imperfect" tools. Different tools can have different success probability, let's call that probability p here. People typically use tool that have p=100%, or at least very close to it. But LLM is a tool that is far from that, so making use of it takes different approach.
Imagine there is an probabilistic oracle that can answer any question with a yes/no with success probability p. If p=100% or p=0% then it is apparently very useful. If p=50% then it is absolutely worthless. In other cases, such oracle can be utilized in different way to get the answer we want, and it is still a useful thing.
I think people confuse the power of the technology with the very real bubble we’re living in.
There’s no question that we’re in a bubble which will eventually subside, probably in a “dot com” bust kind of way.
But let me tell you…last month I sent several hundred million requests to AI, as a single developer, and got exactly what I needed.
Three things are happening at once in this industry…
(1) executives are over promising a literal unicorn with AGI, that is totally unnecessary for the ongoing viability of LLM’s and is pumping the bubble.
(2) the technology is improving and delivery costs are changing as we figure out what works and who will pay.
(3) the industry’s instincts are developing, so it’s common for people to think “AI” can do something it absolutely cannot do today.
But again…as one guy, for a few thousand dollars, I sent hundreds of millions of requests to AI that are generating a lot of value for me and my team.
Our instincts have a long way to go before we’ve collectively internalized the fact that one person can do that.
My experience is starkly different. Today I used LLMs to:
1. Write python code for a new type of loss function I was considering
2. Perform lots of annoying CSV munging ("split this CSV into 4 equal parts", "convert paths in this column into absolute paths", "combine these and then split into 4 distinct subsets based on this field.." - they're great for that)
3. Expedite some basic shell operations like "generate softlinks for 100 randomly selected files in this directory"
4. Generate some summary plots of the data in the files I was working with
5. Not to mention extensive use in Cursor & GH Copilot
The tool (Claude 3.7 mostly, integrated with my shell so it can execute shell commands and run python locally) worked great in all cases. Yes I could've done most of it myself, but I personally hate CSV munging and bulk file manipulations and its super nice to delegate that stuff to an LLM agent
So many people putting expectations up to knock down about models. Infinite reasons to critique them.
Please dispense with anyone's "expectations" when critiquing things! (Expectations are not a fault or property of the object of the expectations.)
Today's models (1) do things that are unprecedented. Their generality of knowledge, and ability to weave completely disparate subjects together sensibly, in real time (and faster if we want), is beyond any other artifact in existence. Including humans.
They are (2) progressing quickly. AI has been an active field (even through its famous "winters") for several decades, and they have never moved forward this fast.
Finally and most importantly (3), many people, including myself, continue to find serious new uses for them in daily work, that no other tech or sea of human assistants could replace cost effectively.
The only way I can make sense out of anyone's disappointment is to assume they simply haven't found the right way to use them for themselves. Or are unable to fathom that what is not useful for them is useful for others.
They are incredibly flexible tools, which means a lot of value, idiosyncratic to each user, only gets discovered over time with use and exploration.
That that they have many limits isn't surprising. What doesn't? Who doesn't? Zeus help us the day AI doesn't have obvious limits to complain about.
I’ve been using Claude a lot lately, and I must say I very much disagree.
For example, the other day I was chatting with it about the health risks associated with my high consumption of grown salmon. It then generated a small program to simulate the accumulation of PCB in my body. I could review the program, ask questions about the assumptions, etc. It all seemed very reasonable. A toxicokinetic analysis it called it.
It then struck me how immensely valuable this is to a curious and inquisitive mind. This is essentially my gold standard of intelligence: take a complex question and break it down in a logical way, explaining every step of the reasoning process to me, and be willing to revise the analysis if I point out errors / weaknesses.
Now try that with your doctor. ;)
Can it make mistakes? Sure, but so can your doctor. The main difference is that here the responsibility is clearly on you. If you do not feel comfortable reviewing the reasoning then you shouldn’t trust it.
My anecdotal experience is similar. For any important or hard technical questions relevant to anything I do, the LLM results are consistently trash. And if you are an expert in the domain you can’t not notice this.
On the other hand, for trivial technical problems with well known solutions, LLMs are great. But those are in many senses the low value problems; you can throw human bodies against that question cheaply. And honestly, before Google results became total rubbish, you could just Google it.
I try to use LLMs for various purposes. In almost all cases where I bother to use them, which are usually subject matters I care about, the results are poorer than I can quickly produce myself because I care enough to be semi-competent at it.
I can sort of understand the kinds of roles that LLMs might replace in the next few years, but there are many roles where it isn’t even close. They are useless in domains with minimal training data.
Because it’s not a scientific research tool, it’s a most likely next text generator. It doesn’t keep a database of ingested information with source URLs. There are plenty of scientific research tools but something that just outputs text based on your input is no good for it.
I’m sure that in the future there will be a really good search tool that utilises an LLM but for now a plain model just isn’t designed for that. There are a ton of other uses for them, so I don’t think that we should discount them entirely based on their ability to output citations.
In general LLMs have made many areas worse. Now you see people writing content using LLMs without understanding the content itself, it becomes really annoying especially if you don't know this and ask the question "did you perhaps write this using LLM" and get the "yes" answer.
In programming circles it's also annoying when you try to help and you get fed garbage outputted by LLMs.
I belive models for generating visuals (image, video sound generation) is much more interesting as it's area where errors do not matter as much. Though the ethicality of how these models have been trained is another matter.
I think that to understand the diversity of opinions, we have to recognize a few different categories of users:
Category 1: people who don't like to admit that anything trendy can also be good at what it does.
Category 2: people who don't like to admit that anything made by for-profit tech companies can also be good at what it does.
Category 3: people who don't like to admit that anything can write code better than them.
Category 4: people who don't like to admit that anything which may be put people out of work who didn't deserve to be put out of work, and who already earn less than the people creating the thing, can also be good at what it does
Category 5: people who aren't using llms for things they are good at
Category 6: people who can't bring themselves to communicate with AIs with any degree of humility
Category 7: people to whom none of the above applies
I have a bunch of friends who don't get along well with each other, but I tend to get along with all of them. I believe this is about focusing on the good in people, and being able to ignore the bad. I think it's the same with tools. To me AI is an out-of-this-world, OP tool. Is it perfect? No. But it's amazing! The good I get out of it far surpasses its mistakes. Almost like people. People "hallucinate" and say wrong things all the time. But that doesn't make them useless or bad. So, whoever is having issues with AIs is probably having an issue dealing with people as well :) Learn how to deal with people, and learn how to deal with AI -- the single biggest skill you'll need in 21st century.
I think the disconnect is that people who produce serious, researched and referenced work tend to believe that most work is like that. It is not. The majority of content created by humans is not referenced, it's not deeply researched, and a lot of it isn't even read, at least not closely. It sends a message just by existing in a particular place at a particular time. And its creation gainfully employs millions of people, which of course costs millions of dollars. That's why, warts and all, people are bullish on LLMs.
Write code to pull down a significant amount of public data using an open API. (That took about 30 seconds - I just gave it the swagger file and said “here’s what I want”)
Get the data (an hour or so), clean the data (barely any time, gave it some samples, it wrote the code), used the cleaned data to query another API, combined the data sources, pulled down a bunch of PDFs relating to the data, had the AI write code to use tesseract to extract data from the PDFs, and used that to build a dashboard. That’s a mini product for my users.
I also had a play with Mistral’s OCR and have tested a few things using that against the data. When I was out walking my dogs I thought about that more, and have come up with a nice workflow for a problem I had, which I’ll test in more detail next week.
That was all whole doing an entirely different series of tasks, on calls, in meetings. I literally checked the progress a few times and wrote a new prompt or copy/pasted some stuff in from dev tools.
For the calls I was on, I took the recording of those calls, passed them into my local instance whisper, fed the transcript into Claude with a prompt I use to extract action points, pasted those into a google doc, circulated them.
One of the calls was an interview with an expert. The transcript + another prompt has given me the basis for an article (bulleted narrative + key quotes) - I will refine that tomorrow, and write the article, using a detailed prompt based on my own writing style and tone.
I needed to gather data for a project I’m involved in, so had Claude write a handful of scrapers for me (HTML source > here is what I need).
I downloaded two podcasts I need to listen to - but only need to listen to five minutes of each - and fed them into whisper then found the exact bits I needed and read the extracts rather than listening to tedious podcast waffle.
I turned an article I’d written into an audio file using elevenlabs, as a test for something a client asked me about earlier this week.
I achieved about three times as much today as I would have done a year ago. And finished work at 3pm.
So yeah, I don’t understand why people are so bullish about LLMs. Who knows?
They are bullish for the right reasons. Those reasons are not the same ones you think of. They are betting that humans wouuld keep getting addicted to more and more technology crutches and assistants as they keep inviting more workload onto their minds and body. There is no going back with this trend.
Why do we burden ourselves with such expectations on us? Look at cities like Dallas. It is designed for cars. Not for human walking. The buildings are far apart, workplaces are far away from homes and everything looks like designed for some king-kong like creatures.
The burden of expectations on humans is driven by technology. Technology makes you work harder than before. It didn't make your life easier. Check how hectic life has become for you now vs a laid-back village peasant a century back.
The bullishness on LLMs is betting on this trend of self-inflicted human agony and dependency on tech. Man is going back to the craddle. LLMs give the milk feeder.
I feel like we'll laugh at posts like this in 5 years. It's not inaccurate in any way, it just misses the wood for the trees. Any new technology is always worse in some ways. Smart phones still have much worse battery life and are harder to type on than Blackberries. But imagine not understanding why people are bullish about Smartphones.
It's 100x easier to see how LLM's change everything. It takes very little vision to see what an advancement they are. I don't understand how you can NOT be bullish about LLM's (whether you happen to like them or not is a different question).
Think about the early days of digital photography. When digital cameras first emerged, expert critics from the photograph field were quick to point out issues like low resolution, significant noise, and poor color reproduction—imperfections that many felt made them inferior to film. Yet, those early digital cameras represented a breakthrough: they enabled immediate image review, easy sharing, and rapid technological improvements that soon eclipsed film in many areas. Just as people eventually recognized that the early “flaws” of digital photography were a natural part of a revolutionary leap forward, so too should we view the occasional hallucinations in modern LLMs as a byproduct of rapidly evolving technology rather than a fundamental flaw.
Or how about computer graphics? Early efforts to move 3D graphics hardware into the PC realm were met with extreme skepticism by my colleagues who were “computer graphics researchers” armed with the latest Silicon Graphics hardware. One researcher I was doing some work with in the mid-nineties remarked about PC graphics at the time: “It doesn’t even have a frame buffer. Look how terrible the refresh rate is. It flickers in a nauseating way.” Etc.
It’s interesting how people who are actual experts in a field where there is a major disruption going on often take a negative view of the remarkable new innovation simply because it isn’t perfect yet. One day, they all end up eating their words. I don’t think it’s any different with LLMs. The progress is nothing short of astonishing, yet very smart people continue to complain about this one issue of hallucination as if it’s the “missing framebuffer” of 1990s PC graphics…
I think there are multiple conversations happening that are tying to converge on one.
On one hand, LLMs are overhyped and not delivering on promises made by their biggest advocates.
On the other hand, any other type of technology (not so overhyped) would be massively celebrated in significantly improving a subset of niche problems.
It’s worth acknowledging that LLMs do solve a good set of problems well, while also being overhyped as a silver bullet by folks who are generally really excited about its potential.
Reality is that none of us know what the future is, and whether LLMs will have enough breakthroughs to solve more problems then today, but what they do solve today is still very impressive as is.
I've been saying the same thing, though in less detail. AI is so dr8ven by hype at the moment, that it's unavoidable that it's going to collapse at some point. I'm not saying tue current crop of AI is useless; there are plenty of useful applications, but it's clear lots of people expect more from it than it's capable of, and everybody is investing in it just because everybody else is.
But even if it does work, you still need to doublecheck everything it does.
Anyway, my RPG group is going to try roleplaying with AI generated content (not yet as GM). We'll see how it goes.
> Eisegesis is "the process of interpreting text in such a way as to introduce one's own presuppositions, agendas or biases". LLMs feel very smart when you do the work of making them sound smart on your own end: when the interpretation of their output has a free parameter which you can mentally set to some value which makes it sensible/useful to you.
> This includes e. g. philosophical babbling or brainstorming. You do the work of picking good interpretations/directions to explore, you impute the coherent personality to the LLM. And you inject very few bits of steering by doing so, but those bits are load-bearing. If left to their own devices, LLMs won't pick those obviously correct ideas any more often than chance.
I wrote an AI assistant which generates working spreadsheets with formulas and working presentations with neatly laid out elements and styles. It's a huge productivity gain relative to starting from a blank page.
I think LLMs work best when they are used as a "creative" tool. They're good for the brainstorming part of a task, not for the finishing touches.
They are too unreliable to be put in front of your users. People don't want to talk to unpredictable chatbots. Yes, they can be useful in customer service chats because you can put them on rails and map natural language to predetermined actions. But generally speaking I think LLMs are most effective when used _by_ someone who's piloting them instead of wrapped in a service offered _to_ someone.
I do think we've squeezed 90%+ of what we could from current models. Throwing more dollars of compute at training or inference won't make much difference. The next "GPT moment" will come from some sufficiently novel approach.
Some web development advice I remember from long ago is to do most of the styling after the functionality is implemented. If it looks done, people will think it is done.
LLMs did the "styling" first. They generate high-quality language output of the sort most of us would take as a sign of high intelligence and education in a human. A human who can write well can probably reason well, and probably even has some knowledge of facts they write confidently about.
I don't trust Sabine Hossenfelder on this. Even I as a computer scientist/programmer with just some old experience from my master courses towards AI and ML, I know much more than her about how the things work.
She became more of an influencer than a scientist. And that is nothing wrong with that unless she doesn't try to pose as an authority on subjects she doesn't have a clue. It's OK to have an opinion as an outsider but it's not OK to pretend you are right and that you are an expert on every scientific or technical subject that happens to make you want to make a tweet about.
Like others here, I use it to code (no longer a professional engineer, but keep side projects).
As soon as LLMs were introduced into the IDE it began to feeling like LLM autocomplete was almost reading my mind. With some context built up over a few hundred lines of initial architecture, autocomplete now sees around the same corners I am. It’s more than just “solve this contrived puzzle” or “write snake”. It combines the subject matter use case (informed by variable and type naming) underlying the architecture and sometimes produces really breathtaking and productive results. Like I said, it took some time but when it happened, it was pretty shocking.
I'm also not bullish on this. In the sense that I don't think LLMs are going to get 10x better, but they are useful for what they can do already.
If I see what Copilot suggests most of the time, I would be very uncomfortable using it for vibe coding though. I think it's going to be... entertaining watching this trend take off. I don't really fear I'm going to lose my job soon.
I'm skeptical that you can build a business on a calculator that's wrong 10% of the time when you're using it 24/7. You're gonna need a human who can do the math.
When digital cameras first appeared, their initial generations produced low-resolution, poor-quality images, leading many to dismiss them as passing gimmicks. This skepticism caused prominent companies, notably Kodak, to overlook the significance of digital photography entirely. Today, however, film photography is largely reserved for niche professionals and specialized use-cases.
New technologies typically require multiple generations of refinement—iterations that optimize hardware, software, cost-efficiency, and performance—to reach mainstream adoption. Similarly, AI, Large Language Models (LLMs), and Machine Learning (ML) technologies are poised to become permanent fixtures across industries, influencing everything from automotive systems and robotics to software automation, content creation, document review, and broader business operations.
Considering the immense volume of new information generated and delivered to us constantly, it becomes evident that we will increasingly depend on automated systems to effectively process and analyze this data. Current challenges—such as inaccuracies and fabrications in AI-generated content—parallel the early imperfections of digital photography. These issues, while significant today, represent evolutionary hurdles rather than permanent limitations, suggesting that patience and continuous improvement will ultimately transform these AI systems into indispensable tools.
Walk into any coffee shop or office and I can guarantee that you'll see several people actively typing into ChatGPT or Claude. If it was so useless, four years on, why would people be bothering with it?
I don't think you can even be bullish or bearish about this tech. It's here and it's changing pretty much every sector you can think of. It would be like saying you're not bullish about the Internet.
I honestly can't imagine life without one of these tools. I have a subscription to pretty much all of them because I get so excited to try out new models.
She's a scientist. Most of the people on here are writing software which is essentially reinventing the wheel over and over. Of course you have a different experience of LLMs.
I don't know.. I've maintained skepticism, but recently AI has enabled solutions for client problems that would have been intractable with conventional coding.
A team was migrating a years old excel based workflow where no less than 3 spreadsheets contained thousands of call notes, often with multiple notes stuffed into the same column separated inconstantly by a shorthand date and initials of who was in the call. Sometimes with text arrows or other meta descriptions like (all calls after 3/5 were handled by Tim). They want to move all of this into structured jira tickets and child tickets.
Joining the mess of freeform, redundant, and sometimes self contradicting data into JSON lines, and feeding it into AI with a big explicit prompt containing example conversions and corrections for possible pitfalls has resulted in almost magically good output. I added a 'notes' field to the output and instructed the model to call out anything unusual and it caught lots of date typos by context, ambiguously attributed notes, and more.
It would have been a man month or so of soul drowningly tedious and error prone intern level work, but now it was 40 minutes and $15 of Gemini usage.
So, even if it's not a galaxy brained super intelligence yet, it is a massive change to be able to automate what was once exclusively 'people' work.
The author of the tweet is a physicist. Her work is at the edge of the boundary of human knowledge. LLM’s are useless in this domain, at last when applied directly.
When I use LLM’s to explore applications of cutting edge nonlinear optics, I too am appalled about the quality of the output. When I use an LLM to implement a React program, something that has been done hundreds of times before by others, I find it performs well.
I think there is a lot of denial going around right now.
The present path of IA is nothing short of revolutionary, a lot of jobs and industries are going to suffer a major upheaval and a lot of people are just living in some wishful thinking moment where it will all go away.
I see people complaining it gives them bad results. Sure it does, so all other parsed information we get. It’s our job to check it ourselves. Still , the amount of time it saves me, even if I have to correct it is huge.
A can give an example that has nothing to do with work. I was searching for the smallest miniATX computer cases that would accept at least 3 HDDs (3.5”). The amount of time LLMs saved me is staggering.
Sure, there was one wrong result in the mix, and sure, I had to double check all the cases myself, but, just not having to go through dozens of cases, find the dimensions, calculate the volume, check the HDDs in difficult to read (and sometimes obtain) pages, saved days of work - yes I had done a similar search completely manually about 5 years ago.
This is a personal example, I also have others at work.
We've had the opposite experience, especially with o3-mini using Deep Research for market research & topic deep-dive tasks. The sources that are pulled have never been 404 for us, and typically have been highly relevant to the search prompt. It's been a huge time-saver. We are just scratching the surface of how good these LLMs will become at research tasks.
It's a common and fair criticism. LLM-based products promise to save time, but for many complex, day-to-day tasks - adding a feature to a 50M LOC codebase, writing a magazine-quality article, properly summarizing 5 SEC filings - they often don't. They require careful re-validation, and once you find a few obvious issues, trust erodes and the whole thing gets tossed.
This isn't a technology problem, it's a product problem - and one that may not be solvable with better models alone.
Another issue: people communicate uncertainty naturally. We say "maybe", "it seems", "I'm not sure, but...". LLMs suppress that entirely, for structural reasons. The output sounds confident and polished, which warps perception - especially when the content is wrong.
I used to be skeptical of the hype but it's hard to deny that they are incredible tools. For coding, they save me a few hours a week and this is just the beginning. A few months ago I would use them to generate simple pieces of code, but now they can handle refactoring across several files. Even if LLMs don't get smarter, the tooling around them will improve and they'll be even more useful. We'll also learn to use them better.
Also my gf who's not particularly tech savvy relies heavily on ChatGPT for her work. It's very useful for a variety of text (translation, summaries, answering some emails).
Maybe Sabine Hossenfelder tries to use them for things they can't do well and she's not aware that they work for other use cases.
Should we listen to Sabine in this case? Isn't this another manifestation of a generally intelligent person, who happens to be an expert in her field weighing in on something she's not an expert on, thinking her expertise transfers?
This is the most common 'smart person' fallacy out there.
As for my 2 cents, LLMs can do sequence modeling and prediction tasks, so as long as a problem can be reduced to sequence modeling (which is a lot of them!), they can do the job.
This is like saying that the Fourier Transform is played out because you can only do so much with manipulating signal frequencies.
LLMs are an incredible technological breakthrough which even their creators are using completely incorrectly - in my view, half because they want to make a lot of money, and half because they are enchanted by the allure of their own achivement and are desperate to generalize it. The ability to generate human-style language, art, and other media dynamically on demand, based on a prompt communicated in natural language, is an astonishing feat. It's immensely useful on its own.
But even its creators, who acknowledge it is not AGI, are trying to use it as if it were. They want to sell you LLMs as "AI" writ large, that is, they want you to use it as your research assistant, your secretary, your lawyer, your doctor, and so on and so forth. LLMs on their own simply cannot do those tasks. They is great for other uses: troubleshooting, assisting with creativity and ideation, prototyping concepts of the same, and correlating lots of information, so long as a human then verifies the results.
LLMs right now are flour, sugar, and salt, mixed in a bowl and sold as a cake. Because they have no reasoning capability, only rote generation via prediction, LLMs cannot process contextual information the way required for them to be trustworthy or reliable for the tasks people are trying to use them for. No amount of creative prompting can resolve this totally. (I'll note that I just read the recent Anthropic paper, which uses terms like "AI biology" and "concept" to imply that the AI has reasoning capacity - but I think these are misused terms. An LLM's "concept" of something bears no referent to the real world, only a set of weights to other related concepts.)
What LLMs need is some sort of intelligent data store, tuned for their intended purpose, that can generate programmatic answers for the LLMs to decipher and present. Even then, their tendency to hallucinate makes things tough - they might imagine the user requested something they didn't, for instance. I don't have a clear solution to this problem. I suspect whoever does will have solved a much bigger, more complex than the already massive one that LLMs have solved, and if they are able to do so, will have brought us much much closer to AGI.
I am tired of seeing every company under the sun claim otherwise to make a buck.
I am making an effort to use LLMs at work, but in my workflow it's basically just a fancy auto complete. Having a more AI centric workflow could be interesting, but I haven't thought of a good way to rig that up. I'm also not really itching for something to do my puzzles for me. They're what gets me out of bed in the morning.
I haven't tried using LLMs for much else, but I am curious as long as I can run it on my own hardware.
I also totally get having a problem with the massive environmental impact of the technology. That's not AIs fault per se, but its a valid objection.
The author mentioned Gemini sometimes refusing to do something.
I’ve recently been using Gemini (mostly 2.0 flash) a lot and I’ve noticed it sometimes will challenge me to try doing something by myself. Maybe it’s something in my system prompt or the way I worded the request itself. I am a long time user of 4o so it felt annoying at first.
Since my purpose was to learn how to do something, being open minded I tried to comply with the request and I can say that… it’s being a really great experience in terms of retention of knowledge. Even if I’m making mistakes Gemini will point them out and explain it nicely.
People have different opinions about this, but I think one problem is there are different questions.
One is - Google, Facebook, OpenAI, Anthropic, Deepseek etc. have put a lot of capital expenditure into train frontier large language models, and are continuing to do so. There is a current bet that growing the size of LLMs, with more or maybe even synthetic data, with some minor breakthroughs (nothing as big as the Alexnet deep learning breakthrough, or transformers), will have a payoff for at least the leading frontier model. Similar to Moore's law for ICs, the bet is that more data and more parameters will yield a more powerful LLM - without that much more innovation needed. So the question for this is whether the capital expenditure for this bet will pay off.
Then there's the question of how useful current LLMs are, whether we expect to see breakthroughs at the level of Alexnet or transformers in the coming decades, whether non-LLM neural networks will become useful - text-to-image, image-to-text, text-to-video, video-to-text, image-to-video, text-to-audio and so on.
So there's the business side question, of whether the bet that spending a lot of capital expenditure training a frontier model will be worth it for the winner in the next few years - with the method being an increase in data, perhaps synthetic data, and increasing the parameter numbers - without much major innovation expected. Then there's every other question around this. All questions may seem important but the first one is what seems important to business, and is connected to a lot of the capital spending being done on all of this.
I love this. The more people that say "I don't get it" or "it's a stochastic parrot", the more time I get to build products rapidly without the competition that there would be if everyone was effectively using AI. Effectively is the key.
It's cliche at this point to say "you're using it wrong" but damn... it really is a thing. It's kind of like how some people can find something online in one Google query and others somehow manage to phrase things just wrong enough that they struggle. It really is two worlds. I can have AI pump out 100k tokens with a nearly 0% error rate, meanwhile my friends with equally high engineering skill struggle to get AI to edit 2 classes in their codebase.
There are a lot of critical skills and a lot of fluff out there. I think the fluff confuses things further. The variety of models and model versions confuses things EVEN MORE! When someone says "I tried LLMs and they failed at task xyz" ... what version was it? How long was the session? How did they prompt it? Did they provide sufficient context around what they wanted performed or answered? Did they have the LLM use tools if that is appropriate (web/deepresearch)?
It's never a like-for-like comparison. Today's cutting-edge models are nothing like even 6-months ago.
Honestly, with models like Claude 3.7 Sonnet (thinking mode) and OpenAI o3-mini-high, I'm not sure how people fail so hard at prompting and getting quality answers. The models practically predict your thoughts.
Maybe that's the problem, poor specifications in (prompt), expecting magic that conforms to their every specification (out).
I genuinely don't understand why some people are still pessimistic about LLMs.
Can our brains recall precise citations to tens of papers we read a while ago? For the vast majority, no. LLMs function somewhat similarly to our brains in many ways, as opposed to classical computers.
Their strengths and flaws differ from our brains, to be sure, but some of these flaws are being mitigated and improved on by the month. Similarly, unaided humans cannot operate successfully in many situations. We build tools, teams, and institutions to help us deal with them.
I think LLMs have value, but what I'm really looking forward to is the day when everyone can just quietly use (or not use) LLMs and move on with their lives. It's like that one friend who started a new diet and can't shut up about it every time you see them, except instead of that one friend it's seemingly the majority of participants in tech forums. It's getting so old.
I think the author is being overly pessimistic with this. The positives of an LLM agent outweigh the negatives when used with a Human-in-the-loop.
For people interested in understanding the possibilities of LLM for use in a specific domain see The AI Revolution in Medicine: GPT-4 and Beyond by Peter Lee (Microsoft Research VP), Isaac Kohane (Harvard Biomedical Informatics MD) et al. It is an easy read showing the authors systematic experiments with using the OpenAI models via the ChatGPT interface for the medical/healthcare domain.
While I am amazed at the technology, at the same time I hate it. First, 90% of people misintepret it and cite output as facts. Second, it needs too much energy. Third, energy consumption is rarely mentioned in nerdy discussions about LLMs.
'But you're such a killjoy.'
Yes, it is an evil technology in its current shape. So we should focus on fixing it, instead of making it worse.
I understand how developers can come to this conclusion if they're only using local models that can run on consumer GPUs since there's a time cost to prompting and the output is fairly low quality with a higher probability of errors and hallucinations.
But I don't understand how you can come to this conclusion when using SOTA models like Claude Sonnet 3.7, it's response has always been useful and when it doesn't get it right first time you can keep prompting it with clarifications and error responses. On the rare occasion it's unable to get it right, I'm still left with a bulk of useful code that I can manually fix and refactor.
Either way my interactions with Sonnet is always beneficial. Maybe it's a prompt issue? I only ask it to perform small, specific deterministic tasks and provide the necessary context (with examples when possible) to achieve it.
I don't vibe code or unleash an LLM on an entire code base since the context is not large enough and I don't want it to refactor/break working code.
The key for LLM productivity, it seems to me, is grounding. Let me give you my last example, from something I've been working on.
I just updated my company commercial PPT. ChatGPT helped me with:
- Deep Research great examples and references of such presentations.
- Restructure my argument and slides according to some articles I found on the previous step, and thought were pretty good.
- Come up with copy for each slide.
- Iterate new ideas as I was progressing.
Now, without proper context and grounding, LLMs wouldn't be so helpful at this task, because they don't know my company, clients, product and strategy, and would be generic at best. The key: I provided it with my support portal documentation and a brain dump I recorded to text on ChatGPT with key strategic information about my company. Those are two bits of info I keep always around, so ChatGPT can help me with many tasks in the company.
From that grounding to the final PPT, it's pretty much a trivial and boring transformation task that would have cost me many, many hours to do.
An LLM can do some pretty interesting things, but the actual applicability is narrow. It seems to me that you have to know a fair amount about what you're asking it to do.
For example, last week I dusted off my very rusty coding skills to whip up a quick and dirty Python utility to automate something I'd done by hand a few too many times.
My first draft of the script worked, but was ugly and lacked any trace of good programming practices; it was basically a dumb batch file, but in Python. Because it worked part of me didn't care.
I knew what I should have done -- decompose it into a few generic functions; drive it from an intelligent data structure; etc -- but I don't code all the time anymore, and I never coded much in Python, so I lack the grasp of Python syntax and conventions to refactor it well ON MY OWN. Stumbling through with online references was intellectually interesting, but I also have a whole job to do and lack the time to devote to that. And as I said, it worked as it was.
But I couldn't let it go, and then had the idea "hey, what if I ask ChatGPT to refactor this for me?" It was very short (< 200 lines), so it was easy to paste into the Chat buffer.
Here's where the story got interesting. YES, the first pass of its refactor was better, but in order to get it to where I wanted it, I had to coach the LLM. It took a couple passes through before it had made the changes I wanted while still retaining all the logic I had in it, and I had to explicitly tell it "hey, wouldn't it be better to use a data structure here?" or "you lost this feature; please re-add it" and whatnot.
In the end, I got the script refactored the way I wanted it, but in order to get there I had to understand exactly what I wanted in the first place. A person trying to do the same thing without that understanding wouldn't magically get a well-built Python script.
I genuinely don't understand why some people are so critical of LLMs. This is new tech, we don't really understand the emergent effects of attention and transformers within these LLMs at all. It is very possible that, with some further theoretical development, LLMs which are currently just 'regurgitating and hallucinating' can be made to be significantly more performant indeed. In fact, reasoning models - when combined with whatever Google is doing with the 1M+ ctxt windows - are much closer to that than people who were using LLMs expected.
The tech isn't there yet, clearly. And stock valuations are over the board way too much. But, LLMs as a tech != the stock valuations of the companies. And, LLMs as a tech are here to stay and improve and integrate into everyday life more and more - with massive impacts on education (particularly K-12) as models get better at thinking and explaining concepts for example.
Note that there is a different between being "bullish" (ie market that is upwards trending) vs "useful". I think there is general value with LLM for semantic search & information extraction, but not as an exclusive way to AGI that some the market expects for its overinflated valuation.
She's a physicist. LLMs are not for creating new information. They're for efficiently delivering established information. I use it to quickly inform me about business decisions all the time, because a thousand answers about those questions already exist. She's just using it for the wrong thing.
LLMs are like any tool, you get what you put in. If you are frustrated with the results, maybe you need to think about what you're doing.
300/5290 functions decompiled and analyzed in less than three hours off of a huge codebase. By next weekend, a binary that had lost source code will have tests running on a platform it wasn't designed for.
I think she is right. And it may have some very real consequence. The latest Gen Alpha are AI native. They have been using AI for one or two years now. And as they grow up their knowledge will definitely be built on top of AI. This leads to a few fundamental problems.
1. AI inventing false information that is being used to built on their foundational knowledge.
2. There is a lot less problem solving for them once they are used to AI.
I think the fundamental of Eduction needs to look at AI or current LLM chatbot seriously and start asking or planning how to react to it. We have already witness Gen Z, with era of Google thinking they know everything and if not google it. Thinking of "They Know it ALL" only to be battered in the real world.
The way I see it, it's less about the technicalities of accuracy and more about the long term human and societal problems it presents when widely adopted.
On one hand, every new technology that comes about unregulated creates a set of ethical and in this particular case, existential issues.
- What will happen to our jobs?
- Who is held accountable when that car navigation system designed by an LLM went haywire and caused an accident?
- What will happen with education if we kill all entry level jobs and make technical skills redundant?
In a sense they're not new concerns in science, we research things to make life easier, but as technology advances, critical thinking takes a hit.
So yeah, I would say people are still right to be weary and 'bullish" of LLMs as it's the normal behaviour for disruptive technology, and one will help us create adequate regulations to safeguard the future.
Simple example: My company is using Gemini to analyze every 13F filing and find correlations between all S&P500 companies the minute new earnings are released. We profited millions off of this in the last six months or so. Replicating this work alone without AI would require hiring dozens of people. How can I not be bullish on LLMs? This is only one of many things we are doing with it.
I do not understand how you can be bearish on LLMs. Data analysis, data entry, agents controlling browsers, browsing the web, doing marketing, doing much of customer support, writing BS React code for a promo that will be obsolete in 3 months anyway.
The possibilities are endless, and almost every week, there is a new breakthrough.
That being said, OpenAI has no moat, and there definitely is a bubble. I'm not bullish on AI stocks. I'm bullish on the tech.
I say the author tells us more than the headline or first sentence of the loop. If you have recently scrolled through Sabines posts on Twitter, or her clickbaity thumbnails, facial expressions and headlines on YouTube [1], you would see that she is all-in for clicks. She often takes a popular belief, then negates it, and throws around counter-examples of why X or Y has absolutely failed. It's a repeating pattern to gain popularity, and it seems to work not only on Twitter and YouTube, but even here on Hacker News, given the massive amount of upvotes her post has.
Sabine is on my blocklist as she very often puts out really ignorant and short-sighted perspectives.
LLMs are the most impactful technology we've had since the internet, that is why people are bullish on them, anyone who fails to see that cannot probably tie its own shoes without a "peer-reviewed" mechanism, lol.
I think LLM skeptics and cheerleaders all have a point but I lean toward skepticism. And that's because though LLMs are easy to pick up, they're impossible to "master" and are extremely finicky. The tech is really fun to tinker with and capable of producing some truly awesome results in the hands of a committed practitioner. That puts it in the category of specialized tools. Despite this fact, LLMs are hyped and valued by the markets like they are breakthrough consumer products. My own experience plus the vague/underwhelming adoption and revenue numbers reported by the major vendors tell me that something's not quite right in this area of the industry.
LLMs as a tool that you use and check can be useful, especially for code.
However, I think that putting some LLM in your customer-facing app/SaaS/game is like using the "I'm feeling lucky" button when Google introduced it. It only works for trivial things that you might not even had to use a search engine for (find the address of a website you had already visited). But since it's so cheap to implement and feels like it's doing the work of humans for a tiny fraction of the cost, they won't care about the customers and implement it anyway. So it'll probably flood any system that can get away with basic mistakes, hopefully not in systems where human lives are at stake.
I’m bullish because many of these models rest on the consumption of copyrighted material or information that wasn’t intended for mass consumption in this way.
Also, I like to think for myself. Writing code and thinking through what I am writing often exposes edge cases that I wouldn’t otherwise realize.
I was previously a non believer in LLMs, but I've come around to accepting them. Gemini has saved me so much time and energy it's actually insane. I wouldn't be able to do my work to my satisfaction (which is highly technical) without its support.
To each is own. I give o3 with deep research my notes and ask it for a high-level design document, then feed that to claude and get a skeleton of a multi-service system, then build out functionality in each service with subsequent claude requests.
Sure, it does middle-of-the-road stuff, but it comments the code well, I can manually tweak things at the various levels of granularity to guide it along, and the design doc is on par something a senior principal would produce.
I do in a week what a team of four would take a month and a half to do. It's insane.
Sure, don't be bullish. I'm frantically piecing together enough hardware to run a decent sized LLM at home.
A bunch of comments here seem to be missing a point : the author is (at least was ?) a scientist.
Her primarily work interest is in the truth, not the statically plausible.
Her point is that using LLM to generate truth is pointless, and that people should stop advertising llms as "intelligent", since, to a scientist, being "intelligent" and being "dead wrong" are polar opposite.
Other use cases have feedback loops - it does not matter so much if Claude spuits wrong code, provided you have a compiler and automated tests.
Scientists _are_ acting as compilers to check truth. And they rely on truths compiled by other scientists, just like your programs rely on code written by other people.
What if I tell you that, from now on, any third party library that you call will _stastically_ work 76% of the time, and I have no clue what it does is the remaining X % ?(I don't know what X is, I haven't hast chatgpt yet.)
In the meantime, I still have to see a headline "AI X discovered life-changing new Y on its own" (the closest thing I know of is alpha fold, which I both know is apparently "changing the world of scientists", and yet feel has "not changed the world of your average joe, so far" - emphasis on the "so far" ) ; but I've already seen at least one headline of "dumb mistake made because an AI hallucinated".
I suppose we have to hope the trend will revert at some point ? Hope, on a Friday...
AI coding overall still seems to be underrated by the average developer.
They try to add a new feature or change some behavior in a large existing codebase and it does something dumb and they write it off as a waste of time for that use case. And that's understandable. But if they had tweaked the prompt just a bit it actually might've done it flawlessly.
It requires patience and learning the best way to guide it and iterate with it when it does something silly.
Although you undoubtedly will lose some time re-attempting prompts and fixing mistakes and poor design choices, on net I believe the frontier models can currently make development much more productive in almost any codebase.
"Yes, I have tried Gemini, and actually it was even worse in that it frequently refuses to even search for a source and instead gives me instructions for how to do it myself. Stopped using it for that reason."
Thank you Sabine. Every time I have mentioned Gemini is the worst, and not even worth of consideration, I have been bombarded with downvotes, and told I am using it wrong.
One is a boss's view, looking for an AI to replace his employees someday. I think that is a dead end. It is just getting better to become a sophisticate, increasingly impressive but won't work.
One is the worker's view, looking at AI to be a powerful tool that can leverage one's productivity. I think that is looking promising.
I don't really care for the chat bot to give me accurate sources. I care about an AI that can provide likely places to look for sources and I'll build the tool chain to lookup and verify the sources.
I pay our lawyer quite a lot and he also makes mistakes. What have you not - typos, somewhat inconcise language in contracts, but we work to fix it and everything's ok.
Well you see, there are lots of people that are dependent on bullshitting their way through life, and don't actually contribute anything new of novel to the world. For these people, LLMs are great because they can generate more bullshit than they ever could before. For those that are actually trying to do new and interesting things, well, an LLM is only as good as what it has seen before, and you're doing something new and exciting. Congratulations, you beat the machine until they steal your work and feed it to the LLM.
People who don't work in tech have no idea how hard it is to do certain things at scale. Skilled tech people are severely underappreciated.
From a sub-tweet:
>> no LLM should ever output a url that gives a 404 error. How hard can it be?
As a developer, I'm just imagining a server having to call up all the URLs to check that they still exist (and the extra costs/latency incurred there)... And if any URLs are missing, getting the AI to re-generate a different variant of the response, until you find one which does not contain the missing links.
And no, you can't do it from the client side either... It would just be confusing if you removed invalid URLs from the middle of the AI's sentence without re-generating the sentence.
You almost need to get the LLM to engineer/pre-process its own prompts in a way which guesses what the user is thinking in order to produce great responses...
Worse than that though... A fundamental problem of 'prompt engineering' is that people (especially non-tech people) often don't actually fully understand what they're asking. Contradictions in requirements are extremely common. When building software especially, people often have a vague idea of what they want... They strongly believe that they have a perfectly clear idea but once you scope out the feature in detail, mapping out complex UX interactions, they start to see all these necessary tradeoffs and limitations rise to the surface and suddenly they realize that they were asking for something they don't want.
It's hard to understand your own needs precisely; even harder to communicate them.
Why do people who don't like using LLMs keep insisting they are useless for the rest of us? If you don't like to use them, then simply don't use them.
I use them almost daily in my job and get tremendous use out of them. I guess you could accuse me of lying, but what do I stand to gain from that?
I've also seem people claim that only people who don't know how to code or people doing super simple done a million times apps can get value out of LLMs. I don't believe that applies to my situation, but even if it did, so what? I do real work for a real company delivering real value, and the LLM delivers value to me. It's really as simple as that.
I can't believe they're complaining about LLMs not being able to do unit conversion. Of all things, this is the least interesting and last task I would ever ask an LLM to do.
,,By my personal estimate currently GPT 4o DeepResearch is the best one. ''
If the o3 based 3 month old strongest model is the best one, it's a proof that there were quite significant improvements in the last 2 years.
I can't name any other technology that improved as much in 2 years.
O1 and o1 pro helped me with filing tax returns and answered me questions that (probably quite bad) tax accountants (and less smart models) weren't able to (of course I read the referenced laws, I don't trust the output either).
I hope she is aware of the limited context window and ability to retrieve older tokens from conversations.
I have used llms for the exact same purpose she has, summerize chapters or whole books and find the source from e quote, both with success.
I think they key to a successful output lies in the way you prompt it.
Hallucinations should be expected though, as we all hopefully know, llms are more of a autocomplete than intelligence, we should stick to that mindset.
I feel like at this point when people make some claim about LLM they need to actually include the model they are using. So many "LLMs can do / cant do X", without reference to the model, which I think is relevant.
I am hoping that the LLM approach will face increasingly diminished returns however. So I am biased toward Sabine's griping. I don't want LLM to go all the way to "AGI".
I haven't met any developers in real life who are hyped about LLMs. The only people who seem excited are managers and others who don't really know how to program.
I don't actually like LLMs that much or find them useful, but its clear there are some things they are good at and some things they are bad at, and this post is all about things they are weak at.
LLMs are a little bit magical but they are still a square peg. The fact they don't fit in a round hole is uninteresting. The interesting thing to debate is how useful they are at the things they are good at, not at the things they are bad at.
The keyword in title is "bullish". It's about the future.
Specifically I think it's about the potential of the transformer architecture & the idea that scaling is all that's needed to get to AGI (however you define AGI).
> Companies will keep pumping up LLMs until the day a newcomer puts forward a different type of AI model that will swiftly outperform them.
I don't mean to be disparaging to the original author, but I genuinely think a good litmus test today for the calibre of a human researcher's intelligence is what they are able to get out of the current state of the art AI with all of its faults. Contrasting any of Terrance Tao's comments over the last year to the comments above are telling. Perhaps it seems unfair to contrast such a celebrated mathematician to a popular science author, but in fact one would a priori expect that AI ought to be less helpful for the most talented mind in a field. Yet we seem to find exactly the opposite: this cottage industry of "academics" who, now more than a year since LLMs entered popular consciousness, seem to do nothing but decry "AI hype" - while both everyday people and seemingly the most talented researchers continue to pursue what interests them at a higher level.
If I were to be cynical, I think we've seen over the last decade the descent of most of academia, humanities as much as natural sciences, to a rather poor state, drawing entirely on a self-contained loop of references without much use or interest. Especially in the natural sciences, one can today with little effort obtain an infinitely more insightful and, yes, accurate synthesis of the present state of a field from an LLM than 99% of popular science authors.
This user just doesn't understand how to use an LLM properly.
The best solution to hallucination and inaccuracy is to give the LLM mechanisms for looking up the information it lacks. Tools, MCP, RAG, etc are crucial for use cases where you are looking for factual responses.
This is interesting. I have replaced google search with ChatGPT and meta’s AI. They more than deliver. Thinking about my use cases, I use google to recollect things or to give me a starting point for further research. LLMs are great for that and so I am never going back to google. However I am curious about how the cases where the OP sees this great gap and failures
This article seems to focus on the shortcomings of LLMs being wrong, but fails to consider the value LLMs provide and just how large of an opportunity that value presents.
If you look at any company on earth, especially large ones, they all share the same line item as their biggest expense: labor. Any technology that can reduce that cost represents an overwhelmingly huge opportunity.
Upon asking ChatGPT multiple times it just wouldn’t tell me why Vlad the Impaler has this name. It will refuse to say anything cruel I guess, but it’s very frustrating when history is not represented truthfully.
When I’m asking about topics unknown to me, what is it hidding from me, I don’t know it’s awful.
She's a scientist. In that area LLMs are quite useful in my opinion and part of my daily workflow. Quick scripts that use APIs to get data, cleaning the data and converting it. Quickly working with polar data frames. Dumb daily stuff like "take this data from my CSV file and turn it into a Latex table"...but most importantly freeing up time from tedious administrative tasks (let's not go into detail here).
Also great for brainstorming and quick drafting grant proposals. Anything prototyping and quickly glued together I'll go for LLMs (or LLM agents). They are no substitute for your own brain though.
I'm also curious about the hallucinated sources. I've recently read some papers on using LLM-agents to conduct structured literature reviews and they do it quite well and fairly reproducible. I'm quite willing to build some LLM-agents to reproduce my literature review process in the near future since it's fairly algorithmic. Check for surveys and reviews on the topic, scan for interesting papers within, check sources of sources, go through A tier conference proceedings for the last X years and find relevant papers. Rinse, repeat.
I'm mostly bullish because of LLM-agents, not because of using stock models with the default chat interface.
OP highlights application problems, and RAG specifically. But that is not an LLM problem.
Chat is such a “leaky” abstraction for LLMs
I think most people share the same negative experience as they only interact with LLMs through the chat UI by OpenAI and Anthropic. The real magic moment for me was still the autocompletion moment from the gh copilot.
My experience mirrors hers. Asking questions is worthless because the answers are either 404 links, telling me how to use a search engine, or just flat out wrong and the code generated compiles maybe one time out of ten and when it does the implementation is usually poor.
When I evaluate against areas I possess professional expertise I become convinced LLMs produce the Gell Mann amnesia effect for any area I don't know.
Funny how most of the counter comments here used the form "my experience is different/it's amazing!" and then listed activities that are completely different from what Sabine listed :)
We really should stop reinforcing our echo bubbles and learn from other people. And sometimes be cool in the face of criticism.
I'm just glad I have something that can write logic in any unix shell script for me. It's the right combination of thing I don't have to do often, thing that doesn't fit my mental model, thing that works differently based on the platform, and thing I just can't be bothered to master.
What is a good tutorial / training, to learn about LLMs from scratch ?
I guess that’s the main problem, even more so for non-developers and -tech people. The learning curve is too steep, and people don’t know where to start.
LLM interactions are more a reflection on the user than a objective technology. I'm starting to believe that people that think that LLM's are terrible are sub optimal communicators. Because their input is bad the output is.
To me, the bullishness stems from the observation that current gen of LLM an plausibly lead to a route to human level intelligence / agi, because they have this causal behavior: (memory/context + inputs) --> outputs. Vaguely similar to humans.
Well strap in, because I only see this getting worse (or better, depending on your outlook). Faster chips, more power, more algorithm research and more data. I don't know what's coming exactly, but changes are coming fast.
I wonder when we can get LLM, which might be more "stupid" but knows what it does not know rather than hallucinates... Though perhaps when it would be not LLM but entirely different tech :)
Experts will be in denial of LLMs for a long time, while the non-experts will swiftly use it to bridge their own knowledge gap. This is the use case for LLMs, maybe more so than 100% correctness.
I'm genuinely just unimpressed by computers and space craft. Sometimes they just won't boot up, or blow up. I also just am unimpressed by wireless 4G/5G internet,... the printing press. (sarcasm)
I feel that sentiment on LLMs has almost approached a left/right style idealogical divide, where both sides seem to have a totally different interpretation of what they are seeing and what is important.
It feels like OP is not using the tools that LLM are correctly. Yes they hallucinate, but I've found they rarely do on first run. It's only when you insist that they start doing it.
If you have tried to use LLMs and find them useless or without value, you should seriously consider learning how to correctly use them and doing more research. It is literally a skill issue and I promise you that you are in the process of being left behind. In the coming years Human + AI cooperatives are going to far surpass you in terms of efficiency and output. You are handicapping yourself by not becoming good at using them. These things can deliver massive value NOW. You are going to be a grumbling gray beard losing your job to 22 year old zoomers who spend 10 hours a day talking to LLMs.
A better model would be good for the stockmarket. The stock market SP500 is shovel town. What would be bad id AI fizzle out and not deliver on hype valuations.
LLMs are not a miracle, they are a type of tool. The hype I am angry about is the “black magic? Well, clearly that can solve every problem” mode of thought.
I realized it would be a bubble this xmas when I ran across a neighbor of mine walking their dog. I told them what I did and they immedately asked if I worked with AI, they were looking for AI programmers. No regard for details, or what field, just get me AI programmers.
He's a person with money and he wants AI programmers. I bet there are millions like him.
Don't get me wrong though, I do believe in a future with LLMs. But I believe they will become more and more specialized for specific tasks. The more general an AI is, the more it's likely to fail.
I use them everyday and they work greatly, I even made a command (using Claude, actually Claude made everything in that script) that calls Gemini from the terminal so that I can ask for question related to the shell directly there, just doing a: ai "how can I convert a webp to a png", the system prompt asks to be brief, using markdown (it does display nicely), that most question are related to Linux and it provides information about my OS (uname -a), the last code block is also copied in the clipboard, super useful, I imagine there are plenty online of similar utilities.
I wonder if this site isn't impressed because there are a lot of 1% coders here that don't understand what most people do at work. It's mostly administrative. Take this spreadsheet, combine with this one, stuff it into power BI, email it to Debbie so she can spell check it and send to the client. Y'all forget there are companies that actually make things that don't have insane valuations like your bullshit apps do. A company that scopes municipal sewer pipes can't afford a $500k/yr dev, so there's a few $60k/yr office workers that fiddle with reports and spreadsheets all day. It's literally a whole department in some cases. Those are the jobs that are about to be replaced and there are a lot of those jobs out there.
so many people have shown me these stupid ass AI summaries for random things and if you have even a basic understanding of the relevent issue then the answers just seem bizare. this feels like cheating on my homework and not understanding. like using photomath on homework does.
I think of LLMs like smart but unreliable humans. You don't want to use them for anything that you need to have right. I would never have one write anything that I don't subsequently go over with a fine-toothed comb.
With that said, I find that they are very helpful for a lot of tasks, and improve my productivity in many ways. The types of things that I do are coding and a small amount of writing that is often opinion-based. I will admit that I am somewhat of a hacker, and more broad than deep. I find that LLMs tend to be good at extending my depth a little bit.
From what I can tell, Sabine Hossenfelder is an expert in physics, and I would guess that she already is pretty deep in the areas that she works in. LLMs are probably somewhat less useful at this type of deep, fact-based work, particularly because of the issue where LLMs don't have access to paywalled journal articles. They are also less likely to find something that she doesn't know (unlike with my use cases, where they are very likely to find things that I don't know).
What I have been hearing recently is that it will take a long time for LLMs will be better than humans at everything. However, they are already better than many many humans at a lot of things.
I have a very specific esoteric question like: "What material is both electrically conductive and good at blocking sound?" I could type this into google and sift through the titles and short descriptions of websites and eventually maybe find an answer, or I can put the question to the LLM and instantly get an answer that I can then research further to confirm.
This is significantly faster, more informative, more efficient, and a rewarding experience.
As others have said, its a tool. A tool is as good as how you use it. If you expect to build a house by yelling at your tools I wouldn't be bullish either.
My take is that if you expect current LLMs to be some near perfect, near-AGI models, then you're going to be sorely disappointed.
If that disappoints you to such a degree that you simply won't use them, you might find yourself in a position some years ahead - could be 1...could be 2...could be 5...could be 10 - who knows, but when the time comes, you might just be outdated and replaced yourself.
When you closely follow the incremental improvements of tech, you don't really fall for the same hype hysteria. If you on the other hand only look into it when big breakthroughs are made, you'll get caught in the hype and FOMO.
And even if you don't want to explicitly use the tools, at least try to keep some surface-level attention to the progress and improvements.
I honestly believe that there are many, many senior engineers / scientists out there that currently just scoff at these models, and view them as some sort of toy tech that is completely overblown and overhyped. They simply refuse to use the tools. They'll point to some specific time a LLM didn't deliver, roll their eyes, and call it useless.
Then when these tools progress, and finally meet their standards, they will panic and scramble to get into the loop. Meanwhile their non-tech bosses and executives will see the tech as some magic that can be used to reduce headcount.
I mean the answer is simple, money. There's a bajillion dollars getting shoved into this crap, and most of the bulls are themselves pushing some AI thing. Look at YC, pretty much everything they're funding has some mention of AI. It's a massive bubble with people trying to cash in on the hype. Plus the managerial class being the scum they are, they're bullish because they keep getting sold on the idea that they can replace swathes of workers.
Use LLMs for what they are good at. Most of my prompts start with “What are my options for … ?” And they excel at that, particularly the recent reasoning models with the ability to search the web. They can help expand your horizons and analyze pros/cons from many angles.
Just today, I was thinking of making changes to my home theater audio setup and there are many ways to go about that, not to mention lots of competing products, so I asked ChatGPT for options and gave it a few requirements. I said I want 5.1 surround sound, I like the quality and simplicity of Sonos, but I want separate front left and right speakers instead of “virtual” speakers from a soundbar. I waited years thinking Sonos would add that ability, but they never did. I said I’d prefer to use the TV as the hub and do audio through eARC to minimize gaming latency and because the TV has enough inputs anyway, so I really don’t need a full blown AV receiver. Basically just a DAC/preamp that can handle HDMI eARC input and all of the channels.
It proceeded to tell me that audio-only eARC receivers that support surround sound don’t really exist as an off-the-shelf product. I thought, “What? That can’t be right, this seems like an obvious product. I can’t be the first one to have thought of this.” Turns out it was right, there are some stereo DAC/preamps that have an eARC input and I could maybe cobble together one as a DIY project, but nothing exactly like what I wanted. Interesting!
ChatGPT suggested that it’s probably because by the time a manufacturer fully implements eARC and all of the format decoding, they might as well just throw in a few video inputs for flexibility and mass-market appeal, plus one less SKU to deal with. And that kind of makes sense, though it adds excess buttons and bothers me from a complexity standpoint.
It then suggested WISA as a possible solution, which I had never heard of, and as a music producer I pay a lot of attention to speaker technology, so that was interesting to me. I’m generally pretty skeptical of wireless audio, as it’s rarely done well, and expensive when it is done well. But WISA seems like a genuine alternative to an AV receiver for someone who only wants it to do audio. I’m probably going to go with the more traditional approach, but it was fun learning about new tech in a brainstorming discussion. Google struggles with these sorts of broad research queries in my experience. I may or may not have found out about it if I had posted on Reddit, depending on whether someone knowledgeable happened to see my post. But the LLM is faster and knows quite a bit about many subjects.
I also can’t remember the last time it hallucinated when having a discussion like this. Whereas, when I ask it to write code, it still hallucinates and makes plenty of mistakes.
I have been using Claude this week the first time for a _slightly_ bigger SwiftUI project than just a few lines of bash or SQL I used it before. I have never used swift before but I am amazed how much Claude could do. It feels to me as we are at the point where anyone can now generate small tools with low effort for themselves. Maybe not production ready, but good enough to use yourself. It feels like it should be good enough to empower the average user to break out of having to rely on pre-made apps to do small things. Kind of like bash for the average Joe.
What worked:
- generated a mostly working PoC with minimal input and hallucinated UI layout, Color scheme, etc. this is amazing because it did not bombard me with detailed questions. It just carried on to provide me with a baseline that I could then finetune
- it corrected build issues by me simply copy pasting the errors from Xcode
- got APIs working
- added debug code when it could not fix an issue after a few rounds
- resolved an API issue after I pointed it to a typescript SDK to the API (I literally gave a link to the file and told it, try to use this to work out where the problem is)
- it produces code very fast
What is not working great yet:
- it started off with one large file and crashed soon after because it hit a timeout when regenerating the file. I needed to ask it to split the file into a typical project order
- some logic I asked it to implement explicitly got changed at some point during an unrelated task. To prevent this in future I asked it mark this code part as important and that it should only be changed at explicit request. I don’t know yet how long this code will stay protected for
- by the time enough context got build up usage warnings pop up in Claude
- only so many files are supported atm
So my takeaway is that it is very good at translating, I.e. API docs into code, errors into fixes. There is also a fine line between providing enough context and running out of tokens.
I am planning to continue my project to see how far I can push it. As I am getting close to the limit of the token size now, I am thinking of structuring my app in a Claude friendly way:
- clear internal APIs. Kind of like header files so that I can tell Claude what functions it can use without allowing it to change them or needing to tokenize the full source code
- adversarial testing. I don’t have tests yet, but I am thinking of asking one dedicated instance of Claude to generate tests. I will use other Claude instances for coding and provide them with failing test outputs like I do now with build errors. I hope it will fix itself similarly.
This is just the beginning - and it need not be nation states. Imagine instead of Russian disinfo, it is oil companies doing the same thing with positive takes on oil, climate change is a hoax, etc. Or <insert religion> pushing a narrative against a group they are against.
Perhaps attitudes to this new phenomenon are correlated with propensity to skepticism in general.
I will cite myself as Exhibit A. I am the sort of person who takes almost nothing at face value. To me, physiotherapy, and oenology, and musicology, and bed marketing, and mineral-water benefits, and very many other such things, are all obviously pseudoscience, worthy of no more attention than horoscopes. If I saw a ghost I would assume it was a hallucination caused by something I ate.
So it seems like no coincidence that I reflexively ignore the AI babble at the top of search results. After all, an LLM is a language-rehashing machine which (as we all know by now) does not understand facts. That's terribly relevant.
I remember reading, a couple of years back, about some Very Serious Person (i.e. a credible voice, I believe some kind of scientist) who, after a three-hour conversation with ChatGPT, had become convinced that the thing was conscious. Rarely have I rolled my eyes so hard. It occurred to me then that skepticism must be (even) less common a mindset than I assumed.
Regardless of whether she's right or not, isn't she the same person who recently said that Europe needs to invest tens of billions in foundation models as soon as possible?
I genuinely don’t understand why everyone is so hyper polarized on LLMs. I find them to be fantastically useful tools in certain situations, and definitely one of the most impressive technological developments in decades. It isn’t some silver-bullet, solve all issues solution. It can be wrong and you can’t simply take things it outputs for granted. But that does not make it anywhere near useless or any less impressive. This is especially true considering how truly young the technology is. It literally didn’t exist 10 years ago, and the iteration that thrust them into the public is less than 3 years old and has already advanced remarkably. I find the idea that they are useless snake-oil to be just as deluded of a take as the people claiming we have solved AGI.
As far as I understand LLM what is being asked is unfortunately close to impossible with LLM.
Also I find it disingenuous that apologists are stating thing close to "you are using it wrong". Where it is advertised that LLM based AI should be more and more trusted (because more accurate, based on some arbitrary metrics) and might save some time ( on some undescribed task).
Of course in that use case most would say to use your judgement to verify whatever is generated, but for the generation that is using AI LLM as a source of knowledge ( like some people are using Wikipedia as source of truth, or stack overflow) it will be difficult to verify, when all they knew is LLM generated content as source of knowledge.
Never underestimate the momentum of a poorly understood idea that appears as magic to the average person. Once the money starts flowing, that momentum will increase until the idea hits a brick wall that its creators can't gaslight people about.
I hope that realization happens before "vibe coding" is accepted as standard practice by software teams (especially when you consider the poor quality of software before the LLM era). If not, it's only a matter of time before we refer to the internet as "something we used to enjoy."
Not everyone is as impressed as us by new tech, especially when it's kinda buggy.
The article makes a lot of good points. I get a lot of slop responses to both coding and non-coding prompts, but I've also gotten some really really good responses, especially code completion from Copilot. Even today, ChatGPT saved me a ton of Google searches.
I'm going to continue using it and taking every response with a grain of salt. It can only get better and better.
I like them but I still feel like I am employing a bunch of over confident narcissist engineers and if I ask them to do something, I am never really comfortable the result is correct.
What I want is a work force where I can pass off a request and go home in confidence that the results were correctly implemented.
I watched this creator on YT and she is unsatisfied all the time of LLMs. Just blocked her not to see her videos, because do not give anything but bad aura. Wishing her a good luck finding a topic where she will be happy to discuss.
I genuinely don't understand why some people are still bullish about LLMs
(twitter.com)717 points by ksec 27 March 2025 | 1219 comments
Comments
And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly? This is the most amazing technology I've experienced as a 50+ year old nerd that has been sitting deep in tech for basically my whole life. This is the stuff of science fiction, and while there totally are limitations, the speed at which it is progressing is insane. And people are like, "Wah, it can't write code like a Senior engineer with 20 years of experience!"
Crazy.
I think that there are two kinds of people who use AI: people who are looking for the ways in which AIs fail (of which there are still many) and people who are looking for the ways in which AIs succeed (of which there are also many).
A lot of what I do is relatively simple one off scripting. Code that doesn't need to deal with edge cases, won't be widely deployed, and whose outputs are very quickly and easily verifiable.
LLMs are almost perfect for this. It's generally faster than me looking up syntax/documentation, when it's wrong it's easy to tell and correct.
Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air. Not every use case is like this, but there are many.
-edit- Also, when she says "none of my students has ever invented references that just don't exist"...all I can say is "press X to doubt"
"I ask them to give me a source for an alleged quote, I click on the link, it returns a 404 error. I Google for the alleged quote, it doesn't exist. They reference a scientific publication, I look it up, it doesn't exist."
To experienced LLM users that's not surprising at all - providing citations, sources for quotes, useful URLs are all things that they are demonstrably terrible at.
But it's a computer! Telling people "this advanced computer system cannot reliably look up facts" goes against everything computers have been good at for the last 40+ years.
It’s been a common mantra - at least in my bubble of technologists - that a good majority of the software engineering skill set is knowing how to search well. Knowing when search is the right tool, how to format a query, how to peruse the results and find the useful ones, what results indicate a bad query you should adjust… these all sort of become second nature the longer you’ve been using Search, but I also have noticed them as an obvious difference between people that are tech-adept vs not.
LLMs seems to have a very similar usability pattern. They’re not always the right tool, and are crippled by bad prompting. Even with good prompting, you need to know how to notice good results vs bad, how to cherry-pick and refine the useful bits, and have a sense for when to start over with a fresh prompt. And none of this is really _hard_ - just like Search, none of us need to go take a course on prompting - IMO folks jusr need to engage with LLMs as a non-perfect tool they are learning how to wield.
The fact that we have to learn a tool doesn’t make it a bad one. The fact that a tool doesn’t always get it 100% on the first try doesn’t make it useless. I strip a lot of screws with my screwdriver, but I don’t blame the screwdriver.
Of late, deaf tech forums are taken over by language model debates over which works best for speech transcription. (Multimodal language models are the the state of the art in machine transcription. Everyone seems to forget that when complaining they can't cite sources for scientific papers yet.) The debates are sort of to the point that it's become annoying how it has taken over so much space just like it has here on HN.
But then I remember, oh yeah, there was no such thing as live machine transcription ten years ago. And now there is. And it's going to continue to get better. It's already good enough to be very useful in many situations. I have elsewhere complained about the faults of AI models for machine transcription - in particular when they make mistakes they tend to hallucinate something that is superficially grammatical and coherent instead - but for a single phrase in an audio transcription sporadically that's sometimes tolerable. In many cases you still want a human transcriber but the cost of that means that the amount of transcription needed can never be satisfied.
It's a revolutionary technology. I think in a few years I'm going have glasses that continuously narrate the sounds around me and transcribe speech and it's going to be so good I can probably "pass" as a hearing person in some contexts. It's hard not to get a bit giddy and carried away sometimes.
These critics don't seem to have learned the lesson that the perfect is the enemy of the good.
I use ChatGPT all the time for academic research. Does it fabricate references? Absolutely, maybe about a third of the time. But has it pointed me to important research papers I might never have found otherwise? Absolutely.
The rate of inaccuracies and falsehoods doesn't matter. What matters is, is it saving you time and increasing your productivity. Verifying the accuracy of its statements is easy. While finding the knowledge it spits out in the first place is hard. The net balance is a huge positive.
People are bullish on LLM's because they can save you days' worth of work, like every day. My research productivity has gone way up with ChatGPT -- asking it to explain ideas, related concepts, relevant papers, and so forth. It's amazing.
Imagine there is an probabilistic oracle that can answer any question with a yes/no with success probability p. If p=100% or p=0% then it is apparently very useful. If p=50% then it is absolutely worthless. In other cases, such oracle can be utilized in different way to get the answer we want, and it is still a useful thing.
There’s no question that we’re in a bubble which will eventually subside, probably in a “dot com” bust kind of way.
But let me tell you…last month I sent several hundred million requests to AI, as a single developer, and got exactly what I needed.
Three things are happening at once in this industry… (1) executives are over promising a literal unicorn with AGI, that is totally unnecessary for the ongoing viability of LLM’s and is pumping the bubble. (2) the technology is improving and delivery costs are changing as we figure out what works and who will pay. (3) the industry’s instincts are developing, so it’s common for people to think “AI” can do something it absolutely cannot do today.
But again…as one guy, for a few thousand dollars, I sent hundreds of millions of requests to AI that are generating a lot of value for me and my team.
Our instincts have a long way to go before we’ve collectively internalized the fact that one person can do that.
1. Write python code for a new type of loss function I was considering
2. Perform lots of annoying CSV munging ("split this CSV into 4 equal parts", "convert paths in this column into absolute paths", "combine these and then split into 4 distinct subsets based on this field.." - they're great for that)
3. Expedite some basic shell operations like "generate softlinks for 100 randomly selected files in this directory"
4. Generate some summary plots of the data in the files I was working with
5. Not to mention extensive use in Cursor & GH Copilot
The tool (Claude 3.7 mostly, integrated with my shell so it can execute shell commands and run python locally) worked great in all cases. Yes I could've done most of it myself, but I personally hate CSV munging and bulk file manipulations and its super nice to delegate that stuff to an LLM agent
edit: formatting
Please dispense with anyone's "expectations" when critiquing things! (Expectations are not a fault or property of the object of the expectations.)
Today's models (1) do things that are unprecedented. Their generality of knowledge, and ability to weave completely disparate subjects together sensibly, in real time (and faster if we want), is beyond any other artifact in existence. Including humans.
They are (2) progressing quickly. AI has been an active field (even through its famous "winters") for several decades, and they have never moved forward this fast.
Finally and most importantly (3), many people, including myself, continue to find serious new uses for them in daily work, that no other tech or sea of human assistants could replace cost effectively.
The only way I can make sense out of anyone's disappointment is to assume they simply haven't found the right way to use them for themselves. Or are unable to fathom that what is not useful for them is useful for others.
They are incredibly flexible tools, which means a lot of value, idiosyncratic to each user, only gets discovered over time with use and exploration.
That that they have many limits isn't surprising. What doesn't? Who doesn't? Zeus help us the day AI doesn't have obvious limits to complain about.
For example, the other day I was chatting with it about the health risks associated with my high consumption of grown salmon. It then generated a small program to simulate the accumulation of PCB in my body. I could review the program, ask questions about the assumptions, etc. It all seemed very reasonable. A toxicokinetic analysis it called it.
It then struck me how immensely valuable this is to a curious and inquisitive mind. This is essentially my gold standard of intelligence: take a complex question and break it down in a logical way, explaining every step of the reasoning process to me, and be willing to revise the analysis if I point out errors / weaknesses.
Now try that with your doctor. ;)
Can it make mistakes? Sure, but so can your doctor. The main difference is that here the responsibility is clearly on you. If you do not feel comfortable reviewing the reasoning then you shouldn’t trust it.
On the other hand, for trivial technical problems with well known solutions, LLMs are great. But those are in many senses the low value problems; you can throw human bodies against that question cheaply. And honestly, before Google results became total rubbish, you could just Google it.
I try to use LLMs for various purposes. In almost all cases where I bother to use them, which are usually subject matters I care about, the results are poorer than I can quickly produce myself because I care enough to be semi-competent at it.
I can sort of understand the kinds of roles that LLMs might replace in the next few years, but there are many roles where it isn’t even close. They are useless in domains with minimal training data.
I’m sure that in the future there will be a really good search tool that utilises an LLM but for now a plain model just isn’t designed for that. There are a ton of other uses for them, so I don’t think that we should discount them entirely based on their ability to output citations.
In programming circles it's also annoying when you try to help and you get fed garbage outputted by LLMs.
I belive models for generating visuals (image, video sound generation) is much more interesting as it's area where errors do not matter as much. Though the ethicality of how these models have been trained is another matter.
Category 1: people who don't like to admit that anything trendy can also be good at what it does.
Category 2: people who don't like to admit that anything made by for-profit tech companies can also be good at what it does.
Category 3: people who don't like to admit that anything can write code better than them.
Category 4: people who don't like to admit that anything which may be put people out of work who didn't deserve to be put out of work, and who already earn less than the people creating the thing, can also be good at what it does
Category 5: people who aren't using llms for things they are good at
Category 6: people who can't bring themselves to communicate with AIs with any degree of humility
Category 7: people to whom none of the above applies
Write code to pull down a significant amount of public data using an open API. (That took about 30 seconds - I just gave it the swagger file and said “here’s what I want”)
Get the data (an hour or so), clean the data (barely any time, gave it some samples, it wrote the code), used the cleaned data to query another API, combined the data sources, pulled down a bunch of PDFs relating to the data, had the AI write code to use tesseract to extract data from the PDFs, and used that to build a dashboard. That’s a mini product for my users.
I also had a play with Mistral’s OCR and have tested a few things using that against the data. When I was out walking my dogs I thought about that more, and have come up with a nice workflow for a problem I had, which I’ll test in more detail next week.
That was all whole doing an entirely different series of tasks, on calls, in meetings. I literally checked the progress a few times and wrote a new prompt or copy/pasted some stuff in from dev tools.
For the calls I was on, I took the recording of those calls, passed them into my local instance whisper, fed the transcript into Claude with a prompt I use to extract action points, pasted those into a google doc, circulated them.
One of the calls was an interview with an expert. The transcript + another prompt has given me the basis for an article (bulleted narrative + key quotes) - I will refine that tomorrow, and write the article, using a detailed prompt based on my own writing style and tone.
I needed to gather data for a project I’m involved in, so had Claude write a handful of scrapers for me (HTML source > here is what I need).
I downloaded two podcasts I need to listen to - but only need to listen to five minutes of each - and fed them into whisper then found the exact bits I needed and read the extracts rather than listening to tedious podcast waffle.
I turned an article I’d written into an audio file using elevenlabs, as a test for something a client asked me about earlier this week.
I achieved about three times as much today as I would have done a year ago. And finished work at 3pm.
So yeah, I don’t understand why people are so bullish about LLMs. Who knows?
It's a hammer -- sometimes it works well. It summarizes the user reviews on a site... cool, not perfect, but useful.
And like every tool, it is useless for 90% of life's situations.
And I know when it's useful because I've already tried a hammer on 1000 things and have figured out what I should be using a hammer on.
Why do we burden ourselves with such expectations on us? Look at cities like Dallas. It is designed for cars. Not for human walking. The buildings are far apart, workplaces are far away from homes and everything looks like designed for some king-kong like creatures.
The burden of expectations on humans is driven by technology. Technology makes you work harder than before. It didn't make your life easier. Check how hectic life has become for you now vs a laid-back village peasant a century back.
The bullishness on LLMs is betting on this trend of self-inflicted human agony and dependency on tech. Man is going back to the craddle. LLMs give the milk feeder.
It's 100x easier to see how LLM's change everything. It takes very little vision to see what an advancement they are. I don't understand how you can NOT be bullish about LLM's (whether you happen to like them or not is a different question).
Or how about computer graphics? Early efforts to move 3D graphics hardware into the PC realm were met with extreme skepticism by my colleagues who were “computer graphics researchers” armed with the latest Silicon Graphics hardware. One researcher I was doing some work with in the mid-nineties remarked about PC graphics at the time: “It doesn’t even have a frame buffer. Look how terrible the refresh rate is. It flickers in a nauseating way.” Etc.
It’s interesting how people who are actual experts in a field where there is a major disruption going on often take a negative view of the remarkable new innovation simply because it isn’t perfect yet. One day, they all end up eating their words. I don’t think it’s any different with LLMs. The progress is nothing short of astonishing, yet very smart people continue to complain about this one issue of hallucination as if it’s the “missing framebuffer” of 1990s PC graphics…
On one hand, LLMs are overhyped and not delivering on promises made by their biggest advocates.
On the other hand, any other type of technology (not so overhyped) would be massively celebrated in significantly improving a subset of niche problems.
It’s worth acknowledging that LLMs do solve a good set of problems well, while also being overhyped as a silver bullet by folks who are generally really excited about its potential.
Reality is that none of us know what the future is, and whether LLMs will have enough breakthroughs to solve more problems then today, but what they do solve today is still very impressive as is.
But even if it does work, you still need to doublecheck everything it does.
Anyway, my RPG group is going to try roleplaying with AI generated content (not yet as GM). We'll see how it goes.
> Eisegesis is "the process of interpreting text in such a way as to introduce one's own presuppositions, agendas or biases". LLMs feel very smart when you do the work of making them sound smart on your own end: when the interpretation of their output has a free parameter which you can mentally set to some value which makes it sensible/useful to you.
> This includes e. g. philosophical babbling or brainstorming. You do the work of picking good interpretations/directions to explore, you impute the coherent personality to the LLM. And you inject very few bits of steering by doing so, but those bits are load-bearing. If left to their own devices, LLMs won't pick those obviously correct ideas any more often than chance.
I think LLMs work best when they are used as a "creative" tool. They're good for the brainstorming part of a task, not for the finishing touches.
They are too unreliable to be put in front of your users. People don't want to talk to unpredictable chatbots. Yes, they can be useful in customer service chats because you can put them on rails and map natural language to predetermined actions. But generally speaking I think LLMs are most effective when used _by_ someone who's piloting them instead of wrapped in a service offered _to_ someone.
I do think we've squeezed 90%+ of what we could from current models. Throwing more dollars of compute at training or inference won't make much difference. The next "GPT moment" will come from some sufficiently novel approach.
LLMs did the "styling" first. They generate high-quality language output of the sort most of us would take as a sign of high intelligence and education in a human. A human who can write well can probably reason well, and probably even has some knowledge of facts they write confidently about.
> I use GPT, Grok, Gemini, Mistral etc every day in the hope they'll save me time searching for information and summarizing it.
Even worse, you're continually waiting for it to get better. If the present is bright and the future is brighter, bullishness is justified.
She became more of an influencer than a scientist. And that is nothing wrong with that unless she doesn't try to pose as an authority on subjects she doesn't have a clue. It's OK to have an opinion as an outsider but it's not OK to pretend you are right and that you are an expert on every scientific or technical subject that happens to make you want to make a tweet about.
I don't believe OP's thesis is properly backed by the rest of his tweet, which seems to boil down to "LLM's can't properly cite links".
If LLM's performing poorly on an arbitrary small-scoped test case makes you bearish on the whole field, I don't think that falls on the LLM's.
As soon as LLMs were introduced into the IDE it began to feeling like LLM autocomplete was almost reading my mind. With some context built up over a few hundred lines of initial architecture, autocomplete now sees around the same corners I am. It’s more than just “solve this contrived puzzle” or “write snake”. It combines the subject matter use case (informed by variable and type naming) underlying the architecture and sometimes produces really breathtaking and productive results. Like I said, it took some time but when it happened, it was pretty shocking.
If I see what Copilot suggests most of the time, I would be very uncomfortable using it for vibe coding though. I think it's going to be... entertaining watching this trend take off. I don't really fear I'm going to lose my job soon.
I'm skeptical that you can build a business on a calculator that's wrong 10% of the time when you're using it 24/7. You're gonna need a human who can do the math.
New technologies typically require multiple generations of refinement—iterations that optimize hardware, software, cost-efficiency, and performance—to reach mainstream adoption. Similarly, AI, Large Language Models (LLMs), and Machine Learning (ML) technologies are poised to become permanent fixtures across industries, influencing everything from automotive systems and robotics to software automation, content creation, document review, and broader business operations.
Considering the immense volume of new information generated and delivered to us constantly, it becomes evident that we will increasingly depend on automated systems to effectively process and analyze this data. Current challenges—such as inaccuracies and fabrications in AI-generated content—parallel the early imperfections of digital photography. These issues, while significant today, represent evolutionary hurdles rather than permanent limitations, suggesting that patience and continuous improvement will ultimately transform these AI systems into indispensable tools.
I don't think you can even be bullish or bearish about this tech. It's here and it's changing pretty much every sector you can think of. It would be like saying you're not bullish about the Internet.
I honestly can't imagine life without one of these tools. I have a subscription to pretty much all of them because I get so excited to try out new models.
Joining the mess of freeform, redundant, and sometimes self contradicting data into JSON lines, and feeding it into AI with a big explicit prompt containing example conversions and corrections for possible pitfalls has resulted in almost magically good output. I added a 'notes' field to the output and instructed the model to call out anything unusual and it caught lots of date typos by context, ambiguously attributed notes, and more.
It would have been a man month or so of soul drowningly tedious and error prone intern level work, but now it was 40 minutes and $15 of Gemini usage.
So, even if it's not a galaxy brained super intelligence yet, it is a massive change to be able to automate what was once exclusively 'people' work.
When I use LLM’s to explore applications of cutting edge nonlinear optics, I too am appalled about the quality of the output. When I use an LLM to implement a React program, something that has been done hundreds of times before by others, I find it performs well.
The present path of IA is nothing short of revolutionary, a lot of jobs and industries are going to suffer a major upheaval and a lot of people are just living in some wishful thinking moment where it will all go away.
I see people complaining it gives them bad results. Sure it does, so all other parsed information we get. It’s our job to check it ourselves. Still , the amount of time it saves me, even if I have to correct it is huge.
A can give an example that has nothing to do with work. I was searching for the smallest miniATX computer cases that would accept at least 3 HDDs (3.5”). The amount of time LLMs saved me is staggering.
Sure, there was one wrong result in the mix, and sure, I had to double check all the cases myself, but, just not having to go through dozens of cases, find the dimensions, calculate the volume, check the HDDs in difficult to read (and sometimes obtain) pages, saved days of work - yes I had done a similar search completely manually about 5 years ago.
This is a personal example, I also have others at work.
It’s truly revolutionary and it’s just starting.
This isn't a technology problem, it's a product problem - and one that may not be solvable with better models alone.
Another issue: people communicate uncertainty naturally. We say "maybe", "it seems", "I'm not sure, but...". LLMs suppress that entirely, for structural reasons. The output sounds confident and polished, which warps perception - especially when the content is wrong.
Also my gf who's not particularly tech savvy relies heavily on ChatGPT for her work. It's very useful for a variety of text (translation, summaries, answering some emails).
Maybe Sabine Hossenfelder tries to use them for things they can't do well and she's not aware that they work for other use cases.
This is the most common 'smart person' fallacy out there.
As for my 2 cents, LLMs can do sequence modeling and prediction tasks, so as long as a problem can be reduced to sequence modeling (which is a lot of them!), they can do the job.
This is like saying that the Fourier Transform is played out because you can only do so much with manipulating signal frequencies.
But even its creators, who acknowledge it is not AGI, are trying to use it as if it were. They want to sell you LLMs as "AI" writ large, that is, they want you to use it as your research assistant, your secretary, your lawyer, your doctor, and so on and so forth. LLMs on their own simply cannot do those tasks. They is great for other uses: troubleshooting, assisting with creativity and ideation, prototyping concepts of the same, and correlating lots of information, so long as a human then verifies the results.
LLMs right now are flour, sugar, and salt, mixed in a bowl and sold as a cake. Because they have no reasoning capability, only rote generation via prediction, LLMs cannot process contextual information the way required for them to be trustworthy or reliable for the tasks people are trying to use them for. No amount of creative prompting can resolve this totally. (I'll note that I just read the recent Anthropic paper, which uses terms like "AI biology" and "concept" to imply that the AI has reasoning capacity - but I think these are misused terms. An LLM's "concept" of something bears no referent to the real world, only a set of weights to other related concepts.)
What LLMs need is some sort of intelligent data store, tuned for their intended purpose, that can generate programmatic answers for the LLMs to decipher and present. Even then, their tendency to hallucinate makes things tough - they might imagine the user requested something they didn't, for instance. I don't have a clear solution to this problem. I suspect whoever does will have solved a much bigger, more complex than the already massive one that LLMs have solved, and if they are able to do so, will have brought us much much closer to AGI.
I am tired of seeing every company under the sun claim otherwise to make a buck.
I haven't tried using LLMs for much else, but I am curious as long as I can run it on my own hardware.
I also totally get having a problem with the massive environmental impact of the technology. That's not AIs fault per se, but its a valid objection.
I’ve recently been using Gemini (mostly 2.0 flash) a lot and I’ve noticed it sometimes will challenge me to try doing something by myself. Maybe it’s something in my system prompt or the way I worded the request itself. I am a long time user of 4o so it felt annoying at first.
Since my purpose was to learn how to do something, being open minded I tried to comply with the request and I can say that… it’s being a really great experience in terms of retention of knowledge. Even if I’m making mistakes Gemini will point them out and explain it nicely.
One is - Google, Facebook, OpenAI, Anthropic, Deepseek etc. have put a lot of capital expenditure into train frontier large language models, and are continuing to do so. There is a current bet that growing the size of LLMs, with more or maybe even synthetic data, with some minor breakthroughs (nothing as big as the Alexnet deep learning breakthrough, or transformers), will have a payoff for at least the leading frontier model. Similar to Moore's law for ICs, the bet is that more data and more parameters will yield a more powerful LLM - without that much more innovation needed. So the question for this is whether the capital expenditure for this bet will pay off.
Then there's the question of how useful current LLMs are, whether we expect to see breakthroughs at the level of Alexnet or transformers in the coming decades, whether non-LLM neural networks will become useful - text-to-image, image-to-text, text-to-video, video-to-text, image-to-video, text-to-audio and so on.
So there's the business side question, of whether the bet that spending a lot of capital expenditure training a frontier model will be worth it for the winner in the next few years - with the method being an increase in data, perhaps synthetic data, and increasing the parameter numbers - without much major innovation expected. Then there's every other question around this. All questions may seem important but the first one is what seems important to business, and is connected to a lot of the capital spending being done on all of this.
It's cliche at this point to say "you're using it wrong" but damn... it really is a thing. It's kind of like how some people can find something online in one Google query and others somehow manage to phrase things just wrong enough that they struggle. It really is two worlds. I can have AI pump out 100k tokens with a nearly 0% error rate, meanwhile my friends with equally high engineering skill struggle to get AI to edit 2 classes in their codebase.
There are a lot of critical skills and a lot of fluff out there. I think the fluff confuses things further. The variety of models and model versions confuses things EVEN MORE! When someone says "I tried LLMs and they failed at task xyz" ... what version was it? How long was the session? How did they prompt it? Did they provide sufficient context around what they wanted performed or answered? Did they have the LLM use tools if that is appropriate (web/deepresearch)?
It's never a like-for-like comparison. Today's cutting-edge models are nothing like even 6-months ago.
Honestly, with models like Claude 3.7 Sonnet (thinking mode) and OpenAI o3-mini-high, I'm not sure how people fail so hard at prompting and getting quality answers. The models practically predict your thoughts.
Maybe that's the problem, poor specifications in (prompt), expecting magic that conforms to their every specification (out).
I genuinely don't understand why some people are still pessimistic about LLMs.
Their strengths and flaws differ from our brains, to be sure, but some of these flaws are being mitigated and improved on by the month. Similarly, unaided humans cannot operate successfully in many situations. We build tools, teams, and institutions to help us deal with them.
For people interested in understanding the possibilities of LLM for use in a specific domain see The AI Revolution in Medicine: GPT-4 and Beyond by Peter Lee (Microsoft Research VP), Isaac Kohane (Harvard Biomedical Informatics MD) et al. It is an easy read showing the authors systematic experiments with using the OpenAI models via the ChatGPT interface for the medical/healthcare domain.
For a current-status follow-up to the above book, here is Peter Lee's podcast series The AI Revolution in Medicine, Revisited - https://www.microsoft.com/en-us/research/story/the-ai-revolu...
Instead of reading trivial blogs/tweets etc. which are useless, read the above to get a much better idea of an LLM's actual capabilities.
https://www.nerdwallet.com/article/investing/bullish-vs-bear...
The current discussion about LLMs guarantees that both positive and negative expectations are a valid title for an article xD
'But you're such a killjoy.'
Yes, it is an evil technology in its current shape. So we should focus on fixing it, instead of making it worse.
But I don't understand how you can come to this conclusion when using SOTA models like Claude Sonnet 3.7, it's response has always been useful and when it doesn't get it right first time you can keep prompting it with clarifications and error responses. On the rare occasion it's unable to get it right, I'm still left with a bulk of useful code that I can manually fix and refactor.
Either way my interactions with Sonnet is always beneficial. Maybe it's a prompt issue? I only ask it to perform small, specific deterministic tasks and provide the necessary context (with examples when possible) to achieve it.
I don't vibe code or unleash an LLM on an entire code base since the context is not large enough and I don't want it to refactor/break working code.
I just updated my company commercial PPT. ChatGPT helped me with: - Deep Research great examples and references of such presentations. - Restructure my argument and slides according to some articles I found on the previous step, and thought were pretty good. - Come up with copy for each slide. - Iterate new ideas as I was progressing.
Now, without proper context and grounding, LLMs wouldn't be so helpful at this task, because they don't know my company, clients, product and strategy, and would be generic at best. The key: I provided it with my support portal documentation and a brain dump I recorded to text on ChatGPT with key strategic information about my company. Those are two bits of info I keep always around, so ChatGPT can help me with many tasks in the company.
From that grounding to the final PPT, it's pretty much a trivial and boring transformation task that would have cost me many, many hours to do.
An LLM can do some pretty interesting things, but the actual applicability is narrow. It seems to me that you have to know a fair amount about what you're asking it to do.
For example, last week I dusted off my very rusty coding skills to whip up a quick and dirty Python utility to automate something I'd done by hand a few too many times.
My first draft of the script worked, but was ugly and lacked any trace of good programming practices; it was basically a dumb batch file, but in Python. Because it worked part of me didn't care.
I knew what I should have done -- decompose it into a few generic functions; drive it from an intelligent data structure; etc -- but I don't code all the time anymore, and I never coded much in Python, so I lack the grasp of Python syntax and conventions to refactor it well ON MY OWN. Stumbling through with online references was intellectually interesting, but I also have a whole job to do and lack the time to devote to that. And as I said, it worked as it was.
But I couldn't let it go, and then had the idea "hey, what if I ask ChatGPT to refactor this for me?" It was very short (< 200 lines), so it was easy to paste into the Chat buffer.
Here's where the story got interesting. YES, the first pass of its refactor was better, but in order to get it to where I wanted it, I had to coach the LLM. It took a couple passes through before it had made the changes I wanted while still retaining all the logic I had in it, and I had to explicitly tell it "hey, wouldn't it be better to use a data structure here?" or "you lost this feature; please re-add it" and whatnot.
In the end, I got the script refactored the way I wanted it, but in order to get there I had to understand exactly what I wanted in the first place. A person trying to do the same thing without that understanding wouldn't magically get a well-built Python script.
The tech isn't there yet, clearly. And stock valuations are over the board way too much. But, LLMs as a tech != the stock valuations of the companies. And, LLMs as a tech are here to stay and improve and integrate into everyday life more and more - with massive impacts on education (particularly K-12) as models get better at thinking and explaining concepts for example.
300/5290 functions decompiled and analyzed in less than three hours off of a huge codebase. By next weekend, a binary that had lost source code will have tests running on a platform it wasn't designed for.
1. AI inventing false information that is being used to built on their foundational knowledge.
2. There is a lot less problem solving for them once they are used to AI.
I think the fundamental of Eduction needs to look at AI or current LLM chatbot seriously and start asking or planning how to react to it. We have already witness Gen Z, with era of Google thinking they know everything and if not google it. Thinking of "They Know it ALL" only to be battered in the real world.
AI may make it even worst.
On one hand, every new technology that comes about unregulated creates a set of ethical and in this particular case, existential issues.
- What will happen to our jobs?
- Who is held accountable when that car navigation system designed by an LLM went haywire and caused an accident?
- What will happen with education if we kill all entry level jobs and make technical skills redundant?
In a sense they're not new concerns in science, we research things to make life easier, but as technology advances, critical thinking takes a hit.
So yeah, I would say people are still right to be weary and 'bullish" of LLMs as it's the normal behaviour for disruptive technology, and one will help us create adequate regulations to safeguard the future.
I do not understand how you can be bearish on LLMs. Data analysis, data entry, agents controlling browsers, browsing the web, doing marketing, doing much of customer support, writing BS React code for a promo that will be obsolete in 3 months anyway.
The possibilities are endless, and almost every week, there is a new breakthrough.
That being said, OpenAI has no moat, and there definitely is a bubble. I'm not bullish on AI stocks. I'm bullish on the tech.
[1] https://youtube.com/@SabineHossenfelder/featured
LLMs are the most impactful technology we've had since the internet, that is why people are bullish on them, anyone who fails to see that cannot probably tie its own shoes without a "peer-reviewed" mechanism, lol.
Also, I like to think for myself. Writing code and thinking through what I am writing often exposes edge cases that I wouldn’t otherwise realize.
1) LLMs are a wonderous technology, capable of doing some really ingenious things
2) The hundreds of billions spent on them will not meet a positive ROI
Bonus 3) they are not good at everything & they are very bad at some things, but they are sold as good at everything.
Sure, it does middle-of-the-road stuff, but it comments the code well, I can manually tweak things at the various levels of granularity to guide it along, and the design doc is on par something a senior principal would produce.
I do in a week what a team of four would take a month and a half to do. It's insane.
Sure, don't be bullish. I'm frantically piecing together enough hardware to run a decent sized LLM at home.
Her primarily work interest is in the truth, not the statically plausible.
Her point is that using LLM to generate truth is pointless, and that people should stop advertising llms as "intelligent", since, to a scientist, being "intelligent" and being "dead wrong" are polar opposite.
Other use cases have feedback loops - it does not matter so much if Claude spuits wrong code, provided you have a compiler and automated tests.
Scientists _are_ acting as compilers to check truth. And they rely on truths compiled by other scientists, just like your programs rely on code written by other people.
What if I tell you that, from now on, any third party library that you call will _stastically_ work 76% of the time, and I have no clue what it does is the remaining X % ?(I don't know what X is, I haven't hast chatgpt yet.)
In the meantime, I still have to see a headline "AI X discovered life-changing new Y on its own" (the closest thing I know of is alpha fold, which I both know is apparently "changing the world of scientists", and yet feel has "not changed the world of your average joe, so far" - emphasis on the "so far" ) ; but I've already seen at least one headline of "dumb mistake made because an AI hallucinated".
I suppose we have to hope the trend will revert at some point ? Hope, on a Friday...
They try to add a new feature or change some behavior in a large existing codebase and it does something dumb and they write it off as a waste of time for that use case. And that's understandable. But if they had tweaked the prompt just a bit it actually might've done it flawlessly.
It requires patience and learning the best way to guide it and iterate with it when it does something silly.
Although you undoubtedly will lose some time re-attempting prompts and fixing mistakes and poor design choices, on net I believe the frontier models can currently make development much more productive in almost any codebase.
Thank you Sabine. Every time I have mentioned Gemini is the worst, and not even worth of consideration, I have been bombarded with downvotes, and told I am using it wrong.
One is the worker's view, looking at AI to be a powerful tool that can leverage one's productivity. I think that is looking promising.
I don't really care for the chat bot to give me accurate sources. I care about an AI that can provide likely places to look for sources and I'll build the tool chain to lookup and verify the sources.
vs
"Bullish" The LLM is going to revolutionise human behaviour and thought and bring about a new golden age.
Former is justifiable, the latter is just reinforcing the bubble.
People are looking for perfect instead of better.
For those who have Twitter blocked.
From a sub-tweet:
>> no LLM should ever output a url that gives a 404 error. How hard can it be?
As a developer, I'm just imagining a server having to call up all the URLs to check that they still exist (and the extra costs/latency incurred there)... And if any URLs are missing, getting the AI to re-generate a different variant of the response, until you find one which does not contain the missing links.
And no, you can't do it from the client side either... It would just be confusing if you removed invalid URLs from the middle of the AI's sentence without re-generating the sentence.
You almost need to get the LLM to engineer/pre-process its own prompts in a way which guesses what the user is thinking in order to produce great responses...
Worse than that though... A fundamental problem of 'prompt engineering' is that people (especially non-tech people) often don't actually fully understand what they're asking. Contradictions in requirements are extremely common. When building software especially, people often have a vague idea of what they want... They strongly believe that they have a perfectly clear idea but once you scope out the feature in detail, mapping out complex UX interactions, they start to see all these necessary tradeoffs and limitations rise to the surface and suddenly they realize that they were asking for something they don't want.
It's hard to understand your own needs precisely; even harder to communicate them.
I use them almost daily in my job and get tremendous use out of them. I guess you could accuse me of lying, but what do I stand to gain from that?
I've also seem people claim that only people who don't know how to code or people doing super simple done a million times apps can get value out of LLMs. I don't believe that applies to my situation, but even if it did, so what? I do real work for a real company delivering real value, and the LLM delivers value to me. It's really as simple as that.
If the o3 based 3 month old strongest model is the best one, it's a proof that there were quite significant improvements in the last 2 years.
I can't name any other technology that improved as much in 2 years.
O1 and o1 pro helped me with filing tax returns and answered me questions that (probably quite bad) tax accountants (and less smart models) weren't able to (of course I read the referenced laws, I don't trust the output either).
I hope she is aware of the limited context window and ability to retrieve older tokens from conversations.
I have used llms for the exact same purpose she has, summerize chapters or whole books and find the source from e quote, both with success.
I think they key to a successful output lies in the way you prompt it.
Hallucinations should be expected though, as we all hopefully know, llms are more of a autocomplete than intelligence, we should stick to that mindset.
I am hoping that the LLM approach will face increasingly diminished returns however. So I am biased toward Sabine's griping. I don't want LLM to go all the way to "AGI".
LLMs are a little bit magical but they are still a square peg. The fact they don't fit in a round hole is uninteresting. The interesting thing to debate is how useful they are at the things they are good at, not at the things they are bad at.
If people aren't linking the conversation, it's really hard to take the complaint seriously.
The keyword in title is "bullish". It's about the future.
Specifically I think it's about the potential of the transformer architecture & the idea that scaling is all that's needed to get to AGI (however you define AGI).
> Companies will keep pumping up LLMs until the day a newcomer puts forward a different type of AI model that will swiftly outperform them.
If I were to be cynical, I think we've seen over the last decade the descent of most of academia, humanities as much as natural sciences, to a rather poor state, drawing entirely on a self-contained loop of references without much use or interest. Especially in the natural sciences, one can today with little effort obtain an infinitely more insightful and, yes, accurate synthesis of the present state of a field from an LLM than 99% of popular science authors.
The best solution to hallucination and inaccuracy is to give the LLM mechanisms for looking up the information it lacks. Tools, MCP, RAG, etc are crucial for use cases where you are looking for factual responses.
If you look at any company on earth, especially large ones, they all share the same line item as their biggest expense: labor. Any technology that can reduce that cost represents an overwhelmingly huge opportunity.
Also great for brainstorming and quick drafting grant proposals. Anything prototyping and quickly glued together I'll go for LLMs (or LLM agents). They are no substitute for your own brain though.
I'm also curious about the hallucinated sources. I've recently read some papers on using LLM-agents to conduct structured literature reviews and they do it quite well and fairly reproducible. I'm quite willing to build some LLM-agents to reproduce my literature review process in the near future since it's fairly algorithmic. Check for surveys and reviews on the topic, scan for interesting papers within, check sources of sources, go through A tier conference proceedings for the last X years and find relevant papers. Rinse, repeat.
I'm mostly bullish because of LLM-agents, not because of using stock models with the default chat interface.
Chat is such a “leaky” abstraction for LLMs
I think most people share the same negative experience as they only interact with LLMs through the chat UI by OpenAI and Anthropic. The real magic moment for me was still the autocompletion moment from the gh copilot.
When I evaluate against areas I possess professional expertise I become convinced LLMs produce the Gell Mann amnesia effect for any area I don't know.
We really should stop reinforcing our echo bubbles and learn from other people. And sometimes be cool in the face of criticism.
What is a good tutorial / training, to learn about LLMs from scratch ?
I guess that’s the main problem, even more so for non-developers and -tech people. The learning curve is too steep, and people don’t know where to start.
They are impressive for what they are… but do you know what they are? I do, and that’s why I’m not that hyped about them.
It is the same discussion with autonomous driving, it generates much less accidents than humans, but look at these anedoctical accidents...
I've been looking at these recently but not specifically to solve "LLM issues".
Actually I think they work really well, in any case where you can detect errors, which the OP apparently can.
If you have tried to use LLMs and find them useless or without value, you should seriously consider learning how to correctly use them and doing more research. It is literally a skill issue and I promise you that you are in the process of being left behind. In the coming years Human + AI cooperatives are going to far surpass you in terms of efficiency and output. You are handicapping yourself by not becoming good at using them. These things can deliver massive value NOW. You are going to be a grumbling gray beard losing your job to 22 year old zoomers who spend 10 hours a day talking to LLMs.
the bull case very obviously speaks for itself!
It’s like a parrot and if you know what you’re doing you can catch tons of mistakes.
He's a person with money and he wants AI programmers. I bet there are millions like him.
Don't get me wrong though, I do believe in a future with LLMs. But I believe they will become more and more specialized for specific tasks. The more general an AI is, the more it's likely to fail.
They excel in spitballing more than accurate citations.
With that said, I find that they are very helpful for a lot of tasks, and improve my productivity in many ways. The types of things that I do are coding and a small amount of writing that is often opinion-based. I will admit that I am somewhat of a hacker, and more broad than deep. I find that LLMs tend to be good at extending my depth a little bit.
From what I can tell, Sabine Hossenfelder is an expert in physics, and I would guess that she already is pretty deep in the areas that she works in. LLMs are probably somewhat less useful at this type of deep, fact-based work, particularly because of the issue where LLMs don't have access to paywalled journal articles. They are also less likely to find something that she doesn't know (unlike with my use cases, where they are very likely to find things that I don't know).
What I have been hearing recently is that it will take a long time for LLMs will be better than humans at everything. However, they are already better than many many humans at a lot of things.
I have a very specific esoteric question like: "What material is both electrically conductive and good at blocking sound?" I could type this into google and sift through the titles and short descriptions of websites and eventually maybe find an answer, or I can put the question to the LLM and instantly get an answer that I can then research further to confirm.
This is significantly faster, more informative, more efficient, and a rewarding experience.
As others have said, its a tool. A tool is as good as how you use it. If you expect to build a house by yelling at your tools I wouldn't be bullish either.
If that disappoints you to such a degree that you simply won't use them, you might find yourself in a position some years ahead - could be 1...could be 2...could be 5...could be 10 - who knows, but when the time comes, you might just be outdated and replaced yourself.
When you closely follow the incremental improvements of tech, you don't really fall for the same hype hysteria. If you on the other hand only look into it when big breakthroughs are made, you'll get caught in the hype and FOMO.
And even if you don't want to explicitly use the tools, at least try to keep some surface-level attention to the progress and improvements.
I honestly believe that there are many, many senior engineers / scientists out there that currently just scoff at these models, and view them as some sort of toy tech that is completely overblown and overhyped. They simply refuse to use the tools. They'll point to some specific time a LLM didn't deliver, roll their eyes, and call it useless.
Then when these tools progress, and finally meet their standards, they will panic and scramble to get into the loop. Meanwhile their non-tech bosses and executives will see the tech as some magic that can be used to reduce headcount.
Just today, I was thinking of making changes to my home theater audio setup and there are many ways to go about that, not to mention lots of competing products, so I asked ChatGPT for options and gave it a few requirements. I said I want 5.1 surround sound, I like the quality and simplicity of Sonos, but I want separate front left and right speakers instead of “virtual” speakers from a soundbar. I waited years thinking Sonos would add that ability, but they never did. I said I’d prefer to use the TV as the hub and do audio through eARC to minimize gaming latency and because the TV has enough inputs anyway, so I really don’t need a full blown AV receiver. Basically just a DAC/preamp that can handle HDMI eARC input and all of the channels.
It proceeded to tell me that audio-only eARC receivers that support surround sound don’t really exist as an off-the-shelf product. I thought, “What? That can’t be right, this seems like an obvious product. I can’t be the first one to have thought of this.” Turns out it was right, there are some stereo DAC/preamps that have an eARC input and I could maybe cobble together one as a DIY project, but nothing exactly like what I wanted. Interesting!
ChatGPT suggested that it’s probably because by the time a manufacturer fully implements eARC and all of the format decoding, they might as well just throw in a few video inputs for flexibility and mass-market appeal, plus one less SKU to deal with. And that kind of makes sense, though it adds excess buttons and bothers me from a complexity standpoint.
It then suggested WISA as a possible solution, which I had never heard of, and as a music producer I pay a lot of attention to speaker technology, so that was interesting to me. I’m generally pretty skeptical of wireless audio, as it’s rarely done well, and expensive when it is done well. But WISA seems like a genuine alternative to an AV receiver for someone who only wants it to do audio. I’m probably going to go with the more traditional approach, but it was fun learning about new tech in a brainstorming discussion. Google struggles with these sorts of broad research queries in my experience. I may or may not have found out about it if I had posted on Reddit, depending on whether someone knowledgeable happened to see my post. But the LLM is faster and knows quite a bit about many subjects.
I also can’t remember the last time it hallucinated when having a discussion like this. Whereas, when I ask it to write code, it still hallucinates and makes plenty of mistakes.
What worked:
- generated a mostly working PoC with minimal input and hallucinated UI layout, Color scheme, etc. this is amazing because it did not bombard me with detailed questions. It just carried on to provide me with a baseline that I could then finetune
- it corrected build issues by me simply copy pasting the errors from Xcode - got APIs working - added debug code when it could not fix an issue after a few rounds
- resolved an API issue after I pointed it to a typescript SDK to the API (I literally gave a link to the file and told it, try to use this to work out where the problem is) - it produces code very fast
What is not working great yet:
- it started off with one large file and crashed soon after because it hit a timeout when regenerating the file. I needed to ask it to split the file into a typical project order
- some logic I asked it to implement explicitly got changed at some point during an unrelated task. To prevent this in future I asked it mark this code part as important and that it should only be changed at explicit request. I don’t know yet how long this code will stay protected for
- by the time enough context got build up usage warnings pop up in Claude
- only so many files are supported atm
So my takeaway is that it is very good at translating, I.e. API docs into code, errors into fixes. There is also a fine line between providing enough context and running out of tokens.
I am planning to continue my project to see how far I can push it. As I am getting close to the limit of the token size now, I am thinking of structuring my app in a Claude friendly way:
- clear internal APIs. Kind of like header files so that I can tell Claude what functions it can use without allowing it to change them or needing to tokenize the full source code
- adversarial testing. I don’t have tests yet, but I am thinking of asking one dedicated instance of Claude to generate tests. I will use other Claude instances for coding and provide them with failing test outputs like I do now with build errors. I hope it will fix itself similarly.
https://euromaidanpress.com/2025/03/27/russian-propaganda-ne...
I will cite myself as Exhibit A. I am the sort of person who takes almost nothing at face value. To me, physiotherapy, and oenology, and musicology, and bed marketing, and mineral-water benefits, and very many other such things, are all obviously pseudoscience, worthy of no more attention than horoscopes. If I saw a ghost I would assume it was a hallucination caused by something I ate.
So it seems like no coincidence that I reflexively ignore the AI babble at the top of search results. After all, an LLM is a language-rehashing machine which (as we all know by now) does not understand facts. That's terribly relevant.
I remember reading, a couple of years back, about some Very Serious Person (i.e. a credible voice, I believe some kind of scientist) who, after a three-hour conversation with ChatGPT, had become convinced that the thing was conscious. Rarely have I rolled my eyes so hard. It occurred to me then that skepticism must be (even) less common a mindset than I assumed.
Maybe that's why?
Also I find it disingenuous that apologists are stating thing close to "you are using it wrong". Where it is advertised that LLM based AI should be more and more trusted (because more accurate, based on some arbitrary metrics) and might save some time ( on some undescribed task).
Of course in that use case most would say to use your judgement to verify whatever is generated, but for the generation that is using AI LLM as a source of knowledge ( like some people are using Wikipedia as source of truth, or stack overflow) it will be difficult to verify, when all they knew is LLM generated content as source of knowledge.
I hope that realization happens before "vibe coding" is accepted as standard practice by software teams (especially when you consider the poor quality of software before the LLM era). If not, it's only a matter of time before we refer to the internet as "something we used to enjoy."
The article makes a lot of good points. I get a lot of slop responses to both coding and non-coding prompts, but I've also gotten some really really good responses, especially code completion from Copilot. Even today, ChatGPT saved me a ton of Google searches.
I'm going to continue using it and taking every response with a grain of salt. It can only get better and better.
Wait until the investors want their returns
uhm, I dismiss this statement here? if you call 4o the best, that means you haven't genuinely explored other models before making such claims...
saying "why are people bullish" only to continue with bullying does not add any clarity to this world
I just hope they keep feeling that way and avoid LLMs. Less competition for those of us who are using them to make our jobs/lives easier every day.
...that was the LLM responding, and it did not set an alarm.