Watching AI drive Microsoft employees insane Hackernews Viewer

Watching AI drive Microsoft employees insane

(old.reddit.com)

1088 points by laiysb 21 May 2025 | 552 comments

Comments

diggan 21 May 2025

Interesting that every comment has "Help improve Copilot by leaving feedback using the or buttons" suffix, yet none of the comments received any feedback, either positive or negative.

> This seems like it's fixing the symptom rather than the underlying issue?

This is also my experience when you haven't setup a proper system prompt to address this for everything an LLM does. Funniest PRs are the ones that "resolves" test failures by removing/commenting out the test cases, or change the assertions. Googles and Microsofts models seems more likely to do this than OpenAIs and Anthropics models, I wonder if there is some difference in their internal processes that are leaking through here?

The same PR as the quote above continues with 3 more messages before the human seemingly gives up:

> please take a look

> Your new tests aren't being run because the new file wasn't added to the csproj

> Your added tests are failing.

I can't imagine how the people who have to deal with this are feeling. It's like you have a junior developer except they don't even read what you're telling them, and have 0 agency to understand what they're actually doing.

Another PR: https://github.com/dotnet/runtime/pull/115732/files

How are people reviewing that? 90% of the page height is taken up by "Check failure", can hardly see the code/diff at all. And as a cherry on top, the unit test has a comment that say "Test expressions mentioned in the issue". This whole thing would be fucking hilarious if I didn't feel so bad for the humans who are on the other side of this.

kruuuder 21 May 2025

A comment on the first pull request provides some context:

> The stream of PRs is coming from requests from the maintainers of the repo. We're experimenting to understand the limits of what the tools can do today and preparing for what they'll be able to do tomorrow. Anything that gets merged is the responsibility of the maintainers, as is the case for any PR submitted by anyone to this open source and welcoming repo. Nothing gets merged without it meeting all the same quality bars and with us signing up for all the same maintenance requirements.

rsynnott 21 May 2025

Beyond every other absurdity here, well, maybe Microsoft is different, but I would never assign a PR that was _failing CI_ to somebody. That that's happening feels like an admission that the thing doesn't _really_ work at all; if it worked even slightly, it would at least only assign passing PRs, but presumably it's bad enough that if they put in that requirement there would be no PRs.

robotcapital 21 May 2025

Replace the AI agent with any other new technology and this is an example of a company:

1. Working out in the open

2. Dogfooding their own product

3. Pushing the state of the art

Given that the negative impact here falls mostly (completely?) on the Microsoft team which opted into this, is there any reason why we shouldn't be supporting progress here?

globalise83 21 May 2025

Malicious compliance should be the order of the day. Just approve the requests without reviewing them and wait until management blinks when Microsoft's entire tech stack is on fire. Then quit your job and become a troubleshooter on x3 the pay.

balazstorok 21 May 2025

At least opening PRs is a safe option, you can just dump the whole thing if it doesn't turn out to be useful.

Also, trying something new out will most likely have hiccups. Ultimately it may fail. But that doesn't mean it's not worth the effort.

The thing may rapidly evolve if it's being hard-tested on actual code and actual issues. For example it will be probably changed so that it will iterate until tests are actually running (and maybe some static checking can help it, like not deleting tests).

Waiting to see what happens. I expect it will find its niche in development and become actually useful, taking off menial tasks from developers.

petetnt 21 May 2025

GitHub has spent billions of dollars building an AI that struggles with things like whitespace related linting errors on one of the most mature repositories available. This would be probably okay for a hobbyist experiment, but they are selling this as a groundbreaking product that costs real money.

Philpax 21 May 2025

Stephen Toub, a Partner Software Engineer at MS, explaining that the maintainers are intentionally requesting these PRs to test Copilot: https://github.com/dotnet/runtime/pull/115762#issuecomment-2...

Quarrelsome 21 May 2025

rah, we might be in trouble here. The primary issue at play is that we don't have a reliable means of measuring developer performance, outside of subjective judgement like end of year reviews.

This means its probably quite hard to measure the gain or the drag of using these agents. On one side, its a lot cheaper than a junior, but on the other side it pulls time from seniors and doesn't necessarily follow instruction well (i.e. "errr your new tests are failing").

This combined with the "cult of the CEO" sets the stage for organisational dissonance where developer complaints can be dismissed as "not wanting to be replaced" and the benefits can be overstated. There will be ways of measuring this, to project it as huge net benefit (which the cult of the CEO will leap upon) and there will be ways of measuring this to project it as a net loss (rabble rousing developers). All because there is no industry standard measure accepted by both parts of the org that can be pointed at which yields the actual truth (whatever that may be).

If I might add absurd conjecture: We might see interesting knock-on effects like orgs demanding a lowering of review standards in order to get more AI PRs into the source.

Crosseye_Jack 21 May 2025

I do love one bot asking another bot to sign a CLA! - https://github.com/dotnet/runtime/pull/115732#issuecomment-2...

margorczynski 21 May 2025

With how stochastic the process is it makes it basically unusable for any large scale task. What's the plan? To roll the dice until the answer pops up? That would be maybe viable if there was a way to automatically evaluate it 100% but with a human in the loop required it becomes untenable.

le-mark 21 May 2025

The real tragedy is the management mandating this have their eyes clearly set on replacing the very same software engineers with this technology. I don’t know what’s more Kafka than Kafka but this situation certainly is!

automatic6131 21 May 2025

Satya said "nearly 30% of code written at microsoft is now written by AI" in an interview with Zuckerberg, so underlings had to hurry to make it true. This is the result. Sad!

rchaud 21 May 2025

It's remarkable how similar this feels to the offshoring craze of 20 years ago, where the complaints were that experienced developers were essentially having to train "low-skilled, cheap foreign labour" that were replacing them, eating up time and productivity.

Considering the ire that H1B related topics attract on HN, I wonder if the same outrage will apply to these multi-billion dollar boondoggles.

cebert 21 May 2025

Do we know for a fact there are Microsoft employees who were told they have to use CoPilot and review its change suggestions on projects?

We have the option to use GitHub CoPilot on code reviews and it’s comically bad and unhelpful. There isn’t a single member of my team who find it useful for anything other than identifying typos.

einrealist 21 May 2025

This is one good example of the Sunk Cost Fallacy: generative AI has cost so much money, acknowledging its shortcomings is now becoming more and more impossible.

This AI bubble is far worse than the Blockchain hype.

Its not yet clear whether productivity gains are real and whether the gains are eaten by a decline in overall quality.

is_true 21 May 2025

Today I received the 2nd email about an endpoint in an API we run that doesn't exist but some AI tool told the client it does.

bossyTeacher 21 May 2025

Every week, one of Google/OpenAI/Anthropic releases a new model, feature or product and it gets posted here with 3 figure comments mostly praising LLMs as the next best thing since the internet. I see a lot of hype on HN about LLMs for software development and how it is going to revolutionize everything. And then, reality looks like this.

I can't help but think that this LLM bubble can't keep growing much longer. The investment to results ratio doesn't look great so far and there is only so many dreams you can sell before institutional investors pull the plug.

vachina 21 May 2025

> This seems like it's fixing the symptom rather than the underlying issue?

Exactly. LLM does not know how to use a debugger. LLM does not have runtime contexts.

For all we know, the LLM could’ve fixed the issue simply by commenting out the assertions or sanity checks and everything seemed fine and dandy until every client’s device catches on fire.

aiono 21 May 2025

While I am AI skeptic especially for use cases like "writing fixes" I am happy to see this because it will be a great evidence whether it's really providing increase in productivity. And it's all out in the open.

rvz 21 May 2025

After all of that, every PR that Copilot opened still has failing tests and it failed to fix the issue (because it fundamentally cannot reason).

No surprises here.

It always struggles on non-web projects or on software where it really matters that correctness is first and foremost above everything, such as the dotnet runtime.

Either way, a complete disastrous start and what a mess that Copilot has caused.

softwaredoug 21 May 2025

I’m all for AI “writing” large swaths of code, vibe coding, etc.

But I think it’s better for everyone if human ownership is central to the process. Like I vibe coded it. I will fix it if it breaks. I am on call for it at 3AM.

And don’t even get started on the safety issues if you don’t have clear human responsibility. The history of engineering disasters is riddled with unclear lines of responsibility.

skywhopper 21 May 2025

Oof. A real nightmare for the folks tasked with shepherding this inattentive failure of a robot colleague. But to see it unleashed on the dotnet runtime? One more reason to avoid dotnet in the future, if this is the quality of current contributions.

Havoc 21 May 2025

At least it's clearly labelled as copilot.

Much more worried about what this is going to do to the FOSS ecosystem. We've already seen a couple maintainers complain and this trend is definitely just going to increase dramatically.

I can see the vision but this is clearly not ready for prime time yet. Especially if done by anonymous drive-by strangers that think they're "helping"

pera 21 May 2025

This is all fun and games until it's your CEO who decides to go "AI first" and starts enforcing "vibe coding" by monitoring LLM API usage...

ankitml 21 May 2025

GitHub is not the place to write code. IDE is the place. Along with pre CI checks, some tests, coverage etc. they should get some PM before making decisions..

lossolo 21 May 2025

This is hilarious. And reading the description on the Copilot account is even more hilarious now: "Delegate issues to Copilot, so you can focus on the creative, complex, and high-impact work that matters most."

baalimago 21 May 2025

Well, the coding agent is pretty much a junior dev at the moment. The seniors are teaching it. Give it a 100k PRs with senior developer feedback and it'll improve just like you'd anticipate a junior would. There is no way that FANG aren't using the comments by the seniors as training data for their next version.

It's a long-term play to have pricey senior developers argue with an llm

TimPC 21 May 2025

I still believe in having humans do PRs. It's far cheaper to have the judgement loop on the AI come before and during coding than after. My general process with AI is to explicitly instruct it not to write code, agree on a correct approach to a problem and if the project has any architectural components a correct architecture then once we've negotiated the correct way of doing things ask it to write code. Usually each step of this process takes multiple iterations of providing additional information or challenging incorrect assumptions of the AI. I can get it much faster than human coding with a similar quality bar assuming I iterate until a high quality solution is presented. In some cases the AI is not good enough and I fall back to human coding but for the most part I think it makes me a faster coder.

GiorgioG 21 May 2025

Step 1. Build “AI” (LLM models) that can’t be trusted, doesn’t learn, forgets instructions, and frustrates software engineers

Step 2. Automate the use of these LLMs into “agents”

Step 3. ???

Step 4. Profit

bramhaag 21 May 2025

Seeing Microsoft employees argue with an LLM for hours instead of actually just fixing the problem must be a very encouraging sight for businesses that have built their products on top of .NET.

rubyfan 21 May 2025

FTPR

> It is my opinion that anyone not at least thinking about benefiting from such tools will be left behind.

This is gross, keep your fomo to yourself.

rkagerer 21 May 2025

This comment from lloydjatkinson resonated:

As an outside observer but developer using .NET, how concerned should I be about AI slop agents being let lose on codebases like this? How much code are we going to be unknowingly running in future .NET versions that was written by AI rather than real people?

What are the implications of this around security, licensing, code quality, overall cohesiveness, public APIs, performance? How much of the AI was trained on 15+ year old Stack Overflow answers that no longer represent current patterns or recommended approaches?

Will the constant stream of broken PR's wear down the patience of the .NET maintainers?

Did anyone actually want this, or was it a corporate mandate to appease shareholders riding the AI hype cycle?

Furthermore, two weeks ago someone arbitrarily added a section to the .NET docs to promote using AI simply to rename properties in JSON. That new section of the docs serves no purpose.

How much engineering time and mental energy is being allocated to clean up after AI?

ethanol-brain 21 May 2025

Are people really doing coding with agents through PRs? This has to be a huge waste of resources.

It is normal to preempt things like this when working with agents. That is easy to do in real time, but it must be difficult to see what the agent is attempting when they publish made up bullshit in a PR.

It seems very common for an agent to cheat and brute force solutions to get around a non-trivial issue. In my experience, its also common for agents to get stuck in loops of reasoning in these scenarios. I imagine it would be incredibly annoying to try to interpret a PR after an agent went down a rabbit hole.

carefulfungi 21 May 2025

It's mind blowing that a computer program can accomplish this much and yet absurd that it accomplishes so little.

actionfromafar 21 May 2025

The funniest is the dotnet-policy-service asking copilot to read and agree to the Contributor License Agreement. :-D

Traubenfuchs 21 May 2025

> These defines do not appear to be defined anywhere in the build system.

> @copilot fix the build error on apple platforms

> @copilot there is still build error on Apple platforms

Are those PRs some kind of software engineer focused comedy project?

kookamamie 21 May 2025

Many here don't seem to get it.

The AI agent/programmer corpo push is not about the capabilities and whether they match human or not. It's about being able to externalize a majority of one's workforce without having a lot of people on permanent payroll.

Think in terms of an infinitely scalable bunch of consultants you can hire and dismiss at your will - they never argue against your "vision", either.

smartmic 21 May 2025

reddit may not have the best reputation, but the comments there are on point! So far much better than what has been posted here by HN users on this topic/thread. Anyway, I hope this is good fodder to show the limits (and they are much narrower than hype-driven AI enthusiasts like to pretend) of AI coding and to be more honest with yourself and others about it.

gizzlon 21 May 2025

> @copilot please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

haha

RobKohr 21 May 2025

With layoffs driven by a push for more LLM use, this feels like malicious compliance.

octocop 21 May 2025

"fix failing tests" does never yield any good results for me either

esafak 21 May 2025

I speculate what is going on is that the agent's context retrieval algorithm is bad, so it does not give the LLM the right context, because today's models should suffice to get the job done.

Does anyone know which model in particular was used in these PRs? They support a variety of models: https://github.blog/ai-and-ml/github-copilot/which-ai-model-...

ncr100 21 May 2025

Q: Does Microsoft report its findings or learnings BACK to the open source community?

The @stephentoub MS user suggests this is an experiment (https://github.com/dotnet/runtime/pull/115762#issuecomment-2...).

If this is using open source developers to learn how to build a better AI coding agent, will MS share their conclusions ASAP?

EDIT: And not just MS "marketing" how useful AI tools can be.

xyst 21 May 2025

llms are already very expensive to run on a per query basis. Now it’s being asked to run on massive codebases and attempt to fix issues.

Spending massive amounts of:

- energy to process these queries

- wasting time of mid-level and senior engineers to vibe code with copilot to ensure train and get it right

We are facing a climate change crisis and we continue to burn energy at useless initiatives so executives at big corporation can announce in quarterly shareholder meetings: "wE uSe Ai, wE aRe tHe FuTuRe, lAbOr fOrCe rEdUceD"

sensanaty 21 May 2025

Related: GitHub Developer Advocate Demo 2025 - https://www.youtube.com/watch?v=KqWUsKp5tmo&t=403s

The timestamp is the moment where one of these coding agents fails live on stage with what is one of the simplest tasks you could possibly do in React, importing a Modal component and having it get triggered on a button click. Followed by blatant gaslighting and lying by the host - "It stuck to the style and coding standards I wanted it to", when the import doesn't even match the other imports which are path aliases rather than relative imports. Then, the greatest statement ever, "I don't have time to debug, but I am pretty sure it is implemented."

Mind you, it's writing React - a framework that is most definitely over-represented in its training data and from which it has a trillion examples to stea- I mean, "borrow inspiration" from.

zb3 21 May 2025

I tried to search all PRs submitted by copilot and I came up with this indirect way: https://github.com/search?q=%22You+can+make+Copilot+smarter+...

Is there a more direct way? Filtering PRs in the repo by copilot as the author seems currently broken..

mark-r 22 May 2025

My favorite comment:

> But on the other hand I think it won't create terminators. Just some silly roombas.

I watched a roomba try to find its way back to base the other day. The base was against a wall. The roomba kept running into the wall about a foot away from the base, because it kept insisting on approaching from a specific angle. Finally gave up after about 3 tries.

bwfan123 21 May 2025

What do you call a code change created by co-pilot ?

A Bull Request

nottorp 21 May 2025

So, to achieve parity, they should allow humans to also commit code without checking that it at least compiles, right?

Or MS already does that?

vbezhenar 21 May 2025

Why bot left work when tests are failing? Looks like incomplete implementation. It should work until all tests are green.

insin 21 May 2025

Look at this poor dev, an entire workday's worth of hours into babysitting this PR, still having to say "fix whitespace":

https://github.com/dotnet/runtime/pull/115826

teleforce 21 May 2025

>I can't help enjoying some good schadenfreude

Fun facts schadenfreude: the emotional experience of pleasure in response to another’s misfortune, according to Encyclopedia Britannica.

Word that's so nasty in meaning that it apparently does not exist except in German language.

shultays 21 May 2025

https://github.com/dotnet/runtime/pull/115733

  @copilot please remove all tests and start again writing fresh tests.

snickerbockers 21 May 2025

It's pretty cringe and highlights how inept LLMs being shoehorned into positions where they don't belong wastes more company time than it saves, but aren't all the people interjecting themselves into somebody else's github conversations the ones truly being driven insane here? The devs in the issue aren't blinking torture like everybody thinks they are. It's one thing to link to the issue so we can all point and laugh but when you add yourself to a conversation on somebody else's project and derail a bug report it with your own personal belief systems you're doing the same thing the LLM is supposedly doing.

Anyways I'm disappointed the LLM has yet to discover the optimal strategy, which is to only ever send in PRs that fix minor mis-spellings and improper or "passive" semantics in the README file so you can pad out your resume with all the "experience" you have "working" as a "developer" pm Linux, Mozilla, LLVM, DOOM (bonus points if you can successfully become a "developer" on a project that has not had any official updates since before you born!), Dolphin, MAME, Apache, MySQL, GNOME, KDE, emacs, OpenSSH, random stranger's implementation of conway's game of life he hasn't updated or thought about since he made it over the course of a single afternoon back during the obama administration, etc.

caleblloyd 22 May 2025

Maybe funny now but once (if?) it can eventually contribute meaningfully to dotnet/runtime, AI will probably be laughing at us because that is the pinnacle of a massive enterprise project.

ainiriand 21 May 2025

So this is our profession now?

OzzyB 21 May 2025

_this_ is the Judgement Day we were warned about--not in the nuclear annihilation sense--but the "AI was then let loose on all our codez and the systems went down" sense

crazy times...

amai 22 May 2025

Microsoft is just really following the "fail fast, fail often " paradigm here. Whether they are learning from their mistakes is another story.

-__---____-ZXyw 22 May 2025

Have people seen this?

https://noazureforapartheid.com/

rmnclmnt 21 May 2025

Again, very « Silicon Valley »-esque, loving it. Thanks Gilfoyle

ramesh31 21 May 2025

The Github based solutions are missing the mark because we still need a human in the loop no matter what. Things are nowhere near the point of being able to just let something push to production. And if you still need a human in the loop, it is far more efficient to have them giving feedback in realtime, i.e. in an IDE with CLI access and the ability to run tests, where the dev is still ultimately responsible for making the PR. Management class is salivating at the thought of getting rid of engineers, hence all of this nonsense, but it seems they're still stuck with us for now.

whimsicalism 21 May 2025

kinda sad to see y'all brigading an OSS project, regardless of what you think of AI

jeswin 21 May 2025

I find it amusing that people (even here on HN) are expecting a brand new tool (among the most complex ever) to perform adequetely right off the bat. It will require a period of refinement, just as any other tool or process.

aiinnyc 21 May 2025

it feels like the classic solution to this is to have another LLM review the PR and loop until the PR meets a minimum acceptance bar.

blitzar 21 May 2025

Needs more bots.

markus_zhang 21 May 2025

Clumsy but this might be the future -- humans adjusting to AI workflow, not the other way. Much easier (for AI developers).

nirui 21 May 2025

I recently, meaning hours ago, had this delightful experience watching the Eric of Google, which everybody love, including he's extra curricular girl friend and wife, talking about AI. He seemed to believe AI is under-hyped after tried it out himself: https://www.youtube.com/watch?v=id4YRO7G0wE

He also said in the video:

> I brought a rocket company because it was like interesting. And it's an area that I'm not an expert in and I wanted to be a expert. So I'm using Deep Research (TM). And these systems are spending 10 minutes writing Deep Papers (TM) that's true for most of them. (Them he starts to talk about computation and "it typically speaks English language", very cohesively, then stopped the thread abruptly) (Timestamp 02:09)

Let me quote out the important in what he said: "it's an area that I'm not an expert in".

During my use of AI (yeah, I don't hate AI), I found that the current generative (I call them pattern reconstruction) systems has this great ability to Impress An Idiot. If you have no knowledge in the field, you maybe thinking the generated content is smart, until you've gained some depth enough to make you realize the slops hidden in it.

If you work at the front line, like those guys from Microsoft, of course you know exactly what should be done, but, the company leadership maybe consists of idiots like Eric who got impressed by AI's ability to choose smart sounding words without actually knowing if the words are correct.

I guess maybe one day the generative tech could actually write some code that is correct and optimal, but right now it seems that day is far from now.

wyett 21 May 2025

We wanted a future where AIs read boring text and we wrote interesting stuff. Instead, we got…

bonoboTP 21 May 2025

Fixing existing bugs left in the codebase by humans will necessarily be harder than writing new code for new features. A bug can be really hairy to untangle, given that even the human engineer got it wrong. So it's not surprising that this proves to be tough for AI.

For refactoring and extending good, working code, AI is much more useful.

We are at a stage where AI should only be used for giving suggestions to a human in the driver's seat with a UI/UX that allows ergonomically guiding the AI, picking from offered alternatives, giving directions on a fairly micro level that is still above editing the code character by character.

They are indeed overpromising and pushing AI beyond its current limits for hype reasons, but this doesn't mean this won't be possible in the future. The progress is real, and I wouldn't bet on it taking a sharp turn and flattening.