Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Wasting Inferences with Aider (worksonmymachine.substack.com)

62 points by Stwerner 4 hours ago | 38 comments

denidoman 1 hours ago [-]

The current challenge is not to create a patch, but to verify it.

Testing a fix in a big application is a very complex task. First of all, you have to reproduce the issue, to verify steps (or create them, because many issues don't contain clear description). Then you should switch to the fixed version and make sure that the issue doesn't exists. Finally, you should apply little exploratory testing to make sure that the fix doesn't corrupted neighbour logic (deep application knowledge required to perform it).

To perform these steps you have to deploy staging with the original/fixed versions or run everything locally and do pre-setup (create users, entities, etc. to achieve the corrupted state).

This is very challenging area for the current agents. Now they just can't do these steps - their mental models just not ready for a such level of integration into the app and infra. And creation of 3/5/10/100 unverified pull requests just slow down software development process.

50 minutes ago [-]

fxtentacle 2 hours ago [-]

For me, a team of junior developers that refuse to learn from their mistakes is the fuel of nightmares. I'm stuck in a loop where every day I need to explain to a new hire why they made the exact same beginner's mistake as the last person on the last day. Eventually, I'd rather spend half an hour of my own time than to explain the problem once more...

Why anyone thinks having 3 different PRs for each jira ticket might boost productivity, is beyond me.

Related anime: I May Be a Guild Receptionist, But I'll Solo Any Boss to Clock Out on Time

noodletheworld 2 hours ago [-]

It may not be as stupid as it sounds.

Randomising LLM outputs (temperature) results is outputs that will always have some degree of hallucination.

That’s just math. You can’t mix a random factor in and magically expect it to not exist. There will always be p(generates random crap) > 0.

However, in any probabilistic system, you can run a function k times and you’ll get an output distribution that is meaningful if k is high enough.

3 is not high enough.

At 3, this is stupid; all you’re observing is random variance.

…but, in general, running the same prompt multiple times and taking some kind of general solution from the distribution isn’t totally meaningless, I guess.

The thing with LLMs is they scale in a way that actually allows this to be possible, in a way that scaling with humans can’t.

… like the monkeys and Shakespeare, there probably a limit to the value it can offer; but it’s not totally meaningless to try it.

horsawlarway 1 hours ago [-]

I think this is an interesting idea, but I also somewhat suspect you've replaced a tedious problem with a harder, more tedious problem.

Take your idea further. Now I've got 100 agents, and 100 PRs, and some small percentage of them are decent. The task went from "implement a feature" to "review 100 PRs and select the best one".

Even assuming you can ditch 50 percent right off the bat as trash... Reviewing 50 potentially buggy implementations of a feature and selecting the best genuinely sounds worse than just writing the solution.

Worse... If you haven't solved the problem before anyways, you're woefully unqualified as a reviewer.

bbatchelder 16 minutes ago [-]

Even with human junior devs, ideally you'd maintain some documentation about common mistakes/gotchas so that when you onboard new people to the team they can read that instead of you having to hold their hand manually.

You can do the same thing for LLMs by keeping a file with those details available and included in their context.

You can even set up evaluation loops so that entries can be made by other agents.

simonw 2 hours ago [-]

One of the (many) differences between junior developers and LLM assistance is that humans can learn from their mistakes, whereas with LLMs it's up to you as the prompter to learn from their mistakes.

If an LLM screws something up you can often adjust their prompt to avoid that particular problem in the future.

nico 1 hours ago [-]

> For me, a team of junior developers that refuse to learn from their mistakes is the fuel of nightmares. I'm stuck in a loop where every day I need to explain to a new hire why they made the exact same

This is a huge opportunity, maybe the next big breakthrough in AI when someone figures out how to solve it

Instead of having a model that knows everything, have a model that can learn on the go from the feedback it gets from the user

Ideally a local model too. So something that runs on my computer that I train with my own feedback so that it gets better at the tasks I need it to perform

You could also have one at team level, a model that learns from the whole team to perform the tasks the team needs it to perform

freeone3000 26 minutes ago [-]

Continual feedback means continual training. No way around it. So you’d have to scope down the functional unit to a fairly small lora in order to get reasonable re-training costs here.

nico 10 minutes ago [-]

Or maybe figure out a different architecture

Either way, the end user experience would be vastly improved

abc-1 2 hours ago [-]

Darn I wonder if systems could be modified so that common mistakes become less common or if documentation could be written once and read multiple times by different people.

danielbln 58 minutes ago [-]

We feed it conventions that are automatically loaded for every LLM task, do that the LLM adheres to coding style, comment style, common project tooling and architecture etc.

These systems don't do online learning, but that doesn't mean you can spoon feed them what they should know and mutate that knowledge over time.

tekacs 1 hours ago [-]

Over the last two days, I've built out support for autonomy in Aider (a lot like Claude Code) that hybridizes with the rest of the app:

https://github.com/Aider-AI/aider/pull/3781

Edit: In case anyone wants to try it, I uploaded it to PyPI as `navigator-mode`, until (and if!) the PR is accepted. By I, I mean that it uploaded itself. You can see the session where it did that here: https://asciinema.org/a/9JtT7DKIRrtpylhUts0lr3EfY

Edit 2: And as a Show HN, too: https://news.ycombinator.com/item?id=43674180

and, because Aider's already an amazing platform without the autonomy, it's very easy to use the rest of Aider's options, like using `/ask` first, using `/code` or `/architect` for specific tasks [1], but if you start in `/navigator` mode (which I built, here), you can just... ask for a particular task to be done and... wait and it'll often 'just get done'.

It's... decidedly expensive to run an LLM this way right now (Gemini 2.5 Pro is your best bet), but if it's $N today, I don't doubt that it'll be $0.N by next year.

I don't mean to speak in meaningless hype, but I think that a lot of folks who are speaking to LLMs' 'inability' to do things are also spending relatively cautiously on them, when tomorrow's capabilities are often here, just pricey.

I'm definitely still intervening as it goes (as in the Devin demos, say), but I'm also having LLMs relatively autonomously build out large swathes of functionality, the kind that I would put off or avoid without them. I wouldn't call it a programmer-replacement any time soon (it feels far from that), but I'm solo finishing architectures now that I know how to build, but where delegating them to a team of senior devs would've resulted in chaos.

[1]: also for anyone who hasn't tried it and doesn't like TUI, do note that Aider has a web mode and a 'watch mode', where you can use your normal editor and if you leave a comment like '# make this darker ai!', Aider will step in and apply the change. This is even fancier with navigator/autonomy.

nico 1 hours ago [-]

> It's... decidedly expensive to run an LLM this way right now

Does it work ok with local models? Something like the quantized deepseeks, gemma3 or llamas?

tekacs 1 hours ago [-]

It does for me, yes -- models seem to be pretty capable of adhering to the tool call format, which is really all that they 'need' in order to do a good job.

I'm still tweaking the prompts (and I've introduced a new, tool-call based edit format as a primary replacement to Aider's usual SEARCH/REPLACE, which is both easier and harder for LLMs to use - but it allows them to better express e.g. 'change the name of this function').

So... if you have any trouble with it, I would adjust the prompts (in `navigator_prompts.py` and `navigator_legacy_prompts.py` for non-tool-based editing). In particular when I adopted more 'terseness and proactively stop' prompting, weaker LLMs started stopping prematurely more often. It's helpful for powerful thinking models (like Sonnet and Gemini 2.5 Pro), but for smaller models I might need to provide an extra set of prompts that let them roam more.

pton_xd 10 minutes ago [-]

The trend with LLMs so far has been: if you have an issue with the AI, wait 6 months for a more advanced model. Cobbling together workarounds for their deficiencies is basically a waste of effort.

wrs 1 hours ago [-]

I’ve been using Cursor and Code regularly for a few months now and the idea of letting three of them run free on the codebase seems insane. The reason for the chat interface is that the agent goes off the rails on a regular basis. At least 25% of the time I have to hit the stop button and go back to a checkpoint because the automatic lawnmower has started driving through the flowerbed again. And paradoxically, the more capable the model gets, the more likely it seems to get random ideas of how to fix things that aren’t broken.

barrell 20 minutes ago [-]

Had a similar experience with Claude Code lately. I got a notice some credits were expiring, so I opened up Claude Code and asked it to fix all the credo errors in an elixir project (style guide enforcement).

I gave it incredibly clear steps of what to run in what process, maybe 6 steps, 4 of which were individual severity levels.

Within a few minutes it would as to commit code, create branches, run tests, start servers — always something new, none of which were in my instructions. It would also often run mix credo, get a list of warnings, deem them unimportant, then try to go do its own thing.

It was really cool, I basically worked through 1000 formatting errors in 2 hours with $40 of credits (that I would have had no use for otherwise).

But man, I can’t imagine letting this thing run a single command without checking the output

tekacs 8 minutes ago [-]

So... I know that people frame these sorts of things as if it's some kind of quantization conspiracy, but as someone who started using Claude Code the _moment_ that it came out, it felt particularly strong. Then, it feels like they... tweaked something, whether in CC or Sonnet 3.7 and it went a little downhill. It's still very impressive, but something was lost.

I've found Gemini 2.5 Pro to be extremely impressive and much more able to run in an extended fashion by itself, although I've found very high variability in how well 'agent mode' works between different editors. Cursor has been very very weak in this regard for me, with Windsurf working a little better. Claude Code is excellent, but at the moment does feel let down by the model.

I've been using Aider with Gemini 2.5 Pro and found that it's very much able to 'just go' by itself. I shipped a mode for Aider that lets it do so (sibling comment here) and I've had it do some huge things that run for an hour or more, but assuredly it does get stuck and act stupidly on other tasks as well.

My point, more than anything, is that... I'd try different editors and different (stronger) models and see - and that small tweaks to prompt and tooling are making a big difference to these tools' effectiveness right now. Also, different models seem to excel at different problems, so switching models is often a good choice.

danenania 2 hours ago [-]

Plandex[1] uses a similar “wasteful” approach for file edits (note: I’m the creator). It orchestrates a race between diff-style replacements plus validation, writing the whole file with edits incorporated, and (on the cloud service) a specialized model plus validation.

While it sounds wasteful, the calls are all very cheap since most of the input tokens are cached, and once a valid result is achieved, other in-flight requests are cancelled. It’s working quite well, allowing for quick results on easy edits with fallbacks for more complex changes/large files that don’t feel incredibly slow.

1 - https://github.com/plandex-ai/plandex

billmalarky 2 hours ago [-]

I've been lucky enough to have a few conversations with Scott a month or so ago and he is doing some really compelling work around the AISDLC and creating a factory line approach to building software. Seriously folks, I recommend following this guy closely.

There's another guy in this space I know who's doing similar incredible things but he doesn't really speak about it publicly so don't want to discuss w/o his permission. I'm happy to make an introduction for those interested just hmu (check my profile for how).

Really excited to see you on the FP of HN Scott!

joshstrange 3 hours ago [-]

This is a very interesting idea and I really should consider Aider in the "scriptable" sense more, I only use interactively.

I might add another step after each PR is created where another agent(s?) review and compare the results (maybe have the other 2 agents review the first agents code?).

Stwerner 3 hours ago [-]

Thanks, and having another step for reviewing each other's code is a really cool extension to this, I'll give it a shot :) Whether it works or it doesn't it could be really interesting for a future post!

brookst 2 hours ago [-]

Wonder if you could have the reviewer characterize any mistakes and feed those back into the coding prompt: “be sure to… be sure not to…”

lherron 1 hours ago [-]

I love this! I have a similar automation for moving a feature through ideation/requirements/technical design, but I usually dump the result into Cursor for last mile and to save on inference. Seeing the cost analysis is eye opening.

There’s probably also some upside to running the same model multiple times. I find Sonnet will sometimes fail, I’ll roll back and try again with same prompt but clean context, and it will succeed.

aqme28 1 hours ago [-]

It's cute but I don't see the benefit. In my experience, if one LLM fails to solve a problem, the other ones won't be too different.

If you picked a problem where LLMs are good, now you have to review 3 PRs instead of just 1. If you picked a problem where they're bad, now you have 3 failures.

I think there are not many cases where throwing more attempts at the problem is useful.

phamilton 3 hours ago [-]

Sincere question: Has anyone figured out how we're going to code review the output of an agent fleet?

jsheard 3 hours ago [-]

Insincere answer that will probably be attempted sincerely nonetheless: throw even more agents at the problem by having them do code review as well. The solution to problems caused by AI is always more AI.

brookst 2 hours ago [-]

s/AI/tech

lsllc 2 hours ago [-]

Simple, just ask an(other) AI! But seriously, different models are better/worse at different tasks, so if you can figure out which model is best at evaluating changes, use that for the review.

fxtentacle 2 hours ago [-]

You just don't. Choose randomly and then try to quickly sell the company. /s

IshKebab 3 hours ago [-]

We're going to have no traditional programming in 2 years? Riiight.

It would also be nice to see a demo where the task was something that I couldn't have done myself in essentially no time. Like, what happens if you say "tasks should support tags, and you should be able to filter/group tasks by tag"?

Stwerner 3 hours ago [-]

Gave it a shot real quick, looks like I need to fix something up about automatically running the migrations either in the CI script or locally...

But if you're curious, task was this:

----

Title: Bug: Users should be able to add tags to a task to categorize them

Description: Users should be able to add multiple tags to a task but aren't currently able to.

Given I am a user with multiple tasks When I select one Then I should be able to add one or many tags to it

Given I am a user with multiple tasks each with multiple tags When I view the list of tasks Then I should be able to see the tags associated with each task

----

And then we ended up with:

GPT-4o ($0.05): https://github.com/sublayerapp/buggy_todo_app/pull/51

Claude 3.5 Sonnet ($0.09): https://github.com/sublayerapp/buggy_todo_app/pull/52

Gemini 2.0 Flash ($0.0018): https://github.com/sublayerapp/buggy_todo_app/pull/53

One thing to note that I've found - I know you had the "...and you should be able to filter/group tasks by tag" on the request - usually when you have a request that is "feature A AND feature B" you get better results when you break it down into smaller pieces and apply them one by one. I'm pretty confident that if I spent time to get the migrations running, we'd be able to build that request out story-by-story as long as we break it out into bite-sized pieces.

IanCal 3 hours ago [-]

You can have a larger model split things out into more manageable steps and create new tickets - marked as blocked or not on each other, then have the whole thing run.

precompute 40 minutes ago [-]

Feels like a way to live with a bad decision rather than getting rid of it.

evertedsphere 3 hours ago [-]

love to see "Why It Matters" turn into the heading equivalent to "delve" in body text (although different in that the latter is a legitimate word while the former is a "we need to talk about…"–level turn of phrase)

emorning3 3 hours ago [-]

I see 'Waste Inferences' as a form of abductive reasoning.

I see LLMs as a form of inductive reasoning, and so I can see how WI could extend LLMs.

Also, I have no doubt that there are problems that can't be solved with just an LLM but would need abductive extensions.

Same comments apply to deductive (logical) extensions to LLMs.

DeathArrow 3 hours ago [-]

I don't really think having an agent fleet is a much better solution than having a single agent.

We would like to think that having 10 agents working on the same task will improve the chances of success 10x.

But I would argue that some classes of problems are hard for LLMs and where one agent will fail, 10 agents or 100 agents will fail too.

As an easy example I suggest leetcode hard problems.

adhamsalama 2 hours ago [-]

We need The Mythical Man-Month: LLM version book.

Rendered at 17:40:13 GMT+0000 (Coordinated Universal Time) with Vercel.