Interesting! Could you give an example with a bit more specific detail here? I t...

FloorEgg · 2025-09-04T00:04:12 1756944252

Yes, essentially.

There are multiple long-form text inputs, one set is provided by User A, and another set by User B. User A inputs act as a prompt for User B, and then User A analyzes User B's input according to the original User A inputs, producing an output.

My system takes User A and B inputs and produces the output with more accuracy and precision than User As do, but a wide margin.

Instead of trying to train a model on all the history of these inputs and outputs, the solution was a combination of goal->job->task breakdown (like a fixed agentic process), and lots of context and prompt engineering. I then test against customer legacy samples, and inspect any variances by hand. At first the variances were usually system errors, which informed improvements to context and prompt engineering, and after working through about a thousand of these (test -> inspect variance -> if system mistake improve system -> repeat) iterations, and benefiting from a couple base-model upgrades, the variances are now about 99.9% user error (bad historical data or user inputs) and 0.1% system error. Overall it took about 9 months to build, and this one niche is worth ~$30m a year revenue easy, and everywhere I look there are market niches like this... it's ridiculous. (and a basic chat interface like ChatGPT doesn't work for these types of problems, no matter how smart it gets, for a variety of reasons)

So to summarize:

Instead of training a model on the historical inputs and outputs, the solution was to use the best base model LLMs, a pre-determined agentic flow, thoughtful system prompt and context engineering, and an iterative testing process with a human in the loop (me) to refine the overall system by carefully comparing the variances between system outputs and historical customer input/output samples.

marlott · 2025-09-04T01:13:43 1756948423

Thanks a lot for the detailed reply! Makes a lot of sense now. I'm working on similar problems, and have dabbled with this kind of approach.