A colleague of mine raised a very important point here. The class is being taught at NYU business school(co taught Konstantinos Rizakos AI/ML Product Mgmt). The fees is pretty high 60,000/year ($2,000+/credit @15 credits/sem) . How much of an ask is it on the business model to incorporate human evaluation say 25% of the cost ~15000$ to spending per student to have their exams evaluated orally by a TA or just do the damn exam in a controlled class environment?
Absolutely the easiest solution would have been to have a written exam on the cases and concepts that we discussed in class. It would take a few hours to create and grade the exam.
But at a university you should experiment and learn. What better class to experiment and learn than the “AI Product Management”. Students were actually intrigued by the idea themselves.
The key goal: we wanted to ensure that the projects that students submitted was actually their own work, not “outsourced” (in a general sense) to teammates or to an LLM.
Gemini 3 and NotebookLM with slide generation were released in the middle of the class, and we realized that it is feasible for a student to have a flaweless presentation in front of the class, without understanding deeply what they are presenting.
We could schedule oral exams during the finals week, which would be a major disruption for the students, or schedule exams during the break, violating university rules and ruining students vacation.
But as I said, we learned that AI-driven interviews are more structured and better than human-driven ones, because humans do get tired, and they do have biases based on who is the person they are interviewing. That’s why we decided to experiment with voice AI for running the oral exam.
One (narrow) circumstance to make the process of reviewing a large contribution — with significant aid from LLM — easier to review is to jump on a call with the reviewer, explain what the change is, and answer their questions on why is it necessary and what it brings to the table. This first pass is useful for a few reasons:
1. It shifts the cognitive load from the reviewer to the author because now the author has to do an elevator pitch and this can work sort of like a "rubber duck" where one would likely have to think about these questions up front.
2. In my experience this is a much faster to do this than a lonesome review with no live input from the author on the many choices they made.
First pass and have a reviewer give a go/no-go with optional comments on design/code quality etc.
Have you ever done that with new contributors to open source projects? Typically things tend to be asynchronous but maybe it's a practice I've just not encountered in such context.
I've done that in contributions to unknown people's repo but not necessarily open source ones. I believe that this is quite under valued for the reasons I listed.
In addition, 1:1 contact can speed up things immensely in such situations because most activity on a change happens very soon after the first change is made and that initial voluminous back and forth can be faster than typing to have a back and forth on github for a PR.
From my own experience as one grows over their 30's, or probably much older, to get to what you mentioned "money, houses, kids, friends", these ads pretty much don't target u very effectively any ways because one's priorities are shifted and you care more about other things than what the attention economy is all about. IOW these ads all about the people who have attention to spare.
The general process feels very much like having kids over for a birthday party. Except you have to get them all to play nice and you have no idea what this other kid was conditioned on by their parents. Generally it would all work fine, all the kids know how the party progresses and what their roles are — if any.
But imagine how hard it would be if these kids had short term memory only and they would not know what to focus on except what you tell them to. You literally have to tell them "Here is A-Z pay attention to 'X' only and go do your thing". Add in other managers for this party like a caterer, clowns, your spouse and they also have to tell them that and remember, communicate what other managers have done. No one has solved for this, really.
This is what it felt like in 2025 to code with LLMs on non trivial projects, with some what of an improvement as the year went by. But I am not sure much progress was made in fixing the process part of the problem.
There was a time when if you edited documentation in vscode and had copilot on it would complete internal user and project names when it encountered a path on some.random LLM project we were building. I could find people and their projects by just googling the username and contextual keywords.
We all had a lot of laughs with tab auto complete and wondered in anticipation what ridiculous stuff it threw up next.
One thing that is interesting to think about is given a skill which is just "pre-context", how can it be _evolved_ to create prompts given _my_ context? e.g. here is their web artifact skill builder from desktop app:
```
web-artifacts-builder
Suite of tools for creating elaborate, multi-component claude.ai HTML artifacts using modern frontend web technologies (React, Tailwind CSS, shadcn/ui). Use for complex artifacts requiring state management, routing, or shadcn/ui components - not for simple single-file HTML/JSX artifacts.
```
Say I want to build a landing page with some relatively static content — I don't know it yet but its just gonna be bootstrap CSS, no SPA/React(ish), it'll be fine with templated server side thing. But I don't know how to express this in words. Could the skill _evolve_ based on what my preferences are and what is possible for a relative novice to grok and construct?
This is a simple example, but it could extend to say using sqlite+litestream instead of postgres or using Gradient boosted trees instead of an expensive transformer based classifier.
Isn't atleast part of that GH issue something that this https://docs.boundaryml.com/guide/introduction/what-is-baml is also trying to solve? LLM inputs and outputs must be functions with defined functions. That was their starting point.
IIUC their most recent arc focuses on prompt optimization[0] where you can optimize — using DSPy and an optimization algo GEPA [1] — using relative weights on different things like errors, token usage, complexity.
I gave opus an "incorrect" research task (using this slash command[1]) in my REST server to research to use SQLite + Litestream VFS can be used to create read-replicas for REST service itself. This is obviously a dangerous use of VFS[2] and a system like sqlite in general(stale reads and isolation wise speaking). Ofc it happily went ahead and used Django's DB router feature to implement `allow_relation` to return true if `obj._state.db` was a `replica` or `default` master db.
Now claude had access to this[2] link and it got the daya in the research prompt using web-searcher. But that's not the point. Any Junior worth their salt — distributed systems 101 — would know _what_ was obvious, failure to pay attention to the _right_ thing. While there are ideas on prompt optimization out there [3][4], the issue is how many tokens can it burn to think about these things and come up with optimal prompt and corrections to it is a very hard problem to solve.
If you are not familiar with data systems, havea read DDIA(Designing Data Intensive Applications) Chapter 3. Especially the part on building a database from the ground up — It almost starts with sthing like "Whats the simplest key value store?": `echo`(O(1) write to end of file, super fast) and `grep`(O(n) read, slow) — and then build up all the way to LSMTrees and BTrees. It will all make a lot more sense why this preserves so many of those ideas.
reply