wow thanks for leaving this comment - i now realize two things:
1. the farmer's almanac i thought of when i saw the title and even read the article is not going anywhere
2. i have never before heard of the farmer's almanac referred to in this notice
yeah i think they shot themselves in the foot a bit here by creating the o series. the truth is that GPT-5 _is_ a huge step forward, for the "GPT-x" models. The current GPT-x model was basically still 4o, with 4.1 available in some capacity. GPT-5 vs GPT-4o looks like a massive upgrade.
But it's only an incremental improvement over the existing o line. So people feel like the improvement from the current OpenAI SoTA isn't there to justify a whole bump. They probably should have just called o1 GPT-5 last year.
"The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material."
Chat is a great UX _around_ development tools. Imagine having a pair programmer and never being allowed to speak to them. You could only communicate by taking over the keyboard and editing the code. You'd never get anything done.
Chat is an awesome powerup for any serious tool you already have, so long as the entity on the other side of the chat has the agency to actually manipulate the tool alongside you as well.
a little glossed over, but they do point out that most important improvement o1 has over gpt-4o is not it's "correct" score improving from 38% to 42% but actually it's "not attempted" going from 1% to 9%. The improvement is even more stark for o1-mini vs gpt-4o-mini: 1% to 28%.
They don't really describe what "success" would look like but it seems to me like the primary goal is to minimize "incorrect", rather than to maximize "correct". the mini models would get there by maximizing "not attempted" with the larger models having much higher "correct". Then both model sizes could hopefully reach 90%+ "correct" when given access to external lookup tools.
disagree - good products meet their users where they are and bury complexity under the hood. i can't imagine trying to use a calendar app (or any app really) that refuses to operate in any mode other than UTC.
OK but most people would agree that "only UTC" is not an ergonomic default. There is a balance.
Also, are the users where they are because they want to be there, or because long ago some government or religious leader forced something through and they go along with it because of some kind of inertia?
It's kind of interesting because I think most people implementing RAG aren't even thinking about tokenization at all. They're thinking about embeddings:
1. chunk the corpus of data (various strategies but they're all somewhat intuitive)
2. compute embedding for each chunk
3. generate search query/queries
4. compute embedding for each query
5. rank corpus chunks by distance to query (vector search)
So this article really gets at the importance of a hidden, relatively mundane-feeling, operation that occurs which can have an outsized impact on the performance of the system. I do wish it had more concrete recommendations in the last section and code sample of a robust project with normalization, fine-tuning, and eval.
1. the farmer's almanac i thought of when i saw the title and even read the article is not going anywhere 2. i have never before heard of the farmer's almanac referred to in this notice