The examples appear to knowledge retrieval and factoids only.
The concept appears to be large scale chain of thought and automatic prompt generate and fine tuning… but there don’t appear to be actual examples of this.
The problem is, there is a big song and dance about string template prompts.
…but, carefully crafted string templates would be a) simpler and b) arguably better with existing solutions for this task, because it’s a trivial task and you can hand massage your string template prompts for that.
So, the narrative really doesn’t make sense, unless you’re doing something hard, but the example just shows doing something easy in a very complicated way.
I get it, maybe you can scale this up better… but you’re really not showing it off well.
@wokwokwok Okay now we disagree.
This task is not easy, it's just easy to follow in one notebook. (If it were easy, the RAG score wouldn't be 26%.)
As for "carefully crafted string templates", I'm not sure what your argument here is. Are you saying you could have spent a few hours of trial and error writing 3 long prompts in a pipeline, until you matched what the machine does in 60 seconds?
You give DSPy (1) your free-form code with declarative calls to LMs, (2) a few inputs [labels optional], and (3) some validation metric [e.g., sanity checks].
It simulates your code on the inputs. When there's an LM call, it will make one or more simple zero-shot calls that respect your declarative signature. Think of this like a more general form of "function calling" if you will. It's just trying out things to see what passes your validation logic, but it's a highly-constrained search process.
The constraints enforced by the signature (per LM call) and the validation metric allow the compiler [with some metaprogramming tricks] to gather "good" and "bad" examples of execution for every step in which your code calls an LM. Even if you have no labels for it, because you're just exploring different pipelines. (Who has time to label each step?)
For now, we throw away the bad examples. The good examples become potential demonstrations. The compiler can now do an optimization process to find the best combination of these automatically bootstrapped demonstrations in the prompts. Maybe the best on average, maybe (in principle) the predicted best for a specific input. There's no magic here, it's just optimizing your metric.
The same bootstrapping logic lends itself (with more internal metaprogramming tricks, which you don't need to worry about) to finetuning models for your LM calls, instead of prompting.
In practice, this works really well because even tiny LMs can do powerful things when they see a few well-selected examples.
Hi Omar - thanks for engaging here. I have a similar question to simonw, it _feels_ like there is something useful here but I haven't managed to grok it yet, after sitting through the tutorial notebooks.
Specifically, to your description above, I'd love seeing specific retrieval examples, where you need more-complex pipelines. Zero shot QA (1-step), few-shot QA (2-step), retrieval + few-shot QA (3-step) all make sense, but when the README starts talking about demonstrations, I can't really follow when is that actually needed. Also, it starts feeling too magical when you introduce "smaller LMs" since I don't know what those are.
I'm trying to wrap my head around this project too, since it does seem interesting. Similar to what OP wrote, the sense I got from poking around (and of course from reading the bit in the README that basically says exactly this) was that there are two distinct pieces here, the first being a nice, clean library for working directly with LLMs that refreshingly lacks the assumptions and brittle abstractions found in many current LLM frameworks, and the second being everything related to automatic optimization of prompts. The second half is the part I'm trying to better understand - more specifically, I understand that it uses a process to generate and select examples that are then added to the prompt, but am unclear if it's also doing any prompt transformations other than these example-related improvements. I guess to put it another way, if one were to reframe the second half as a library for automatic n-shot example generation and optimization, made possible via the various cool things this project has implemented like the spec language/syntax, is there anything lost or not covered by the new framing?
As more of an aside, I gave the paper a quick skim and plan on circling back to it when I have more time - are the ideas in the paper an accurate/complete representation of the under-the-hood workings, and general type of optimizations being performed, of the current state of the project?
As another related aside, I vaguely remember coming across this a month or two ago and coming away with a different impression/understanding of it at the time - has the framing of or documentation for the project changed substantially recently, or perhaps the scope of the project itself? I seem to recall focusing mostly on the LM and RM steps and reading up a bit on retrieval model options afterwards. I could very well be mixing up projects or just had focused on the wrong things the first time around of course.
Thanks! Lots to discuss from your excellent response, but I'll address the easy part first: DSPy is v2 of DSP (demonstrate-search-predict).
The DSPy paper hasn't been released yet. DSPy is a completely different thing from DSP. It's a superset. (We actually implemented DSPy _using_ DSPv1. Talk about bootstrapping!)
Reading the DSPv1 paper is still useful to understand the history of these ideas, but it's not a complete picture. DSPy is meant to be much cleaner and more automatic.
DSPy provides composable and declarative modules for instructing LMs in a familiar Pythonic syntax and an automatic compiler that teaches LMs how to conduct the declarative steps in your program. Specifically, the DSPy compiler will internally trace your program and then craft high-quality prompts for large LMs (or train automatic finetunes for small LMs) to teach them the steps of your task.
"A neural network layer is just a matrix. Why abstract that matrix and learn it?" Well, because it's not your job to figure out how to hardcode delicate string or floats that work well for a given architecture & backend.
We want developers to iterate quickly on system designs: How should we break down the task? Where do we call LMs? What should they do?
---
If you can guess the right prompts right away for each LLM, tweak them well for any complex pipeline, and rarely have to change the pipeline (and hence all prompts in it), then you probably won't need this.
That said, it turns out that (a) prompts that work well are very specific to particular LMs, large & especially small ones, (b) prompts that work well change significantly when you tweak your pipeline or your data, and (c) prompts that work well may be long and time-consuming to find.
Oh, and often the prompt that works well changes for different inputs. Thinking in terms of strings is a glaring anti-pattern.
I agree with you on all of those points - but my conclusion is different: those are the reasons it's so important to me that the prompts are not abstracted away from me!
I'm working with Llama 2 a bunch at the moment and much of the challenge is learning how to prompt it differently from how I prompt GPT-4. I'm not yet convinced that an abstraction will solve that problem for me.
> People seem to underestimate and overlook the importance of prompts.
We do this to each other as well. Being able to communicate clear, concise, and complete requests will produce better results with both humans and LLMs. What is interesting is that we can experiment with prompts against machines at a scale we cannot with other people. I'd really like to see more work towards leveraging this feature to improve our human interactions, kind of like empathy training in VR
1] when prototyping, it's useful to not have to tweak each prompt by hand as long as you can inspect them easily
2] when the system design is "final", it's important to be able to tweak any prompts or finetunes with full flexibility
But we may or may not agree on:
3] automatic optimization can basically make #2 above only very rarely needed
---
Anyway, the entire DSPy project has zero hard-coded prompts for tasks. It's all bootstrapped and validated for your logic. In case you're worried that we're doing some opinionated prompting on your behalf.
It sounds fascinating! Is there anything one could read to figure out more about how this is being done (From reading the docs by the "Teleprompter"s right)?
One time (ironically, after I learned about model free methods) I got sucked into writing a heuristic for an A* algorithm. It turned into a bottomless pit of manually tuning various combinations of rules. I learned the value of machine learning the hard way.
If prompts can be learned, then eventually it will be better to learn them than to manually tune them. However, these ideas need not be mutually exclusive. When we reject the tyranny of the “or” and we can have a prompt prior we manually tune and then update it with a learning process, right?
P.S. whoever wrote the title, I think it’s pretty silly to write “The Framework…” for anything because this presumes you have the only member of some category, which is never true!
There’s always DSP for those who need a lightweight but powerful programming model — not a library of predefined prompts and integrations.
It’s a very different experience from the hand-holding of LangChain, but it packs reusable magic in generic constructs like annotate, compile, etc that work with arbitrary programs.
Very cool! But for hard enough problems, prompt engineering is kind of like hyperparameter tuning. It's only a final (and relatively minor) step after building up an effective architecture and getting its modules to work together.
DSP provides a high-level abstraction for building these architectures—with LMs and search. And it gets the modules working together on your behalf (e.g., it annotates few-shot demonstrations for LM calls automatically).
Once you're happy with things, it can compile your DSP program into a tiny LM that's a lot cheaper to work with.