Anthropic's recent announcement of tool use/function calls caught my attention, specifically their claim that the Claude models can correctly handle 250+ tools with >90% accuracy.
I've been working with GPT function calling for a while and noticed that the recall for larger and more complex functions is quite low.
So, I decided to compare GPT and Claude's performance in using different tools for tasks like web scraping and browser automation.
Learnings:
- AI agents still work best for simple, well-constrained tasks.
- To create a successful agent, you need to provide it with good tools. The LLM can then figure out the correct sequence of tool calls itself, which feel like a promising direction.
- Tool use is still quite slow and often very expensive. I've spend around $50 just on experimenting with Claude for one day. Imagine what the testing would cost for a production-scale system. Making the unit economics work is difficult but will improve as LLM costs continue to drop.