Show HN: Claude vs. GPT Agent Comparison

Anthropic's recent announcement of tool use/function calls caught my attention, specifically their claim that the Claude models can correctly handle 250+ tools with >90% accuracy. I've been working with GPT function calling for a while and noticed that the recall for larger and more complex functions is quite low. So, I decided to compare GPT and Claude's performance in using different tools for tasks like web scraping and browser automation.

Learnings:

- AI agents still work best for simple, well-constrained tasks.

- To create a successful agent, you need to provide it with good tools. The LLM can then figure out the correct sequence of tool calls itself, which feel like a promising direction.

- Tool use is still quite slow and often very expensive. I've spend around $50 just on experimenting with Claude for one day. Imagine what the testing would cost for a production-scale system. Making the unit economics work is difficult but will improve as LLM costs continue to drop.