Is that actually a moat? Seems like all model providers managed to scrape the en...

jmb99 · 2025-12-31T05:57:04 1767160624

Scraping text across the entire internet is orders of magnitudes easier than scraping YouTube. Even ignoring the sheer volume of data (exabytes), you simply will get blocked at an IP and account level before you make a reasonable dent. Even if you controlled the entire IPv4 space I’m not sure you could scrape all of YouTube without getting every single address banned. IPv6 makes address bans harder, true, but then you’re still left with the problem of actually transferring and then storing that much data.

earthnail · 2025-12-31T07:35:34 1767166534

For now, you actually get pretty far with Tor. Just reset your connection when you hit an IP ban by sending SIGHUP to the Tor daemon.

I did that when I was retraining Stable Audio for fun and it really turned out to be trivial enough to pull of as a little evening side project.

tucnak · 2025-12-31T09:54:44 1767174884

IPv6 doesn't make it "harder," as they would typically ban whole /48 prefixes.

monocasa · 2025-12-31T02:09:18 1767146958

And we're probably already starting to see that, given the semirecent escalations in game of cat and also cat of youtube and the likes of youtube-dl.

Reminds me of Reddit's cracking down on API access after realizing that their data was useful. But I'd expect both youtube to be quicker on the gun knowing about AI data collection, and have more time because of the orders of magnitude greater bandwidth required to scrape video.

jakeydus · 2025-12-31T04:05:14 1767153914

And reddit turned around and sold it all for a mess of pottage…

satvikpendem · 2025-12-31T07:13:54 1767165234

Sold being the operative word, rather than giving it away for free.

monocasa · 2025-12-31T18:19:41 1767205181

Well, it is available for free either way. They pissed off their user base all for a horse that had already left the stable.

https://academictorrents.com/details/2d056b22743718ac81915f2...

satvikpendem · 2025-12-31T18:27:09 1767205629

Look at their stock price. They are doing very well since IPO, and much of it was revenue from selling their data.

monocasa · 2025-12-31T18:35:06 1767206106

Google's $60m/yr is the only thing keeping them profitable.

Mozilla's business model isn't really something to emulate, even if the stock market doesn't really see it that way.

satvikpendem · 2025-12-31T18:41:04 1767206464

Not really. Lots of companies have valuable data they sell and have been in business for decades just fine. It's even better for reddit because it's user generated so they don't even have to do anything. The users who left during the API debacle are not the vast majority of users which are generally casual and do not give a single shit about what happened, much as tech people like to think otherwise.

monocasa · 2025-12-31T22:40:10 1767220810

The causal users (to say nothing of the the massive uptick in bot traffic) are some of the more useless data from an AI training perspective.

satvikpendem · 2025-12-31T22:43:23 1767221003

Again, this is a techie take. Lots of people for example use ChatGPT for personal therapy and guess which subs their training data comes from, r/relationships etc. Those trying to use them for other means are comparatively less frequent.

awesome_dude · 2025-12-31T03:16:22 1767150982

> Seems like all model providers managed to scrape the entire textual internet just fine

Google, though, has been doing it for literal decades. That could mean that they have something nobody else (except archive.org) has - a history on how the internet/knowledge has evolved.