Hacker Newsnew | past | comments | ask | show | jobs | submit | TonyStr's commentslogin

Depends if you're using the botanical definition or the (more common) culinary definition[0].

I would argue fruit and fruit are two words, one created semasiologically and the other created onomasiologically. Had we chosen a different pronunciation for one of those words, there would be no confusion about what fruits are.

[0] - https://en.wikipedia.org/wiki/Fruit#Botanical_vs._culinary


Deportation would also affect the statistics, in that it removes potential repeat-offenders, no?

The study doesn't mention repeat offenses, so I can only assume they sampled both first-offenders and repeat-offenders. If illegal immigrant offenders get deported rather than jailed, the statistics would be lower than if they were sent to prison and allowed to return to crime afterwards.


Yes, as far as I know most criminality follows a Pareto distribution, so consistently removing those who get arrested will indeed affect the stats here.

The 1 % of the population accountable for 63 % of all violent crime convictions: https://pmc.ncbi.nlm.nih.gov/articles/PMC3969807/


> While all this happens, twice as many will experience the same SA by their spouse, family member/friend, tinder date, or whatever

Figure 2 shows that sexual assault is almost twice as likely to be perpetrated by immigrants than US-born citizens[0]. As a foreigner with no ball in the game here, I've seen a lot of outrage from both black-on-white and white-on-black violence in the US. Maybe some news papers like fox news are more selective in what they show, but APNews (which I've been reading for world news) seems to cover a pretty balanced range of racial crime.

> If some illegal / undocumented foreigner sexually assaults a white person, that's all you'll see in the newspapers and media for a week. There will be townhall meetings, debates, protests.

Maybe I'm misunderstanding the scale of this, but surely it doesn't compare to the response that George Floyd got?

[0] - https://www.pnas.org/doi/10.1073/pnas.2014704117#fig02


SA:

US-born citizen: 18.2

Legal immigrant: 31.2

Undocumented immigrant: 11.3

So while this is very US (Texas) specific study, and doesn't seem to provide more granular data. At least where I'm from (Norway), studies have shown that people born to immigrant parents tend to be most at risk, and be assaulted by other people born to immigrant parents. So basically immigrant-on-immigrant violence. Not sure if that at all translates to Texas. The studies usually don't try to explain why these things happen, but could be the result of things like socioeconomic factors.

Not to mention all the dark numbers - how many illegal immigrants that experience assault will report that to law enforcement? Compared to legal citizens.


> be assaulted by other people born to immigrant parents. So basically immigrant-on-immigrant violence

People born of immigrant parents aren't immigrants!

(* terms and conditions apply)


Also, in America they miscategorize criminals as White whenever the FBI can possibly get away with it. The statistics are probably twice as bad in reality, but any figure that refutes globalism will be cooked.

I would like to live in this utopia where free software is funded by the state. This seems impossible to get implemented in our world though.

Several states fund open science, and a couple of them actually do fund open source projects. Germany has its sovereign tech agency for this; France has publicly-funded research agencies that work on a lot of open source stuff, and there are others. There are EU initiatives as well.

It’s not perfect, but it is already something that is being done.


The EU does fund a lot of open source software.

So does the US. In fact they did for this software.

But how would that work? There isn’t unlimited money so who decides what software to support with state money and which developers? I don’t have trust in a bureaucracy to decide which developers should get paid to work on sudo. Just look at a the sudo-rs debacle and that’s without money involved.

You have a failure of imagination if this is what you think, luckily in politics we don't have to listen to people like you and instead those with an actual vision of a better future.

Does this make conspiracy theorists highly intelligent?

No, but they emulate intelligence by making up connections between seemingly disparate things, where there are none.

They make connections but lack the critical thinking skills to weed out the bad/wrong ones.

Which is why, just occasionally, they're right, but mostly by accident.


Banks don't just keep your money in a vault when you store it in a bank account - that would be stupid both of both the bank and of you. Money looses value over time due to inflation, so banks reinvest 90% of your stored balance into loans, stocks, bonds, etc. This means that a theoretical 1B account, would allow someone else to take a 100M loan to fund a new venture. This is how banks make money, and they pay you a small portion of their profits as interest on your money (since they're profiting off it).

This is still not a good idea for you, as the interest doesn't make up for inflation. Most people keep a small portion of their wealth in the bank, as easy access for emergencies (this is called dry powder[0]). The rest is typically invested into private equity, which allows new ventures to be created.

It's very rare for anyone to have more than $50m in the bank. The money is usually out in the market doing it's work.

[0] - https://www.investopedia.com/terms/d/drypowder.asp


The point of the story really was that most rich people, those who don't work for a living but live off of capital assets are in a position of economic power where their wealth accumulates. When the real economy grows a few percent per year but the wealthy gain 10-20% every year their wealth comes at the expense of everyone else.

This is the trickle up economy where the few people at the top suction all the wealth to themselves. And the rate at which they acquire it exceeds the rate at which they spend it.


yep! Had to check to be sure:

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/tvc decompress f854e0b307caf47dee5c09c34641c41b8d5135461fcb26096af030f80d23b0e5`
=== args === decompress f854e0b307caf47dee5c09c34641c41b8d5135461fcb26096af030f80d23b0e5 === tvcignore === ./target ./.git ./.tvc

=== subcommand === decompress ------------------ tree ./src/empty-folder e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 blob ./src/main.rs fdc4ccaa3a6dcc0d5451f8e5ca8aeac0f5a6566fe32e76125d627af4edf2db97


huh, cool. what happens if you use vanilla-git to clone a repo that contains empty folders? and do forges like github display them properly?

You are completely right about tvc ls recomputing each hash, but I think it has to do this? A timestamp wouldn't be reliable, so the only reliable way to verify a file's contents would be to generate a hash.

In my lazy implemenation, I don't even check if the hashes match, the program reads, compresses and tries to write the unchanged files. This is an obvious area to improve performance on. I've noticed that git speeds up object lookups by generating two-letter directories from the first two letters in hashes, so objects aren't actually stored as `.git/objects/asdf12ha89k9fhs98...`, but as `.git/objects/as/df12ha89k9fhs98...`.

>why were TOML files not considered at the end I'm just not that familiar with toml. Maybe that would be a better choice! I saw another commenter who complained about yaml. Though I would argue that the choice doesn't really matter to the user, since you would never actually write a commit object or a tree object by hand. These files are generated by git (or tvc), and only ever read by git/tvc. When you run `git cat-file <hash>`, you'll have to add the `-p` flag (--pretty) to render it in a human-readable format, and at that point it's just a matter of taste whether it's shown in yaml/toml/json/xml/special format.


> A timestamp wouldn't be reliable

I agree, but I'm still iffy on reading all files (already an expensive operation) in the repository, then hashing every one of them, every time you do an ls or a commit. I took a quick look and git seems to check whether it needs to recalculate the hash based on a combination of the modification timestamp and if the filesize has changed, which is not foolproof either since the timestamp can be modified, and the filesize can remain the same and just have different contents.

I'm not too sure how to solve this myself. Apparently this is a known thing in git and is called the "racy git" problem https://git-scm.com/docs/racy-git/ But to be honest, perhaps I'm biased from working in a large repository, but I'd rather the tradeoff of not rehashing often, rather than suffer the rare case of a file being changed without modifying its timestamp, whilst remaining the same size. (I suppose this might have security implications if an attacker were to place such a file into my local repository, but at that point, having them have access to my filesystem is a far larger problem...)

> I'm just not that familiar with toml... Though I would argue that the choice doesn't really matter to the user, since you would never actually write...

Again, I agree. At best, _maybe_ it would be slightly nicer for a developer or a power user debugging an issue, if they prefer the toml syntax, but ultimately, it does not matter much what format it is in. I mainly asked out of curiosity since your first thoughts were to use yaml or json, when I see (completely empirically) most Rust devs prefer toml, probably because of familiarity with Cargo.toml. Which, by the way, I see you use too in your repository (As to be expected with most Rust projects), so I suppose you must be at least a little bit familiar with it, at least from a user perspective. But I suppose you likely have even more experience with yaml and json, which is why it came to mind first.


> ...based on a combination of the modification timestamp and if the filesize has changed

Oh that is interesting. I feel like the only way to get a better and more reliable solution to this would be to have the OS generate a hash each time the file changes, and store that in file metadata. This seems like a reasonable feature for an OS to me, but I don't think any OS does this. Also, it would force programs to rely on whichever hashing algorithm the OS uses.


>... have the OS generate a hash each time the file changes...

I'm not sure I would want this either tbh. If I have a 10GB file on my filesystem, and I want to fseek to a specific position in the file and just change a single byte, I would probably not want it to re-hash the entire file, which will probably take a minute longer compared to not hashing the file. (Or maybe it's fine and it's fast enough on modern systems to do this every time a file is modified by any program running, I don't know how much this would impact the performance.).

Perhaps a higher resolution timestamp by the OS might help though, for decreasing the chance of a file having the exact same timestamp (unless it was specifically crafted to have been so).


This page is beautiful!

Bookmarked for later


Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.

Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).


Particularly on GitHub, might not even be LLMs, just regular bots looking for committed secrets (AWS keypairs, passwords, etc.)

I selfhost Gitea. The instance is crawled by AI crawlers (checked the IPs). They never cloned, they just browse and take it directly from there.

For reference, this is how I do it in my Caddyfile:

   (block_ai) {
       @ai_bots {
           header_regexp User-Agent (?i)(anthropic-ai|ClaudeBot|Claude-Web|Claude-SearchBot|GPTBot|ChatGPT-User|Google-Extended|CCBot|PerplexityBot|ImagesiftBot)
       }

       abort @ai_bots
   }
Then, in a specific app block include it via

   import block_ai

Most of then pretend to be real users though and don't identify themselves with their user agent strings.

I have almost exactly this in my own caddyfile :-D The order of the items in the regex is a little different but mostly the same items. I just pulled them from my web access logs over time and update it every once in a while.

i run a cgit server on an r720 in my apartment with my code on it and that puppy screams whenever sam wants his code

blocking openai ips did wonders for the ambient noise levels in my apartment. they're not the only ones obviously, but they're they only ones i had to block to stay sane


Have you considered putting it behind Anubis or an equivalent?

Yes, but I haven't and would prefer not to

Understandable. It's an outrage that we even have to consider such measures.

Time to start including deliberate bugs. The correct version is in a private repository.

And what purpose would this serve, exactly?

Spite.

They used to do this with maps - eg. fake islands - to pick up when they were copied.

while I think this is a fun idea -- we are in such a dystopian timeline that I fear you will end up being prosecuted under a digital equivalent of various laws like "why did you attack the intruder instead of fleeing" or "you can't simply remove a squatter because its your house, therefore you get an assault charge."

A kind of "they found this code, therefore you have a duty not to poison their model as they take it." Meanwhile if I scrape a website and discover data I'm not supposed to see (e.g. bank details being publicly visible) then I will go to jail for pointing it out. :(


I think if we're at the point where posting deliberate mistakes to poison training data is considered a crime, we would be far far far down the path of authoritarian corporate regulatory capture, much farther than we are now (fortunately).

Look, I get the fantasy of someday pulling out my musket^W ar15 and rushing downstairs to blow away my wife^W an evil intruder, but, like, we live in a society. And it has a lot of benefits, but it does mean you don't get to be "king of your castle" any more.

Living in a country with hundreds of millions of other civilians or a city with tens of thousands means compromising what you're allowed to do when it affects other people.

There's a reason we have attractive nuisance laws and you aren't allowed to put a slide on your yard that electrocutes anyone who touches it.

None of this, of course, applies to "poisoning" llms, that's whatever. But all your examples involved actual humans being attacked, not some database.


Thanks that was the term I was looking for "attractive nuisance". I wouldn't be surprised if a tech company could make that case -- this user caused us tangible harm and cost (training, poisoned models) and left their data out for us to consume. Its the equivalent of putting poison candy on a park table your honor!

That reminds me of the protagonist of Charles Stross's novel "Accelerando", a prolific inventor who is accused by the IRS to have caused millions of losses because he releases all his ideas in the public domain instead of profiting from them and paying taxes on such profits.

This has been happening before LLMs too.

I don't really get why they need to clone in order to scrape ...?

> It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)

The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.


Cloning gets you the raw text objects directly. If you scrape the web UI you're dealing with a lot of markup overhead that just burns compute during ingestion. For training data you usually want the structure to be as clean as possible from the start.

Sure, cloning a local copy. But why clone on github?

The quality of LLM coding agents is pretty good now.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: