Hacker Newsnew | past | comments | ask | show | jobs | submit | timbray's commentslogin

+1 on Newsblur. I use it every day and it has flaws but nothing that really gets in my way.


The tests for the go code at https://github.com/timbray/RFC9839 are in effect test vectors.


I want to implement this. My code is in C.

How does this help me check my implementation? I guess I could ask ChatGPT to convert your tests to my code, but that seems the long way around.


https://github.com/timbray/RFC9839/blob/main/unichars.go

I don't know rust at all but I can pretty quickly understand:

    var unicodeAssignables = []runePair{
     {0x20, 0x7E},       // ASCII
     {0xA, 0xA},         // newline
     {0xA0, 0xD7FF},     // most of the BMP
     {0xE000, 0xFDCF},   // BMP after surrogates
     {0xFDF0, 0xFFFD},   // BMP after noncharacters block
     {0x9, 0x9},         // Tab
     {0xD, 0xD},         // CR
     {0x10000, 0x1FFFD}, // astral planes from here down
     {0x20000, 0x2FFFD},
     {0x30000, 0x3FFFD},
     {0x40000, 0x4FFFD},
     {0x50000, 0x5FFFD},
     {0x60000, 0x6FFFD},
     {0x70000, 0x7FFFD},
     {0x80000, 0x8FFFD},
     {0x90000, 0x9FFFD},
     {0xA0000, 0xAFFFD},
     {0xB0000, 0xBFFFD},
     {0xC0000, 0xCFFFD},
     {0xD0000, 0xDFFFD},
     {0xE0000, 0xEFFFD},
     {0xF0000, 0xFFFFD},
     {0x100000, 0x10FFFD},
    }


Yeah, for example it's how Java stores strings to this day. But I think it's more or less never transmitted over the Network.


Even if all wire format encoding is utf8, you wouldn't be able to decode these new high codepoints into systems that are semantically utf16. Which is Java and JS at least, hardly "obsolete" targets to worry about.

And even Swift is designed so the strings can be utf8 or utf16 for cheap objc interop reasons.

Discarding compatibility with 2 of the top ~5 most widely used languages kind of reflects how disconnected the author of this is from the technical realities if any fixed utf8 was feasible outside of the most toy use cases.


Relevant: https://www.ietf.org/archive/id/draft-bray-unichars-15.html - IETF approved and will have an RFC number in a few weeks.

Tl;dr: Since we're kinda stuck with Uncorrected UTF-8, here are the "characters" you shouldn't use. Includes a bunch of stuff the OP mentioned.


The most important bit of that is the “Unicode Assignables” subset <https://www.ietf.org/archive/id/draft-bray-unichars-15.html#...>:

  unicode-assignable =
     %x9 / %xA / %xD /               ; useful controls
     %x20-7E /                       ; exclude C1 controls and DEL
     %xA0-D7FF /                     ; exclude surrogates
     %xE000-FDCF /                   ; exclude FDD0 nonchars
     %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
     %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
     %x30000-3FFFD / %x40000-4FFFD /
     %x50000-5FFFD / %x60000-6FFFD /
     %x70000-7FFFD / %x80000-8FFFD /
     %x90000-9FFFD / %xA0000-AFFFD /
     %xB0000-BFFFD / %xC0000-CFFFD /
     %xD0000-DFFFD / %xE0000-EFFFD /
     %xF0000-FFFFD / %x100000-10FFFD


This is really helpful - thanks. I write a CRDT library for text editing. I should probably restrict the characters that I transport to the "Unicode Assignables" subset. I can't think of any sensible reason to let people insert characters like U+0000 into a collaborative text document.


That crossed my mind when I saw the piece show up on HN. But I think they're already running more or less at capacity.


You're right, but I didn't realize that till later. Except for the original "Parable of the Sower" was from Jesus not Olivia. But I also thought of Olivia's first.


A high-quality leather sofa these days is closer to $15K than $1500, ouch.


I dunno, my Wikipedia entry is about right.


Same @ my tests w/ video game trivia questions: they might not be extremely popular facts and most humans would struggle to answer them ad-hoc but the facts are in Wikipedia and I'm pretty certain Wikipedia is in the 15T tokens of the training material.


Wow, didn't know about that, thanks. But the query has to be "timbray" not tim bray


Got me jobs, helped me hire other people, got me a ticket to some of the big technology debates and then helped me win one or two. Gave me a place to write cat obituaries and heavy-metal reviews. Launched Feb 27, 2003 (20 years last month) and I haven't regretted it for a microsecond.

[https://tbray.org/ongoing/]


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: