I've given a try to IPv6 in a company with few tens on servers in a 2 DCs, an office + additional location, 3 ISPs in total. For me the real challenge is not just different way to write an IP address or doing NAT. The challenge is that IPv6 changes a lot of unexpected things:
- Our ISPs support IPv6 but routing quality is way worse than IPv4 including occasional inability to connect to some networks or greater latency than IPv4. I had to create tickets with such issues understood that most probably they just don't have IPv6 BGP sessions to all their upstream providers they connect.
- How the VPN (an employee / road warrior setup) should be configured since from the routing perspective you don't need a VPN to connect from your home to the office? Assuming both have proper IPv6 connection and all devices in the office and your laptop have a globally addressable IP address. Employee can have IPv4 or dual stack at his home, where is dual stack in the office. Very confusing. Looks like Fortigate also don't have an idea and decided to not support such case.
- You have to be careful with site-to-site VPN since even your internal services like database are now globally addressable. You really need proper firewall rules / routing policies to not leak unencrypted packets over internet.
- SLAAC is cool but doesn't provide DNS configuration. (there is RFC8106 but is it supported by all OSes?). You need DHCPv6 for that. You have to choose: use only DHCPv6 or SLAAC + DHCPv6 or just relay on the vast that DNS will be proviedd by DHCP IPv4 in a dual stack setup.
- The way of providing high availability gateway address in a network is different. You need router advertisement where you can provide priorities. That actually is much better than any other VIP mechanisms (no issue with MAC table updates, etc.) but you need to know that.
- OSPF works a bit differently. For example: there is no authentication in router communication in OSPF itself, you are supposed to use IPSec.
> You have to be careful with site-to-site VPN since even your internal services like database are now globally addressable. You really need proper firewall rules / routing policies to not leak unencrypted packets over internet.
Sure, you can. But on advantage of IPv6 is you addresses are globally unique. If you ever need to connect two networks that were created independently (like different companies), unique addresses, even local are very helpful.
Sure, but ULA's within your network are still an extremely useful tool to deal with annoying problems like renumbering if you don't have PI address space (and a majority of businesses won't). It's not an ideal solution, but unless you're a large enough shop to have multi-homed BGP sessions the IANA doesn't give a shit about how much headache changing ISPs or backup connections cause and getting a new allocation causes; so ULAs fill that gap to give a non-changing, probably-unique (if you follow the proper process to generate your prefix) address space.
There's no reason you wouldn't use a VPN in exactly the same way. Just because devices have routable addresses doesn't mean they will allow connections from arbitrary sources. You can configure them to only allow connections from your IPv6 VPN pool.
On the other hand using globally unique IPv6 addresses means your VPN won't conflict with the source network where users are connecting from.
SLAAC can provide DNS configuration, see RFC6106 for instance.
You can do HA in the same way with VRRP or whatever too, as you point out the built in mechanisms are generally better but you don't necessarily have to use them.
The chance of traffic leaking still exists, and does happen a lot with legacy IP too. The difference is that with v6 the traffic will be routed back to you, so you will be able to see it on your border firewalls. With legacy IP, the traffic will be dropped by the ISP or absorbed by the local network so you don't know it's happening and consequently you probably won't try to do anything about it.
With BGP (especially anycast) you don't have direct influence where the requests lands. You can steer traffic by techniques like AS prepend, some priorities per BGP session (I'm not an expert in BGP) but eventually it's not CDN that decides where the request will be routed. It's decided by routers of the client's ISP and backbone networks, each taking decision themself (shorted AS path from his point of view + BGP policies set by network operators), not by you. You can't for example split traffic between DCs with specific proportion (like 30% here and 70% there). You can't split by something else than network properties like forward request to the region that has better chance to have this content in the cache.
With DNS and dynamic responses you are directing request to specific DC, even server, almost on every request. It may be dedicated for this traffic type (live stream different than static images etc). Your DNS server can take the hostname ("www.google.com") into consideration - BGP doesn't even know the hostname in the URL. If you wanted to do it with BGP you would need to place specific content to a dedicated /24 subnet and that is impossible considering how many IPv4 addresses are available.
BGP doesn't even consider network latency, current network load. CDN knows load on their machines, on their network link, where given content is placed. The bottleneck may be storage, network or CPU processing, different for different sites and content type. They need to direct traffic on request basis considering this and at least the hostname from the URL. That's why DNS is used first.
Varnish (HTTP caching only) also uses one thread per client. I believe worker threads are used to handle reuqests while a dedicated thread handles all the idle connections between requests using epoll(). Also per-threads stack size is lowered so thousands of threads don't occupy massive amount of memory.
Single threaded HTTP servers have their own issues. If the bottleneck is the storage then lack of async open()/stat() and some other calls is problematic. We feel that serving hundreds of millions of files (long tail content) from slow storage using nginx. For that reason you can configure nginx to spawn multiple processes.
Thought nginx epolls file i/o too along side socket i/o. Or did you find that the first call to open() or stat() stalls, while read/write after that continues normally?
File IO isn’t epoll-able. The operations are always blocking.
Those could however be offloaded onto a threadpool, to avoid the blocking to affect any other requests that are processed by the same Nginx worker. Nginx however only partially does that - whole file read and write operations are offloaded a whole bunch of other IO (stat, open, close) are executed on the main thread. I guess due to implementation challenges - one can’t just make one operation async but also needs to make each operation that utilizes those methods async.
I have some experience serving static content and working with CDNs. Here is what I find interesting / unique here:
- They are not using OS page cache or any memory caching for that, every request is served directly from disks. This seems possible only when requests are spread between may NVMe disks since single high-end NVMe like Micron 9300 PRO has max 3.5GB/s read speed (or 28Gbps) - far less than 800Gbps. Looks like it works ok for long-tail content but what about new hot content everybody wants to watch at the day of release? Do they spread the same content over multiple disks for this purpose?
- Async I/O resolves issues with nginx process stalling because of disk read operation but only after you've already opened the file. Depending on FS / number of files / other FS activities, directory structure opening the file can block for significant time and there is no async open() AFAIK. How they resolve that? Are we assuming i-node cache contains all i-nodes and open() time is insignificant? Or are they configuring nginx() with large open file cache?
- TLS for streamed media was necessary because browsers started to complain about non-TLS content. But that makes things sooo complicated as we see in the presentation (kTLS is 50% of CPU usage before moving to encryption offloaded by NIC). One has to remember that the content is most probably already encrypted (DRM), we just add another layer of encryption / authentication. TLS for media segments make so little sens IMO.
- When you relay on encryption or TCP offloading by NIC you are stuck with that is possible with your NIC. I guess no HTTP/3 over UDP or fancy congestion control optimization in TCP until the vendor somehow implements it in the hardware.
Responding to a few points. We do indeed use the OS page cache. The hottest files remain in cache and are not served from disk. We manage what is cached in the page cache and what is directly released using the SF_NOCACHE flag.
I believe our TLS initiative was started before browsers started to complain, and was done to protect our customer's privacy.
We have lots of fancy congestion optimizations in TCP. We offload TLS to the NIC, *NOT* TCP.
Can I ask if your whole content can be stored on a single server so content is simply replicated everywhere or there is some layer above that that directs requests to the specific group of servers storing the requested content? I assume the described machine is not just part of tiered cache setup since I don't think nginx capable for complex caching scenarios.
How is this possible? If TCP is done on the host and TLS on the NIC data will need to pass through the CPU right? But the slides show cpu fully bypassed for data
The CPU gets the i/o completion for the read, and is in charge of the ram address where it was stored, but it doesn't need to read that data...
Modern NICs use packet descriptors that allow you to more or less say take N bytes from this address, then M bytes from some other address, etc to form the packet. So the kernel is going to make the tcp/ip header, and then tell the nic to send that with the next bytes of data (and mark it for TLS however that's done).
A Micron 9300 Pro is getting rather long in the tooth. They are using PCIe gen 4 drives that are twice as fast as the Micron 9300.
My own testing on single socket systems that look rather similar to the ones they are using suggests it is much easier to push many 100 Gbit interfaces to their maximum throughput without caching. If your working set fits in cache, that may be different. If you have a legit need for sixteen 14 TiB (15.36 TB) drives, you won't be able to fit that amount of RAM into the system. (Edit: I saw a response saying they do use the cache for the most popular content. They seem to explicitly choose what goes into cache, not allowing a bunch of random stuff to keep knocking the most important content out of cache. That makes perfect sense and is not inconsistent with my assertion that hoping a half TiB cache will do the right thing with 224 TiB of content.)
TLS is probably also to keep the cable company from snooping on the Netflix traffic, which would allow the cable company to more effectively market rival products and services. If there's a vulnerability in the decoders of encrypted media formats, putting the content in TLS prevents a MITM from exploiting that.
From the slides, you will see that they started working with Mellanox on this in 2016 and got the first capable hardware in 2020, with iterations since then. Maybe they see value in the engineering relationship to get the HW acceleration that they value into the hardware components they buy.
Disclaimer: I work for NVIDIA who bought Mellanox a while back. I have no inside knowledge of the NVIDIA/Netflix relationship.
Given one can specify arbitrary offsets for sendfile(), it's not clear to me that there must be any kind of O(k > 1) relationship between open() and sendfile() calls: As long as you can map requested content to a sub-interval of a file, you can co-mingle the catalogue into an arbitrarily small number of files, or potentially even stream directly off raw block devices.
AFAIK no. The point of DRM is to prevent recording / playing the media on a device without decryption key (authorization). So the goal is different than TLS that is used by the client to ensure the content is authentic, unaltered during transmission and not readable by a man-in-the-middle.
But do we really need such protection for a TV show?
"Metadata" in HLS / DASH is a separate HTTP request which can be served over HTTPS if you wish. Then it can refer to media segments served over HTTP (unless your browser / client doesn't like "mixed content").
FWIW, neither does the TLS layer: because the video is all chunked into fixed-time-length segments, each video causes a unique signature of variable-byte-size segments, making it possible to determine which Netflix movie someone is watching based simply on their (encrypted) traffic pattern. Someone built this for YouTube a while back and managed to get it up to like 98% accuracy.
Did TLS 1.3 fix this with content length hiding? Doesn't it add support for variable-length padding that could prevent the attacker from measuring the plaintext content length? Do any major servers support it?
I had brief experience managing such IT team after which I've gained big respect for them. I've never seen such low budget to amount of work ratio in my IT career. Everybody expects them to fix the "trivial" problems immediately but reality in my case was:
- 2 "IT guys" for > 500 headcount company
- If you look at IT guy instant messaging app you will see many unread notifications all the time. Without tickets it's impossible to properly prioritize and divide work. "Tickets" for everything" was one of my first change.
- They are responsible for the broadest set of technologies / vendors I've ever seen. That includes: networking (including cabling, switched, wireless routing, VPN, firewalls), laptops, office apps, conferencing hardware (and software), IT security, accounting and warehouse software (including their MS SQL databases,) windows servers, virtualization, managing email accounts and maaaany more.
- without tickets they can't show real amount of work they are doing so justifying adjustment budget and staffing for the need is a big challenge. On the other hand everybody expects to fix trivial problems without forms (so most of their work!).
- Upper management don't perceive the as value but only as a cost
- trivial problems could be fixed faster but that would require supplies for spare parts. This is bigger than you may think: few generations of laptops in use, office hardware, spare monitors, docking stations etc. That require office space for a small "warehouse" which is not free. How to justify this need considering the previous points?
- many people fix "trivial problems" themselves: replacing cable, "borrowing" cable/mouse/monitor from a different desks. That creates more work for the IT guys: more issues because of incomplete office hardware.
Note that they are using 10GbE network when even a single NVMe disk used in the test has few times more bandwidth so the bandwidth results are constrained by the network hardware.
Creating on-premise infrastructure a year ago we went for 2x25Gbps network. This or even 100Gbps seems to be current "mainstream" and 10GbE is definitely not enough for NVMe speeds available today.
I categorize this as bugs caused by data inconsistency because od data duplication. That includes:
- Using asynchronous database replication and reading data from database slaves
- Duplicating same data over multiple database tables (possibly for performance reasons)
- Having additional system that duplicates some data. For example: in the middle of rewriting some legacy system - a process that was split into phases so functionality between new and old systems overlap for some period of time.
Based on my experience I always assume that inconsistency is unavoidable when the same information is stored in more than one place.
If you have your own domain, then move it to one of the listed DNS providers and use DNS challenge with ACME:
We are using using certbot + cloudflare whis way. There is no HTTP request, certbot makes a temporary DNS record using the Cloudflare API to satisfy the challenge so you can run the script anywhere. Then copy the cert to the device that needs it.
From the described mistakes two come from lack of understanding how exactly DNS works. But I agree it's in fact hard, see [1]).
1. "This strict DNS spec enforcement will reject a CNAME record at the apex of a zone (as per RFC-2181), including the APEX of a sub-delegated subdomain. This was the reason that customers using VPN providers were disproportionately"
- This is non intuitive and maay people are surprised by that. You cannot create any subdomain (even www.domain.tld) if you created "domain.tld CNAME something...". Looks like not every server/resolver enforces that restriction.
2. "based on expert advice, our understanding at the time was that DS records at the .com zone were never cached, so pulling it from the registrar would cause resolvers to immediately stop performing DNSSEC validation." - like any other record, they can be cached. DNS has also negative caching (caching of "not found responses". Moreover there are resolvers that allow configuring minimum TTL that can be higher that what your NS servers returns (like unbound - "cache-min-ttl" option) or can be configured to serve stale responses in case of resolution failures after the cached data expires [2]. That means returning TTL of "1s" will not work as you expect.
My (basic and conservative) mental model that "in DNS, everything including the lack of presence of a thing can be cached" is why I'm very cautious before rolling out anything from DKIM to DNSSEC. A deep understanding of specifications is vital. I'm somewhat surprised an organization of Slack's scale didn't have a consultant on the level of "I designed DNSSEC" on hand for this.
DNS is a bit like network engineering, in that simpler errors has the tendency to have large impacts that prevent trial and error. Before working as a sysadmin I thought that doing experimental lab setups was something only researchers and student did, but when you have an old system up and running, it can be quite difficult to get in there and make changes unless you are very sure about what you are doing.
Like networking there can also be existing protocol errors and plain broken things that has for one reason or an other been seemingly working for decades without causing a problem. Internet flag day is one of those things that pokes at those problems, and maybe one day we will see a test for CNAME at the apex.
It's worth noting that this by itself is a reason not to do ambitious security things (and a global PKI is nothing if not ambitious) at the layer of DNS. It's an extension of the end-to-end argument, or at least of the the logic used in the Saltzer and Reed paper: because it's difficult and error-prone to deploy policy code in the core of the network (here: the "conceptual" core of the protocol stack), we should work to get that policy further up the stack and closer to the applications that actually care about that policy.
The Saltzer and Reed paper, if I'm remembering right, even calls out security as specifically one of those things you don't want to be doing in the middle of the network.
When people start to implement security at the BGP layer, which will likely occur some time soon, we will see things break. We will also see BGP fail if we don't do anything as the protocol is ancient, got an untold amount of undefined behavior between different devices and suppliers, and is extremely fragile.
There has been many that has suggested that we should just scrap the whole thing called The Internet and start from scratch. It would be safer, but I don't think it is a serious alternative. DNS, BGP, IP, UDP, TCP, and HTTP to name a few are seeing incremental changes, and the cost is preferable over the alternative of doing nothing. Ambitious security things would be much less costly if we had working redundancy in place, which is one of those things that flag day tend to illustrate. Good redundancy and people won't notice when HTTP becomes HTTP/2 that later becomes HTTP/3. It also helped development at google that when they added QUIC, they controlled both ends of the connection.
>
There has been many that has suggested that we should just scrap the whole thing called The Internet and start from scratch. It would be safer, but I don't think it is a serious alternative.
Yep - in this, as in many things in life, expert knowledge is knowing what experiments and tests you should be doing as much as which ones you can avoid.
Generally agree, for basic configuration it just works. But I was planning to use IPv6 in some network segments in my company. Providing proper VPN (road warrior style) for employees to access such network isn't simple with IPv6 where every address is routable over internet. Morver Fortigate devices we have don't have such functionality.
- Our ISPs support IPv6 but routing quality is way worse than IPv4 including occasional inability to connect to some networks or greater latency than IPv4. I had to create tickets with such issues understood that most probably they just don't have IPv6 BGP sessions to all their upstream providers they connect.
- How the VPN (an employee / road warrior setup) should be configured since from the routing perspective you don't need a VPN to connect from your home to the office? Assuming both have proper IPv6 connection and all devices in the office and your laptop have a globally addressable IP address. Employee can have IPv4 or dual stack at his home, where is dual stack in the office. Very confusing. Looks like Fortigate also don't have an idea and decided to not support such case.
- You have to be careful with site-to-site VPN since even your internal services like database are now globally addressable. You really need proper firewall rules / routing policies to not leak unencrypted packets over internet.
- SLAAC is cool but doesn't provide DNS configuration. (there is RFC8106 but is it supported by all OSes?). You need DHCPv6 for that. You have to choose: use only DHCPv6 or SLAAC + DHCPv6 or just relay on the vast that DNS will be proviedd by DHCP IPv4 in a dual stack setup.
- The way of providing high availability gateway address in a network is different. You need router advertisement where you can provide priorities. That actually is much better than any other VIP mechanisms (no issue with MAC table updates, etc.) but you need to know that.
- OSPF works a bit differently. For example: there is no authentication in router communication in OSPF itself, you are supposed to use IPSec.
The list is longer unfortunately...