Yup, both have their uses. If you use a clipboard manager or have the clipboard synchronized between devices/remote desktops/VMs, the primary selection comes in handy for stuff you don't exactly want saved to disk, crossing VM boundaries, or transmitted over the network. I use middle-click pasting primarily for its separate buffer.
Virtualization.framework was introduced in Big Sur. It builds on top of Hypervisor.framework and is essentially Apple's QEMU (in some ways quite literally, it implements QEMU's pvpanic protocol for example). Before QEMU and other VMMs gained ARM64 Hypervisor.framework support, it was the only way to run virtual machines on ARM Macs and still is the only official way to virtualize ARM macOS.
The new Tahoe framework you're probably thinking of is Containerization, which is a WSL2-esque wrapper around Virtualization.framework allowing for easy installation of Linux containers.
>a WSL2-esque wrapper around Virtualization.framework allowing for easy installation of Linux containers.
So Linux is now a first class citizen on both Windows and Mac? I guess it really is true that 'if you can't beat em, join em.' Jobs must be rolling in his grave.
That's an interesting protocol choice, especially given the purpose. SMTP is probably the most filtered protocol on residential networks, SMB being a runner-up.
SMTP isn't filtered it's port 25 that is. And from a short look at the readme it looks like it's using the transmission port 587 which shouldn't be filtered.
I was thinking this too. I'm assuming it doesn't look like an SMTP server from the outside? Because if it does, that would absolutely land your IP up on many, many DNSbls very quickly if it started getting probed.
Interesting idea though, spoofing other protocols than HTTP/HTTPS are probably a good idea for censorship evasion in countries with incredibly strict national firewalls.
TECHNICAL.md lays it out a bit more, but it claims to be RFC 5321 compliant with a realistic initiation sequence so it should somewhat look like a real SMTP server for the first bit.
Ending up on any DNSBLs shouldn't be a problem unless you have a static home IP you plan on running an actual SMTP server from after this though.
>SMTP traffic on port 587 (submission) is expected and normal
Any residential dynamic or static IP with this port opened is definitely going to get flagged. Most ISPs already prevent these ports from being open, either by policy or by residential routers.
It would probably very quickly end up on something like SpamHaus's PBL, which looks for this kind of thing.[1]
I would imagine you would also find yourself on Shodan pretty quickly getting hit with constant nmap & login attempts from malicious actors. Spam bots are always looking for insecure servers to send emails from.
I feel like ssh, SFTP, or even a secure DNS server would probably make more sense as something to hide traffic from DPI than an SMTP server.
Again, unless you're actually planning on sending "real" SMTP traffic to other "real" SMTP servers from your own "real" SMTP server operating on the same address, then getting put on SpamHaus (or other DNSBLs) for having the port open w/o rDNS or etc configured is irrelevant. Like you say, there is a decent chance your ISP just blocks the port anyways and makes such a setup unfeasible though, but that's why the readme says to host this on a VPS which allows the port.
Any time you have any externally open TCP port (home or VPS) you should expect to get scanned to shit by Shodan and millions of other bots. It doesn't matter if it's the default port for SFTP, DNS, SMTP, HTTP, Minecraft, or whatever - all of them are great targets for malicious actors and as soon as the bots detect one open port they'll scan everything on that IP harder. I once forgot to disable certain default enabled login types and failed connection/authentication logging when exposing SSH/SFTP externally and ended up with GBs of logs in just one week.
>The PBL detects end-user IP address ranges which should not be attempting to directly deliver unauthenticated SMTP email to any Internet mail server. All the email originated by an IP listed in PBL is expected to be submitted - using authentication - to a SMTP server which delivers it to destination
Means in practice port 25 (unauthenticated) and port 587 (authenticated)
All boils down to the kind of DPI you're trying to work around, but generally the most common encrypted or otherwise difficult to process protocols strike me as the most preferable.
RTP isn't a bad choice, especially the WebRTC flavor of it:
- it's UDP; there's no need to worry avoid the TCP meltdown
- it's most commonly used for peer-to-peer and PBX communication; packets going in and out, from and to random IPs are expected
- high bandwidth RTP traffic is normal, so are high irregularities
- it most often carries video; huge room for steganography
- WebRTC makes encryption mandatory
I've come across corporate networks that do block non-intranet WebRTC, however this probably isn't feasible at the Internet scale.
Other good choices are QUIC and WebSockets (assuming your network doesn't do MitM), and SSH, which by default comes with strong protection against MitM and actually has SOCKS5 tunneling built into the most popular implementations (try `ssh -D`). SSH is what some of my friends successfully use to bypass the Great Firewall.
That being said, the shift of client-to-server SMTP from a common part of everyday internet traffic to something rather esoteric may have created some potential for firewall misconfigurations, and those might result in it being passed with minimal inspection. All depends on your particular firewall in the end.
You don't need websockets, just Connection: Upgrade to anything you want. You can upgrade directly to ssh protocol and just pass on decrypted data from https socket to local port 22 from then on with no further processing.
Hehe true, SSH traffic is so characteristically obvious that the packet size and timing can be used as a side channel to leak information about a session.
Tangential: but I recall reading about a similar technique used on SRTP packets to guess the phonemes being uttered without needing to decrypt the traffic.
I guess you would need to either mimic a protocol that always uses a fixed packet size/rate (like a MPEG-TS video stream or something), or artificially pad/delay your packets to throw off detection methods.
This is indeed similar in the effects, but completely different in the cause to the phenomenon referenced in the article (device pixel ratio vs pixel aspect ratio).
What you're referring to stems from an assumption made a long time ago by Microsoft, later adopted as a de facto standard by most computer software. The assumption was that the pixel density of every display, unless otherwise specified, was 96 pixels per inch [1].
The value stuck and started being taken for granted, while the pixel density of displays started growing much beyond that—a move mostly popularized by Apple's Retina. A solution was needed to allow new software to take advantage of the increased detail provided by high-density displays while still accommodating legacy software written exclusively for 96 PPI. This resulted in the decoupling of "logical" pixels from "physical" pixels, with the logical resolution being most commonly defined as "what the resolution of the display would be given its physical size and a PPI of 96" [2], and the physical resolution representing the real amount of pixels. The 100x100 and 200x200 values in your example are respectively the logical and physical resolutions of your screenshot.
Different software vendors refer to these "logical" pixels differently, but the most names you're going to encounter are points (Apple), density-independent pixels ("DPs", Google), and device-independent pixels ("DIPs", Microsoft). The value of 96, while the most common, is also not a standard per se. Android uses 160 PPI as its base, Apple has for a long time used 72.
I might be misunderstanding what you're saying, but I'm pretty sure print and web were already more popular than anything Apple did. The need to be aware of output size and scale pixels was not at all uncommon by the time retina displays came out.
From what I recall only Microsoft had problems with this, and specifically on Windows. You might be right about software that was exclusive to desktop Windows. I don't remember having scaling issues even on other Microsoft products such as Windows Mobile.
Print was always density-independent. This didn't translate into high-density displays, however. The web, at least how I remember it, for the longest time was "best viewed in Internet Explorer at 800x600", and later 1024x768, until vector-based Flash came along :)
If my memory serves, it was Apple that popularized high pixel density in displays with the iPhone 4. They weren't the first to use such a display [1], but certainly the ones to start a chain reaction that resulted in phones adopting crazy resolutions all the way up to 4K.
It's the desktop software that mostly had problems scaling. I'm not sure about Windows Mobile. Windows Phone and UWP have adopted an Android-like model.
Why does the PPI matter at all? Thought we only cared about the scaling factor. So 2 in this 100 to 200 scenario. It's not like I'm trying to display a true to life gummy bear on my monitor, we just want sharp images.
These days, as most take 1x scaling factor for 96 PPI (or 72 if you're Apple), yes, but at the very beginning, there was no such reference. 100x100 without the density could have meant 10x10 or 100x100 inches.
Some software, most notably image editors and word processors, still try to match the zoom of 100% with the physical size of a printout.
I think resolution always refers to physical resolution of the display. But rendering can be using scaling to make things appear to the user in whatever real size regardless of the underlying resolution.
That really depends on the context. Open your browser's Developer Tools and you'll see logical sizes everywhere. Android, which in my opinion has the best scaling model, and where the PPI wildly varies from device to device, nearly always operates on logical sizes. Windows, GNOME, and KDE on the other hand tend to give you measurements in physical pixels. macOS is a mix and match; Preview and QuickTime tell you the physical resolutions, Interface Builder and display preferences will only show logical dimensions.
>Before HD, almost all video was non-square pixels
Correct. This came from the ITU-R BT.601 standard, one of the first digital video standards authors of which chose to define digital video as a sampled analog signal. Analog video never had a concept of pixels and operated on lines instead. The rate at which you could sample it could be arbitrary, and affected only the horizontal resolution. The rate chosen by BT.601 was 13.5 MHz, which resulted in a 10/11 pixel aspect ratio for 4:3 NTSC video and 59/54 for 4:3 PAL.
>SD channels on cable TV systems are 528x480
I'm not actually sure about America, but here in Europe most digital cable and satellite SDTV is delivered as 720x576i 4:2:0 MPEG-2 Part 2. There are some outliers that use 544x576i, however.
Good post. For anyone wondering "why do we have these particular resolutions, sampling and frame rates, which seem quite random", allow me to expand and add some color to your post (pun intended). Similar to how modern railroad track widths can be traced back to the wheel widths of roman chariots, modern digital video standards still reverberate with echoes from 1930s black-and-white television standards.
BT.601 is from 1982 and was the first widely adopted analog component video standard (sampling analog video into 3 color components (YUV) at 13.5 MHz). Prior to BT.601, the main standard for video was SMPTE 244M created by the Society of Motion Picture and Television Engineers, a composite video standard which sampled analog video at 14.32 MHz. Of course, a higher sampling rate is, all things equal, generally better. The reason for BT.601 being lower (13.5 MHz) was a compromise - equal parts technical and political.
Analog television was created in the 1930s as a black-and-white composite standard and in 1953 color was added by a very clever hack which kept all broadcasts backward compatible with existing B&W TVs. Politicians mandated this because they feared nerfing all the B&W TVs owned by voters. But that hack came with some significant technical compromises which complicated and degraded analog video for over 50 years. The composite and component sampling rates (14.32 MHz and 13.5 MHz) are both based on being 4x a specific existing color carrier sampling rate from analog television. And those two frequencies directly dictated all the odd-seeming horizontal pixel resolutions we find in pre-HD digital video (352, 704, 360, 720 and 768) and even the original PC display resolutions (CGA, VGA, XGA, etc). To be clear, analog television signals were never pixels. Each horizontal scanline was only ever an oscillating electrical voltage from the moment photons struck an analog tube in a TV camera to the home viewer's cathode ray tube (CRT). Early digital video resolutions were simply based on how many samples an analog-to-digital converter would need to fully recreate the original electrical voltage.
For example, 720 is tied to 13.5 Mhz because sampling the active picture area of an analog video scanline at 13.5 MHz generates 1440 samples (double per-Nyquist). Similarly, 768 is tied to 14.32 MHz generating 1536 samples. VGA's horizontal resolution of 640 is simply from adjusting analog video's rectangular aspect ratio to be square (720 * 0.909 = 640). It's kind of fascinating all these modern digital resolutions can be traced back to decisions made in the 1930s based on which affordable analog components were available, which competing commercial interests prevailed (RCA vs Philco) and the political sensitivities present at the time.
> For example, 720 is tied to 13.5 Mhz because sampling the active picture area of an analog video scanline at 13.5 MHz generates 1440 samples (double per-Nyquist).
I don't think you need to be doubling here. Sampling at 13.5 MHz generates about 720 samples.
13.5e6 Hz * 53.33...e-6 seconds = 720 samples
The sampling theorem just means that with that 13.5 MHz sampling rate (and 720 samples) signals up to 6.75 MHz can be represented without aliasing.
Non-square pixels come from the legacy of anthropomorphic film projection. This was developed from the need to capture wide aspect ratio images on standard 33mm film.
This allows the captured aspect ratio on film to be fixed for various aspect ratios images that are displayed.
I based that on seeing the BBC Science TV series (and books) Connections by science historian James Burke. If it's been updated since, then I stand corrected. Regardless of the specific example, my point was that sometimes modern standards are linked to long-outdated historical precedents for no currently relevant reason.
While analog video did not have the concept of pixels, it specified the line frequency, the number of visible lines (576 in Europe, composed of 574 full lines and 2 half lines, so some people count them as 575 lines, but the 2 half lines are located in 2 different lines of the image, not on the same line, thus there are 576 distinct lines on the height of the image), the duration of the visible part of a line and the image aspect as being 3:4.
From these 4 values one can compute the video sampling frequency that corresponds to square pixels. For the European TV standard, an image with square pixels would have been of 576 x 768 pixels, obtained at a video sampling frequency close to 15 MHz.
However, in order to allow more TV channels in the available bands, the maximum video frequency was reduced to a lower frequency than required for square pixels (which would have been close to 7.5 MHz in Europe) and then to an even lower maximum video frequency after the transition to PAL/SECAM, i.e. to lower than 5.5 MHz, typically about 5 MHz. (Before the transition to color, Eastern Europe had used sharper black&white signals, with a lower than 6.5 MHz maximum video frequency, typically around 6 MHz. The 5.5/6.5 MHz limits are caused by the location of the audio signal. France had used an even higher-definition B&W system, but that had completely different parameters than the subsequent SECAM, being an 819-line system, while the East-European system differed only in the higher video bandwidth.)
So sampling to a frequency high enough for square pixels would have been pointless as the TV signal had been already reduced to a lower resolution by the earlier analog processing. Thus the 13.5 MHz sampling frequency chosen for digital TV, corresponding to pixels wider than their height, was still high enough to preserve the information contained in the sampled signal.
No, the reason why 13.5 MHz was chosen is because it was desirable to have the same sampling rate for both PAL and NTSC, and 13.5 happens to be an integer multiple of both line frequecuencies. You can read the full history in this article:
That is only one condition among the conditions that had to be satisfied by the sampling rate, and there are an infinity of multiples which satisfy this condition, so this condition is insufficient to determine the choice of the sampling frequency.
Another condition that had to be satisfied by the sampling frequency was to be high enough in comparison with the maximum bandwidth of the video signal, but not much higher than necessary.
Among the common multiples of the line frequencies, 13.5 MHz was chosen because it also satisfied the second condition, which is the condition that I have discussed, i.e. that it was possible to choose 13.5 MHz only because the analog video bandwidth had been standardized to values smaller than needed for square pixels, otherwise for the sampling frequency a common multiple of the line frequencies that is greater than 15 MHz would have been required (which is 20.25 MHz).
Yeah. I recently stumbled across this in an interesting way. Went down a rabbit hole. I was recreating an old game for my education[1]. Scummvm supports Eye of the Beholder and I used it to take screenshots to compare against my own work. I was doing the intro scenes and noticed that the title screens are 320x200. My monitor is 1920x1200 and so the ratios are the same. It displays properly when I full screen my game and all it good. However, on scummvm, it looked vertically elongated. I did some digging and found this about old monitors and how they displayed. Scummvm has a setting called "aspect ratio correction" which stretches the pixels vertically produces pillarboxing to give you the "original nostalgic feel".
I also have a few decrypted samples from the Hot Bird 13E, public DVB-T and T2 transmitters and Vectra DVB-C from Poland, but for that I'd have to dig through my backups.
Projects like this and Docker make me seriously wonder where software engineering is going. Don't get me wrong, I don't mean to criticize Docker or Toro in parcicular. It's the increasing dependency on such approaches that bothers me.
Docker was conceived to solve the problem of things "working on my machine", and not anywhere else. This was generally caused by the differences in the configuration and versions of dependencies. Its approach was simple: bundle both of these together with the application in unified images, and deploy these images as atomic units.
Somewhere along the lines however, the problem has mutated into "works on my container host". How is that possible? Turns out that with larger modular applications, the configuration and dependencies naturally demand separation. This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.
Now hardware virtualization. I like how AArch64 generalizes this: there are 4 levels of privilege baked into the architecture. Each has control over the lower and can call up the one immediately above to request a service. Simple. Let's narrow our focus to the lowest three: EL0 (classically the user space), EL1 (the kernel), EL2 (the hypervisor). EL0, in most operating systems, isn't capable of doing much on its own; its sole purpose is to do raw computation and request I/O from EL1. EL1, on the other hand, has the powers to directly talk to the hardware.
Everyone is happy, until the complexity of EL1 grows out of control and becomes a huge attack surface, difficult to secure and easy to exploit from EL0. Not good. The naive solution? Go a level above, and create a layer that will constrain EL1, or actually, run multiple, per-application EL1s, and punch some holes through for them to still be able to do the job—create a hypervisor. But then, as those vaguely defined "holes", also called system calls and hyper calls, grow, won't so the attack surface?
Or in other words, with the user space shifting to EL1, will our hypervisor become the operating system, just like docker-compose became a dynamic linker?
I see a number of assumptions in your post which I find not matching my view of the picture.
Containers arose as a way to solve the dependency problems created by traditional Unix. They grow from tools like chroot, BSD jails, and Solaris Zones. Containers allow to deploy dependencies that cannot be simultaneously installed on a traditional Unix host system. it's not a UNIX architecture limitation but rather a result of POSIX + tradition; e.g. Nix also solves this, but differently.
Containers (like chroot and jail before them) also help ensure that a running service does not depend on the parts of the filesystem it wasn't given access to. Additionally, containers can limit network access, and process tree access.
These limitations are not a proper security boundary, but definitely a dependency boundary, helping avoid spaghetti-style dependencies, and surprises like "we never realized that our ${X} depends on ${Y}".
Then, there's the Fundamental Theorem of Software Engineering [1], which states: "We can solve any problem by introducing an extra level of indirection." So yes, expect the number of levels of indirection to grow everywhere in the stack. A wise engineer can expect to merge or remove a some levels here and there, when the need for them is gone, but they would never expect that new levels of indirection should stop emerging.
To be honest, I've read your response 3 times and I still don't see where we disagree, assuming that we do.
I've mostly focused on the worst Docker horrors I've seen in production, extrapolating that to the future of containers, as pulling in new "containerized" dependencies will inevitably become just as effortless as it currently is with regular dependencies in the new-style high-level programming languages. You've primarily described a relatively fresh, or a well-managed Docker deployment, while admitting that spaghetti-style dependencies have become a norm and new layers will pile up (and by extension, make things hard to manage).
I think our points of view don't actually collide.
We do not disagree about the essence, but rather in accents. Some might say that sloppy engineers were happy to pack their Ruby-Goldbergesque deployments into containers. I say that even the most excellent and diligent engineers sometimes faced situations when two pieces of software required incompatible versions of a shared library, which depended on a tree of other libraries with incompatible versions, etc, and there's a practical limit of what you can and should do with bash scripts and abuse of LD_PRELOAD.
Many of the "new" languages, like Go (16 years), Rust (13 years), or Zig (9 years) just can build static binaries, not even depending on libc. This has both upsides and downsides, especially with security fixes. Rebuilding a container to include an updated .so dependency is often easier and faster than rebuilding a Rust project.
Docker (or preferably Podman) is not a replacement for linkers. It's an augmentation to the package system, and a replacement for the common file system layout, which is inadequate for modern multi-purpose use of a Unix (well, Linux) box.
I see, you're providing a complementary perspective. I appreciate that, and indeed, Docker isn't always evil. My intention was to bring attention to the abuse of it and compare it to virtualization of unikernels, which to me appears to be on a similar trajectory.
As for the linker analogy, I compared docker-compose (not Docker proper) to a dynamic linker because it's often used to bring up larger multi-container applications, similar to how large monolithic applications with plenty of shared library dependencies are put together by ld.so, and those multi-container applications can be similarly brittle if developed under the assumption that merely wrapping them up in containers will assure portability, defeating most of Docker's advantages and reducing it to a pile of excess layers of indirection. This is similar to the false belief that running kernel-mode code under a hypervisor is by itself more secure than running it as process on top of a bare-metal kernel.
Containers got popular at at time when there were an increasingly number of people that were finding it hard to install software on their system locally - especially if you were, for instance, having to juggle multiple versions of ruby or multiple versions of python and those linked to various major versions of c libraries.
Unfortunately containers have always had an absolutely horrendous security story and they degrade performance by quite a lot.
The hypervisor is not going away anytime soon - it is what the entire public cloud is built on.
While you are correct that containers do add more layers - unikernels go the opposite direction and actively remove those layers. Also, imo the "attack surface" is by far the smallest security benefit - other architectural concepts such as the complete lack of an interactive userland is far more beneficial when you consider what an attacker actually wants to do after landing on your box. (eg: run their software)
When you deploy to AWS you have two layers of linux - one that AWS runs and one that you run - but you don't really need that second layer and you can have much faster/safer software without it.
I can understand the public cloud argument; if the cloud provider insists on you delivering an entire operating system to run your workloads, a unikernel indeed slashes the amount of layers you have to care about.
Suppose you control the entire stack though, from the bare metal up. (Correct me if I'm wrong, but) Toro doesn't seem to run on real hardware, you have to run it atop QEMU or Firecracker. In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls? Both ultimately lead to the host OS servicing the request. There isn't any notable difference between the kernel/hypervisor and the user/kernel boundary in modern processors either; most of the time, privilege escalations come from errors in the software running in the privileged modes of the processor.
Technically, in the former case, besides exploiting the application, a hypothetical attacker will also have to exploit a flaw in QEMU to start processes or gain further privileges on the host, but that's just due to a layer of indirection. You can accomplish this without resorting to hardware virtualization. Once in QEMU, the entire assortment of your host's system calls and services is exposed, just as if you ran your code as a regular user space process.
This is the level you want to block exec() and other functionality your application doesn't need at, so that neither QEMU nor your code ran directly can perform anything out of their scope. Adding a layer of indirection while still leaving user/kernel, or unikernel/hypervisor junction points unsupervised will only stop unmotivated attackers looking for low-hanging fruit.
> Suppose you control the entire stack though, from the bare metal up. (Correct me if I'm wrong, but) Toro doesn't seem to run on real hardware, you have to run it atop QEMU or Firecracker.
Some unikernels are intended to run under a hypervisor or on bare metal. Bare metal means you need some drivers, but if you have a use case for a unikernel on bare metal, you probably don't need to support the vast universe of devices, maybe only a few instances of a couple types of things.
I've got a not production ready at all hobby OS that's adjacent to a unikernel; runs in virtio hypervisors and on bare metal, with support for one NIC. In it's intended hypothetical use, it would boot from PXE, with storage on nodes running a traditional OS, so supporting a handful of NICs would probably be sufficient. Modern NICs tend to be fairly similar in interface, so if the manufacturer provides documentation, it shouldn't take too long to add support at least once you've got one driver doing multiple tx/rx queues and all that jazz... plus or minus optimization.
For storage, you can probably get by with two drivers, one for sata/ahci and one for nvme. And likely reuse an existing filesystem.
Do you usually publish your hobby code publicly? If not, consider this an appeal to do so (:
> Modern NICs tend to be fairly similar in interface, so if the manufacturer provides documentation, it shouldn't take too long to add support at least once you've got one driver ... For storage, you can probably get by with two drivers
I take that there aren't any pluggable drivers for NICs like there's for nvme/sata disks?
> I take that there aren't any pluggable drivers for NICs like there's for nvme/sata disks?
I mean, there is NDIS / NDISWrapper. Or, I think it wouldn't be too hard to run netbsd drivers... but I'm crazy and want my drivers in userland, in Erlang, so none of that applies. :)
As a fair warning, there's some concurrency errors in the kernel which I haven't tracked down that results in sometimes getting stuck before the shell prompt comes up, the tcp stack is just ok enough to mostly work, and the dhcp client only works if everything goes right.
Erlang! Indeed a crazy idea (in a good way!), and while I'm not normally a big fan of unikernels, now you've got me seriously intrigued :)
I've been dabbling in Erlang and OS development myself, my biggest inspirations being Microsoft Singularity and QNX. The former is a C# lookalike of what you're making, or at least that's how it seems from my perspective.
The readme mentions a FreeBSD-like system call interface, but then the drivers and the network stack are written in Erlang, and, as you've mentioned, run in the user land. Is that actually a unikernel design with BEAM running in the kernel, or more of a microkernel hosting BEAM, with it providing device handling and the user space?
The original plan was BEAM on metal, but I had a hard time getting that started... so I pivoted to BEAM from pkg, running on a just enough kernel that exposes only the FreeBSD syscalls that actually get called.
Where that fits in the taxonomy of life, I'm not sure. There is a kernel/userspace boundary (and also a c-code/erlang code boundary in userspace), so it's not quite a unikernel. I wouldn't really call it a microkernel; there's none of the usual microkernel stuff... I just let userspace do i/o ports with the x86 task structure and do memory mapped i/o by letting it mmap anything (more or less). The kernel manages timers/timekeeping and interrupts, Erlang drivers open a socket to get notified when an interrupt fires --- level triggered interrupts would be an issue. Kernel also does thread spawning and mutex support, connects pipes, early/late console output, etc.
If I get through my roadmap (networked demo, uefi/amd64 support, maybe arm64 support, run the otp test suite), I might look again and see if I can eliminate the kernel/userspace divide now that I understand the underneath, but the minimal kernel approach lets me play around with the fun parts, so I'm pretty happy with it. I've got a slightly tweaked dist working and can hotload userspace Erlang code over the network, including the tcp stack, which was the itch I wanted to scratch... nevermind that the tcp stack isn't very good at the moment ;)
Really cool! Will definitely take a closer look in my spare time.
>I just let userspace do i/o ports [...] and do memory mapped i/o by letting it mmap anything (more or less). The kernel manages timers/timekeeping and interrupts [...]
This is how QNX does it too, allowing privileged processes to use MAP_PHYS and port I/O instructions on x86, and handle interrupts like they're POSIX signals. It all boils down to how you structure your design, but personally, I think that's not a bad approach at all. The cool thing about it is that, after the initial setup, you can drop the privileges for creating further mappings and handlers, reducing the attack surface.
Unless you're trying to absolutely minimize the cost and amount of context switches, I think moving BEAM into the kernel would be a downgrade, but again, I'm a big proponent of microkernels :)
I can't speak for all the various projects but imo these aren't made for bare metal - if you want true bare metal (metal you can physically touch) use linux.
One of the things that might not be so apparent is that when you deploy these to something like AWS all the users/process mgmt/etc. gets shifted up and out of the instance you control and put into the cloud layer - I feel that would be hard to do with physical boxen cause it becomes a slippery slope of having certain operations (such as updates) needing auth for instance.
> Suppose you control the entire stack though, from the bare metal up. (Correct me if I'm wrong, but) Toro doesn't seem to run on real hardware, you have to run it atop QEMU or Firecracker. In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls? Both ultimately lead to the host OS servicing the request. There isn't any notable difference between the kernel/hypervisor and the user/kernel boundary in modern processors either; most of the time, privilege escalations come from errors in the software running in the privileged modes of the processor.
Toro can run on baremetal although I stopped to support on that a few years ago. I tagged in master the commit when this happened. Also, I removed the TCP/IP Stack in favor to VSOCK. Those changes, though, could be reversed in case there is interest on those features.
> In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls?
Hypervisors expose a much smaller API surface area to their tenants than an operating system does to its processes which makes them much easier to secure.
That is a artifact of implementation. Monolithic operating systems with tons of shared services expose lots to their tenants. Austere hypervisors, the ones with small API surface areas, basically implement a microkernel interface yet both expose significantly more surface area and offer a significantly worse guest experience than microkernels. That is why high security systems designed for multi-level security for shared tenants that need to protect against state actors use microkernels instead of hypervisors.
> That is why high security systems designed for multi-level security for shared tenants
When you say "high security" do you mean Confidential Computing workloads run by Trusty (Enclave) / Virtee (Realm) etc? If so, aren't these system limited in what they can do, as in, there usually is another full-blown OS that's running the user-facing bits?
> that need to protect against state actors
This is a very high bar for a software-only solution (like a microkernel) to meet? In my view, open hardware specification, like OpenTitan, in combination with small-ish software TCB, make it hard for state actors (even if not impossible).
No. I am talking about multi-level security [1] which allows a single piece of hardware to handle top secret and unclassified materials simultaneously via software protection. This protection is limited to software attempts to access top secret materials from the unclassified domain; hardware and physical attacks are out-of-scope.
There have been many such systems verified to be secure against state actors according to the TCSEC Orange Book Level A1 standard and the subsequent Common Criteria SKPP standard which requires both full formal proofs of security and explicitly requires the NSA to identify zero vulnerabilities during a multi-month penetration test before allowing usage in NSA and DoD systems.
I think they were talking more about the degraded performance.
In terms of the security aspects though, how does security holes in a layer that restricts things more than without it degrade security? Seems like saying that CVEs on browser's javascript sandboxing degrade the browser security more than just not having sandboxes.
Duplicating a networking and storage layer on top of existing storage/networking layers that containers, and the orchestrators such as k8s provide, absolutely degrade performance - full stop. No one runs containers raw (w/out an underlying vm) in the cloud - they always exist on top of vms.
The problem with "container" security is that even in this thread many people seem to think that it is a security barrier of some kind when it was never designed to be one. The v8 sandbox was specifically created to deal with sandboxing. It still has issues but at least it was thought about and a lot of engineering went into it. Container runtimes are not exported via the kernel. Unshare is not named 'create_container'. A lot of the container issues we see are runtime issues. There are over a half-dozen different namespaces that are used in different manners that expose hard to understand gotchas. The various container runtimes decide themselves how to deal with these and they have to deal with all the issues in their code when using them. A very common bug that these runtimes get hit by are TOCTOU (time of check to time of use) vulns that get exposed in these runtimes.
Right now there is a conversation about the upcoming change to systemd that runs sshd on vsock by default (you literally have to disable it via kernel cli flag - systemd.ssh_auto=no) - guess what one of the concerns is? Vsock isn't bound to a network namespace. This is not itself a vulnerability but it most definitely is going to get taken advantage in the future.
A container breakout is a valid CVE, but it also is an escape into an environment that is as secure as any unix environment was before we even had containers to begin with.
> other architectural concepts such as the complete lack of an interactive userland is far more beneficial when you consider what an attacker actually wants to do after landing on your box
What does that have to do with unikernel vs more traditional VMs? You can build a rootfs that doesn't have any interactive userland. Lots of container images do that already.
I am not a security researcher, but I wouldn't think it would be too hard to load your own shell into memory once you get access to it. At least, compared to pulling off an exploit in the first place.
I would think that merging kernel and user address spaces in a unikernel would, if anything, make it more vulnerable than a design using similar kernel options that did not attempt to merge everything into the kernel. Since now every application exploit is a kernel exploit.
A shell by design is explicitly made to run other programs. You type in 'ls', 'cd', 'cat', etc. but those are all different programs. A "webshell" can work to a degree as you could potentially upload files, cat files, write to files, etc. but you aren't running other programs under these conditions - that'd be code you're executing - scripting languages make this vastly easier than compiled ones. It's a lot more than just slapping a heavy-handed seccomp profile on your app.
Also merging the address space is not a necessity. In fact - 64-bit (which is essentially all modern cloud software) mandates virtual memory to begin with and many unikernel projects support elf loading.
The low level API of process isolation on Windows is Job Objects, that provide the necessary kernel APIs for namespacing objects and controlling resource use.
AppContainers, and Docker for Windows (the one for running dockerized windows apps, not running linux docker containers on top of WSL) is using this API, these high-level features are just the 'porcelain'
Windows containers are actually quite nice once you get past a few issues. Perf is the biggest as it seems to run in a VM in windows 11.
Perf is much better on Windows server. It's actually really pleasant to get your office appliances (a build agent etc) in a container on a beefy Windows machine running Windows server.
With a standard windows server license you are only allowed to have a two hyperv virtual machines but unlimited "windows containers". The design is similar to Linux with namespaces bolted onto the main kernel so they don't provide any better security guaranies than Linux namespaces.
Very useful if you are packaging trusted software don't want to upgrade your windows server license.
>what an attacker actually wants to do after landing on your box.
Aren't there ways of overwriting the existing kernel memory/extending it to contain an a new application if an attacker is able to attack the running unikernel?
What protections are provided by the unikernel to prevent this?
To be clear there are still numerous attacks one might lob at you. For instance you if you are running a node app and the attacker uploads a new js file that they can have the interpreter execute that's still an issue. However, you won't be able to start running random programs like curling down some cryptominer or something - it'd all need to be contained within that code.
What becomes harder is if you have a binary that forces you to rewrite the program in memory as you suggest. That's where classic page protections come into play such as not exec'ing rodata, not writing to txt, not exec'ing heap/stack, etc. Just to note that not all unikernel projects have this and even if they do it might be trivial to turn them off. The kernel I'm involved with (Nanos) has other features such as 'exec protection' which prevents that app from exec-mapping anything not already explicitly mapped exec.
Running arbitrary programs, which is what a lot of exploit payloads try to achieve, is pretty different than having to stuff whatever they want to run inside the payload itself. For example if you look at most malware it's not just one program that gets ran - it's like 30. Droppers exist solely to load third party programs on compromised systems.
> The kernel I'm involved with (Nanos) has other features such as 'exec protection' which prevents that app from exec-mapping anything not already explicitly mapped exec.
Does this mean JIT (and I guess most binary instrumentation (debuggers) / virtualization / translation tech) won't run as expected?
If the stack and heap are non-executable and page tables can't be modified then it's hard to inject code. Whether unikernels actually apply this hardening is another matter.
I always thought of Docker as a "fuck it" solution. It's the epitomy of giving up. Instead of some department at a company releasing a libinference.so.3 and a libinference-3.0.0.x86_64.deb they ship some docker image that does inference and call it a microservice. They write that they launched, get a positive performance review, get promoted, and the Docker containers continue to multiply.
Python package management is a disaster. There should be ways of having multiple versions of a package coexist in /usr/lib/python, nicely organized by package name and version number, and import the exact version your script wants, without containerizing everything.
Electron applications are the other type of "fuck it" solution. There should be ways of writing good-looking native apps in JavaScript without actually embedding a full browser. JavaScript is actually a nice language to write front-ends in.
> Python package management is a disaster. There should be ways of having multiple versions of a package coexist in /usr/lib/python, nicely organized by package name and version number, and import the exact version your script wants, without containerizing everything.
Well sure, every language has some band-aid. The real solution should have been Python itself supporting:
import torch==2.9.1
Instead of a bunch of other useless crap additions to the language, this should have been a priority, along with the ability for multiple versions to coexist in PYTHON_PATH.
There is a vast amount of complexity involved in rolling things from scratch today in this fractured ecosystem and providing the same experience for everyone.
Sometimes, the reduction of development friction is the only reason a product ends up in your hands.
I say this as someone whose professional toolkit includes Docker, Python and Electron; Not necessarily tools of choice, but I'm one guy trying to build a lot of things and life is short. This is not a free lunch and the optimizer within me screams out whenever performance is left on the table, but everything is a tradeoff. And I'm always looking for better tools, and keep my eyes on projects such as Tauri.
I think there's merit to your criticisms of the way docker is used, but it also seems like it provides substantial benefits for application developers. They don't need to beg OS maintainers to update the package, and they don't need to maintain builds for different (OS, version) targets any more.
They can just say "here's the source code, here's a container where it works, the rest is the OS maintainer's job, and if Debian users running 10 year old software bug me I'm just gonna tell them to use the container"
Yeah I'm not against Docker in its entirety. I think it is good for development purposes to emulate multiple different environments and test things inside them, just not as a way to ship stuff.
Agree on all fronts. The advent of Dockerfiles as a poor mans packaging system and the per-language package managers has set the industry back several years in some areas IMHO.
Python has what, half a dozen mostly incompatible package managers? Node? Ruby? All because they're too lazy, inexperienced or stubborn to write or automate RPM spec files, and/or Debian rules files.
To be fair, the UNIX wars probably inspired this in the first place - outside of SVR4 deriviatives, most commercial UNIX systems (HP-UX, AIX, Tru64) had their own packaging format. Even the gratis BSD systems all have their own variants of the same packaging system. This was the one thing that AT&T and Sun Solaris got right. Linux distros merely followed suit at the time - Redhat with RPM, Debian with DEB, and then Slackware and half a dozen other systems - thankfully we seem to have coalesced on RPM, DEB, Flatpak, Snap, Appimage etc... but yeah that's before you get to the language specific package management. It's a right mess, carried over from 90's UNIX "NIH" syndrome.
> This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.
The difference is that you can move that whole bunch of interlinked containers to another machine and it will work. You don't get that when running on bare hardware. The technology of "containers" is ultimately about having the kernel expose a cleaned up "namespaced" interface to userspace running inside the container, that abstracts away the details of the original machine. This is very much not intended as "sandboxing" in a security sense, but for most other system administration purposes it gets pretty darn close.
At some point, few people even understand the whole system and whether all these layers are actually accomplishing anything.
It’s especially bad when the code running at rarified levels is developed by junior engineers and “sold” as an opaque closed source thing. It starts to actually weaken security in some ways but nobody is willing to talk about that.
> This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.
Yea, with uneeded bload like rule based access controls, ACS and secret management. Some comments on this site.
This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.
I think you're over-egging the pudding. In reality, you're unlikely to use more than 2 types of container host (local dev and normal deployment maybe), so I think we've moved way beyond square 1. Config is normally very similar, just expressed differently, and being able to encapsulate dependencies removes a ton of headaches.
Nix is where we're going. Maybe not with the configuration language that annoys python devs, but declarative reproducible system closures are a joy to work with at scale.
Reproducible can have a lot of meanings. Nix guarantees that your build environment + commands are the same. It still uses all the usual build tools and it would be trivial to create a non-reproducible binary (--impure).
I've been running either Qubes OS or KVM/QEMU based VMs as my desktop daily driver for 10 years. Nothing runs on bare metal except for the host kernel/hypervisor and virt stack.
I've achieved near-native performance for intensive activities like gaming, music and visual production. Hardware acceleration is kind of a mess but using tricks like GPU passthrough for multiple cards, dedicated audio cards and and block device passthrough, I can achieve great latency and performance.
One benefit of this is that my desktop acts as a mainframe, and streaming machines to thin clients is easy.
My model for a long time has been not to trust anything I run, and this allows me to keep both my own and my client's work reasonably safe from a drive-by NPM install or something of that caliber.
Now that I also use a Apple Silicon MacBook as a daily driver, I very much miss the comfort of a fully virtualized system. I do stream in virtual machines from my mainframe. But the way Tahoe is shaping up, I might soon put Asahi on this machine and go back to a fully virtualized system.
I think this is the ideal way to do things, however, it will need to operate mostly transparently to an end user or they will quickly get security fatigue; the sacrifices involved today are not for those who lack patience.
I think it's fine if you do it for yourself. It's a bit of a poor man's Linux-turned-microkernel solution. In fact, I work like this too, and this extends to my Apple Silicon Mac. The separation does have big security advantages, especially when different pieces of hardware are exclusively passed to the different, closed-off "partitions" of the system and the layer orchestrating everything is as minimal as it gets, or at least as guarded against the guests as it gets.
What worries me is when this model escalates from being cobbled up together by a system administrator with limited resources, to becoming baked into the design of software; the appropriation of the hypervisor layer by software developers who are reluctant to untangle the mess they've created at the user/kernel boundary of their program and instead start building on top of hardware virtualization for "security", to ultimately go on and pollute the hypervisor as the level of host OS access proves insufficient. This is beautifully portrayed by the first XKCD you've linked. I don't want to lose the ability to securely run VMs as the interface between the host and the guest OSes grows just as unmanageable as that of Linux and BSD system calls and new software starts demanding that I let it use the entirety of it, just like some already insists that I let it run as root because privilege dropping was never implemented.
If you develop software, you should know what kind of operating system access it needs to function and sandbox it appropriately, using the operating system's sandboxing facilities, not the tools reserved for system administrators.
I'm not talking about an IBM mainframe. The definition Google gives me for mainframe is `a large high-speed computer, especially one supporting numerous workstations or peripherals`, which is exactly what my machine is.
mainframe (noun)
main· frame ˈmān-ˌfrām
1: a large, powerful computer that can handle many tasks concurrently and is usually used commercially
2: (dated): a computer with its cabinet and internal circuits especially when considered separately from any peripherals connected to the computer
The features you list are great to have, but my setup fits the first definition of mainframe as described. If you feel this definition is not specific enough, email Merriam-Webster and don't bother me about it.
Webster is wrong. A mainframe is not a generic high performance computer (that would be HPC). A mainframe is a very specific high performance computer.
I repeat: I understand that mainframe has a specific meaning to many people, especially those who work on traditional mainframes, but I would rather you and the other user to email both Google and Merriam-Webster about their wrong definitions, and not bother me about it. I will correct my usage once they have updated the definition to your standards.
> Now hardware virtualization. I like how AArch64 generalizes this: there are 4 levels of privilege baked into the architecture. Each has control over the lower and can call up the one immediately above to request a service. Simple. Let's narrow our focus to the lowest three: EL0 (classically the user space), EL1 (the kernel), EL2 (the hypervisor). EL0, in most operating systems, isn't capable of doing much on its own; its sole purpose is to do raw computation and request I/O from EL1. EL1, on the other hand, has the powers to directly talk to the hardware.
> Everyone is happy, until the complexity of EL1 grows out of control and becomes a huge attack surface, difficult to secure and easy to exploit from EL0. Not good. The naive solution? Go a level above, and create a layer that will constrain EL1, or actually, run multiple, per-application EL1s, and punch some holes through for them to still be able to do the job—create a hypervisor. But then, as those vaguely defined "holes", also called system calls and hyper calls, grow, won't so the attack surface?
(Disclaimer: this is all very far from my area of expertise, so some of the below may be wrong/misleading)
Nobody can agree whether microkernels or monolithic kernel are "better" in general, but most people seem to agree that microkernels are better for security [0], with seL4 [1] being a fairly strong example. But microkernels are quite a bit slower, so in the past when computers were slower, microkernels were noticeably slower, and security was less of a concern than it is now, so essentially every mainstream operating system in the 90s used some sort of monolithic kernel. These days, people might prefer different security–performance tradeoffs, but we're still using kernels designed in the 90s, so it isn't easy to change this any more.
Moving things to the hypervisor level lets us gain most of the security benefits of microkernels while maintaining near-perfect compatibility with the classic Linux/NT kernels. And the combination of faster computers (performance overheads therefore being less of an issue), more academic research, and high-quality practical implementations [2] means that I don't expect the current microkernel-style hypervisors to gain much new attack surface.
This idea isn't without precedent either—Multics (from the early 70s) was partially designed around security, and used a similar design with hardware-enforced hierarchical security levels [3]. Classic x86 also supports 4 different "protection rings" [4], and virtualisation plus nested virtualisation adds 2 more, but nothing ever used rings 1 and 2, so adding virtualisation just brings us back to the same number of effective rings as the original design.
I would like to qualify that seL4 (and the entire family of L4 kernels) were created exactly to disprove the idea that microkernels were slow. They are extremely perfomant.
The idea that microkernels are slow came from analyzing a popular microkernel at the time - mach. It in no way is a true blanket statement for all microkernels.
> The idea that microkernels are slow came from analyzing a popular microkernel at the time - mach. It in no way is a true blanket statement for all microkernels.
Don't microkernels inherently require lots of context switches between kernel-space and user-space, which are especially slow in a post-Meltdown/Spectre world? I know that Linux has semi-recently added kTLS and KSMBD to speed up TLS/SMB, and Windows used to implement parts of font rendering and its HTTP server in kernel mode to speed things up too, so this gave me the impression that having more things inside the kernel (== more monolithic) is better for speed. Or is this only the case because of how the Linux/NT kernels are implemented, and doesn't apply to microkernels?
In an era where most development-oriented software is downloaded with wget/git clone/[package manager] install, this whole process feels like a slap in the face. And don't get me wrong, this is still a huge upgrade over the InstallShield Wizard of the previous versions, which rarely worked at all, and if it did, it would butcher your /etc/profile, but its still an absolute abomination, bundling an entire JRE for the only rightful architecture x86-64 just to download and unzip a few files.
Oh! I only worked with it commercially prior to that so I never got the memo. What an insanely stupid move. That was one of their USPs.
In general QnX was commercially mismanaged and technically excellent. I'm imagining a world where they clued in early on that an open source real time OS would have run circles around the rest of the offerings and they'd have cleaned up on commercial licensing. Since the 80's they've steadily lost mind and marketshare though I suspect they'll always be around in some form.
There's been talk about this on Reddit too, where our chief architect of QNX 8 broke down the decision. He mentioned it was ultimately a tough decision, but that in the end the cons outweighed the pros.
Hey, could you please post a link to the thread you're referring to? I'm guessing it had to do with the io-pkt to io-sock transition, but I couldn't find any information about that.
I've also noticed that all of the message passing system calls still accept the node ID. Are there plans to open up this interface to allow for implementation of custom network managers, maybe? I'd be very interested in exploring that.
Such decisions should always involve the customers. A chief architect that knocks one of the foundation stones out from under a building isn't doing the bureau they work for any favors.
reply