Sounds like the vulnerability is in the software (the hardware works as specified): "Intel claims that this vulnerability is a software implementation issue, as their processors are functioning as per their documented specifications. However, software that fails to take the Intel-specific SYSRET behavior into account may be vulnerable."
Intel's position is disingenuous at best. The SYSRET instruction was introduced by AMD long before Intel added support for it, so it was entirely natural for people to expect that Intel's implementation would be consistent with AMD's specification.
If you build a car which experiences temporal anomalies when driven at 88 miles per hour, it isn't good enough to have a line of fine print in the middle of a 1500 page manual. People expect to be able to drive a car at 88 miles per hour without ending up in the wrong century, and you should either not violate that assumption or have really big warning signs.
I'd say blame accumulates pretty well on both sides, here. Certainly Intel should not have gratuitously changed their behavior from AMD's. On the other hand, OS code often needs to deal with CPU-specific behavior, and they definitely should have read that fine print in the middle of the 1500 page manual when implementing this code, especially when the 1500 page manual's index points to exactly that fine print for this instruction.
How precisely should OS developers have read the fine print in Intel's 1500 page manual which hadn't yet been published when support for SYSCALL/SYSRET was implemented?
Well, you didn't really need to read the fine print to do the right thing. AMD's manual also says noncanonical addresses are bad. You're correct that it was unwise for Intel to change the error behavior, but it's also unwise for the OS to depend on that error behavior when it's possible to avoid the error entirely.
I'm not sure AMD's manual is particularly clear here either. The xen blog points out that the AMD behavior was determined by experimentation, because the manual doesn't specify when the canonical check occurs. From what I've gathered, the Intel behavior does in fact conform to the words of the AMD spec, it just doesn't mirror the implementation.
Presumably they should have at least read it after the Intel manuals shipped. (And did Intel really ship x86-64 CPUs without corresponding manuals? Weird if so.) Now, not doing so is a completely understandable mistake, but still a mistake.
And did Intel really ship x86-64 CPUs without corresponding manuals? Weird if so.
As I said, SYSCALL/SYSRET were introduced by AMD. Intel introduced their first x64-64 CPUs over a year after the architecture was introduced and in use (which is why FreeBSD still uses the name "amd64" for that platform).
I believe the point was that the software worked 100% correctly. Your point seems to be that the software should have been updated to support Intel CPUs when they were released.
From my perspective, Intel implemented an instruction, differently. I do not know enough about the low-level details to know best practices when detecting features supported by processors, but if the existing software is told (or figures out) "Yeah, I can do that" by/with the processor, Intel must have expected every single program using the instruction to ship an update? Nice.
This isn't "every single program", it's only OS system call code. My understanding is that code in the kernel sometimes needs adjustment for new CPUs, due to errata or just plain old changes.
Sorry about that. Given that the code which uses the instruction is not a "program" per se but rather is found in OS kernels, I hope you can understand my confusion. How many x86-64 kernels existed and had to be changed at the time?
This whole issue is way lower-level than anything I deal with on a regular basis so the whole time I've been looking at it from the very general perspective, eg. 'a cpu instruction' rather than 'the specific cpu instruction used by operating systems to switch to user mode on 64-bit processors, originally implemented by AMD and later implemented by Intel in accordance to the spec but differing in implementation' as happened here.
My understanding is that instructions which are only used by OS kernels tend to be a little less stable than those used by general-purpose user programs. However, this is not exactly my area of expertise.
Every development team will need to carefully review every CPU manual coming out in the future, in case Intel decides to change what some other instruction does? Even when the scope is narrowed to operating systems and further to only when new features are added to one manufacturer's microprocessor from another's... ouch! At the very least, can we agree it is not a step in the right direction?
Of course people porting software will have to read. How else do you expect them to figure out the do's and don'ts of a new CPU?
Just like software, CPU's have release notes that describe the major behavioral changes and additions. For most software, it will provide sufficient info. However, if you are writing an OS or want to wring out maximum performance, you will have to read about every single detail. For example, suppose that division gets 1 cycle cheaper, or that a pipeline gets one instruction deeper. That could shift the balance in such a way that a compiler needs to change in order to produce the best code possible.
Thanks for sharing your perspective. I guess I don't consider running (old) software on a processor that has implemented an existing instruction set to be porting, but in this case it comes down to how software is expected to enable use of processor features at run-time (and I don't know best practices there).
Here's how I see this particular scenario: old software detects CPU supports [whatever] and uses it, bug-free. New processor tells old software it supports [whatever]. Old software tries [whatever], but new processor works differently - resulting in critical security vulnerability! Processor company blames software.
"New processor tells old software it supports [whatever]"
Processors almost never tell software anything; their manuals tell the programmers how the CPU behaves. There may be instructions that can inform code about you optional features of the instruction set, but the software has to run on the CPU before it can query the CPU for its capabilities, so it has to make some assumptions about the CPU (you could bootstrap things by having a smaller CPU boot the hardware and having that CPU query the 'real' one for its capabilities, but that just moves the problem to the smaller CPU)
So, in the end, the only way that software can know how the CPU behaves is because the programmers have read the CPU's documentation.
As to this specific case: I haven't bothered reading up on it, so I do not know what happened. It could be lack of documentation by Intel, it could be that AMDs documentation was incomplete or ambiguous and that Intel followed it, but implemented something slightly different, or it could be that Intel's documentation explicitly warned about this incompatibility with AMD devices.
And most specs leave room for different implementations, either by design or because the people writing the spec forgot or did not bother to describe a case. For a famous example, various versions of the 6502 treated undocumented opcodes differently (http://visual6502.org/wiki/index.php?title=6502_Unsupported_...) And yes, there was code that made use of these.
I looked at the AMD and Intel specs to see that both use the CPUID instruction function 80000001H to set bit 11 in EDX if SYSRET is supported. Because both AMD and Intel 64-bit processors support the same general instruction set, software written to check only bit 11 could easily run into the scenario I described. Again, I do not know best practices in detecting CPU features, but I do know software developers often tend towards the least necessary to do the the job.
The issue seems to be present in Xen, Windows, and Linuxes -- this suggests that, regardless of technicalities, Intel did something unwise. (Or at least didn't communicate effectively with software vendors)
"Linux actually fixed the bug in 2006, with CVE-2006-0744. [ http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2006-0744 ] But the description says “Linux kernel before 2.6.16.5 does not properly handle uncanonical return addresses on Intel EM64T CPUs…”, which makes it sound like something Linux-specific. It’s therefore not surprising that it attracted little notice from other operating systems."