Overview
TL;DR: full reliable 0day drive-by exploit against Fedora 25 + Google Chrome, by breaking out of Super Nintendo Entertainment System emulation via cascading side effects from a subtle and interesting emulation error. Very full details follow.
[UPDATE 13 Dec 2016 -- a couple of competent readers inform me that I've named the wrong processor! The actual emulated processor where the fault lies is the Sony SPC700, not the Ricoh 5A22. Whoops. The SPC700 is exclusively an audio co-processor. The 5A22, which is not emulated by Game Music Emu, is the main processor.]
I had a lot of fun compromising the Linux desktop using 6502 opcodes on the original Nintendo NES. Would it be possible to have even more fun? Why, yes it would! My previous NES related exploit suffered from multiple fun-limiting issues:
- Although it was a genuine 0day exploit, it only affected very old Linux distributions. Something affecting bang up to date Linux installs would generate greater lulz.
- The vulnerability that was abused -- a total lack of bounds checking on memory bank mapping -- was somewhat obvious. More fun can often be had with vulnerabilities that are slightly more subtle.
- The lack of “super”! The Super Nintendo Entertainment System (SNES) is even more iconic than the original NES. Regarding its 1990 release, Wikipedia notes “the resulting social disturbance led the Japanese government to ask video game manufacturers to schedule future console releases on weekends”. So we need more Super.
Resolving all the above, I present here a full, working, reliable, 0day exploit for current Linux distributions (Ubuntu 16.04 LTS and Fedora 25). It’s a full drive-by download in the context of Fedora. It abuses cascading subtle side effects of an emulation misstep that at first appears extremely difficult to exploit but ends up presenting beautiful and 100% reliable exploitation possibilities.
You’ve likely guessed it by now, but the Linux gstreamer media playback framework supports playback of SNES music files by…. emulating the SNES CPU and audio processor, courtesy of Game Music Emu. How cool is that?
Demo and impact
Today, the demos are videos instead of images. This first video shows a full, reliable drive-by download against Fedora 25 + Google Chrome. The strong reliability of this exploit makes it work inside Fedora’s tracker-extract process, which has highly variable heap state that has frustrated my other exploit attempts. Finally, decent exploit proof of my earlier suspicion that tracker + Google Chrome is very dangerous:
Exploit file: gnome_calc_fedora_25_libc_2.24-3.spc (rename it to .flac to get it to work as in the video).
And this second video shows a couple of different exploitation contexts in Ubuntu 16.04 LTS, using the same exploit file for each. Again, this is showcasing the reliability that the underlying vulnerability permits. The different exploited processes (gnome-video-thumbnailer and totem) have very different heap and threading setups:
Exploit file: xcalc_ubuntu_16.04_libc_2.23-0ubuntu3.spc (rename it to .mp3 to get it to work as in the video).
Impact is mixed. On Ubuntu, the faulty code is installed and on the attack surface by default, if you select the “mp3” option during install -- which I certainly always do. On Fedora, there’s a very sensible decision to split gstreamer1-plugins-bad into multiple packages, with only gstreamer1-plugins-bad-free installed by default. This limits the attack surface and does not include Game Music Emu. Of course, the gstreamer framework will happily offer to install gstreamer1-plugins-bad-free-extras, with a very nice UI, if the victim simply tries to open the relevant media file.
As always, the general lack of sandboxing here contributes to the severity. I think we inhabit a world where media parsing sandboxes should be mandatory these days. There’s hope: some of my other recent disclosures appear to have motivated a sandbox for Gnome’s tracker.
Introducing Blargg!
A Blargg, in Super Mario World, is a monster that lives in lava. It is also the nickname or handle of the original author of Game Music Emulator, and the handle appears in various macros, e.g.
Ay_Cpu.cpp:#if BLARGG_BIG_ENDIAN
I think this is a wonderful sounding word. In fact it sounds like a Terry Pratchett-esque expletive. Instead of presenting the exploit as a magically working thing, we’ll cover a little of the journey towards the final payloads. On this journey, we’ll make mistakes, and crash into nasty dead ends after following promising leads. Exploits do not magically appear; there is often a lot of undocumented sweat behind them. To deal with these frustrations, we’ll need a suitable expletive. To honor Terry, blarrrgggg!! it is.
Two vulnerabilities
At least two vulnerabilities exist in the core emulation logic of the Ricoh 5A22 processor, which is mostly in Spc_Cpu.h.
The Ricoh 5A22 processor instruction set will look immediately familiar to those who have coded a little bit of 6502 assembly. The 5A22 is based on a 65C816 processor, which is itself based on the 6502. So you’re still looking at a 64kB address space, three main 8-bit registers (A, X and Y), etc. But there are also some new instructions that can affect 16 bits of data, or multiple registers at once (oh the luxury)! There are even instructions to directly perform multiplication and even division.
1: Missing X register value clamp for the MOV (X)+,A instruction
(CESA-2016-0012)
This is one of the new, better-than-6502 instructions. It does two things with the single opcode 0xAF: write the value of the A register to the memory location indicated by the X register, and then increment this X register. The code is simple enough, from Spc_Cpu.h:
case 0xAF: // MOV (X)+,A
WRITE_DP( 0, x, a + no_read_before_write );
x++;
goto loop;
WRITE_DP( 0, x, a + no_read_before_write );
x++;
goto loop;
And contrast this with the code for the similar opcode 0xBF, which does a read instead of a write:
case 0xBF:{// MOV A,(X)+
int temp = x + dp;
x = (uint8_t) (x + 1);
a = nz = READ( -1, temp );
goto loop;
int temp = x + dp;
x = (uint8_t) (x + 1);
a = nz = READ( -1, temp );
goto loop;
As can be seen, it is normal to clamp the value of the X register to an 8-bit unsigned value, after modifying it in some way. Curiously, the 0xAF opcode appears to be the only place this is missed.
This lack of clamping wouldn’t be a problem, but for the fact that the X register is not defined as a uint8_t on the local stack frame:
int x = m.cpu_regs.x;
(Various references in the code suggest that this might be for performance because even Intel is slower for non-machine-word-width instructions. Thanks, Intel, I guess?)
At first glance, this appears to be a readily exploitable bug, then. Just hammer out a bunch of 0xAF opcodes in a row and eventually the X register will become so large that the write to the memory location referred to by X will be out of bounds. If only it were that simple. We’ll revisit this in the next section.
2: Missing SP register value clamp for the RET1 instruction
(CESA-2016-0013)
This is another new instruction that does a couple of things: first it restores the flags register from the stack and then restores the instruction pointer from the stack, like this:
case 0x7F: // RET1
temp = *sp;
SET_PC( GET_LE16( sp + 1 ) );
sp += 3;
goto set_psw;
temp = *sp;
SET_PC( GET_LE16( sp + 1 ) );
sp += 3;
goto set_psw;
Again, this code is curious because it appears to be the only modification of the SP register that does not implement appropriate clamping. The SP register is supposed to range from 0x00 to 0xff, representing “next push” stack pointer positions of virtual memory addresses 0x100 - 0x1ff. The sp local variable actually maintains the stack pointer as a raw host heap uint8_t* pointer. So by setting SP to 0xff and then issuing opcode 0x7F, we end up with an effective SP register value of 0x102. Furthermore, once you’ve over-incremented the SP register, further stack pushes and pops do not get stopped. The code only checks for the exact SP wrap around boundary values, for example for a PUSH:
#define PUSH( data )
{
*--sp = (uint8_t) (data);
if ( sp - ram == 0x100 )
sp += 0x100;
}
{
*--sp = (uint8_t) (data);
if ( sp - ram == 0x100 )
sp += 0x100;
}
So we can also envisage that this is a very exploitable bug: use RET1 to get the stack pointer out of bounds; then use POPs to increment the stack pointer until it gets somewhere interesting in the host heap -- using the POPped stack values to leak host heap pointers etc.; then use PUSHes to overwrite host heap structures and pointers, perhaps with values calculated from the leaky POPs.
Well, let’s going and see if things end up so simple. (Hint: no.)
3: Note on equivalence
(Not a vulnerability per-se, just an observation.)
It’s worth noting that these two separate vulnerabilities are likely to have largely equivalent impact. If we can get an out-of-range SP register value, we can transfer it to the X register to get an out-of-range X register value, and visa versa. This is courtesy of the MOV X,SP and MOV SP,X instructions:
case 0x9D: // MOV X,SP
x = nz = GET_SP();
goto loop;
case 0xBD: // MOV SP,X
SET_SP( x );
goto loop;
x = nz = GET_SP();
goto loop;
case 0xBD: // MOV SP,X
SET_SP( x );
goto loop;
Exploitation impediments
Ok, let’s proceed with exploitation of the MOV (X)+,A issue. This should be a straightforward buffer overflow. We typically start by proving the presence of the underlying vulnerability by triggering a crash. So let’s just execute the following sequence in a test SPC file:
AF 2F FD:
here:
MOV (X)+,A
BRA here
Little tests like this can be cooked up by putting the desired opcode sequence at offset 0x100 in this skeleton file: skeleton.spc.
And no crash. What the blarrgg? Upon investigation in the debugger, we see that the X register definitely gets way larger than 0xff, and values are written in virtual address space at offsets greater than 0xff, but then the writes stop.
What is going on here relates to audio chunk processing. In order to synchronize the processor and the audio generation, the processor is run for a little bit (typically 32768 cycles) and then the audio generator routine is run to generate waveforms based on whatever values the CPU has been banging into the audio hardware registers. All fine so far. But in between these little 32768 cycle runs of CPU, the CPU register state is saved to the heap and then restored to stack variables next time around. Here’s the code that implements saving to heap:
m.cpu_regs.pc = (uint16_t) GET_PC();
m.cpu_regs.sp = ( uint8_t) GET_SP();
m.cpu_regs.a = ( uint8_t) a;
m.cpu_regs.x = ( uint8_t) x;
m.cpu_regs.y = ( uint8_t) y;
m.cpu_regs.sp = ( uint8_t) GET_SP();
m.cpu_regs.a = ( uint8_t) a;
m.cpu_regs.x = ( uint8_t) x;
m.cpu_regs.y = ( uint8_t) y;
BLARRRRGG!!!! Well, we see that SP and X register clamping is implemented here. In effect, this means that we have a finite CPU cycle budget within which we have to cause our out-of-bounds register and then abuse some side effect of that.
We can live with this. Instead of using the MOV (X)+,A instruction to tediously increment the X register all the way past 0xffff (which is where the write to X will bust out of virtual CPU addresses and hopefully into the host heap), we can use the instruction only for the purposes of generating a large X register value. And we can then use a different side effect of a large X register, with this instruction:
D5 FF FF:
MOV 0xFFFF+X,A
The astute reader will note that this instruction would already appear dangerous even with in-bounds values of the X register. Taking the largest in-bounds value of 0xff, this would appear to write to virtual address 0x100fe, which is out-of-bounds. Well, looking at the runtime data structure in Snes_Spc.h, we see this is taken care of with the concept of padding:
struct state_t
{
...
{
...
struct
{
// padding to neutralize address overflow
union {
uint8_t padding1 [0x100];
uint16_t align; // makes compiler align data for 16-bit access
} padding1 [1];
uint8_t ram [0x10000];
uint8_t padding2 [0x100];
} ram;
{
// padding to neutralize address overflow
union {
uint8_t padding1 [0x100];
uint16_t align; // makes compiler align data for 16-bit access
} padding1 [1];
uint8_t ram [0x10000];
uint8_t padding2 [0x100];
} ram;
};
state_t m;
state_t m;
So we definitely need an out-of-bounds X register value. Let’s try again to get a heap overflow crash, with this program:
CD FF E8 41 AF D5 FF FF 2F FA:
MOV X,#0xFF
MOV A,#0x41
here:
MOV (X)+,A
MOV 0xFFFF+X,A
BRA here
…. and no crash? What the blarrggity blarrrggg? The extent of our bad luck becomes apparent with a bit of debugging under the test binary we are using, gst-play-1.0. It turns out that the heap in the thread arena for the decoding looks like this:
| header | stuff | stuff | SPC decoder |m| file data | free | PROT_NONE |
This layout is stable and it makes sense: the SPC decoder object is 70792 bytes. This is too small to be allocated with mmap(), but big enough that there is very unlikely to be a heap “hole” to satisfy the allocation request. Therefore, it will go at the end of the heap arena when it is allocated. In the case of gst-play-1.0, there is a subsequent allocation, of ~64kB, where the input file data is stored. This again is a large allocation so will go after the SPC decoder object at the end of the heap. There is no advantage to corrupting into the input file data; it is just a sequence of characters, no pointers and the attacker already has control over the content of the input file. And the input file data is large, so we do not have enough CPU cycles in our budget to corrupt past this allocation and mess with the top of heap chunk (av->top in glibc malloc terms).
So our sole option is to corrupt the 8 byte size + flags value (in red above) that lives between the heap chunks containing the SPC decoder object and the input file data. This will be tough! Not to say it is impossible, but the last time I tried to exploit glibc metadata corruption in a decent ASLR environment, I found it hard enough that I had to pull some tricks to “cheat” a little. We do not like this path, then. In fact, the situation is even more dire for two reasons:
- Corruption of the size value is verified in the debugger as occurring but is not causing a crash.
It turns out that the gstreamer decoder wrapper for libgme contains a bug. If the decoding starts successfully, then gme_delete() is never called and the chunks surrounding the corrupted value are never freed. A memory leak. Unbelieveable and BLARRRRGGGGG! (Reference: lack of gme_delete() call in gst_gme_dec_dispose(), gstgme.c) - Even though we wrote the byte value 0x41 to our out-of-bounds locations, the debugger shows that in fact, 0xff was written instead. Eh? It turns out that Snes_Spc::cpu_write() function has logic to preserve the “padding” bytes, that we briefly mentioned above, as 0xff values. So, if a virtual address >= 0x10000 is written, the logic is:
- Write the value to the host heap: RAM [addr] = (uint8_t) data;
- If the written address was >= 0x10000, put the 0xff value back, in cpu_write_high(): RAM [i + rom_addr] = cpu_pad_fill; // restore overwritten padding
- Also if written address was >= 0x10000, subtract 0x10000 and do the write again, effectively providing standard “wrap around” semantics at the address space boundary: cpu_write( data, i + rom_addr - 0x10000, time );
The heap situation looks a little better inside the totem video player process because the input file data is allocated in some other heap thread arena. So the SPC decoder object is right against the end of allocated memory in the area and we get to corrupt the av->top “top of memory” value. But that’s still not hugely useful in this instance because there is very little heap activity going on in the decode thread during decode.
Summary: after starting with two very promising looking vulnerabilities, situation is looking very grim.
Cascading the vulnerability
We’re running out of options. But is it possible that there are other side effects to having an out-of-bounds X register value, other than directly writing out of bounds? With this question at the front of our mind, a careful re-read of all the opcodes does turn up some ideas.
Cascading side effect #1: MUL
Although most operations on the A, X and Y registers clamp the resulting values carefully, the very interesting new multiply instruction, MUL, does not:
case 0xCF: { // MUL YA
unsigned temp = y * a;
a = (uint8_t) temp;
nz = ((temp >> 1) | temp) & 0x7F;
y = temp >> 8;
nz |= y;
goto loop;
unsigned temp = y * a;
a = (uint8_t) temp;
nz = ((temp >> 1) | temp) & 0x7F;
y = temp >> 8;
nz |= y;
goto loop;
This instruction takes two 8-bit values in the Y and A registers, and multiplies them together. The 16-bit result is split into a high and low 8-bit result value, again stored in the Y and A registers. Neat little instruction! Note how the A result value is clamped to uint8_t but the Y result value is not. That’s actually strictly correct: multiplying two unsigned 8-bit values has a largest result of 0xff * 0xff == 0xfe01, which would leave Y==0xfe, which does not need further clamping.
However, we have a couple of tricks to generate 8-bit register values that are out of bounds. We can increment X past 0xff and then transfer the X value to the A and / or Y registers, and then enter the multiply. With this corrupt input state, the lack of clamping of the Y value allows us to generate much larger out-of-range register values without needing too many instructions. In effect, we can exponentiate the out-of-bounds values. The instruction count limit we are operating under no longer seems so onerous.
Example: let’s say we use the 0xAF opcode trick to increment the X register to the value 512. Now, we can use the MOV A,X and then MOV Y,A instructions to move the value 512 into the A and Y registers. Using MUL then leaves 1024 in the X register. Repeating the trick, we get 4096 in the X register. Do it again and we get 65536. These numbers are getting bigger fast!
However… since the multiplication is done in an unsigned variable, and the result is right shifted by 8, the largest value we’ll ever get using this trick is 2^24 - 1, or about 16MB. (To get there or near there, we’ll need to avoid integer overflow at 65536 * 65536, perhaps by starting the multiplication series with a slightly smaller value. We’ll deal with it later.)
Is this useful? Unfortunately, thread heap arenas are 64MB in size, e.g.:
7fffbc000000-7fffbc16a000 rw-p 00000000 00:00 0
7fffbc16a000-7fffc0000000 ---p 00000000 00:00 0
7fffbc16a000-7fffc0000000 ---p 00000000 00:00 0
So that 16MB “reach” is only going to crash when it smashes into that mapped but PROT_NONE area of reserved but unused thread heap arena. Blarrggg…..
Cascading side effect #2: DIV
So now we have potentially very large values in the X register. Is there another chained side effect we can abuse? Or are we running out of rope here? It turns out that there’s one last hope: the DIV instruction is just as interesting as the MUL one. It also does transforms on the values of incoming registers, leaving the results in the A and Y registers, without any clamping on the Y result. Here’s the code:
case 0x9E: // DIV YA,X
{
unsigned ya = y * 0x100 + a;
{
unsigned ya = y * 0x100 + a;
...
if ( y < x * 2 )
{
a = ya / x;
y = ya - a * x;
}
else
{
a = 255 - (ya - x * 0x200) / (256 - x);
y = x + (ya - x * 0x200) % (256 - x);
}
if ( y < x * 2 )
{
a = ya / x;
y = ya - a * x;
}
else
{
a = 255 - (ya - x * 0x200) / (256 - x);
y = x + (ya - x * 0x200) % (256 - x);
}
...
a = (uint8_t) a;
Even though this code is fairly simple, I don’t claim to understand it 100%, particularly with large input values. What I do know is that I see various integer overflow opportunities, integer underflow opportunities, less useful div-by-zero issues, etc. If I’m being honest here, I copied the code into a simple .c file and plugged in various A, X and Y register input values to see what dropped out. I focussed on experimenting with “clean” input values that are powers of two, or one greater or less than that. And hey presto with a bit of experimentation, I found this:
Input: A = 0, X = 257, Y = 16777215
Output: A = 255, X = 257, Y = -131583
A negative value!! This has the potential to change everything. The use of the int type for the registers isn’t just about width, it’s about signedness. A negative value is awesome for us for two reasons:
- A negative value, when added to the base pointer for the virtual RAM, will be sign extended to 64-bits and therefore reference before the virtual RAM on the heap. This gets us past the serious problem that there is very little of interest after the virtual RAM.
- Looking again at Snes_Spc::cpu_write, a negative address value will avoid the logic that decides to restore padding values. It will also avoid logic that tries to “wrap around” address space writes.
Surely we can move an exploit forward now?
100% reliable exploitation
The virtual RAM is not allocated with a separate malloc() call; it’s a member array of a larger C++ object, the key parts of which are here:
class Spc_Emu : public Music_Emu { // base class defines virtual methods
...
...
Snes_Spc apu;
}
struct Snes_Spc {
Spc_Dsp dsp;
struct state_t
{
...
{
...
int extra_clocks;
sample_t* buf_begin;
sample_t const* buf_end;
sample_t* extra_pos;
sample_t extra_buf [extra_size];
...
sample_t* buf_begin;
sample_t const* buf_end;
sample_t* extra_pos;
sample_t extra_buf [extra_size];
...
struct
{
// padding to neutralize address overflow
union {
uint8_t padding1 [0x100];
uint16_t align; // makes compiler align data for 16-bit access
} padding1 [1];
uint8_t ram [0x10000];
uint8_t padding2 [0x100];
} ram;
};
state_t m;
{
// padding to neutralize address overflow
union {
uint8_t padding1 [0x100];
uint16_t align; // makes compiler align data for 16-bit access
} padding1 [1];
uint8_t ram [0x10000];
uint8_t padding2 [0x100];
} ram;
};
state_t m;
}
struct Spc_Dsp {
struct state_t
{
struct state_t
{
...
uint8_t* ram; // 64K shared RAM between DSP and SMP
uint8_t* ram; // 64K shared RAM between DSP and SMP
All of the above fields are located within the same heap chunk because they are part of the same object. We are starting to realize that this is an exceptionally beautiful vulnerability because we can use our negative X register to index backwards, and access the following fields without the unreliability of crossing a heap chunk:
- A vtable value. We can read it to defeat ASLR by calculating the address of various code and data sections relative to it. We can write it to direct code execution.
- A pointer value to the actual host heap memory address of the virtual RAM (Spc_Dsp::ram). This is pretty spectacular. We can now consider forging a vtable inside the virtual CPU address space… and then we know what address to overwrite the vtable pointer with so that our malicious vtable is cleanly referenced.
- Some sample buffering and timing fields that give us a bit of control over reads and writes outside of the main Spc_Emu object, if we need that.
(Sample file: perfect_vtable.spc. Forges a vtable inside virtual RAM and causes a fault trying to execute the string “xcalc” at a known address. Highly process, heap state and Linux distro independent.)
Memcpy fail
My first attempt to put all this together in an exploit on Ubuntu used one of my standard tricks: overwriting the memcpy() GOT entry with system() and then triggering some memcpy() in the code. At first glance, there’s an excellent trigger opportunity by having the virtual CPU write to a hardware register to enable or disable a small window of ROM:
void Snes_Spc::enable_rom( int enable )
{
if ( m.rom_enabled != enable )
{
m.rom_enabled = enable;
if ( enable )
memcpy( m.hi_ram, &RAM [rom_addr], sizeof m.hi_ram );
memcpy( &RAM [rom_addr], (enable ? m.rom : m.hi_ram), rom_size )
{
if ( m.rom_enabled != enable )
{
m.rom_enabled = enable;
if ( enable )
memcpy( m.hi_ram, &RAM [rom_addr], sizeof m.hi_ram );
memcpy( &RAM [rom_addr], (enable ? m.rom : m.hi_ram), rom_size )
Importantly, if memcpy() is rewired to system(), we just need control of the memory content referenced by the first parameter. Since the second memcpy() is copying to virtual RAM, we trivially have this control.
And I had working exploit along these lines in my debug build, that promptly failed in a real build of libgme.so. What the actual blarrggg?? It turns out that the optimizing compiler decides that since the copy length is a small-ish constant, it will expand these memcpy() calls to inline sequences of reads and writes. Great. Still, a plan B isn’t hard. Also, as noted in previous blog posts, this technique will work on Ubuntu but not Fedora, since the latter has stronger exploit mitigations turned on, including RELRO. In the next section, we’ll pursue something that will work on both Ubuntu on Fedora.
Putting it together
We have all the pieces and information to put together a decent exploit now. Here’s how the exploit works that you saw in the video. We’ll describe the Fedora variant:
1: Set the X register to 255 and then use opcode 0xAF (MOV (X)+,A) in a loop to bump it up to 511
It’s worth remembering that as we abuse the 0xAF opcode to bump up the value of the X register, it writes to the virtual address of the X register, effectively corrupting anything we put there. In this stage, virtual addresses 0xff - 0x1fe get corrupted. In later stages, other smaller areas of virtual RAM get corrupted.
To avoid all these areas of corruption, we host the main code and metadata at virtual address 0x800 - 0xa00 with supporting routines at 0x700. To avoid mistakes, the input file marks virtual addresses which get nailed with the byte 0xfe.
2: A bit of register copying, more additions and then a multiply (MUL, 0xCF)
The X register value of 511 is copied to the A and then Y register. The X register is then further bumped up using 0xAF again, to 513. This is transferred to A.
The MUL instruction is called, which does Y * A, or 511 * 513. The resulting value in Y, after the final right shift by 8, is 0x3ff.
3: Repeat the process of multiplying 2^n + 1 and 2^n - 1
The power of using (2^n + 1) * (2^n - 1) is that after the right shift by 8, this leaves us with a 2^n - 1 result in the Y register. This keeps our result as high as possible without getting to 2^16, which is too large and would leave us with integer overflow.
The sequence of resulting values in the Y register is: 0x3ff, 0xfff, 0xffff, 0xffffff.
4: Use of the division opcode (DIV, 0x9E) to create a negative value
With input A = 0, X = 257, Y = 16777215, we end up with Y = -131583
That’s actually a little large compared to what we need, as it references too far back in memory for us to be able to hit the Spc_Emu object. Playing around some more, we can use our new value and another DIV to get a more amenable value:
With input A = X = Y = -131583, we end up with Y = -65024. That’s much better. Since that value is close to but a bit smaller than negative the size of the 64kB address space, we’ll be able to corrupt backwards far enough to hit any field in the Spc_Emu object, but also still be able to corrupt forwards to conveniently read and write the first few hundred bytes of the virtual RAM.
5: Read the vtable value and store it at virtual RAM address 0x20 for convenience
Looking at offset 0xa00 in the exploit file, we see the first line of a table of reads and writes that are made:
7C EA 20 FE 00 00 00 00 00 00 00 00 76 74 62 6C
This is instructing the exploit loop to read 8 bytes from virtual address 0xea7c + Y, add the 8 byte constant 0, and then write to 0xfe20 + Y. (The last 4 bytes are not used in the code and in this case they spell “vtbl” to remind me what each line does.) Because it’s nice to look at a little 5A22 code, here’s the routine that implements the read / add / write loop:
CLRC ; clear carry, for clean addition
MOV X,#8 ; loop for 8 bytes
loop:
MOV A,(0x80)+Y ; read one byte using OOB Y
PUSH X ; save X
MOV X,#0 ; set X to 0
ADC (0x84+X) ; add constant from table
POP X ; restore X
MOV (0x82)+Y,A ; write one byte using OOB Y
INC 0x80 ; increment read location
...
INC 0x82 ; increment addition constant location
...
INC 0x84 ; increment write location
DEC X ; done one byte
BNE loop ; if not all 8 done, continue
RTS ; otherwise return
This code could be better: I didn’t take advantage of the 16-bit increment operations and I think I may have used a clumsy addressing mode for the ADC (add with carry). But it works. The reason it corrupts memory is that the Y register has a negative value in it. Once we get the negative value into the Y register, we just leave it there and never change it. For example, the instruction MOV A,(0x80)+Y loads a 16-bit virtual address from memory location 0x80, which we populated with 0xea7c from the metadata table. Then, Y (-65024) is added, giving -4996 as the actual virtual address that is read from. And in the real host heap, 4996 bytes before the start of the virtual RAM storage is the Spc_Emu vtable pointer. Voila.
6: Calculate the vtable value + 0x6d8, and store it at virtual address 0x28
The vtable value + 0x6d8 is the address of the free() GOT entry. As a reminder, the GOT is read only because we’re talking about the Fedora exploit.
7: Reload the address of the free() GOT entry and copy it to Spc_Emu::buf_begin
We also copy the address of the free() GOT entry, with an addition of 8, and store it to Spc_Emu::buf_end.
8: Corrupt some other timing related fields to make the sample generation engine behave
At this stage, we’re operating the main emulator object in a corrupt state, so we need to further corrupt some other fields to avoid problems. We set Spc_Emu::extra_clocks to -32760 so that the addition in Snes_Spc::end_frame ends up being close to 0 (8 I think), a safe value for our needs. We also set Spc_Emu::dsp_time to 0, a value which prevents the DSP (digital sound? processor) from running at all in Snes_Spc::end_frame. The DSP is a large chunk of code that touches a ton of state so causing a simple corruption to stop it running at all makes things safer and easier to reason about.
9: Allow the frame to end
We now end the current frame. To do this, we just run in a CPU loop that burns through our remaining CPU cycle budget. The loop is actually a little special; we’ll cover it more below. Once our virtual CPU pauses, some post-frame processing occurs. The DSP does not run because we caused a corruption to have that effect. But Snes_Spc::save_extra does run, which appears to be designed to handle small cycle offsets in the virtual CPU clock time vs. the virtual DSP clock time. Because of the buf_begin, buf_end and extra_clocks corruption we caused in steps above, the resulting effect is that 8 bytes of memory are read which correspond to the value of the free() function pointer inside the GOT. These bytes are placed in Spc_Emu::extra_buf.
10: New frame starts
When a new frame starts, we have the interesting problem: how does our virtual CPU know that a new frame has started? If the CPU is virtualized correctly, it shouldn’t really notice that its effort is being rationed into “frames”. And we do need to know that a new frame has started, because we need to know that interesting bytes await us in Spu_Emu::extra_buf, from step 9 above.
The solution is to monitor for a side effect that frustrated us (blargg! etc.) earlier. But now we can use it to our advantage. Take that, frustration! The act of exiting and then entering a frame saves and restores the CPU registers, clamping them to 8-bit values. So while we are waiting for the new frame, we have the CPU simply repeatedly do a multiplication that results in one value if our corrupt Y register is still corrupt, and another value if it has been clamped.
When the new frame does start, we need to start from scratch to re-generate the out-of-bounds value in the Y register so that we can continue to corrupt memory. Not a problem: we already have that code.
11: Save the free() pointer value from Spc_Emu::extra_buf
We read this value and save it at virtual address 0x30.
12: Calculate where system() is
Now that we have the value of free(), which is inside glibc, we can easily calculate the value of system(), which will be at a fixed offset. We do so and store it at virtual address 0x38.
13: Grab the Spc_Dsp::ram pointer
And store it at 0x40.
14: Read the ram pointer, add 0x100 to it, and overwrite the Spc_Emu vtable
Yeah, now it’s getting real. We now host a vtable at virtual address 0x100 in our virtual RAM.
15: Calculate vtable - 2370739 and store that at virtual RAM address 0x180
This is a bit magic. Also two more bits of magic:
Calculate ram + 0xc0 and store that at &Spc_Emu::vtable + 16.
Store system() at &Spu_Emu::vtable + 8.
16: Frame tick again. Exploitation occurs. Calc appears.
What just happened?
- After the frame tick, the play routine Spc_Emu::play_() is called again. It is virtual, and the vtable entry is at offset 0x80 into the vtable.
- We corrupted the vtable to point to 0x100 in virtual RAM, so a virtual call is made to the pointer at 0x180 in virtual RAM. This is vtable - 2370739.
- At vtable - 2370739, in libgme.so, is the following little x86_64 instruction sequence:
8bdd: 48 89 f8 mov %rdi,%rax
8be0: 48 8b 7f 10 mov 0x10(%rdi),%rdi
8be4: 48 8b 40 08 mov 0x8(%rax),%rax
8be8: 89 da mov %ebx,%edx
8bea: ff d0 callq *%rax
8be0: 48 8b 7f 10 mov 0x10(%rdi),%rdi
8be4: 48 8b 40 08 mov 0x8(%rax),%rax
8be8: 89 da mov %ebx,%edx
8bea: ff d0 callq *%rax
- This loads the pointer at object offset 0x10 into %rdi, which is where we put a pointer to offset 0xc0 into virtual RAM. That happens to be where the string “gnome-calculator” is in the input file.
- Then this loads the pointer at object offset 0x8 into %rax, which is where we put system().
- Then, it calls %rax, which calls system(). The first argument in the x86_64 ABI is %rdi, which points to “gnome-calculator”. Ergo, let’s math!
- This is a COP (call-oriented programming) payload, not ROP. In my opinion, ROP has issues. I recommend you prefer COP if you can. The reason I used any form of *OP is laziness. The path of least resistance to getting around Fedora’s RELRO was a quick little COP gadget. My usual trick of overwriting __free_hook looked to be messy.
Closing notes
Nothing much to add.
Appendix A: patch
No reason I shouldn’t propose a patch. Here it is. It fixes the two noted vulnerabilities and adds a bit of defensiveness to the opcodes that assisted exploitation. Ideally, the types of the local registers should be changed to uint8_t to provide a bunch more robustness. Perhaps the small performance difference mattered over a decade ago, when this code was written. Unlikely to still be the case.
diff -rubB gme-old/Spc_Cpu.h gme/Spc_Cpu.h
--- gme-old/Spc_Cpu.h 2016-12-10 19:05:05.000000000 -0800
+++ gme/Spc_Cpu.h 2016-12-12 15:43:11.563377845 -0800
@@ -76,8 +76,8 @@
// TODO: remove non-wrapping versions?
#define SPC_NO_SP_WRAPAROUND 0
-#define SET_SP( v ) (sp = ram + 0x101 + (v))
-#define GET_SP() (sp - 0x101 - ram)
+#define SET_SP( v ) (sp = ram + 0x101 + ((uint8_t) v))
+#define GET_SP() (uint8_t) (sp - 0x101 - ram)
#if SPC_NO_SP_WRAPAROUND
#define PUSH16( v ) (sp -= 2, SET_LE16( sp, v ))
@@ -485,7 +485,7 @@
case 0xAF: // MOV (X)+,A
WRITE_DP( 0, x, a + no_read_before_write );
- x++;
+ x = (uint8_t) (x + 1);
goto loop;
// 5. 8-BIT LOGIC OPERATION COMMANDS
@@ -808,7 +808,7 @@
unsigned temp = y * a;
a = (uint8_t) temp;
nz = ((temp >> 1) | temp) & 0x7F;
- y = temp >> 8;
+ y = (uint8_t) (temp >> 8);
nz |= y;
goto loop;
}
@@ -838,6 +838,7 @@
nz = (uint8_t) a;
a = (uint8_t) a;
+ y = (uint8_t) y;
goto loop;
}
@@ -1004,7 +1005,7 @@
case 0x7F: // RET1
temp = *sp;
SET_PC( GET_LE16( sp + 1 ) );
- sp += 3;
+ SET_SP(GET_SP() + 3);
goto set_psw;
case 0x8E: // POP PSW
POP( temp );
--- gme-old/Spc_Cpu.h 2016-12-10 19:05:05.000000000 -0800
+++ gme/Spc_Cpu.h 2016-12-12 15:43:11.563377845 -0800
@@ -76,8 +76,8 @@
// TODO: remove non-wrapping versions?
#define SPC_NO_SP_WRAPAROUND 0
-#define SET_SP( v ) (sp = ram + 0x101 + (v))
-#define GET_SP() (sp - 0x101 - ram)
+#define SET_SP( v ) (sp = ram + 0x101 + ((uint8_t) v))
+#define GET_SP() (uint8_t) (sp - 0x101 - ram)
#if SPC_NO_SP_WRAPAROUND
#define PUSH16( v ) (sp -= 2, SET_LE16( sp, v ))
@@ -485,7 +485,7 @@
case 0xAF: // MOV (X)+,A
WRITE_DP( 0, x, a + no_read_before_write );
- x++;
+ x = (uint8_t) (x + 1);
goto loop;
// 5. 8-BIT LOGIC OPERATION COMMANDS
@@ -808,7 +808,7 @@
unsigned temp = y * a;
a = (uint8_t) temp;
nz = ((temp >> 1) | temp) & 0x7F;
- y = temp >> 8;
+ y = (uint8_t) (temp >> 8);
nz |= y;
goto loop;
}
@@ -838,6 +838,7 @@
nz = (uint8_t) a;
a = (uint8_t) a;
+ y = (uint8_t) y;
goto loop;
}
@@ -1004,7 +1005,7 @@
case 0x7F: // RET1
temp = *sp;
SET_PC( GET_LE16( sp + 1 ) );
- sp += 3;
+ SET_SP(GET_SP() + 3);
goto set_psw;
case 0x8E: // POP PSW
POP( temp );
Appendix B: Godspeed to you, Terry Pratchett
“There is another point, of course. It would be a tragedy should anything untoward happen to our little visitor. It would be dreadful if he were to die, for example. Dreadful for the whole of our land, because the Agatean Emperor looks after his own and could certainly extinguish us at a nod. A mere nod. And that would be dreadful for you, Rincewind, because in the weeks that remained before the Empire’s huge mercenary fleet arrived certain of my servants would occupy themselves about your person in the hope that the avenging captains, on their arrival, might find their anger tempered by the sight of your still-living body. There are certain spells that can prevent the life departing from a body, be it never so abused, and- I see by your face that understanding dawns?”
“Yarrg.”