- Joined
- Oct 28, 2015
- Messages
- 114
- Reaction score
- 310
- First Language
- Portuguese
- Primarily Uses
- RMMZ
Event programming
or that time I simulated an ARM processor with events on RMVX Ace
or that time I simulated an ARM processor with events on RMVX Ace
Introduction
So, I think this one warrants a bit of context.
I'm not sure how this is on the international community, but I come from the brazilian RPG Maker community, and, at least to me, there seems to exist a big hesitancy from people there to learn scripting, often with the justification that it is too hard. People would actively avoid learning Ruby (and, more recently, JS) and make all sorts of systems with events (while occasionally throwing around script calls, mostly copied from somewhere); these would range from custom menus using pictures (which are mostly fine, to be honest) to whole battle systems (ABSs being specially common).
This always bugged me a little. The resulting systems were not bad per se (some were pretty awesome, actually), but more often than not they were insanely complicated, with a lot of duplication (I swear, one guy made a digital clock system that had a conditional for each minute of the day), and could be hugely simplified with some abstraction, which events unfortunately do not allow much of (and which scripting languages are specifically designed for). Making such systems with events (and especially making them look good and be usable) seems even harder than learning some programming and then writing them as scripts, at least to me.
With this in mind, I made this post [pt-br] a while back. It was mostly intended as a joke, pretending to be an introductory tutorial to "event programming", which then quickly derailed into radix math and then low level programming with C, kind of as a way of poking fun at that idea that events were easier to work with and that programming was some kind of dark magic that only the chosen ones could understand (in the end it might've ended up becoming more mystifying of programming than anything, though; my bad).
The end result was a number sorting program compiled from C with GCC (with an ARMv6 backend) and run inside an RMVX Ace game, all purely via events (no script calls involved), which I honestly think is pretty cool (albeit not really useful for making actual games, at least not directly).
I don't think translating the post word-by-word would be appropriate, because it was very snarky originally, and specifically targeted at a community I was a part of for a long time, so this post is going to have a more serious and direct tone. I still think this is relevant to post here because I learned some things in the process, which I'd like to share.
Throughout this post I'll be assuming that you have at least some understanding of programming; I'm not going to spend much time explaining how stuff works, because that's not really the point here. I'll try my best to offer some additional reading material for the interested reader that might not be familiar with some particular concept used here; ideally, the reader should have at least a basic understanding of programming, some math (especially radixes) and how CPUs generally operate. The goal here is not to explain these, but to use them to simulate a processor on RMVX Ace with events only.
Final product
So, just to give a little taste of what the final product looks like, this is the system working:
The first few seconds there, when nothing happens, I'm actually locked into the event while it loads the program (which is encoded as a sequence of variable assignments and common event calls) into memory. This takes some time because writing stuff into memory is kinda expensive (we'll see why later).
Some time later, the event "asks" for a numeric input (this is done by executing a
LOAD
instruction reading from a specific part of memory, which is mapped as a "peripheral device"). This corresponds to how many numbers I'm going to input after (i.e. the size of the array it's going to sort).It then asks for numbers three more times (corresponding to the number I gave in the first prompt), and stays "silent" for some time (while I run around it to show that it is doing that in parallel).
When it finishes processing (sorting the list), it then shows three messages in sequence (again, this is done via a
STORE
instruction, writing into a specific part of memory): the numbers I gave it, but in ascending order.It's clear that it's not the most efficient thing ever, and it would probably be much easier to just emulate the same behavior with event commands directly. But where's the fun in that?
System overview
You can download the complete system, with the sorting program loaded, here: Dropbox. I'm going to use it as reference throughout this post. It'd be kinda unwieldy to show the whole implementation here; it's much easier if you download it and have it by your side while you read this post, this way I can give a higher-level explanation of what each part does, and you can get a more concrete notion of how it's done by looking directly into the implementation.
We are going to simulate an ARMv6-M processor (architecture reference is available here). More specifically, the registers and execution state (section A2.3 in the documentation), memory architecture (chapter A3) and instruction set (A4, A5 and A6, and some details spread around the rest of the document). This ignores some aspects of the processor architecture (notably, exception and interrupt handling, privileged execution, and caches), but should be enough to be able to run most simple programs.
I chose ARMv6-M because it has an open specification, and it's a relatively simple processor, with a fairly reduced and very simple instruction set (no turing-complete
MOV
instructions on this one), which makes it easier to simulate.The system will consist essentially of a bunch of common events, each of which does some aspect of the function of the CPU. There are two main "sections" of common events: memory-related (read/write from ROM, RAM and peripheral) and instruction-related (decoding and execution). The system also uses some variables: one for ROM, one for RAM (yes, the ROM and RAM are both a single variable each; more on that later), one for each register, and some auxiliary variables used by the different processes. Processor flags are represented by switches.
The next two sections will describe the general idea behind the memory-related events and instruction-related events, respectively. The last section will then show how a program is compiled and loaded into an event.
Memory
Personally, I think the most interesting part of the system is how it represents and manipulates memory. The method shown here also relates to why this is only possible on RM XP through VX Ace, and not on the more recent versions (yes, it has to do with Ruby).
There's two key concepts to explore here: Bignums (hence why we want Ruby-based engines, which offer automatic conversion into Bignums for numeric variables) and radixes.
How so? Well, memory in computers is usually a big array of bytes (at least conceptually, in practice that's not really true, but for our purposes this serves well as an abstraction). Likewise, numbers in positional notation are essentially arrays of digits. It's not hard then to see that numbers are essentially a generalization of memory! In particular, memory is just a number in base 256 (because each byte has one of 256 values, like a base-256 digit).
This Numberphile video exemplifies this concept well:
The idea here is to encode our memory as a single (very big) number, and then everytime we need to store or load a byte (or many bytes) we just run the algorithm that gets or sets a digit on the number.
We kind of need to be able to do this to simulate memory minimally effectively. We can't afford to go around creating one variable for each position in our memory, because they're limited to just 5k, and it would be hell to select which variable to set and read from a given position (also, reading multiple bytes becomes less convenient).
Anyhow, the algorithms for getting and setting a position in memory are the following:
Rich (BB code):
load(M: bignum, addr: int) -> byte
exp = 1 // start with 256^0 = 1
addr times do // multiply by 256 until we get 256^addr
exp *= 256
return (M / exp) mod 256 // divide by the exponent and take modulus with 256
store(ref M: bignum, addr: int, new: byte)
// first we read the value, the same way we did on load
exp = 1
addr times do
exp *= 256
old = (M / exp) mod 256
// subtract the old value times the exponent to make the corresponding digit 0
M -= old * exp
// add the new value times the exponent, setting the digit to the new value
M += new * exp
The actual implementation has some much needed optimizations (in particular for sequential access), and could have even more (e.g. doing exponentiation by squaring might be nice, and we could take advantage of alignment when reading more than a single byte). These functions also only read/write bytes (8 bits), while the implementation has common events for reading/writing half-words (16-bit) and words (32-bit) too, but they shouldn't be hard to generalize.
One unfortunate aspect of this implementation is that it is slow. Bignum operations are not the cheapest things, and we need a fair amount of them, on fairly big numbers, to do our memory manipulation. Sadly, I couldn't think of a better way of doing this. One possible solution would be to implement some kind of cache, which I didn't really look into much at the time (because it was not really essential, to be honest).
You can find this implemented on the following events in the provided complete implementation:
- [041: (AUX) Find Address] Does the "address finding" job, i.e. calculates the corresponding exponent for a given address. It is used as a subroutine of all the reading/writing common events.
- [002-006: (ROM) Read/Write Byte/Half-Word/Word] [011-016: (RAM) Read/Write Byte/Half-Word/Word] Do the remaining work of subtracting and adding values into the memory (ROM, which contains programs, and RAM, respectively) using the calculated exponent.
0x60000000
, i.e. when a program reads from address 0x60000100
, it's actually reading from address 0x100
(256) in RAM.This also allows programs to communicate with peripheral devices (which, in our case, are message boxes and number inputs, i.e. the player) by reading from a specific range of addresses in memory (in our case,
0x40000000
).The peripheral device behavior is handles by these events in the provided implementation:
- [031-036: (PER) Read/Write Byte/Half-Word/Word] Reads and writes bytes into the "peripheral devices". It shows messages with values written into it, and shows numeric inputs when read (it then returns the value it gets to the
LOAD
instruction as if it was written in that position in memory).
- [011-016: (MEM) Read/Write Byte/Half-Word/Word] Reads and writes bytes into "memory". It maps addresses accordingly, forwarding processing to the appropriate common event (ROM, RAM or PER, depending on the address range). This is the event that's actually used by the simulated CPU to run
LOAD
andSTORE
instructions.
CPU
This is the party where we go and actually simulate the processor. It's also waaaaay boring.
You see, most of what a CPU does is brainless, "manual" work:
- Load an instruction from memory (this is done for us by the common events in the previous section);
- Decode it (which is basically a long sequence of conditional statements, with some occasional bit manipulation);
- Run it (which usually involves only very simple tasks, like adding numbers, comparing them, doing jumps, etc.);
- Repeat.
The common events responsible for the CPU simulation in the implementation are the following:
- [191: (CPU) Step] Does a single iteration of the process described above; it calls the next common events as subroutines as needed.
- [192-203: (CPU) Group 1-7] Handles decoding of a given group of instructions. Instructions are split into groups according to similarities in how they're encoded (which usually has to do with them having related functionality, too);
- [101-176: (CPU) <Instruction>] Simulates the execution of an instruction. Admittedly, I didn't implement some of the instructions that weren't needed for the example here; their implementation is left as an exercise to the reader, I guess?
Some auxiliary events used in the implementation of certain operations are pretty interesting though. Namely:
- [045: (AUX) 2-Complement] Computes the two's complement of a given number (making it "negative in binary"). This is useful for subtraction (there are no negative bytes in memory!), as one might guess.
- [046: (AUX) Not] Computes the binary negation of a given number. This is useful for (you guessed it), computing the binary negation of a given number.
- [050: (AUX) Sign Extend 32] Extends a signed number (a.k.a.
sext
ing; I know, lmao) to 32-bits. This is useful when you have to convert signed bytes or half-words into signed words (which happens for some instructions).
Compiling and loading code
Ok, so far we've got a functional CPU simulator, but that's no good if we have nothing to run on it.
Ideally, we'd also not really want to manually write ARMv6 assembly code and encode it to load into our system. So let's do something a bit more sophisticated: we'll compile C code into our target architecture, and then just use the compiler toolchain to get our binary code that the processor understands. This process is quite similar to how one would do it for an embedded system (which, not by coincidence, is the target of the ARMv6-M architecture).
For this section, I'm assuming you have access to a system running Ubuntu. If you're on Windows 10+, try WSL if you haven't already. It's usually pretty decent, and definitely good enough for our purposes.
So here's what you're going to need to do this:
Bash:
sudo apt-get update -y
sudo apt-get install -y gcc-arm-none-eabi binutils-arm-none-eabi libc6-armel-cross libc6-dev-armel-cross
Each of these packages provide a component necessary to use GCC with an ARMv6 back-end (i.e. make GCC generate binary code targeting our processor architecture). In particular,
gcc-arm-none-eabi
will install the GCC toolchain for the specific architecture (by default, the commands are the usual gcc
, objdump
, objcopy
, etc., but with the prefix gcc-arm-none-
).Now, let's get started. Here's the code we'll be compiling:
C:
// file: sample.c
#include <stdint.h>
#define RAM(offset) ((uint8_t*) 0x60000000 + (offset))
#define STACK_BASE RAM(1024)
#define MEM(offset) (STACK_BASE + 1 + (offset))
volatile uint32_t* peripheral = (uint32_t*) 0x40000000;
// Read a word (32-bits) from the mapped peripheral device (executes the
// numeric input command and returns the value given)
inline static uint32_t input() {
return *peripheral;
}
// Writes a word (32-bits) into the mapped peripheral device (executes the
// show message command with the given value)
inline static void output(uint32_t word) {
*peripheral = word;
}
// Insertion sort
void sort(uint32_t* arr, uint32_t n) {
for (int i = 1; i < n; i++) {
uint32_t key = arr[i];
int j;
for (j = i - 1; j >= 0 && arr[j] > key; j--) {
arr[j + 1] = arr[j];
}
arr[j + 1] = key;
}
}
uint32_t* elements = (uint32_t*) MEM(0);
void _start() {
uint32_t n = input();
for (int i = 0; i < n; i++) {
elements[i] = input();
}
sort(elements, n);
for (int i = 0; i < n; i++) {
output(elements[i]);
}
}
The code will read a given amount of numbers as input, sort them, and then display them in order. There's a few things to unpack here before we proceed with compiling it. From the beginning:
C:
#include <stdint.h>
#define RAM(offset) ((uint32_t*) 0x60000000 + (offset))
#define STACK_BASE RAM(1024)
#define MEM(offset) (STACK_BASE + 1 + (offset))
These macros are here to help us refer to our special memory mapped location without having to calculate them every time:
RAM
calculates the mapped address for some address in memory;STACK_BASE
is the stack base pointer; it points to address 1024 in the RAM, meaning that we have a reserved stack of size 1024 bytes (the stack grows backwards);MEM
calculates the mapped address for some offset in memory that we can use (i.e. not the stack).
stdint.h
. This is OK: as long as the compiler knows how to get the code for the header, it should be able to compile it into our target architecture just fine (in this case, it doesn't even need to generate any code, it's just grabbing some type aliases). Just don't go including huge libraries with humungous functions (looking at you malloc
), otherwise you'll be waiting a long time for your program to even load (remember, we're dealing with a potato here Further down, there's the code that deals with the input and output of the program. Notice that it simply reads from and writes into memory at the peripheral mapped address:
C:
volatile uint32_t* peripheral = (uint32_t*) 0x40000000;
// Read a word (32-bits) from the mapped peripheral device (executes the
// numeric input command and returns the value given)
inline static uint32_t input() {
return *peripheral;
}
// Writes a word (32-bits) into the mapped peripheral device (executes the
// show message command with the given value)
inline static void output(uint32_t word) {
*peripheral = word;
}
One important detail here is that
peripheral
is marked as volatile
; this disables some compiler optimizations that could otherwise break the program logic.Right below, we see our sorting algorithm:
C:
// Insertion sort
void sort(uint32_t* arr, uint32_t n) {
for (int i = 1; i < n; i++) {
uint32_t key = arr[i];
int j;
for (j = i - 1; j >= 0 && arr[j] > key; j--) {
arr[j + 1] = arr[j];
}
arr[j + 1] = key;
}
}
This is a pretty standard implementation of insertion sort. I chose it for the sorting algorithm because it is simple, does relatively few memory accesses (which are very expensive here) if done right, and it performs well for small lists (which is what we're going to have any chance to be sorting in an acceptable amount of time, anyway).
Other than that, there's nothing much special here.
Finally, the entrypoint of our program:
C:
uint32_t* elements = (uint32_t*) MEM(0);
void _start() {
uint32_t n = input();
for (int i = 0; i < n; i++) {
elements[i] = input();
}
sort(elements, n);
for (int i = 0; i < n; i++) {
output(elements[i]);
}
}
The implementation of the
_start
function is pretty standard C code. There's no special meaning to the function name, by the way, it's just something we can use to identify where the program starts later.Notice that
elements
is just some pointer into memory. If you're used to programming in C on a higher level, you might be worried that this could cause a segmentation fault or something like that; but that's of course not the case, because we don't have an operating system to get in our way here. The memory is all ours, to use as we please! One detail here that you might have also noticed is that
elements
is a global variable. This is because it is constant, and having it be local on _start
would make it part of the stack (which requires writing into memory, which takes time and make the program bigger). Making it global makes the program faster and smaller, which is a big win here. But please don't go making variables global everywhere when this is not a concern With the code explained, we can finally go and compile it.
Well... almost. First we need to tell the linker (the thing that links what the compiler spits out and turns it into an actual program) what our system looks like (memory mapping, stack pointer, that sort of stuff), because it's not really something very standard.
This is done with a linker script:
SCSS:
/* file: linker.ls */
MEMORY {
ROM(rx) : ORIGIN = 0x00000000, LENGTH = 512M
RAM(rwx) : ORIGIN = 0x60000000, LENGTH = 512M
}
SECTIONS {
. = ORIGIN(ROM);
.text : {
*(.text)
} > ROM
.data ALIGN(4) : {
_data_start = .;
*(.data)
. = ALIGN(4);
_data_end = .;
} > ROM
_stacktop = ORIGIN(RAM) + 1024;
_data_load = LOADADDR(.data);
}
This basically tells the linker that we have two regions in memory: ROM (from
0x00000000
to 0x1fffffff
) and RAM (from 0x60000000
to 0x7fffffff
); the program (here, the .text
section) should go into ROM, as should the static data (.data
). Also, the stack has 1024 bytes and is part of the RAM.With this, we can finally compile our program (for real this time):
Bash:
arm-none-eabi-gcc -nostdlib -Wall -O2 -Wl,-Tlinker.ls -march=armv6 -mthumb sample.c -o sample.elf
arm-none-eabi-objcopy -O binary sample.elf sample.bin
This generates a file
sample.elf
, which contains the binary of our program, along with some extra data required by the ELF file format. We then extract only the program code into a file sample.bin
with arm-none-eabi-objcopy
.You can see the resulting assembly code with
arm-none-eabi-objdump
, like this:
Bash:
arm-none-eabi-objdump -d sample.elf --visualize-jumps
Which should give you something like this:
Code:
sample.elf: file format elf32-littlearm
Disassembly of section .text:
00000000 <sort>:
0: b5f0 push {r4, r5, r6, r7, lr}
2: 4684 mov ip, r0
4: 2901 cmp r1, #1
6: /----------- d916 bls.n 36 <sort+0x36>
8: | 0003 movs r3, r0
a: | 3b04 subs r3, #4
c: | 0089 lsls r1, r1, #2
e: | 0006 movs r6, r0
10: | 2700 movs r7, #0
12: | 1858 adds r0, r3, r1
14: | /-------> 6875 ldr r5, [r6, #4]
16: | | 0039 movs r1, r7
18: | | 0033 movs r3, r6
1a: | | /----> 001a movs r2, r3
1c: | | | ca10 ldmia r2!, {r4}
1e: | | | 42ac cmp r4, r5
20: | | | /-- d904 bls.n 2c <sort+0x2c>
22: | | | | 605c str r4, [r3, #4]
24: | | | | 3b04 subs r3, #4
26: | | | | 3901 subs r1, #1
28: | | \--|-- d2f7 bcs.n 1a <sort+0x1a>
2a: | | | 4662 mov r2, ip
2c: | | \-> 3604 adds r6, #4
2e: | | 6015 str r5, [r2, #0]
30: | | 3701 adds r7, #1
32: | | 42b0 cmp r0, r6
34: | \-------- d1ee bne.n 14 <sort+0x14>
36: \----------> bdf0 pop {r4, r5, r6, r7, pc}
00000038 <_start>:
38: b5f8 push {r3, r4, r5, r6, r7, lr}
3a: 4f0d ldr r7, [pc, #52] ; (70 <_start+0x38>)
3c: 683d ldr r5, [r7, #0]
3e: 6878 ldr r0, [r7, #4]
40: 6829 ldr r1, [r5, #0]
42: 2900 cmp r1, #0
44: /----- d00f beq.n 66 <_start+0x2e>
46: | 008c lsls r4, r1, #2
48: | 0003 movs r3, r0
4a: | 1826 adds r6, r4, r0
4c: | /-> 682a ldr r2, [r5, #0]
4e: | | c304 stmia r3!, {r2}
50: | | 429e cmp r6, r3
52: | \-- d1fb bne.n 4c <_start+0x14>
54: | f7ff ffd4 bl 0 <sort>
58: | cf09 ldmia r7!, {r0, r3}
5a: | 18e4 adds r4, r4, r3
5c: | /-> cb04 ldmia r3!, {r2}
5e: | | 6002 str r2, [r0, #0]
60: | | 42a3 cmp r3, r4
62: | \-- d1fb bne.n 5c <_start+0x24>
64: | /-> bdf8 pop {r3, r4, r5, r6, r7, pc}
66: \--|-> 2100 movs r1, #0
68: | f7ff ffca bl 0 <sort>
6c: \-- e7fa b.n 64 <_start+0x2c>
6e: 46c0 nop ; (mov r8, r8)
70: 00000074 .word 0x00000074
This also gives us the exact position of the entrypoint on the generated binary code (here,
0x38
, the address of the _start
function).We can then get the hexadecimal representation of that binary with
hexdump
:
Bash:
hexdump -e '16/1 "%02x " "\n"' sample.bin
The output should be something like this:
Code:
f0 b5 84 46 01 29 16 d9 03 00 04 3b 89 00 06 00
00 27 58 18 75 68 39 00 33 00 1a 00 10 ca ac 42
04 d9 5c 60 04 3b 01 39 f7 d2 62 46 04 36 15 60
01 37 b0 42 ee d1 f0 bd f8 b5 0d 4f 3d 68 78 68
29 68 00 29 0f d0 8c 00 03 00 26 18 2a 68 04 c3
9e 42 fb d1 ff f7 d4 ff 09 cf e4 18 04 cb 02 60
a3 42 fb d1 f8 bd 00 21 ff f7 ca ff fa e7 c0 46
74 00 00 00 00 00 00 40 01 04 00 60
And we're done! Now all that's left is to copy the output of
hexdump
and the entrypoint position into the "Load binary code" script given in the implementation, and it should automatically configure the event of ID 1 on map 1 of your project to load the code of your program (just make sure that such event exists).Once it's loaded, running the program is as simple as calling the common event [(CPU) Step] until the switch [Halt] is ON
Final words
Overall, I was (and am) very happy with the result of this undertaking. It's pretty cool being able to compile C programs and run them on RPG Maker without a single line of script (with scripts it becomes kinda trivial, you can even make programs and run them on the fly, as I demonstrated in another early post of mine [pt-br]).
I guess this also shows just how powerful events can be, and how much you can do with them alone (not that you should). Limiting myself to them also made me have to think of some creative solutions to problems that would never have been even a thing had I been using a scripting language, so I think it also counts as a cool brain teaser.
One funny consequence of this is that events are Turing complete (I mean, we probably could've guessed, but it's nice to have an actual proof by construction, right?), and thus it's impossible to decide if a given event ever halts, in general. I guess this shows that events are at least as complicated as scripts...?
My point here is, events are pretty cool, but please don't overuse them. And VX Ace is still the superior engine.
Thank you for your time.
Last edited: