Sunday, 5 July 2020

Writing a winning 4K intro in Rust


I recently wrote my first 4K intro in Rust and released it at the Nova 2020 where it took first place in the new school intro competition. Writing a 4K intro is quite involved and requires you to master many different areas at the same time. Here I will focus on what I learned about making Rust code as small as possible.



You can view the demo on youtube, download the executable at pouet or get the source code from github

A 4K intro is a demo where the entire program ( including any data ) has two be 4096 bytes or less so it is important that the code is as space efficient as possible. Rust has a bit of a reputation for creating bloated executables so I wanted to find out if is possible to create very space efficient code with it.

The setup

The entire intro is written in a combination of Rust and glsl. Glsl is used for rendering everything on screen but Rust does everything else; world creation, camera and object control, creating instruments and playing music etc.

Some of the features I depend on, such as xargo, are not yet part of stable Rust so I use the nightly rust toolchain. To install and use the nightly toolchain as default you need the following rustup commands.

rustup toolchain install nightly
rustup default nightly

I use crinkler to compress the object file generated by the rust compiler.

I also used shader minifier for pre-processing the glsl shader to make it smaller and more crinkler friendly. The shader minifier doesn't support output into .rs files so I ended up using its raw output and manually copying it into my shader.rs file. (In hindsight, I should have written something to automate that stage. Or even created a PR for shader minifier)

The starting point was the proof of concept code I developed earlier (https://www.codeslow.com/2020/01/writing-4k-intro-in-rust.html) which I thought was pretty lean at the time. That article also goes into but more detail about setting up the toml file and how to use xargo for compiling tiny executable.

Optimizing the design for code size

Many of the most effective size optimizations have nothing to do with clever hacks but are the result of rethinking the design.

My initial design had one part of the code creating the world, including placing the spheres and another part was responsible for moving the spheres. At some point I realized that the sphere placement and sphere moving code were doing very similar things and I could merge them into one sightly more complicated function that did both. Unfortunately, this type of optimization can make the code less elegant and readable.

Looking at the assembly code

At some point you have to look at the compiled assembly code to understand what the code gets compiled into and what size optimizations are worth it. The Rust compiler has a very useful option, --emit=asm for outputting assembler code. The following command creates a .s assembly file;

xargo rustc --release --target i686-pc-windows-msvc -- --emit=asm

It is not necessary to be an expert in assembler to benefit from studying the assembler output but it definitely helps to have a basic understanding assembler syntax. The release version uses opt-level = "z which causes the compiler to optimize for the smallest possible size. This can make it a bit tricky to work out which part of the assembly code corresponds to which part of the Rust code.

I discovered that the Rust compiler can be surprisingly good at minimizing code; getting rid of unused code and unnecessary parameters and folding code. It can also do some strange things which is why it is essential to occasionally study the resulting assembly code.

Using cargo features

I worked with two versions of the code; one version does logging and allows the viewer to manipulate the camera which is used for creating interesting camera paths. Rust allows you to define features that you can use to optionally include bits of functionality. The toml file has a [features] section that lets you declare the available features and their dependencies. My 4K intro has the following section in the toml file;

[features]
logger = []
fullscreen = []

Neither of the optional features has dependencies so they effectively work as being conditional compilation flags. The conditional blocks of code are preceded by #[cfg(feature)] statement. Using features in itself does not make the code smaller but it makes development process much nicer when you easily switch between different feature sets.

        #[cfg(feature = "fullscreen")]
        {
            // This code is compiled only if the full screen feature has been selected
        }

        #[cfg(not(feature = "fullscreen"))]
        {
            // This code is compiled only if the full screen feature has not been selected
        }

Having inspected the compiled code I am certain that only the selected features get included in the compiled code.

One of the main uses of features was to enable logging and error checking for the debug build. The code loading and compiling the glsl shader failed frequently and without useful error messages it would have been extremely painful to find the problems.

using get_unchecked

When putting code inside an unsafe{} block I sort of assumed that all safety checks would be disabled within this block but this is not true, all the usual checks are still applied and these checks can be expensive.

By default Rust range checks all array accesses. Take the following Rust code

    delay_counter = sequence[ play_pos ];

Before doing the table look up the compiler would insert code that checks that play_pos is not indexing past the end of sequence and panic if that was the case. This adds considerable size to the code as there can be a lot of table look-ups like this.

Converting the above code into

    delay_counter = *sequence.get_unchecked( play_pos );

tells the compiler to not perform any range checks and just do the table look-up. This is clearly a potentially dangerous operation and can thus only be performed within an unsafe code block

Making loops space efficient.

Initially all my loops used the idiomatic rust way of doing loops, using the for x in 0..10 syntax which I just assumed would be compiled into tightest possible loop. Surprisingly, this was not the case. The simplest case;

for x in 0..10 {
    // do code
}

would get translated into assembly code that did the following;

    setup loop variable
loop:
    check for loop condition    
    if loop finished, jump to end
    // do code inside loop
    unconditionally jump to loop
end:

whereas if did the following rust code

let x = 0;
loop{
    // do code
    x += 1;
    if i == 10 {
        break;
    }
}

would get directly compiled into;

    setup loop variable
loop:
    // do code inside loop
    check for loop condition    
    if loop not finished, jump to loop
end:

Note that the loop condition is checked at the end of each loop which makes the unconditional jump unnecessary. This is small space saving for one loop but they do add up when there are 30 loops in the program.

The other, much harder to understand, problem with the idiomatic Rust loop is that in some cases it the compiler would add some additional iterator setup code that really bloated the code. I never fully understood what triggered this additional iterator setup as it was always trivial to replace the for {} constructs with a loop{} construct.

Using vector instructions

I spent a lot of time optimizing the glsl code and one of the best class of optimizations ( which also usually made the code run faster) was to operate on an entire vector at a time instead of operating at a component at a time.

For example, the ray tracing code use a fast grid traversal algorithm to check which parts of the map each ray visits. The original algorithm considers each axis separately but it is possible to rewrite the algorithm so it considers all axes at the same time and does not need any branches. Rust doesn't really have a native vector type like glsl but you can use intrinsics to tell it to use SIMD instructions.

To use intrinsics I would convert the following code

        global_spheres[ CAMERA_ROT_IDX ][ 0 ] += camera_rot_speed[ 0 ]*camera_speed;
        global_spheres[ CAMERA_ROT_IDX ][ 1 ] += camera_rot_speed[ 1 ]*camera_speed;
        global_spheres[ CAMERA_ROT_IDX ][ 2 ] += camera_rot_speed[ 2 ]*camera_speed;

into

        let mut dst:x86::__m128 = core::arch::x86::_mm_load_ps(global_spheres[ CAMERA_ROT_IDX ].as_mut_ptr());
        let mut src:x86::__m128 = core::arch::x86::_mm_load_ps(camera_rot_speed.as_mut_ptr());
        dst = core::arch::x86::_mm_add_ps( dst, src);
        core::arch::x86::_mm_store_ss( (&mut global_spheres[ CAMERA_ROT_IDX ]).as_mut_ptr(), dst );

which would be quite a bit smaller ( but a lot less readable ). Sadly, for some reason this broke the debug build while working perfectly on the release build. Clearly, this is a problem with my intrinsics knowledge and not a problem with Rust. This is something I would spend more time on for my next 4K intro as the space saving were significant.

Using OpenGL

There are lot of standard Rust crates for loading OpenGL functions but by default they all load a very large set of OpenGL functions. Each loaded function takes up some space because the loader has to know its name. Crinkler does a very good job of compressing this kind of code but it is not able to completely get rid of the overhead so I had to create my own version gl.rs that only includes the OpenGL functions that are used in the code.

Conclusion

My first objective was to write a competitive proper 4K intro to prove that language was suitable for scenarios where every single byte counts and you really need low level control. Typically this has been the sole domain of assembler and C. The secondary objective was to write it using idiomatic Rust as much possible.

I think I was fairly successful on the first objective. At no point during the development did I feel that Rust was holding me back in any way or that I was sacrificing performance or capabilities because I was using Rust rather than C.

I was less successful on the second objective. There is far too much unsafe code that doesn't really need to be there. Unsafe has a corrupting effect; it is very easy to use unsafe code to quickly accomplish something (like using mutable statics) but once the unsafe code is there it begets more unsafe code and suddenly it is everywhere. In the future I think I would be far more cautious about using unsafe and only use it when there really is no alternative.

8 comments:

  1. Very nice!
    If you're to make another intro, I'll be happy to add a Rust output in Shader Minifier (please file an issue on https://github.com/laurentlb/shader_minifier/issues).

    Laurent

    ReplyDelete
  2. Did you write the music? It's great.

    ReplyDelete
  3. 16x larger filesize than https://youtu.be/sWblpsLZ-O8 which was also posted to ycombinator recently.

    Nice work though, demos are cool.

    ReplyDelete
  4. Genewitch: 256b demos and 4k demos are completely different, you are comparing apple and oranges.

    ReplyDelete
  5. I just hope before the next update that rust actually optimizes the game it should not take the amount of computer power that this game needs to run flawlessly

    ReplyDelete
  6. It should be noted that if you put something like assert!(v.len() < size) if you use indexing with known indices, the bounds check can be optimized out. Same if say you have [T; 256] where you're using u8 indices or index as v[ix & 0xFF].

    ReplyDelete
    Replies
    1. LLVM is pretty smart, basically.

      Delete
    2. You can use std::hints::unreachable_unchecked to write an assert_assume macro that does an assert in debug or an unreachable_unchecked in release. Allows for the same types of assumptions without taking space for an actual check

      Delete

Writing a winning 4K intro in Rust

I recently wrote my first 4K intro in Rust and released it at the Nova 2020 where it took first place in the new school intro competition. W...