Code Slow: 2020

Sunday, 5 July 2020

Writing a winning 4K intro in Rust

I recently wrote my first 4K intro in Rust and released it at the Nova 2020 where it took first place in the new school intro competition. Writing a 4K intro is quite involved and requires you to master many different areas at the same time. Here I will focus on what I learned about making Rust code as small as possible.

You can view the demo on youtube, download the executable at pouet or get the source code from github

A 4K intro is a demo where the entire program ( including any data ) has two be 4096 bytes or less so it is important that the code is as space efficient as possible. Rust has a bit of a reputation for creating bloated executables so I wanted to find out if is possible to create very space efficient code with it.

The setup

The entire intro is written in a combination of Rust and glsl. Glsl is used for rendering everything on screen but Rust does everything else; world creation, camera and object control, creating instruments and playing music etc.

Some of the features I depend on, such as xargo, are not yet part of stable Rust so I use the nightly rust toolchain. To install and use the nightly toolchain as default you need the following rustup commands.

rustup toolchain install nightly
rustup default nightly

I use crinkler to compress the object file generated by the rust compiler.

I also used shader minifier for pre-processing the glsl shader to make it smaller and more crinkler friendly. The shader minifier doesn't support output into .rs files so I ended up using its raw output and manually copying it into my shader.rs file. (In hindsight, I should have written something to automate that stage. Or even created a PR for shader minifier)

The starting point was the proof of concept code I developed earlier (https://www.codeslow.com/2020/01/writing-4k-intro-in-rust.html) which I thought was pretty lean at the time. That article also goes into but more detail about setting up the toml file and how to use xargo for compiling tiny executable.

Optimizing the design for code size

Many of the most effective size optimizations have nothing to do with clever hacks but are the result of rethinking the design.

My initial design had one part of the code creating the world, including placing the spheres and another part was responsible for moving the spheres. At some point I realized that the sphere placement and sphere moving code were doing very similar things and I could merge them into one sightly more complicated function that did both. Unfortunately, this type of optimization can make the code less elegant and readable.

Looking at the assembly code

At some point you have to look at the compiled assembly code to understand what the code gets compiled into and what size optimizations are worth it. The Rust compiler has a very useful option, --emit=asm for outputting assembler code. The following command creates a .s assembly file;

xargo rustc --release --target i686-pc-windows-msvc -- --emit=asm

It is not necessary to be an expert in assembler to benefit from studying the assembler output but it definitely helps to have a basic understanding assembler syntax. The release version uses opt-level = "z which causes the compiler to optimize for the smallest possible size. This can make it a bit tricky to work out which part of the assembly code corresponds to which part of the Rust code.

I discovered that the Rust compiler can be surprisingly good at minimizing code; getting rid of unused code and unnecessary parameters and folding code. It can also do some strange things which is why it is essential to occasionally study the resulting assembly code.

Using cargo features

I worked with two versions of the code; one version does logging and allows the viewer to manipulate the camera which is used for creating interesting camera paths. Rust allows you to define features that you can use to optionally include bits of functionality. The toml file has a [features] section that lets you declare the available features and their dependencies. My 4K intro has the following section in the toml file;

[features]
logger = []
fullscreen = []

Neither of the optional features has dependencies so they effectively work as being conditional compilation flags. The conditional blocks of code are preceded by #[cfg(feature)] statement. Using features in itself does not make the code smaller but it makes development process much nicer when you easily switch between different feature sets.

        #[cfg(feature = "fullscreen")]
        {
            // This code is compiled only if the full screen feature has been selected
        }

        #[cfg(not(feature = "fullscreen"))]
        {
            // This code is compiled only if the full screen feature has not been selected
        }

Having inspected the compiled code I am certain that only the selected features get included in the compiled code.

One of the main uses of features was to enable logging and error checking for the debug build. The code loading and compiling the glsl shader failed frequently and without useful error messages it would have been extremely painful to find the problems.

using get_unchecked

When putting code inside an unsafe{} block I sort of assumed that all safety checks would be disabled within this block but this is not true, all the usual checks are still applied and these checks can be expensive.

By default Rust range checks all array accesses. Take the following Rust code

    delay_counter = sequence[ play_pos ];

Before doing the table look up the compiler would insert code that checks that play_pos is not indexing past the end of sequence and panic if that was the case. This adds considerable size to the code as there can be a lot of table look-ups like this.

Converting the above code into

    delay_counter = *sequence.get_unchecked( play_pos );

tells the compiler to not perform any range checks and just do the table look-up. This is clearly a potentially dangerous operation and can thus only be performed within an unsafe code block

Making loops space efficient.

Initially all my loops used the idiomatic rust way of doing loops, using the for x in 0..10 syntax which I just assumed would be compiled into tightest possible loop. Surprisingly, this was not the case. The simplest case;

for x in 0..10 {
    // do code
}

would get translated into assembly code that did the following;

    setup loop variable
loop:
    check for loop condition    
    if loop finished, jump to end
    // do code inside loop
    unconditionally jump to loop
end:

whereas if did the following rust code

let x = 0;
loop{
    // do code
    x += 1;
    if i == 10 {
        break;
    }
}

would get directly compiled into;

    setup loop variable
loop:
    // do code inside loop
    check for loop condition    
    if loop not finished, jump to loop
end:

Note that the loop condition is checked at the end of each loop which makes the unconditional jump unnecessary. This is small space saving for one loop but they do add up when there are 30 loops in the program.

The other, much harder to understand, problem with the idiomatic Rust loop is that in some cases it the compiler would add some additional iterator setup code that really bloated the code. I never fully understood what triggered this additional iterator setup as it was always trivial to replace the for {} constructs with a loop{} construct.

Using vector instructions

I spent a lot of time optimizing the glsl code and one of the best class of optimizations ( which also usually made the code run faster) was to operate on an entire vector at a time instead of operating at a component at a time.

For example, the ray tracing code use a fast grid traversal algorithm to check which parts of the map each ray visits. The original algorithm considers each axis separately but it is possible to rewrite the algorithm so it considers all axes at the same time and does not need any branches. Rust doesn't really have a native vector type like glsl but you can use intrinsics to tell it to use SIMD instructions.

To use intrinsics I would convert the following code

        global_spheres[ CAMERA_ROT_IDX ][ 0 ] += camera_rot_speed[ 0 ]*camera_speed;
        global_spheres[ CAMERA_ROT_IDX ][ 1 ] += camera_rot_speed[ 1 ]*camera_speed;
        global_spheres[ CAMERA_ROT_IDX ][ 2 ] += camera_rot_speed[ 2 ]*camera_speed;

into

        let mut dst:x86::__m128 = core::arch::x86::_mm_load_ps(global_spheres[ CAMERA_ROT_IDX ].as_mut_ptr());
        let mut src:x86::__m128 = core::arch::x86::_mm_load_ps(camera_rot_speed.as_mut_ptr());
        dst = core::arch::x86::_mm_add_ps( dst, src);
        core::arch::x86::_mm_store_ss( (&mut global_spheres[ CAMERA_ROT_IDX ]).as_mut_ptr(), dst );

which would be quite a bit smaller ( but a lot less readable ). Sadly, for some reason this broke the debug build while working perfectly on the release build. Clearly, this is a problem with my intrinsics knowledge and not a problem with Rust. This is something I would spend more time on for my next 4K intro as the space saving were significant.

Using OpenGL

There are lot of standard Rust crates for loading OpenGL functions but by default they all load a very large set of OpenGL functions. Each loaded function takes up some space because the loader has to know its name. Crinkler does a very good job of compressing this kind of code but it is not able to completely get rid of the overhead so I had to create my own version gl.rs that only includes the OpenGL functions that are used in the code.

Conclusion

My first objective was to write a competitive proper 4K intro to prove that language was suitable for scenarios where every single byte counts and you really need low level control. Typically this has been the sole domain of assembler and C. The secondary objective was to write it using idiomatic Rust as much possible.

I think I was fairly successful on the first objective. At no point during the development did I feel that Rust was holding me back in any way or that I was sacrificing performance or capabilities because I was using Rust rather than C.

I was less successful on the second objective. There is far too much unsafe code that doesn't really need to be there. Unsafe has a corrupting effect; it is very easy to use unsafe code to quickly accomplish something (like using mutable statics) but once the unsafe code is there it begets more unsafe code and suddenly it is everywhere. In the future I think I would be far more cautious about using unsafe and only use it when there really is no alternative.

Saturday, 11 January 2020

Writing a 4K intro in Rust

Some weeks ago I started looking at using Rust for making a 64K intro. I started out by making a minimal window app that takes up only 3.5Kbytes without any compression. Some of the feedback I got encouraged me to have a look at Crinkler for compressing the executable. Given that crinkler is really targeted at 4K intros I decided to try creating a minimal modern OpenGL app that can be squeezed into 4K or less.

Rust OpenGL in 2 Kbytes

Debugging

Before working on new features I had to address debugging.

My minimal application was a nightmare to debug. If something went badly wrong it would just freeze without any indication of what happened. I couldn't print out any debug messages to console because the std library was not included. The only method of debugging I had was to pop up windows MessageBoxes at certain points but even that was limited because I could not do any string formatting.

Most crucially, I could not step through the code because the optimizations meant that the debugger could not tell what the variable values were.

I thought I could just create a new a new debug profile in the toml file which would not use any optimizations. This way both versions would have the same capbilities but I could easily step through the non-optimized dev build. I added this profile to the toml.

[profile.dev]
lto = true 
opt-level = 0
panic = "abort"

This did not work. The linker complained about unresolved symbols to memset and memcpy. It was surprising to find out that the optimization level changed the set of of required external functions. My suspicion is that the higher optimization just happened to make the memory management functions unnecessary for my particular program.

Initially I started writing my versions of memset and memcpy but then I realized that I could just copy the std versions into my program. (Not something I would normally recommend).

Once I had created the replacements for memset and memcpy I was able to build a dev version that I could now properly debug and step through the code.

Setting up OpenGL

There are several parts to getting an application ready to use OpenGL:

Creating an OpenGL context.
Setup all OpenGL functions.
Setting up all the data required to draw something.

There are several good resources that go through the above steps. The Ironic blog has a an excellent set of tutorials for setting up OpenGL with Rust. Additionally I looked at the learnopenglarticles for more in-depth explanations on OpenGL. Finally, the official OpenGL website has a lot of information but it is not the best place to for easy-to-understand introductory explanations.

Setting up the context

Setting up the GL context is pretty straight forward and just uses functions and enums imported by winapi. The context setup effectively boils down to the following bit of code ( with all error checking removed )

        let mut PFD : PIXELFORMATDESCRIPTOR = core::mem::zeroed();
        PFD.nSize = core::mem::size_of::<PIXELFORMATDESCRIPTOR>() as u16;
        PFD.nVersion = 1;
        PFD.dwFlags = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
        PFD.iPixelType = PFD_TYPE_RGBA;
        PFD.cColorBits = 32;
        fakePFD.cAlphaBits = 8;
        fakePFD.cDepthBits = 24;
         
        let pFDID : i32 = ChoosePixelFormat(hDC, &fakePFD );
        SetPixelFormat(hDC, fakePFDID, &fakePFD);
        let fakeRC : HGLRC = wglCreateContext(hDC);
        wglMakeCurrent(hDC, fakeRC);

Importing GL functions

None of the modern OpenGL functions are present in winapi but must loaded into memory before they can be used. This has nothing to do with winapi but is just how OpenGL functionality is accessed. The crate gl has the functionality for loading these functions into memory and providing bindings to all of the functions. Unfortunately the generated bindings require std to function so I need to pull out the required functionality manually. Also, the bindings generators creates bindings for functions that I wont be using thus bloating the executable size.

My OpenGL function loader is very much based on jonil's project on github.

The GL function loader needs to locate the address for the OpenGL function and provide a safe wrapper for each function. The init function below is the heart of the function loader. It goes through the LOAD_DESC array and loads the function names listed in it and stores them at the corresponding indices in the GL_API array.

The type of the GL_API array where all the function addresses are stored is usize which is guaranteed to be large enough to store a function pointer. There is no general function pointer type in Rust which is why they are stored in an array of usize.

static mut GL_API: [usize; 696] = [0; 696]; // large enough to store all function pointers  
//..
const CreateProgramIdx: u16 = 96;
//..
static LOAD_DESC: &'static [(u16, &'static str)] = &[
    (CreateProgramIdx, "glCreateProgram\0"),
    //...
];

pub fn init() {
    let handle : HMODULE;
    unsafe { handle = LoadLibraryA( "Opengl32.dll\0".as_ptr() as *const i8);  }
    for &(index, name) in LOAD_DESC {
        unsafe {
            let mut prc = wglGetProcAddress(name.as_ptr() as *const i8) as usize;
            if prc == 0 {
                prc = GetProcAddress( handle, name.as_ptr() as *const i8 ) as usize;
            }
            GL_API[ index as usize] =  prc;
        }
    }
}

The wrapper functions do the address look-up and call into OpenGL. Each wrapper function wraps one of the underlying actual GL functions and is responsible for converting the usize value into the actual function pointer type and then calling it with the given arguments. Each wrapper function looks like the following;

pub unsafe fn GenBuffers(n: GLsizei, buffers: *mut GLuint) -> () {
    mem::transmute::<_, extern "system" fn(GLsizei, *mut GLuint) -> ()>(*GL_API.get_unchecked(GenBuffersIdx as usize))(n, buffers)
}

The wrapper function GenBuffers wraps the OpenGL function of the same name. The mem::transmute call converts the usize into an external function pointer without doing any checks. This is clearly an unsafe function ( as is calling the external function ).

This is enough to get OpenGL functions loaded but nothing is being shown on the screen yet. For that we need to set up vertex and fragment shaders.

Setting up Shaders

Modern OpenGL requires that at least one fragment shader and one vertex shader is setup before anything can be rendered. The process of setting up the shaders is identical regardless of what type of shader it is so I created the helper function shader_from_source that creates both kinds of shader objects.

pub fn shader_from_source( shader_source : &str, kind: gl::GLenum, error_dest : &mut [i8] ) -> Option<gl::GLuint> {
    let id;
    let mut success: gl::GLint = 1;
    unsafe {
        id = gl::CreateShader(kind);
        gl::ShaderSource(id, 1, &shader_source.as_ptr(), 0 as *const _);
        gl::CompileShader(id);
        gl::GetShaderiv(id, gl::COMPILE_STATUS, &mut success);
    }
 
    if success == 0 {
        unsafe{ gl::GetShaderInfoLog( id, error_dest.len() as i32,  0 as *mut _, error_dest.as_mut_ptr() as *mut gl::GLchar ); }
        return None;
    }
    return Some( id );
}

I have tried to keep it as Rust like as possible despite not having std available. The one odd thing about the code above is that the error text is stored into a buffer given to the function as an argument rather than allocated by the function itself ( There is no std that would bring in Stringand there is not even any heap allocator ). I considered using a local static buffer for errors but that would have broken too many ideas about Rust memory safety for comfort.

The shaders themselves are pretty simple. The vertex position is passed through to the fragment shader which in turn does some colorful calculation based on the screen location and time. To vary its output over time the fragment shader does need to be told the time through the iTime uniform. ( The full code is on github )

    let vtx_shader_src : &'static str = "#version 330 core
    layout (location = 0) in vec3 Position; 
    void main()
    {
     gl_Position = vec4(Position, 1.0);
    }\0";

    let frag_shader_src : &'static str = "#version 330 core
    in vec4 gl_FragCoord;
    out vec4 Color;
    uniform float iTime;
    void main()
    {
        // Do interesting stuff
    }\0";

Passing these shaders into the shader_from_source function creates the required shaders.

    let vtx_shader = match gl_util::shader_from_source( vtx_shader_src, gl::VERTEX_SHADER, &mut error_message ) {
        Some( shader ) => shader,
        None => { show_error( error_message.as_ptr()  ); 0 }
    };

    let frag_shader  = match gl_util::shader_from_source( frag_shader_src, gl::FRAGMENT_SHADER,  &mut error_message ) {
        Some( shader ) => shader,
        None => { show_error( error_message.as_ptr() ); 0 }
    };

These shaders are finally combined into a shader program that describes the entire graphics pipeline from vertices into screen pixels. program_from_shaders is a utility function that puts all the given shaders into the new shader program.

    let shader_prog = match gl_util::program_from_shaders(vtx_shader, frag_shader, &mut error_message ) {
        Some( prog ) => prog,
        None => { show_error( error_message.as_ptr() ); 0 }
    };

Setting up vertices

Shaders need vertices to operate on. Modern OpenGL requires a buffer object to hold the actual vertex data.

    gl::GenBuffers(1, &mut vertex_buffer_id);

The buffer holding the vertex data is just some data in memory but OpenGL also needs to know how the data is laid out so it can map it to the vertex shader. To avoid having to fully define the vertex layout every time the selected vertices are changed OpenGL uses vertex arrays. These combine the vertex buffer and their configuration into a single object that be selected by one bind call.

        gl::GenVertexArrays(1, &mut vertex_array_id );
        gl::BindVertexArray(vertex_array_id);

After the gl::BindVertexArraycall all vertex configuration is stored on the bound vertex array.

The vertex data is loaded using gl::BufferData which loads the given bloack of memory into the bound vertex array. The code loads 4 vertices, each holding 3 floats.

    gl::BindBuffer(gl::ARRAY_BUFFER, vertex_buffer_id);
    gl::BufferData( gl::ARRAY_BUFFER, size_of::<gl::GLfloat>() as isize * 3 * 4, vtx_coords.as_ptr() as *const gl::CVoid, gl::STATIC_DRAW);

The vertices are configured using a series of gl::EnableVertexAttribArray and gl::VertexAttribPointer calls to first enable attribute location and then define it. The attribute location has to match the layout location used in the vertex shader. The vertex shader only uses location 0 for the Position input.

    gl::EnableVertexAttribArray(0);     // enable location 0
    gl::VertexAttribPointer( 0, 3, gl::FLOAT, gl::FALSE, 3 * size_of::<f32>() as gl::GLint, 0 as *const CVoid );    
;

Now we are finally ready to draw the scene in the main loop

    gl::UseProgram(shader_prog);
    gl::BindVertexArray(vertex_array_id);
    gl::DrawArrays( gl::TRIANGLE_STRIP, 0, 4 );

Passing uniform data

The fragment shader uses the uniform iTime for its animation. Uniforms are effectively global variables that all stages of the shader can read and every fragment/vertex sees the same values. They are ideal for passing things global parameters like transformation matricies and time.

The following code finds the openl identifier to the uniform iTimein the program and sets its value from time.

    let time_loc : i32 = gl::GetUniformLocation(shader_prog, "iTime\0".as_ptr());
    gl::Uniform1f(time_loc, time );

Syncing the refresh

By default OpenGL will render frames as fast as it can. This means that the application will max out on the GPU and do a lot unnecessary work because many of the rendered frames will never be shown to the user if the frames are calculated faster than display refreshes.

OpenGL on Windows has an extension function wglSwapIntervalEXT that can be used for setting the minimum number of display refreshes between displaying frames. If it is set to one OpenGL will show a new frame for every display refresh if a new frame is available.

This means that the program will block on the gl::SwapBuffers call if a frame is already waiting to be shown. But it also means that GPU utilization on my computer goes from 99% to somewhere around 10% (depending on what is being rendered)

This only works for wWindows but this is not really a concern here given that this already tightly coupled with the windows API.

Using Crinkler

Crinkler performs the real magic in compressing the executable into an incredibly small size. Crinkler is a bit more complicated to use than the standard exe packers such as UPX because it operates on object files, not the final executable. In effect crinkler is a compressing linker and this is one of the reasons why it achieves such impressive compression rates.

It took my awhile to figure out how to coax xargo to output the right kind of object file but the correct command is;

xargo rustc --release --target i686-pc-windows-msvc -- --emit=obj

The cargo rustc docs explain that the rustc command compiles the package and passes in the extra options after the --. In this case the only required extra option is emit=obj which tells the compiler to emit an object files ( as explained in the rustc documentation.)

Once the object file is available crinkler can be run with;

crinkler /OUT:test.exe /SUBSYSTEM:WINDOWS miniwin.o  /ENTRY:mainCRTStartup "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\Lib\10.0.17763.0\um\x86" gdi32.lib user32.lib opengl32.lib kernel32.lib

The command tells crinkler

to create a file called out.exe (/OUT:test.exe)
that the program is a windows program (/SUBSYSTEM:WINDOWS) rather than a console program
to use the object file miniwin.o. The object files are simply listed in the command
the program entry point is mainCRTStartup (/ENTRY:mainCRTStartup)
which path the libraries can be found in
which libraries to link to

This compresses the executable into which is 1911 bytes long into only X bytes. This leaves over 2000 compressed bytes for doing something interesting. ( The fragment shader is not smallest possible and is probably costing a couple hundred bytes )

Wrapping up

All the code is available on github at https://github.com/janiorca/tinywin/tree/master/miniwinGL

There is still some functionality missing (sound and fullscreen mode ) and the invokation of of Crinkler is still very manual but I think this provides a good starting point for anyone thinking about using rust for writing a 4K intro. The code is not particularly aggressively structured so it could could probably be quite a bit smaller. A shader minifier would also help saving some extra bytes.

Code Slow