Code Slow: 2019

Saturday, 7 December 2019

Tiny Windows executable in Rust

I have recently spent a lot of time writing pixel shaders and given that I have already written a pure Rust mod player I have started to think about trying my hand at writing a 64K intro in Rust.

One of the main challenges in writing a 64K intro is to squeeze all the code and assets into 64K of memory. There are several tiny frameworks written in C++ that set up a Windows app with a modern OpenGL context. I could not find anything similar for Rust so I decided to create the smallest possible bare bones app that does just that.

First attempt at a minimal windows app

There are quite a few good crates for creating a small windows apps, such as winit. My first attempt was to create a windows program using winit and enabling all possible size optimizations to see what I end up with. If the resulting program is small enough I could declare success and stop there.
`
After setting up a small project using the winit example and enabling most of the size optimizations listed in johnthagen's excellent article on https://github.com/johnthagen/min-sized-rust I ended up with the following toml file.

[profile.release]
lto = true 
codegen-units = 1
opt-level = "z"   
panic = 'abort'

[dependencies]
winit = "0.20.0-alpha4"

The settings in the toml file enable the have the following optimizations;

lto = true Enables link time optimizations which tells the compiler to optimize code generation at link time and can result in dropping code that is not used.
codegen-units = 1 Tells the compiler to use only one code generator instead of running several in parallel. Disabling parallel code generation makes the compilation slower but makes all optimizations possible.
opt-level = "z" Tells the compiler to optimize for minimal code size. This will make the code less performant but it will take up less space.
panic = 'abort' Stops Rust generating a helpful stack trace and panic message when it panics. With this optimization it will be much harder to figure out what went wrong when the program crashes.

With all these optimizations enabled the release build of the winit window example comes to 233 kilobytes. This is nearly 4 times as large as the entire 64K intro can be and the app does not do anything yet!

Xargo

The one optimization I didn't initially enable from johnthagen's article was using xargo as it seemed like a lot of work to set up all the necessary tools and figuring out the command line but my initial result make it clear that I need to use all the available tools to minimize the executable size. ( Re-reading the xargo section I realized that the setup was actually quite simple)

Xargo tells Rust to compile its custom version of the std crate. By default a standard version is used that includes a lot of unused functionality. Enabling xargo requires a new setup file; xargo.toml that describes how to compile the std library.

[dependencies]
std = {default-features=false,features=["panic_immediate_abort"]}

It also requires switching to nightly build and installing xargo. The command line for running xargo is the same as cargo but you also need to define the target platform ( which you can find out by looking up hostwhen running the command rustc -vV). In my case the command is

xargo build --target x86_64-pc-windows-msvc --release

With xargo the size of the executable is now 118 kilobytes. This is a big improvement but it is still too large. It is also a lot bigger than the 30kb that johnthagen achieved at this point. The big difference is that we are building a windows app and using the winit crate to access Windows functionality. Clearly, we need to get rid of the crate and access Windows API without any helper crates.

Winapi crate

The winapi crate provides foreign function interfaces to the Windows API but does not provide any coded functionality of its own. This means that the application is responsible for all of functionality not provided by Windows.

I created a new project using the previous cargo.toml and xargo.toml files as my starting point but change the dependencies section in cargo.toml to;

[dependencies]
winapi = { version = "0.3.8", features = ["winuser", "libloaderapi" ] }

Earlier versions of the winapi crate split the functionality across multiple crate but now some parts of the Windows API are enabled using 'features'. The APIs I need are in winuser and libloaderapi.

The main function

I learned a lot about setting up the plain Window in Rust from TheSatoshiChiba's gist and lot of the code here is based on his example.

The top of the main.rs file has the following statements;

#![no_main]
#![windows_subsystem = "windows"]

The first line tells the compiler the the code will provide its own C entry point to the code. Normally this would be provided by the std crate but that uses additional std functionality that bloats the executable.

The second line defines this as a windows program that does not have a console window. Some older Rust Windows applications needed additional functionality to close the console window. This eliminates the need for that. There is full explanation on https://rust-lang.github.io/rfcs/1665-windows-subsystem.html

The actual main functions is;

#[no_mangle]
pub fn main(_argc: i32, _argv: *const *const u8) {
    let mut window = create_window(  );

    loop {
        if !handle_message( &mut window ) {
            break;
        }
    }
}

The line #[no_mangle] turns off the Rust name mangling making it conform with the signature expected for the the main functions. The rest of the code 'just' creates a window and enters a message processing loop.

Character sets with W and A funtions

There are two variants of many Windows functions; the W and A variants. The difference is in which character set they use. The W versions use wide characters strings (16 bit unicode ) and the A versions use ANSI characters.

Rust natively uses UTF-8 for its strings. For the first 127 characters UTF-8 and ANSI have the same encoding, so if I stick with those characters I can just assume that they are the same.

If I was writing an interactive application that needs to run in different regions I should really be using the W versions but because I intend to use this for writing a 64K intro I am going to use the A version because it makes the required string handling much easier.

Creating the window

All of the interaction with the Windows API must be marked as unsafe code because Rust cannot verify that the code satisfies all the usual Rust rules regarding ownership and initialization. All of the code dealing with the Windows API is surrounded by unsafe{ ... } block.

The window creation function performs two operations; registering a new window class, and creating a windoww of that class.

The code for registering the window class is

let hinstance = GetModuleHandleA( 0 as *const i8 );
let wnd_class = WNDCLASSA {
    style : CS_OWNDC | CS_HREDRAW | CS_VREDRAW,     
    lpfnWndProc : Some( window_proc ),
    hInstance : hinstance,
    lpszClassName : "MyClass\0".as_ptr() as *const i8,
    cbClsExtra : 0,         
    cbWndExtra : 0,
    hIcon: 0 as HICON,
    hCursor: 0 as HICON,
    hbrBackground: 0 as HBRUSH,
    lpszMenuName: 0 as *const i8,
};
RegisterClassA( &wnd_class );

The first line gets the handle to the module that launched the current module. In a C program this would be passed into winMain as an argument but calling GetModuleHandleA with a null argument returns the same handle.

The rest of the function uses the WNDCLASSA constructor to setup windows class object. Because I am using A variants of the Windows API I can directly use the string constant without any conversion. The one details is that Rust string slices are not null terminated which is why I have added the zero to the end of the class name.

The rest of the create_window function creates the actual window and returns the handle.

        let handle = CreateWindowExA(
            0,                                  // dwExStyle 
            "MyClass\0".as_ptr() as *const i8, // class we registered.
            "MiniWIN\0".as_ptr() as *const i8,  // title
            WS_OVERLAPPEDWINDOW | WS_VISIBLE, // dwStyle
            // size and position
            CW_USEDEFAULT, CW_USEDEFAULT, CW_USEDEFAULT, CW_USEDEFAULT, 
            0 as HWND,                  // hWndParent
            0 as HMENU,                 // hMenu
            hinstance,                  // hInstance
            0 as LPVOID );              // lpParam

The message pump

The second part of the code is the message handler which reads messages from the message queue and processes them.

fn handle_message( window : HWND ) -> bool {
    unsafe {
        let mut msg = MaybeUninit::uninit();
        if GetMessageA( msg.as_mut_ptr(), window, 0, 0 ) > 0 {
            TranslateMessage( msg.as_ptr() ); 
            DispatchMessageA( msg.as_ptr() ); 
            true
        } else {
            false
        }
    }
}

The noteworthy feature here is the use of MaybeUninit::uninit() which creates a structure on the stack without initializing it. Under normal Rust safety rules I would have to use the constructor to create the structure but here the msg structure is filled in by GetMessageA. Because a lot of Windows APIs follow this pattern it is nice to be able to use MaybeUninit::uninit() to avoid having to call constructors with dummy values.

The window_proc

The final piece is to create the window proc that handles all the messges that are routed to it. Strictly speaking this is not necessary because we could just have passed DefWindowProcA to RegisterClassA but I want to see something that proves that this is a well functioning normal Windows app.

pub unsafe extern "system" fn windowProc(hwnd: HWND,
    msg: UINT, wParam: WPARAM, lParam: LPARAM) -> LRESULT {

    match msg {
        winapi::um::winuser::WM_PAINT => {
            let mut paint_struct = MaybeUninit::uninit();
            let mut rect = MaybeUninit::uninit();
            let hdc = BeginPaint(hwnd, paint_struct.as_mut_ptr());
            GetClientRect(hwnd, rect.as_mut_ptr());
            DrawTextA(hdc, "Test\0".as_ptr() as *const i8, -1, rect.as_mut_ptr(), DT_SINGLELINE | DT_CENTER | DT_VCENTER);
            EndPaint(hwnd, paint_struct.as_mut_ptr());
        }
        winapi::um::winuser::WM_DESTROY => {
            PostQuitMessage(0);
        }
        _ => { return DefWindowProcA(hwnd, msg, wParam, lParam); }
    }
    return 0;
}

The shrinking executable

The new program comes to 10 kilobytes (or precisely 10240 bytes). This is a big improvement and takes the excutable into the range required for 64K intro. ( An exe packer like UPX reduces the size of the executable into 7680 bytes)

I had a look at the produced exe file and it looks like the small parts of the std library that do get used bring in a lot of extra code. It is time to completely jettison std library.

I add the following line to the of main.rs to completely break off any dependency on std

#![no_std]

Removing std causes two bits of automatically provided functionality to go missing. First, the panic handler is missing. This can be added with the following code;

#[panic_handler]
#[no_mangle]
pub extern fn panic( _info: &PanicInfo ) -> ! { loop {} }

The second issue is that we do not have a mainCRTStartup. This does work like initializing the runtime, getting the arguments etc. before calling main. None of that is required for the app so I can just repurpose the old main function into mainCRTStartup. The new function looks like;

#[no_mangle]
pub extern "system" fn mainCRTStartup() {
    let window = create_window(  );
    loop {
        if !handle_message( window ) {
            break;
        }
    }
}

Without std the executable is now 3584 bytes. The executable still looks quite padded out so the executable can probably still be made smaller yet but I think this requires more control of the linker. I might yet return to that but for now I think the executable is small enough to be a starting point for a 64K intro.

Next steps

The next stage is to setup a moden OpenGL rendering context without growing the code too much. All the code is on github https://github.com/janiorca/tinywin/tree/master/miniwin

Monday, 1 July 2019

glium - instancing

I have spent a bit of time looking into instancing with glium and OpenGL. It turns out that glium makes it very easy to make use of the powerful tehcnique.

Instancing

In my previous code I created an array of cubes by creating the actual vertices for every cube that I wanted to display. This is clearly incredibly inefficient and inflexible. Once the cubes have been created they can't easily be moved or adjusted and I am duplicating a lot of vertices. I could draw one cube per draw call but the overhead per call can become substantial if there are a lot of cubes ( or other objects).

Instancing reuses the same object data and redraws the same object where some of the data changes for each instance.

OpenGL has several ways of supporting instancing.

You can use a special glsl variable InstanceID that is incremented for each instance and use this to modify the vertex data ( for example, by looking up data by InstanceID ).
You can define a vertex type where some of the vertex data only changes for each instance. This is the approach I am going to take

Vertex types for Instancing

I am keeping my existing vertex Vertex3t2 type that defines the position and texture coordinates but I am also defining a new vertex type Attr that has all per-instance data. The new vertex will have the matrix for defining the instance position and orientation and its color;

#[derive(Copy, Clone)]
struct Attr {
    world_matrix: [[f32; 4]; 4],
}
glium::implement_vertex!(Attr, world_matrix);

This new "instance" vertex is defined using the exactly same type of code as ordinary vertices.

Setting up the vertex data

Because I am using instancing I build only one cube centered around the origin. I use the exactly same add_cube function I developed in the last post.

    let mut verts: Vec<Vertex3t2> = Vec::new();
    add_cube(&mut verts, &glm::Vec3::new(0.0, 0.0, 0.0));
    let indices = glium::index::NoIndices(glium::index::PrimitiveType::TrianglesList);

I also need to build the per-instance vertex data. Because the Attr vertices will be used for data changes for each frame I creating the vertex buffer using dynamic to let glium know that the vertices will be dynamcic ( whereas the cube vertices are stored in a regular vertex buffer )

    let mut positions: Vec<Attr> = Vec::new();
    build_object(&mut positions);
    let position_buffer = glium::VertexBuffer::dynamic(&display, &positions).unwrap();

The build object creates position and color vertices. For now I will simply recreate the same array of cubes that I created before.

fn build_object(attribs: &mut Vec<Attr>) {
    for x in -4..5 {
        for y in -4..5 {
            for z in -4..5 {
                let pos_matrix = glm::translate(&glm::identity(), 
                    &(glm::Vec3::new( x as f32, y as f32, z as f32 )*1.5f32 ) );
                attribs.push(Attr { world_matrix: pos_matrix.into() });
            }
        }
    }
}

Updating the shader code

The vertex shader needs to be updated to make use of the additional instance specific data. The vertex shader sees the instance data just like any other vertex data but it only gets updated between instances;

    in vec3 position;
    in vec2 tex_coords;
    in mat4 world_matrix;
    out vec2 v_tex_position;
    uniform mat4 view_matrix;
    uniform mat4 perspective;
    void main() {
        v_tex_position = tex_coords; 
        gl_Position = perspective * view_matrix * world_matrix * vec4(position, 1.0);
    }

There are a couple of changes here.

There is a new vertex input world_matrix which is the instance specific matrix that holds the object-to-world transformation
The old matrix uniform has been replaced by the view_matrix uniform. The previous code did not separate the world and view transforms but now we want to move both the instances and the camera independently so we separate out the view and world matrices
The calculation of the position has been changed to reflect the separate view and matrix transformations. The way to read the position calculation code is from right-to-left; First the position is transformed into world coordinates, then into view space and finally the perspective is applied.

The fragment shader does not need any changes.

Updating the rendering code

I have updated the code in the draw loop to make the part view matrix plays much explicit than before.

The view matrix is constructed around the idea that the camera always looks at the origin and rotates around it around the y- and x-axes.

Because the view matrix describes the transform from world space into view space it is constructed as the inverse of the camera-to-world transformation. I could construct the camera-to-world matrix and then take its inverse but I have chosen to manually construct the view matrix by applying the different transforms in opposite direction and order.

    let mut view_matrix_glm = glm::translate(&glm::identity(), &glm::Vec3::new(0.0, 0.0, -18.0));
    view_matrix_glm = glm::rotate_x(&view_matrix_glm, camera_angles.x);
    view_matrix_glm = glm::rotate_y(&view_matrix_glm, camera_angles.y);
    let view_matrix: [[f32; 4]; 4] = view_matrix_glm.into();

    let perspective_glm = glm::perspective(1.0, 3.14 / 2.0, 0.1, 1000.0);
    let perspective: [[f32; 4]; 4] = perspective_glm.into();
    let uniforms = glium::uniform! { view_matrix : view_matrix, perspective : perspective };

Finally, the actual render call needs to be updated to let glium know we now using two vertex buffers.

    target.draw( (&vertex_buffer, position_buffer.per_instance().unwrap()),
                &indices, &program, &uniforms, &params ).unwrap();

The difference is that now we pass in a tuple of vertex buffers. The first member contains the cube vertices and the seconds buffer contains the instance vertices.

The draw code now uses a unique matrix for each cube.

Updating the camera

I want to be able to move the camera. We already have the code for converting the camera angles into a view matrix so I just need to hook up to the mouse events to update the camera angles.

The camera should only be updated when the mouse is pressed down, so I need to capture the left mouse button state;

        glutin::WindowEvent::MouseInput { device_id: _device_id, state,  button, ..} => match button {
            glutin::MouseButton::Left => {
                mouse_down = state == glutin::ElementState::Pressed;
            }
        }

The mouse positions are posted via DeviceEvents so it needs to in its own match statement;

    glutin::Event::DeviceEvent { device_id, event } => match event {
        glutin::DeviceEvent::MouseMotion { delta } => {
            if mouse_down {
                camera_angles.y += delta.0 as f32 / 100.0f32;
                camera_angles.x += delta.1 as f32 / 100.0f32;
            }
        }
        _ => (),
    },

Animating the cubes

To really see the instance based matrices in action they need to be animated. So I replace the build_object function with some code to allocate space for the matrices.

    let world_matrix : [[f32;4];4] = ( glm::Mat4x4::identity() ).into();
    let mut positions : Vec<Attr> = vec!( Attr { world_matrix: world_matrix }; 32*80);

This create 32*80 matrices ( this number is completely arbitrary and depends on the animation function. Something I will address in the future ) and sets them all to identity. Before the draw I call animate_object that recalculates all the matrices, overriding the previous matrices. Finally I call write on the vertex buffer object to push them to OpenGL.

    animate_object(&mut positions, t);
    position_buffer.write(&positions);

The code inside animate_object function that animates the matrices somewhat unimportant as long as the cubes get animated. It is a lot of fun to play with for the purpose of demonstrating but is fun to play with. The code below produces the pulsating arch at the top of this posting.

fn animate_object(attribs: &mut Vec<Attr>, iTime: f32) {
    let mut cursor: glm::Mat4x4 = glm::Mat4x4::identity();
    cursor = glm::translate(&cursor, &glm::Vec3::new( -5.0, -10.0f32, 0.0f32 ) );
    let mut idx : usize = 0; 
    for y in 0..80 {
        cursor = glm::translate(&cursor, &glm::Vec3::new(0.0, 1.5, 0.0));
        cursor = glm::rotate_x(&cursor, 0.04);

        let radius = 7.0 + f32::sin(iTime + y as f32 * 0.4f32) * 2.0f32 * glm::smoothstep(0.0, 0.2, (y as f32) / 20.0f32);
        let points = 32;
        for c in 0..points {
            let mut inner = glm::rotate_y(&cursor, std::f32::consts::PI * 2.0 * c as f32 / (points as f32));
            inner = glm::translate(&inner, &glm::Vec3::new(0.0, 0.0, radius));
            attribs[ idx ].world_matrix = inner.into();
            idx += 1;
        }
    }
}

The code is a reasonable demonstration of how powerful instancing is and how easy it is to do instancing with glium.

The cubes are pretty ugly as they don't really have any proper shading. In my next blog post I want to add better, more interesting shading

Code Slow