April 2017 Update

I've finally removed SDL2 and the CRT (on windows) from my personal codebase! I've been furiously hacking away at a platform layer + needed libs for most of April, and while it wasn't ready for Ludum Dare, it's coming along now; I can get a window + OpenGL graphics + input working with no issues.

Removing the CRT proved to be an issue for a lot of the free/public domain libraries I was using. After correctly mapping intrinsics and other library functions to my own code, adding my own special-case sort functions to replace instances of qsort, and commenting out #includes when things didn't work, this is how things ended up:

stb_vorbis didn't make it. I couldn't stop it from generating a __chkstk in one of its procedures.
stb_image works, but has trouble. I ended up spending a lot of time tracing its allocations, only to find that it expects realloc to behave like malloc if it passes a null pointer, which differs from the behavior of Microsoft's HeapReAlloc. Right now it's having trouble with pngs generated by Aseprite, so I'm still looking into that.
stb_sprintf works perfectly!
I ended up cutting nuklear; I didn't want to rewrite the vertex buffer renderer and I find the library has a lot of little gotcha's. If I'm going to have to read the source code to figure out how to use a library, if it's got problems, I'd rather copy the hard stuff out or write it myself. The amount of time I saved in VMFSketch by using nuklear was significantly offset by the amount of time lost trying to figure out why my app would crash if I didn't call nk_begin on every window every frame. In the end, if I'm prepared to write my own GUI stuff (which I've already done in part for Rituals), I would only be using nuklear for its font atlas stuff, which isn't too much to implement myself.
To that end, I pulled in stb_truetype, which compiled just fine after replacing all its intrinsics.
stb_rectpack had to have a few instances of qsort replaced.
If I remember correctly, miniz, sts_mixer, dr_wav, and dr_flac all worked just fine after cleaning up.

Small, single-file libraries tend to do pretty well if you have replacements for the commonly used CRT functions, namely qsort, memcpy and the math.h transcendentals, all of which I spent time on replacements for.

Funnily enough, the transcendental math was the easiest: these zlib licensed implementations seem to be pretty good. Add several hours translating a atan2 approximation to SSE2 intrinsics, adding single-float versions, and filling in some of the gaps (pow, fabs, min, max, sqrt/rsqrt, ldexp), and I have a reasonable replacement for most of the CRT's math functions.

Sorting turned out to be a little more difficult, but only because I bothered translating Orson Peters' pdqsort to C. The translation of an iterator-heavy C++ library to the sort(type* array, size_t count) C-style ended up being more confusing than I had accounted for; I found myself debugging line-by-line in two instances of Visual Studio. Some problems I ran into:

std::less behaves differently than qsort comparison functions.
Keeping track of offsets is tricky when converting from iterators to array/count style. I could, maybe should have used two pointers instead.
Debugging with C++ iterators is a pain, since you don't know where you are relative to the start of the array. I suppose my implementation doesn't work with generic list-like objects... but that's a bridge to cross when I get there.
When bugs had me reading and writing outside of the array, offsetting the array in a bigger chunk of memory and writing all the external values to -1 helped find where things were going wrong
Converting everything to macros wasn't as straightforward as I'd have liked; I ended up inlining everything to get around this, which was no fun.

Actually, I'm pretty sure there's still a bug in there that makes it run slower when a lot of values are the same (probably a <= vs a <), but the important thing is that it correctly sorts everything I thought to throw at it. When I start using it more, I'll compare its speed to other implementations.

And, last of the big three things I've wrestled with, replacing memcpy had me running in circles for a while. I wasn't able to find too much on the subject; talking to d7 and J_vanRijn in the discord and mmozeiko's post were my main sources of information (there's a big post on CodeProject too, but supposedly it's all licensed under the CPOL, which is pretty restrictive). d7's conclusion was that there's too much variation in processors to really write one memcpy to rule them all; for his current project, a simple memcpy is all that was needed. Starting with his general process as a base, I played around with some common stuff (loop unrolling, Duff's device), which maybe helped a little bit. Mārtiņš' advice holds true though: rep movsb is pretty good in most cases. My final version pretty much matches or beats the builtin CRT memcpy across a range of sizes... at least on my Skylake i5. Movsb isn't optimized on pre-Ivy Bridge Intels either, and I don't know about AMD chips. The final implementation looks like this:

For sizes less than 16 bytes, copy them as a series of ints.
For sizes less than 1024 bytes in the AVX route and 512 bytes in the SSE route, use head and tail copying (copy the first N bytes, then the last N bytes, and so on). This tends to be a lot faster than the builtin memcpy on my computer.
For sizes greater than 1kb and less than ~2000kb (well, 1<<21, about 2 megabytes), use movsb. (It's an intrinsic on cl, you can use inline assembly on gcc/clang)
For sizes greater than 2mb, I use _mm_load_si128 and _mm_stream_si128 on 16-byte aligned buffers and _mm_storeu_si128 and _mm_lddqu_si128 on unaligned buffers. This ends up being faster than both movsb and memcpy for aligned; can't remember for unaligned; however, I highly doubt I'll actually be doing any copies of this size.

Again, your mileage may vary here, and I expect to have to revisit this as other people try to run my code. It's possible that I'll need to provide a movsb alternative for Sandy Bridge and previous, or for AMD chips. A small note: AVX wasn't faster at scale, but it consistently did better in the 512-1024 byte range with head/tail copies. If the buffer was in cache, it was much, much, faster up to 4096 bytes. You can check out the code for this here

That's about it. I wrote my own OpenGL loader, but that's remarkably simple if you already have a list of functions and their parameters. Creating OpenGL contexts with Win32 is a pain, but well documented. Feature-wise, soon I'm going to implement a texture atlas system to use with stb_truetype and my own graphics. I haven't done audio yet, I'm told WASAPI is the way to go for modern stuff, and poking around it seems like it works from pure C? I'm not sure yet, so that's more testing to come. I grabbed sts_mixer for a LD a while back, and it seems to be okay, but I might take a stab at writing my own too.

As for future plans? I plan to get this working pretty well in May. I'd like to stream more and put together a few videos, but I don't expect to be too consistent with it this month.