114 points
by theanonymousone
4 days ago
|
36 comments
|
[HN]
[hidden]
— thisislife2's reply was filtered, but the responses below were kept
thisislife2
·
about 16 hours ago
q=0.62
This is an interesting project - kudos for executing it. I have to admit that when I was starting out in this field, I too fantasised about, "Would this software be faster, smaller and better in assembly?". Ofcourse, assembly programming made some sense in embedded electronics, which can be very resource constrained and even specialised for one particular application. Thinking from that aspect, perhaps you should consider making this a specialised program that runs on something like a Raspberry Pi - running such a web server directly on it, without an OS (or a very minimal OS), would make for a real cool and interesting project.
[hidden]
— hatefulheart's reply was filtered, but the responses below were kept
hatefulheart
·
about 15 hours ago
q=0.62
What on earth are you talking about? Assembly makes sense in desktop computing as well. Have you ever, for example, watched a video? What do you think powers the codecs, JSX?
[hidden]
— xnorswap's reply was filtered, but the responses below were kept
xnorswap
·
about 15 hours ago
q=0.58
I'd assume most are written in C or some other higher level language. Certainly I checked the x264 ( VLC implementation of h264 ) and that is C.
[hidden]
— otherjason's reply was filtered, but the responses below were kept
otherjason
·
about 15 hours ago
q=0.62
The statistics reported by GitLab for the x264 repo (https://code.videolan.org/videolan/x264) report that the project is 13.5% assembly; common utilities used in the inner loops of the codec have optimized assembly implementations for several CPU architectures.
[hidden]
— xnorswap's reply was filtered, but the responses below were kept
xnorswap
·
about 14 hours ago
q=0.02
I stand corrected.
[hidden]
— jamal-kumar's reply was filtered, but the responses below were kept
jamal-kumar
·
about 14 hours ago
q=0.62
A lot of the encoding side on ffmpeg now uses hand-coded assembly optimizations to take advantage of avx512 instructions on newer x64 processors for "100x speed increase" since February 2025 in a stable form
[hidden]
— thisislife2's reply was filtered, but the responses below were kept
thisislife2
·
about 12 hours ago
q=0.23
Yes, I do know that some Assembly is used in systems programming and other niches where it makes sense. To be clear, I was talking about the phase some of us go through (as amateurs) when we think everything would be "faster, smaller, better" if written in assembly - Python is slow. What about C or Pascal? Wouldn't asm be faster? ... but, as we all realise sooner or later, there's a reason we prefer to code in high-level or very high-level languages, and why pre-mature optimisation can be a real handicap.
[hidden]
— hatefulheart's reply was filtered, but the responses below were kept
hatefulheart
·
about 11 hours ago
q=0.62
Ah yes, the niche that is video, audio, game and systems programming.
When those three to four amateurs still doing those niche things grow up they’ll move on to real programming, like putting together a solid skills.md file.
[hidden]
— perching_aix's reply was filtered, but the responses below were kept
perching_aix
·
about 7 hours ago
q=0.58
Ah yes, assembly in contemporary game and systems programming. Definitely an everyday encounter, along with the millions of AV codec developers.
What?
[hidden]
— senfiaj's reply was filtered, but the responses below were kept
senfiaj
·
about 15 hours ago
q=0.58
IMHO, for servers IO (FS/DB, network, etc) is usually a greater bottleneck. Microoptimizations make sense only for CPU bound problems.
[hidden]
— jvanderbot's reply was filtered, but the responses below were kept
jvanderbot
·
about 15 hours ago
q=0.58
I am pained to think of TLS/HTTPS implemented as a hobby project in ASM, but would be impressed to see it.
[hidden]
— LegionMammal978's reply was filtered, but the responses below were kept
LegionMammal978
·
about 15 hours ago
q=0.62
I did actually make an attempt at that once for BGGP5 [0]. (That is, making a minimal, horribly insecure 'client' implementing just enough behavior to get a response from a server.) But I got demoralized by how much space the binary blobs for the crypto algorithms took up, in comparison to the actual machine code.
[hidden]
— dorianmariecom's reply was filtered, but the responses below were kept
dorianmariecom
·
about 14 hours ago
q=0.19
that url leads to scam
[hidden]
— cevn's reply was filtered, but the responses below were kept
cevn
·
about 14 hours ago
q=0.58
If you keep scrolling down there is some kind of contest, I was confused at first as well. It seems to be parodying scam sites.
[hidden]
— golem14's reply was filtered, but the responses below were kept
golem14
·
about 13 hours ago
q=0.62
I'd really like to see a TCP/IP stack written in native forth (if anyone needs a really good therapist, that sounds like a _great_ project to try ;)
I mean, it doesn't look _that_ daunting, but the fact that noone seems to ever have release an open source version (there are rumours of proprietary stacks though) speaks for itself.
One of these days ...
[hidden]
— HeliumHydride's reply was filtered, but the responses below were kept
[hidden]
— globnomulous's reply was filtered, but the responses below were kept
globnomulous
·
about 13 hours ago
q=0.62
I love this. I'm wondering: what skills did this project hone? What are you better at now than you were before you undertook it? Or was it just for fun?
[hidden]
— drob518's reply was filtered, but the responses below were kept
drob518
·
about 12 hours ago
q=0.62
I’m curious what the performance of this implementation is versus a server written in C, C++, or Rust. How much performance can a human still squeeze out at the assembly level versus today’s state of the art compilers?
[hidden]
— mananaysiempre's reply was filtered, but the responses below were kept
mananaysiempre
·
about 10 hours ago
q=0.63
> How much performance can a human still squeeze out at the assembly level versus today’s state of the art compilers?
Most of the squeezing is to be had in the parts where the compiler can’t help. (Which I guess is logically equivalent to saying that you can’t often do meaningfully better than the compiler on the things that the compiler is concerned with, but you have to admit it reads very differently.) Two important widely-applicable examples are data layout (locality, in particular getting rid of large and costly-to-traverse pointers) and vectorization; what they have in common is that you may well have to redesign the entire flow of data in your program around the issue before you get meaningful improvements. (And there is often an order-of-magnitude improvement to be had on a CPU-bound task, if you are willing to spend the time and effort to optimize.)
There are also specific situations where the approaches used by modern compilers work badly. The straightforward switch-based interpreter is a well-known example: modern Clang essentially turns into Clippy and goes “looks like you’re writing an interpreter, would you like me to duplicate your dispatch for you” so branch prediction works out as well as in manual assembly, but it still allocates registers a function at a time, so when the function in question is the entirety of the interpreter including the slowpaths, the regalloc sucks. Tail-call interpreters and __attribute__((cold, noinline, preserve_most)) amount to expressing the exact same control-flow graph in such a way that the compiler can digest it better, ironically by understanding less of it at any given time. This is one way that the dumb fundamental nature of the admittedly quite smart modern compiler shines through.
And in very tight loops there are still places when doing things by hand can help. For instance, when computing a histogram of byte values over a large block (for which I’m not aware of any public vectorized code that would go faster than the best scalar options) I’ve seen Clang lose as much as 20% to (contemporary) GCC on the best C implementation[1] or its straightforward manual translation to assembly, because Clang had decided it knew better which order the instructions should go in. As a less exotic case, I’ve seen GCC lose out by about 20% to (contemporary) Clang in vectorized loops because it had decided that having half the loop body be MOVs
(or rather VMOVDQAs) would be a better idea than taking advantage of AVX’s ability to not overwrite either of the input arguments, and though MOVs are basically free on a superscalar they’re not that free. I’ve even seen both GCC and Clang ignore an explicit __builtin_expect() and compile a very predictable (but unavoidable) inner-loop branch into a CMOV, once again costing me about 20% in performance.
So if you do in fact care about the difference between 1.1 cycles/byte and 1.3 cycles/byte, yes you can beat a compiler even on a micro level. You just probably don’t have the, depending on your point of view, fortune or misfortune of working on code like that.
[hidden]
— hansvm's reply was filtered, but the responses below were kept
hansvm
·
about 2 hours ago
q=0.63
Today's state of the art compilers can't even do vectorized integer division by a compile time known constant very well. They definitely can't map high-level constructs onto low-level patterns, and they don't carry anywhere near enough semantic information through the different optimization passes to be able to take even very safe, simple, sane shortcuts with zero possibility of UB or other issues. There's a lot of performance being left on the table.
Mind you, somebody who's sympathetic to the machine's needs can easily scrape most of that performance back by writing C/C++/Zig in a way that easily maps to the optimal assembly. The optimizer won't make your code drastically worse too often, so if you start with something nice then actually dropping down into assembly has limited use cases and usually limited benefits...if you know what you're doing and throw out every style guide as you do so.
As to this server in particular? At first blush it looks more like a learning exercise. You'll go a lot further with clever incremental routines and appropriately leveraging your OS's async API than you will by shaving a few instructions here and there.
As to servers in general? Your kernel is the real bottleneck. If you need all of its features then you don't have a lot of options, but if you're like most applications then you're leaving a ton of performance on the floor not going for kernel bypass (not that using your kernel for network is a _bad_ decision, but you are nevertheless incurring a 10x-50x performance hit as the cost). Assembly shenanigans literally don't matter in comparison.
[hidden]
— pjdesno's reply was filtered, but the responses below were kept
pjdesno
·
about 12 hours ago
q=0.62
> I’m curious what the performance of this implementation is
Almost certainly crap.
As the author states, it's a simple fork-on-request server, which was state-of-the-art in about 1996. But that's not the point.
[hidden]
— jubilanti's reply was filtered, but the responses below were kept
jubilanti
·
about 11 hours ago
q=0.58
And that wasn't the question the parent comment asked.
[hidden]
— tredre3's reply was filtered, but the responses below were kept
tredre3
·
about 10 hours ago
q=0.78
The parent asked two questions:
1. What is the performance of this implementation of an assembly server? The comment you replied to answered it: It's likely crap. Hand written assembly is almost always worse than compiler.
2. How much performance can a human still squeeze out at the assembly level? That question is different and remains unanswered. But in my experience the answer is the same.
[hidden]
— nikisweeting's reply was filtered, but the responses below were kept
But I've struggled with the IO parts, it seems every system is so different, it's hard to live without LLVM-IR as the middle layer abstracting away compiler and target differences.
[hidden]
— DeathArrow's reply was filtered, but the responses below were kept
DeathArrow
·
about 13 hours ago
q=0.58
Meanwhile another 10000 developers published desktop apps, mobile apps and system software written in javascript. /s
[hidden]
— imvetri's reply was filtered, but the responses below were kept
imvetri
·
about 13 hours ago
q=0.58
How to do assembly only to build a web server for the hardware? I mean it may sound like building os from scratch.
[hidden]
— StilesCrisis's reply was filtered, but the responses below were kept
StilesCrisis
·
about 13 hours ago
q=0.58
"i hate string parsing. especially in assembly."
Well. Interesting choice of side project!
[hidden]
— declank's reply was filtered, but the responses below were kept
[hidden]
— theanonymousone's reply was filtered, but the responses below were kept
theanonymousone
·
about 14 hours ago
q=0.58
Something funny has happened. I didn't submit this link today, as is shown in the front page. I submitted three days ago, before the original author (above submission) actually.
[hidden]
— tomku's reply was filtered, but the responses below were kept
tomku
·
about 8 hours ago
q=0.58
HN has a "second chance queue" that sometimes revives posts and gives them another shot. Not entirely sure how it works, but it sounds like what happened here.
[hidden]
— globnomulous's reply was filtered, but the responses below were kept
globnomulous
·
about 13 hours ago
q=0.19
Maybe combine the threads, dang?
[hidden]
— peterpanhead's reply was filtered, but the responses below were kept
https://www.techspot.com/news/108715-ffmpeg-gets-100x-faster...
When those three to four amateurs still doing those niche things grow up they’ll move on to real programming, like putting together a solid skills.md file.
What?
[0] https://binary.golf/5/
I mean, it doesn't look _that_ daunting, but the fact that noone seems to ever have release an open source version (there are rumours of proprietary stacks though) speaks for itself.
One of these days ...
https://github.com/openssl/openssl/blob/master/crypto/aes/as...
Most of the squeezing is to be had in the parts where the compiler can’t help. (Which I guess is logically equivalent to saying that you can’t often do meaningfully better than the compiler on the things that the compiler is concerned with, but you have to admit it reads very differently.) Two important widely-applicable examples are data layout (locality, in particular getting rid of large and costly-to-traverse pointers) and vectorization; what they have in common is that you may well have to redesign the entire flow of data in your program around the issue before you get meaningful improvements. (And there is often an order-of-magnitude improvement to be had on a CPU-bound task, if you are willing to spend the time and effort to optimize.)
There are also specific situations where the approaches used by modern compilers work badly. The straightforward switch-based interpreter is a well-known example: modern Clang essentially turns into Clippy and goes “looks like you’re writing an interpreter, would you like me to duplicate your dispatch for you” so branch prediction works out as well as in manual assembly, but it still allocates registers a function at a time, so when the function in question is the entirety of the interpreter including the slowpaths, the regalloc sucks. Tail-call interpreters and __attribute__((cold, noinline, preserve_most)) amount to expressing the exact same control-flow graph in such a way that the compiler can digest it better, ironically by understanding less of it at any given time. This is one way that the dumb fundamental nature of the admittedly quite smart modern compiler shines through.
And in very tight loops there are still places when doing things by hand can help. For instance, when computing a histogram of byte values over a large block (for which I’m not aware of any public vectorized code that would go faster than the best scalar options) I’ve seen Clang lose as much as 20% to (contemporary) GCC on the best C implementation[1] or its straightforward manual translation to assembly, because Clang had decided it knew better which order the instructions should go in. As a less exotic case, I’ve seen GCC lose out by about 20% to (contemporary) Clang in vectorized loops because it had decided that having half the loop body be MOVs (or rather VMOVDQAs) would be a better idea than taking advantage of AVX’s ability to not overwrite either of the input arguments, and though MOVs are basically free on a superscalar they’re not that free. I’ve even seen both GCC and Clang ignore an explicit __builtin_expect() and compile a very predictable (but unavoidable) inner-loop branch into a CMOV, once again costing me about 20% in performance.
So if you do in fact care about the difference between 1.1 cycles/byte and 1.3 cycles/byte, yes you can beat a compiler even on a micro level. You just probably don’t have the, depending on your point of view, fortune or misfortune of working on code like that.
[1] https://github.com/powturbo/Turbo-Histogram
Mind you, somebody who's sympathetic to the machine's needs can easily scrape most of that performance back by writing C/C++/Zig in a way that easily maps to the optimal assembly. The optimizer won't make your code drastically worse too often, so if you start with something nice then actually dropping down into assembly has limited use cases and usually limited benefits...if you know what you're doing and throw out every style guide as you do so.
As to this server in particular? At first blush it looks more like a learning exercise. You'll go a lot further with clever incremental routines and appropriately leveraging your OS's async API than you will by shaving a few instructions here and there.
As to servers in general? Your kernel is the real bottleneck. If you need all of its features then you don't have a lot of options, but if you're like most applications then you're leaving a ton of performance on the floor not going for kernel bypass (not that using your kernel for network is a _bad_ decision, but you are nevertheless incurring a 10x-50x performance hit as the cost). Assembly shenanigans literally don't matter in comparison.
Almost certainly crap.
As the author states, it's a simple fork-on-request server, which was state-of-the-art in about 1996. But that's not the point.
1. What is the performance of this implementation of an assembly server? The comment you replied to answered it: It's likely crap. Hand written assembly is almost always worse than compiler.
2. How much performance can a human still squeeze out at the assembly level? That question is different and remains unanswered. But in my experience the answer is the same.
But I've struggled with the IO parts, it seems every system is so different, it's hard to live without LLVM-IR as the middle layer abstracting away compiler and target differences.
Well. Interesting choice of side project!
https://news.ycombinator.com/item?id=48080587