Sensei Posted October 27, 2014 Posted October 27, 2014 (edited) Hello! In other thread author suggested using jump-table with sub-routines as a method of optimizing complex if() { } else if() { } statement. I have other, alternative suggestion. Hand-made binary-tree for complex if() statement.. Suppose so we have following code: if( state == 0 ) { } else if( state == 1 ) { } else if( state == 2 ) { } else if( state == 3 ){}else if( state == 4 ){}else if( state == 5 ){}else if( state == 6 ){}else if( state == 7 ){}else if( state == 8 ){}else if( state == 9 ){}else if( state == 10 ){}else if( state == 11 ){}else if( state == 12 ){}else if( state == 13 ){} else if( state == 14 ) { [... code executed for 14...] } else if( state == 15 ) { [... code executed for 15...] } In the best case, it'll find appropriate code after 1 comparison, in the worst scenario it'll have to compare all 16 cases. Such code we can rearrange to: if( state >= 8 ) // 50% of cases { if( state >= 12 ) // 25% of cases { if( state >= 14 ) // 12.5% of cases { if( state == 15 ) // 6.25% of cases { [... code executed for 15...] } else { [... code executed for 14...] } } else // states 12-13 { [...] } } else // states between 8 and 11 { [...] } } else // states smaller than 8 { [...] } Now any case is found after just 4 executions of if(). max.400% speedup. [math]\log_2(16)=4[/math] If we would have 256 cases in normal if() else if(), this method would reduce them to just 8 executions of if(). max.3200% speedup. [math]\log_2(256)=8[/math] Using it has sense when cases have approximately the same probability of being executed. It shouldn't be hard to make program generating such code with empty {} to fill them, taking as argument quantity of cases.. Best Regards! Edited October 27, 2014 by Sensei
SarK0Y Posted October 29, 2014 Posted October 29, 2014 if-reducing techniques have been always on the very edge of HPC. some implementation, you can get here: http://alg0z.blogspot.ru/2014/07/fastsortfsort-no-if-vs-qsort.html It's short post about & the're link to take really working algo for.
Strange Posted October 29, 2014 Posted October 29, 2014 I have other, alternative suggestion. Hand-made binary-tree for complex if() statement.. That is quite neat. I don't think I have ever seen that approach in source code but it is one strategy used when compiling case statements (and possibly if-else-if statements, but they are not as easy to analyse in the general case). I can imagine this being used in a loop, which reminded me of this: http://www.lysator.liu.se/c/duffs-device.html
Sensei Posted October 30, 2014 Author Posted October 30, 2014 (edited) switch() case code generated by C/C++ compiler can be much more efficient than above reduced quantity of if() code. Right-shifting instruction is like dividing by 2, but is also putting bit that's gone to Carry status register (at least on CPUs I worked). Pseudo-code that should be generated by compiler: value >>= 1; BCC case2 // Branch on Carry Clear [... code executed when the less significant bit is 1...] BRA end case2: [...code executed when the less significant bit is 0...] end: Now, repeat it as many times as there is significant bits in state variable. But jump-table for such switch is probably better option, like: state *= address of( second ) - address of( start ); JMP start+state start: BRA case1 second: BRA case2 (....repeat as many times as needed...) case1: [...code for case 1... ] BRA end case2: [...code for case 1... ] BRA end [....more cases...] end: size of BRA+parameter will be probably dividable by 2, so should be possible to be replaced multiplication by left-shifting of argument (state <<= 2).. Edited October 30, 2014 by Sensei
SarK0Y Posted October 30, 2014 Posted October 30, 2014 That is quite neat. I don't think I have ever seen that approach in source code but it is one strategy used when compiling case statements (and possibly if-else-if statements, but they are not as easy to analyse in the general case). I can imagine this being used in a loop, which reminded me of this: http://www.lysator.liu.se/c/duffs-device.html send(to, from, count) register short *to, *from; register count; { register n=(count+7)/8; switch(count%8){ case 0: do{ *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; }while(--n>0); } } http://www.lysator.liu.se/c/duffs-device.html funny, but useless such tricks toughly dependent onto compiler. By practice, real HPC gets based on many tricks/methods to boost algos. However, better off to implement any freaky codes w/ pure asm because it's much more efficient & even more portable/compatible switch() case code generated by C/C++ compiler can be much more efficient than above reduced quantity of if() code. Right-shifting instruction is like dividing by 2, but is also putting bit that's gone to Carry status register (at least on CPUs I worked). Pseudo-code that should be generated by compiler: value >>= 1; BCC case2 // Branch on Carry Clear [... code executed when the less significant bit is 1...] BRA end case2: [...code executed when the less significant bit is 0...] end: Now, repeat it as many times as there is significant bits in state variable. But jump-table for such switch is probably better option, like: state *= address of( second ) - address of( start ); JMP start+state start: BRA case1 second: BRA case2 (....repeat as many times as needed...) case1: [...code for case 1... ] BRA end case2: [...code for case 1... ] BRA end [....more cases...] end: size of BRA+parameter will be probably dividable by 2, so should be possible to be replaced multiplication by left-shifting of argument (state <<= 2).. actually, i have seen many claims how efficient compilers can do, but i haven't seen such efficiency in practice. that if-reducing strategy, 1st of the all, is based upon function pointers. So algo changes itself on-fly. i cannot imagine compilers to implement such kind of things in automatic mode. perhaps, Skynet would be ok on that
Strange Posted October 30, 2014 Posted October 30, 2014 funny, but useless such tricks toughly dependent onto compiler. Not useless because it was written to solve a specific purpose. No compiler dependent because it is standard C so all compilers will correctly compile it. Although modern compilers might generate more efficient code automatically. However, better off to implement any freaky codes w/ pure asm because it's much more efficient & even more portable/compatible Huh? Assembler code is about as non-portable (and non-maintainable) as you can get. One of the reasons for using high level languages is to create portable code. Also, there are few, if any, cases where hand-coded assembler is better than a modern compiler.
SarK0Y Posted October 30, 2014 Posted October 30, 2014 Huh? Assembler code is about as non-portable (and non-maintainable) as you can get. One of the reasons for using high level languages is to create portable code. Also, there are few, if any, cases where hand-coded assembler is better than a modern compiler. asm code is quite portable, if you deal w/ CPUs of the same set of instructions 2nd moment: if someone cannot write asm codes faster than compilers, it's Just matter of weak skills/knowledge of programmer. And 3rd, you're right: high-level programming has been only for easy/fast developing. For standard codes, compilers are the best choice. For really true HPC, you must deal mostly w/ pure asming. + look at intrinsics: The Intel Intrinsics Guide is an interactive reference tool for Intel intrinsic instructions, which are C style functions that provide access to many Intel instructions - including Intel® SSE, AVX, AVX-512, and more - without the need to write assembly code. https://software.intel.com/sites/landingpage/IntrinsicsGuide/ that's pure admittion of impossibility to abstract hardware level fully Not useless because it was written to solve a specific purpose.No compiler dependent because it is standard C so all compilers will correctly compile it. did you encounter ever w/ situations where compiler does do something wrong??? full portability is from theoretical assumptions. if you do run something specific, ye have to keep in mind a probability of whatever bugs due to compiler/hardware/users/(3rd party libs). 1
Sensei Posted October 30, 2014 Author Posted October 30, 2014 full portability is from theoretical assumptions. I am porting Windows VisualStudio C/C++ code couple times per month to Macintosh OS X XCode, and there is no way source code is compiling without problems.. f.e. XCode does not like 'static' in front of functions... static void func( void ) { } and you have error... Such code is telling compiler on Windows that func() is visible only from the same C/C++ source file, and nothing else, so func() might be reused in other source file.. Otherwise there is linker error " 1>test.obj : error LNK2005: "void __cdecl func(void)" (?func@@YAXXZ) already defined in main.obj" But it's subject for completely different thread..
Strange Posted October 30, 2014 Posted October 30, 2014 asm code is quite portable, if you deal w/ CPUs of the same set of instructions Er, so it's portable as long as you don't want to port it. Therefore, not portable. 2nd moment: if someone cannot write asm codes faster than compilers, it's Just matter of weak skills/knowledge of programmer. Nope. There are few cases where a person can outperform a compiler. For really true HPC, you must deal mostly w/ pure asming. + look at intrinsics: Intrinsics like that are only used to access specialised instructions. But a good compiler will use them anyway. When I worked in HPC, I don't remember anyone writing in assembler code. And certainly not "pure assembler". did you encounter ever w/ situations where compiler does do something wrong? But you weren't talking about compiler bugs. They are pretty rare, anyway.
SarK0Y Posted October 31, 2014 Posted October 31, 2014 @Strange Nope. There are few cases where a person can outperform a compiler. compiler Just applies templates to convert pseudo-code to machine instructions, what template gets picked up has been up to programmer. if you solve standard task, compiler could be quite enough. specific tricks run beyond the capability of compilers. For instance, you want efficient 3OE (out-of-order execution) optimization, then you must write deeply asmed algo. Intrinsics like that are only used to access specialised instructions. But a good compiler will use them anyway. When I worked in HPC, I don't remember anyone writing in assembler code. And certainly not "pure assembler". HPC w/o Asming has been because of the idea to cut edges on funding. But, in fact, it's surrogate HPC. Until recently software developers have relied on the increasing capacity of hardware to compensate for increased software inefficiency, but the benefits of the so-called Moore’s Dividend are running out [3]. While the number of transistors on a chip is still doubling every two years, the gain in number of transistors can no longer be used to increase individual processor performance due to insufficient instruction level parallelism in a program and a chips power dissipation limit. Instead, the gain is being used to increase the number of processors on a chip. Therefore, unless the application itself is highly parallel in nature the potential performance improvement from increased hardware capacity has reached its limit. To accommodate future computing requirements it is necessary that accidental growth be minimized. Hence, software efficiency is becoming more important.Current software development practices (particularly reliance on shared libraries, elaborate class hierarchies, and layers of function calls), while increasing programmer productivity, also lead to software bloat and inefficient program execution. Existing compiler technology generally includes optimization options, but it neither removes unneeded or unreachable code from dynamically linked libraries nor simplifies class hierarchies. Because of this, existing technologies do not significantly reduce software bloat nor improve execution efficiency. Tools are needed to reduce software bloat and collapse software layers and thus improve software efficiency and robustness without impacting developer productivity.http://www.navysbir.com/n13_1/N131-061.htm Actually, you can compile my fastsort (for floats) & try to outperform it w/ purely C-written algo But you weren't talking about compiler bugs. They are pretty rare, anyway. if you make freaky tricks to speed-up, they become not so rare + one freaky tricks at C level makes different asm output for different hardware/compilers/(compile-time options).
Strange Posted October 31, 2014 Posted October 31, 2014 compiler Just applies templates to convert pseudo-code to machine instructions. OK. So you don't know how compilers work. Fine.
SarK0Y Posted October 31, 2014 Posted October 31, 2014 OK. So you don't know how compilers work. Fine. well, enlighten me upon it, if you're so knowledgeable. Please However, i prefer real test drive
EdEarl Posted February 4, 2018 Posted February 4, 2018 (edited) I think the fastest code will use an indirect function call, and a vector (array) of function address per array element. Then call the function via the array. Here is an example program that does an indirect call, but without the array. If you need help, let me know. It's been a few years and I'm rusty, but should be able to do it again. Sorry for the delay. Edited February 4, 2018 by EdEarl
Sensei Posted February 5, 2018 Author Posted February 5, 2018 (edited) 16 hours ago, EdEarl said: I think the fastest code will use an indirect function call, Jumping to subroutine is slower than jumping to code using goto (or assembler equivalent of it), because local variables in registers and return address have to be pushed to/popped from CPU stack. More memory accesses required. Edited February 5, 2018 by Sensei
EdEarl Posted February 5, 2018 Posted February 5, 2018 2 minutes ago, Sensei said: Jumping to subroutine is slower than jumping to code using goto (or assembler equivalent of it), because variable in registers and return address have to be pushed to/popped from CPU stack. More memory accesses required. Yes, at least on older machines. On modern chips there is so much caching and other run-time optimization it's possible there is no difference. It's best to run a test to be sure.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now