Sensei Posted February 17, 2015 Posted February 17, 2015 (edited) Hello! Suppose so you, mathematician, have such task: find whether point P (xp,yp) (2d) or (xp,yp,zp) (3d) is in triangle defined by vertexes A,B,C. Please show me your algorithms. The more optimized algorithm, the better *) Here is common stage needed for all below algorithms: (please note that this can be already calculated once at triangle initialization stage) min = Min( a, Min( b, c ) ); max = Max( a, Max( b, c ) ); if( ( p.x >= min.x ) && ( p.x <= max.x ) && ( p.y >= min.y ) && ( p.y <= max.y ) && ( p.z >= min.z ) && ( p.z <= max.z ) ) { [...more detailed check here....] } else { // outside of bounding-box of triangle } Example algorithm 1: Scanline algorithm. Find which vertex of triangle has lowest y, which has middle y, which has highest y. Common edge will be called f.e. right (from top to bottom y). 2nd edge will be called f.e. left (from top to middle, or from middle to bottom) Split triangle to 2 smaller triangles, with middle y to be bottom, or middle y to be top. right.x = top.x + ( bottom.x - top.x ) * ( p.y - top.y ) / ( bottom.y - top.y ); //right.y = p.y; if( p.y >= middle.y ) { // we're in bottom part of triangle left.x = middle.x + ( bottom.x - middle.x ) * ( p.y - middle.y ) / ( bottom.y - middle.y ); //left.y = p.y; } else { // we're in top part of triangle left.x = top.x + ( middle.x - top.x ) * ( p.y - top.y ) / ( middle.y - top.y ); //left.y = p.y; } if( left.x > right.x ) swap( left.x, right.x ); // p.y we don't have to bother, it was rejected at 1st common stage if( ( p.x >= left.x ) && ( p.x <= right.x ) ) { // we're in triangle. } For 2D triangle it's done. For 3D triangle we need to repeat with other axis. (I skipped division by 0, which needs to be handled properly) Example algorithm 2: Calculate angle between ABP, say alpha. Calculate angle between ABC, say beta. If alpha is bigger than beta, point is outside. Otherwise further check: If we divide alpha/beta we will have variable which we can use for interpolation: f = alpha / beta; length_ba = Length( b, a ); // length between B and A length_bc = Length( b, c ); // length between B and C length_bp = Length( b, p ); // length between B and P length = length_ba * ( 1.0 - f ) + length_bc * f; if( length >= length_bp ) { // point P inside of triangle ABC } else { // outside } double Length( vector &a, vector &b ) { vector c = a - b; return( sqrt( DotProduct( c, c ) ) ); } double DotProduct( vector &a, vector &b ) { return( a.x*b.x + a.y*b.y + a.z*b.z ); } *) As less as possible sqrt()/sin()/cos()/asin()/acos() etc. etc. very slow functions. This week I had application which was doing 2 billions of sqrt() on single thread of Core i7 it was taking 90 seconds. After getting rid of sqrt() and replacing it with something lighter, same code started running in.. 3 seconds. 30 times speed up just because no sqrt(). Internal loop also had 2 bln executions. Best Regards! ps. Please don't cheat and use Internet sources, just your brain. Thank you Edited February 17, 2015 by Sensei 1
Strange Posted February 17, 2015 Posted February 17, 2015 (edited) Had to do this all the time when coding 3D graphics. The standard solution, if I remember correctly, was to substitute the coordinates of the point into the line equations for each of the edges; if all three solutions are negative, then the point is inside the triangle. (This depends on all triangles being defined in a consistent order.) Worst case, in 2D this needs 2 multiplies and 1 add for each edge (it works with any ploygon, not just triangles) and then a comparison for each edge. In 3D it requires 3 multiplies and 2 adds for each edge. I seem to remember that there are optimizations (barycentric coordinates?) which can reduce the operations needed. Edited February 17, 2015 by Strange
Sensei Posted February 17, 2015 Author Posted February 17, 2015 (edited) Answers should be in spoiler tag to not influence other people.. But show your algorithm in such way, it will be possible to count operations needed (multiply, sub, add etc.) So there will be chance to estimate whether it's more optimal or not. Edited February 17, 2015 by Sensei
Strange Posted February 17, 2015 Posted February 17, 2015 This week I had application which was doing 2 billions of sqrt() on single thread of Core i7 it was taking 90 seconds. Shame you weren't doing inverse square root, you could have used this neat trick: http://en.wikipedia.org/wiki/Fast_inverse_square_root
Sensei Posted February 18, 2015 Author Posted February 18, 2015 Worst case, in 2D this needs 2 multiplies and 1 add for each edge (it works with any ploygon, not just triangles) and then a comparison for each edge. In 3D it requires 3 multiplies and 2 adds for each edge. I seem to remember that there are optimizations (barycentric coordinates?) which can reduce the operations needed. Sounds like you are counting operations after preprocessing stage, when points are no longer in their raw A, B, C positions anymore.. ? Strange: In one algorithm there was stored raw position of point f.e. A, and deltas B-A, and C-A, or something like that. So it's instant 6 subtractions less (cached at preprocess of triangles). In other algorithm one can calculate normal vector (and/or length) for AB and BC and CA, instead of doing this at triangle-hit routine. But this way we will never be able to compare one algorithm with another when you won't include what is done at preprocessing! ps. Aren't you talking about triangle-ray intersecting routine? A bit different from triangle-point. The whole point of this thread was to do something constructive, brainstorm that will lead to finding another unknown algorithm.. I was hoping for mathematician coming in with something out of box, but so far nobody have idea.. ?
Daedalus Posted February 18, 2015 Posted February 18, 2015 (edited) One technique that I used to determine if a point is in a triangle (2d or 3d), involved finding the areas of 4 triangles created by the points as follows: First we find the area of the triangle defined by the vertices. Then, we find the areas of the three triangles created by the point and two of the vertices and sum them up. If our sum at any point in the summation process is greater than the area of the triangle, we can break from the algorithm because we know the point isn't in the triangle. Worst case scenario is that we have to add up all three areas. If the sum is equal to the area of the triangle, then the point is in the triangle, else it is not. No square roots or vector calculus needed. Super fast, and easy to implement on the GPU using Cuda or whatever!!! Edited February 18, 2015 by Daedalus 4
Strange Posted February 18, 2015 Posted February 18, 2015 Sounds like you are counting operations after preprocessing stage, when points are no longer in their raw A, B, C positions anymore.. ? What preprocessing stage? The whole point of this thread was to do something constructive, brainstorm that will lead to finding another unknown algorithm.. I was hoping for mathematician coming in with something out of box, but so far nobody have idea.. ? Some very bright people have spent decades on this. But you never know, someone might have a novel insight.
Sensei Posted February 18, 2015 Author Posted February 18, 2015 (edited) What preprocessing stage? You have xp,yp and xa,ya and xb,yb, raw positions, in total 6 variables, right? Show me how using three math operations: 2 multiplications and 1 addition (like you said in #2 post), you're knowing that xp,yp is below or above line defined by xa,ya and xb,yb... ?? Daedalus, congratulations. Finally something new. +1 for fresh algorithm. But I don't think so it's going to be faster than algorithm #1. You will have something like 16 multiplications (division can replaced by multiply by 1/x), 12 subtractions, 8 additions or more, for 2D triangle calculation. Edited February 18, 2015 by Sensei
Strange Posted February 18, 2015 Posted February 18, 2015 You have xp,yp and xa,ya and xb,yb, raw positions, in total 6 variables, right? Show me how using three math operations: 2 multiplications and 1 addition (like you said in #2 post), you're knowing that xp,yp is below or above line defined by xa,ya and xb,yb... ?? Do you mean generating the line equation from the vertices? That only needs to be done once for each triangle, not for each point test, so it barely counts as an overhead. Basically, you represent the triangles as sets of line equations rather than the coordinates of the vertices. (I think this is where barycentric coordinates come in ... but it is about 30 years since I looked at this!)
Daedalus Posted February 19, 2015 Posted February 19, 2015 (edited) Daedalus, congratulations. Finally something new. +1 for fresh algorithm. But I don't think so it's going to be faster than algorithm #1. You will have something like 16 multiplications (division can replaced by multiply by 1/x), 12 subtractions, 8 additions or more, for 2D triangle calculation. Considering that you want optimized algorithms, I think I got you beat. The more optimized algorithm, the better *) My algorithm can be easily optimized to take advantage of parallel processing on the GPU or any processor that supports parallel processing. Of course, this type of thing is something that you normally do on the GPU anyways. For instance, suppose we have a scene of 3D objects which are all composed of triangles. Now, let's say that we want to select an object or a triangle. All we have to do is transform the vertices to screen coordinates (done on the GPU anyways), pass in the mouse coordinates to the GPU, and launch 4 threads per triangle needing testing. The number of threads launched to test all the triangles seem like madness, but that is a small task for the GPU. It was made to launch thousands of concurrent threads. The following equation is for the area of a 2D triangle: [math]A=\frac{x_1(y_2-y_3)+x_2(y_3-y_1)+x_3(y_1-y_2)}{2}[/math] Launching 4 threads per triangle allows me to reduce the processing time to the amount of time it takes to calculate the area for one triangle + the cycles needed to add the three areas, and do one comparison. Because each thread calculates one of the areas needed for the test, we reduce the number of sequential operations to only having to do (3 multiplications, 3 subtractions, 2 additions, and 1 division) in parallel, then the final thread does 2 additions and 1 comparison. Factor in a few cycles for the GPU to spin up 4 threads. We can reduce this time even further by launching 3 threads for each thread we launched previously for 12 concurrent threads per triangle to test. From there, we only have to do (1 multiplication and 1 subtraction) for each of the area calculations in parallel. Then, the four threads that started the test performs 2 additions and 1 division to finish calculating the areas, but still do so in parallel. The last thread does 2 additions to add up the three areas, and then 1 comparison to complete the test. By using parallel processing, we can reduce the number of sequential operations to 1 multiplication, 1 subtraction, 4 additions, 1 division, and 1 comparison plus a few cycles to launch 12 threads. From what I can tell, your algorithm has to test the results of one set of calculations before moving on to the next. That makes it harder to parallelize your algorithm, where my technique has this advantage. Of course, it's not as fast as your algorithm if we had to sequentially process every calculation. Edited February 19, 2015 by Daedalus
Sensei Posted February 19, 2015 Author Posted February 19, 2015 (edited) Considering that you want optimized algorithms, I think I got you beat. I was not talking about optimizing by using multiple threads! This way there is completely no way to compare one algorithm with another. You're to some level right: in some special circumstances your algorithm thanks to parallel working could be faster. But: - triangles must be already in GPU memory. Uploading them takes a lot of time. You might end up uploading data for longer than algorithm is working. - GPU memory is several times smaller than main-board memory, thus quantity of triangles possible to process will be much smaller. How much memory do you have on gfx card? 2 GB? 4 GB? If user of application will have small amount of memory, whole algorithm might have no sense. 2 GB, with 3*4*3=36 bytes per triangle, is enough for holding just 59 millions triangles. Or 28 mln with double precision. - for 3d scenes, only few right one triangles should be examined, just those that are in kd-tree/octree leaf, not entire scene.. We can reduce this time even further by launching 3 threads for each thread we launched previously for 12 concurrent threads per triangle to test. And we can launch 1024 cores to process 256 triangles at once.. Or we can launch renderfarm with 200 computers, and have even faster result. *) But if somebody would do it this way, with this algorithm, it would be sign he has no idea about programming.. In typical case, quantity of triangles SHOULD be less than 10 in KD-Tree/Octree leaf.. And then just content of leaf is processed, instead of entire scene! Simple compare min-max bounding box with point will get rid of 99.999% of wrong triangles. And main routine will be working with just these which are very very close to triangle. *) Yes, I was writing software which was working on renderfarm. It did job in less than 1 hour on 70+ Xenon/Core i7 machines what my couple machines crunched whole week. But better optimized algorithm is a way to bypass need for adding more cores, more machines, and crunching by brute force way, in the first place. Edited February 19, 2015 by Sensei
Daedalus Posted February 19, 2015 Posted February 19, 2015 (edited) I was not talking about optimizing by using multiple threads! This way there is completely no way to compare one algorithm with another. You're to some level right: in some special circumstances your algorithm thanks to parallel working could be faster. But: - triangles must be already in GPU memory. Uploading them takes a lot of time. You might end up uploading data for longer than algorithm is working. - GPU memory is several times smaller than main-board memory, thus quantity of triangles possible to process will be much smaller. How much memory do you have on gfx card? 2 GB? 4 GB? If user of application will have small amount of memory, whole algorithm might have no sense. 2 GB, with 3*4*3=36 bytes per triangle, is enough for holding just 59 millions triangles. Or 28 mln with double precision. - for 3d scenes, only few right one triangles should be examined, just those that are in kd-tree/octree leaf, not entire scene.. And we can launch 1024 cores to process 256 triangles at once.. Or we can launch renderfarm with 200 computers, and have even faster result. *) But if somebody would do it this way, with this algorithm, it would be sign he has no idea about programming.. In typical case, quantity of triangles SHOULD be less than 10 in KD-Tree/Octree leaf.. And then just content of leaf is processed, instead of entire scene! Simple compare min-max bounding box with point will get rid of 99.999% of wrong triangles. And main routine will be working with just these which are very very close to triangle. *) Yes, I was writing software which was working on renderfarm. It did job in less than 1 hour on 70+ Xenon/Core i7 machines what my couple machines crunched whole week. But better optimized algorithm is a way to bypass need for adding more cores, more machines, and crunching by brute force way, in the first place. All good points. However, I wasn't talking about actual usage. Only theoretically as it applies to optimization through parallelization. For a complete system, as you have pointed out, there are several factors to consider besides just parallelizing the point in a triangle test. Edited February 19, 2015 by Daedalus
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now