Category Archives: Programming

N9 Scratchbox/Harmattan SDK Install

I'm experimenting with compiling programs for the N9.
Go to http://harmattan-dev.nokia.com and get their python script harmattan-sdk-setup.py. Make it executable:
chmod a+x harmattan-sdk-setup.py
Run the script as root:
sudo ./harmattan-sdk-setup.py
Press "0" for admininstall. This will start a lengthy install. On my machine it started with installing python-qt4. Then scratchbox (89Mb download for X86 and another 89Mb download for ARM). Then comes the SDK itself, one 825 Mb download for the X86-SDK and another 806 Mb download for the ARMEL-SDK.

Now I get:

:~/Desktop$ scratchbox
bash: /usr/bin/scratchbox: Permission denied

That's because the scratchbox install creates a new user group sbox, and adds your username to this group. We now need to log out of the machine and log in again so that this group is created and your user is added to the group. Now we can run scratchbox. Now, from another terminal on the host-machine, start Xephyr with this:

Xephyr :2 -host-cursor -screen 854x480 -dpi 96 -ac +extension Composite &

From the scratchbox terminal we can now start meego:

meego-sb-session start

This gives a little bit of error-messages, but nothing major I guess. And we have a working "phone":

If we start Xephyr in portrait mode instead with 480x854x16 it looks a bit better: (should also work by passing '-landscape' flag to meego-sb-session)

The screen where all the open applications are shown looks a bit strange:

This environment should allow coding and compiling in the correct environment on X86, and then cross-compiling on the ARMEL-target and packaging into debs for sending to the device. But on IRC I am told there's an alternative QtSDK toolchain that does the same thing - maybe in a simpler way. To be continued...

On the density of doubles and floats

With finite precision numbers, if you take a small number eps, there comes a point when x == x + eps. Here's some code that figures out when that happens:

double epsD(double x) {
    double r;
    r = 1000.0;
    while ( x < (x + r) )
        r = r / 2.0;
    return ( 2.0 * r );
}

With x around one the resolution limit is roughly at 1e-16 for doubles, and 1e-7 for floats. The top data points show how the relative accuracy stays constant, so at x=1e7 the resolution is only about eps=1 for floats.

Thinking about CAD/CAM related stuff, if you have a fancy top of the line CNC-machine with the longest axis x=1000mm and a theoretical accuracy of one micron, or 0.001mm, then we're at the point where we really should use doubles (which are slower, but how much slower?).

python source for plotting the figure: epsplot.py.tar

Qt and OpenGL test

It looks like there is no efficient way of updating (adding and removing) triangles in a polydata-surface in VTK, so for the cutting-simulation I am looking at other visualization options. Despite all the tutorials and documentation out there on the interwebs it always takes about two hours to get these "Hello World" examples running...

Download a zip-file with the source and cmake file: (This compiles and runs on Ubuntu 10.04LTS) qtopengl

Faster waterlines with OpenMP

This example has three times more fibers, and thus also CL-points, than the original one, but it still runs in a reasonable time of ~15s because (1) I hard-coded the matrix-determinant expressions everywhere instead of relying on a slow general purpose function and (2) the batch-processing of the fibers now uses OpenMP in order to put all those multi-cores to work.

(red=vertex contacts, green=facet contacts, blue=edge contacts)

My initial point-ordering scheme based on a complete breadth-first-search at each CL-point is a bit naive and slow (that's not included in the 15s time), so that still needs more work.

Radians vs. DiamondAngle

Over at Freesteel, Julian talks about using a "DiamondAngle" in the interval [0,4] for representing a plane angle instead of the usual radian in [0,2PI]. The argument is that using diangles there is an exact floating-point number for north/south/east/west, and that conversion to/from diangles is faster because it doesn't involve calling trigonometric functions.

I did a test with this, converting 10 million uniformly distributed radian angles to/from unit-vectors, and ditto for diangles. Converting diangles to unit-vectors is ca 25% faster than calling sin() and cos(), while the reverse transform is almost 3x faster than calling atan2. Runtimes in seconds below:

compare radians to diangles
cos/sin 1.13
atan2 0.78
dia2vec 0.83
vec2dia 0.27
Download code(diangle_test.cpp) and cmake-file(CMakeLists.txt).
I will use a diangle to specify a position on the ellipse, for the offset-ellipse solver, which is central to the edge-drop function of drop-cutter for toroidal tools.

Matrix determinant with Boost::uBLAS

Boost uBLAS provides BLAS functionality, but doesn't have a function for computing the determinant of a matrix. Googling for this turns up a few code snippets, but it's best to document this completely here now since I got it to work, and it will be useful for opencamlib sooner or later.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include <boost /numeric/ublas/matrix.hpp>
#include </boost><boost /numeric/ublas/io.hpp>
#include </boost><boost /numeric/ublas/lu.hpp>
 
namespace bnu = boost::numeric::ublas;
 
int determinant_sign(const bnu::permutation_matrix<std ::size_t>& pm)
{
    int pm_sign=1;
    std::size_t size = pm.size();
    for (std::size_t i = 0; i < size; ++i)
        if (i != pm(i))
            pm_sign *= -1.0; // swap_rows would swap a pair of rows here, so we change sign
    return pm_sign;
}
 
double determinant( bnu::matrix<double>& m ) {
    bnu::permutation_matrix</std><std ::size_t> pm(m.size1());
    double det = 1.0;
    if( bnu::lu_factorize(m,pm) ) {
        det = 0.0;
    } else {
        for(int i = 0; i < m.size1(); i++)
            det *= m(i,i); // multiply by elements on diagonal
        det = det * determinant_sign( pm );
    }
    return det;
}
 
int main () {
    bnu::matrix<double> m(3, 3);
    for (unsigned i = 0; i < m.size1() ; ++i) {
        for (unsigned j = 0; j < m.size2() ; ++j) {
            m (i, j) = 3 * i + sqrt(j+1); // fill matrix
            m(i,j) = m(i,j)*m(i,j);       // with some numbers
        }
    }
    std::cout << "before det() call m= " << m << std::endl;
    double det = determinant(m);
    std::cout << "after det() call  m= " << m << std::endl; // m has changed afted determinant() call!
    std::cout << "determinant=" << det << std::endl;
}

I'm trying the WP-syntax plugin here for the first time. The include statements are garbled, but otherwise it seems to work.

download source: utst1.cpp

This compiles on Lucid Lynx with the following CMakeLists.txt:

cmake_minimum_required(VERSION 2.6)
PROJECT(utst1)
find_package( Boost )
if(Boost_FOUND)
    include_directories(${Boost_INCLUDE_DIRS})
endif()
ADD_EXECUTABLE(utst1 utst1.cpp)
target_link_libraries(utst1 ${Boost_LIBRARIES})

OpenMP test on i7

Here's a simple piece of c-code (try zipped version) for testing how to parallelize code with OpenMP. It compiles with
gcc -fopenmp -lm otest.c

The CPU-load while running looks like this:

cpuload

Looks like two logical CPUs never get used (two low lines beyond "5" in the chart). It outputs some timing information:

running with 1 threads: runtime = 17.236827 s clock=17.230000
running with 2 threads: runtime = 8.624231 s clock=17.260000
running with 3 threads: runtime = 5.791805 s clock=17.090000
running with 4 threads: runtime = 5.241023 s clock=20.820000
running with 5 threads: runtime = 4.107738 s clock=20.139999
running with 6 threads: runtime = 4.045839 s clock=20.240000
running with 7 threads: runtime = 4.056122 s clock=20.280001
running with 8 threads: runtime = 4.062750 s clock=20.299999

which can be plotted like this:
chart
I'm measuring the clock-cycles spent by the program using clock(), which I hope is some kind of measure of how much work is performed. Note how the amount of work increases due to overheads related to creating threads and communication between them. Another plot shows the speedup:
speedup

The i7 uses Hyper Threading to present 8 logical CPUs to the system with only 4 physical cores. Anyone care to run this on a real 8-core machine ? :)

Next stop is getting this to work from a Boost Python extension.

More pystones with shedskin

As I'm very much an amateur programmer with not too much time to learn new stuff I've decided my CAM-algorithms are going to be written in Python (don't hold your breath, they'll be online when they'll be online...). The benefits of rapid development will more than outweigh the performance issues of Python at this stage.

But then I found Mark Dufour's project shedskin (see also blog here and Mark's MSc thesis here), a Python to C++ compiler! Can you have the best of both worlds? Develop and debug your code interactively with Python and then, when you're happy with it, translate it automagically over to C++ and have it run as fast as native code?

Well, shedskin doesn't support any and all python constructs (yet?), and only a limited number of modules from the standard library are supported. But still I think it's a pretty cool tool. For someone who doesn't look forward to learning C++ from the ground up typing 'shedskin -e myprog.py' followed by 'make' is just a very simple way to create fast python extensions! As a test, I ran shedskin on the pystone benchmark and called both the python and c++ version from my multiprocessing test-code:

Python version

Processes	Pystones	Wall time	pystones/s	Speedup
1		50000		0.7		76171		1.0X
2		100000		0.7		143808		1.9X
3		150000		0.7		208695		2.7X
4		200000		0.8		264410		3.5X
5		250000		1.0		244635		3.2X
6		300000		1.2		259643		3.4X

'shedskinned' C++ version

Processes	Pystones	Wall time		pystones/s	Speedup
1		5000000			2.9		1696625		1.0X
2		10000000		3.1		3234625		1.9X
3		15000000		3.1		4901829		2.9X
4		20000000		3.4		5968676		3.5X
5		25000000		4.4		5714151		3.4X
6		30000000		5.1		5890737		3.5X

A speedup of around 20x.

multiprocessing pystone benchmark

A simple pystone benchmark using the python multiprocessing package. Seems to scale quite well - guess how many cores my machine has! :)


" Simple multiprocessing test.pystones benchmark "
" Anders Wallin 2008Jun15 anders.e.e.wallin (at) gmail.com "
from test import pystone
import processing
import time

STONES_PER_PROCESS= 10*pystone.LOOPS

def f(q):
    t=pystone.pystones(STONES_PER_PROCESS)
    q.put(t,block=True)

if __name__ == '__main__':
    print 'multiprocessing test.pystones() benchmark'
    print 'You have '+str(processing.cpuCount()) + ' CPU(s)'
    print 'Processes\tPystones\tWall time\tpystones/s'

    results = processing.Queue()
    for N in range(1,processing.cpuCount()+3):
        p=[]
        q=processing.Queue()
        results=[]

        for m in range(1,N+1):
            p.append( processing.Process(target=f,args=(q,)) )

        start=time.time()
        for pr in p:
            pr.start()
        for r in p:
            results.append( q.get() )
        stop=time.time()

        cputime = stop-start

        print str(N)+'\t\t'+str(N*STONES_PER_PROCESS) \
              +'\t\t'+ str(cputime)+'\t'+str( N*STONES_PER_PROCESS / cputime )