Speed comparison

Discussion in 'General Discussion' started by ndzinn, Nov 30, 2013.

  1. ndzinn

    ndzinn New Member

    Joined:
    Nov 30, 2013
    Messages:
    31
    Likes Received:
    0
    Just thought some might be interested in processing speed comparisons among Udoo quad running Ubuntu 12.04, Raspberry Pi running Raspian and Beaglebone Black running Debian wheezy. I've written a Kalman filter in Matlab. It integrates GPS and an IMU (inertial measurement unit). Heavy math computation. In Matlab on Windows 7 with an Intel i7 it takes 51 seconds to process 33 minutes of data. Translated into C and compiled in Visual Studio, it takes 0.8 seconds (W7/i7). Using the GCC compiler on the three single board computers and using the -O3 optimization switch, it takes about 18 seconds on the RPi and the BBB (remarkably, the RPi is a bit faster than the BBB for heavy math computations) and 4.5 seconds on the Udoo. So, for what's important to me (your mileage may vary), the Udoo is 4 times faster than the competition!
     
  2. Lifeboat_Jim

    Lifeboat_Jim New Member

    Joined:
    Sep 16, 2013
    Messages:
    399
    Likes Received:
    1
    That's good to hear. Many thanks for doing the exercise and sharing the results here :)

    Team UDOO (in their marketing/promotional collateral) always said 4x the power of a rPi but it's good to have empirical evidence that is true.

    Perhaps you can post you code so people can run their own benchmarks?
     
  3. ndzinn

    ndzinn New Member

    Joined:
    Nov 30, 2013
    Messages:
    31
    Likes Received:
    0
    Well, Jim, the Kalman filter is much too long and complicated to post, but a substitute, FPU-intensive test program is given below. It compiles on Udoo, RPi and BBB with the command:

    gcc fpuTest.c -lm -O3 -o test

    To run "test", type "./test" (no quotes)

    The syntax is not the greatest, but it's written to be compatible with Matlab and Python for cross-language comparisons. And the syntax is not optimized, either, but that's not the point. The point is to exercise the FPU.

    The variable "product" needs to be close to 1 (unity), i.e. at least 10 successive 9s for 10000 iterations. The execution time is also given.

    Interestingly, the Arduino Due does OK with this test, slow but accurate. The Aruino Uno fails miserably in accuracy and it's unbearably slow.

    I'm getting 1.75 seconds with my Udoo quad.

    Good luck,
    Noel

    #include <math.h>
    #include <time.h>
    #include <stdio.h>

    int main()
    {
    double atan(double x), pow(double x, double y), sin(double q), cos(double r);
    double x, y, q, r;
    double pii, d2r, product, angle;
    clock_t start_t, end_t;
    double diff_t;
    double counter, dex;
    pii = 4.0*atan(1.0);
    d2r = pii/180.0;
    product = 1.0;
    start_t = clock();
    for(counter=1.0; counter<=10000.0; counter++)
    {
    for(dex=1.0; dex<=360.0; dex++)
    {
    angle = dex*d2r;
    product = product*(pow(sin(angle), 2.0) + pow(cos(angle),2.0));
    }
    }
    end_t = clock();
    diff_t = ((double) (end_t - start_t)) / CLOCKS_PER_SEC;
    printf("product = %18.15f\n", product);
    printf("elapsed = %18.15f\n", diff_t);
    return 0;
    }
     
  4. DracoLlasa

    DracoLlasa UDOOer

    Joined:
    Oct 15, 2013
    Messages:
    419
    Likes Received:
    3
    i have been doing a lot of testing work on the UDOO (as you can see in my signature) and i would like to try and reproduce your test and the results as i own all the the noted boards.

    So to make sure i understand you right you're saying on each board i will just run
    gcc fpuTest.c -lm -O3 -o test

    Then to run the test i will type "./test" (no quotes)

    Are there any pre-reqs i need to install?
     
  5. ndzinn

    ndzinn New Member

    Joined:
    Nov 30, 2013
    Messages:
    31
    Likes Received:
    0
    Terrific. Please share your results.

    You need to save everything between and including #include <math.h> (first line of the program) and the last closing brace (}) as a file. I called it <fpuTest.c>, but you can name it anything as long as you make the change in the compilation statement,

    gcc <name you choose> -lm -O3 -o test

    Of course, you need to have GCC. Udoo's Linaro has it in the image. I remember that Raspian has it. I may have had to download it for Debian wheezy on the BBB with apt-get in the usual way.

    Then you need to navigate to the directory in which the file is located and execute the gcc command.

    -lm means to include the math library. For some reason this is required for gcc but not g++ (the C++ version of GCC).

    -O3 is an optimization switch that really speeds things up.

    -o is the switch for the name of the executable file. In this case it's "test".

    Then you execute test with the command ./test

    The program will run and report the variable product (a measure of accuracy) and the time of execution.

    I haven't tried this on Debian armhf on the Udoo yet because the image is just too minimal for my taste.

    Good luck,
    Noel
     
  6. ndzinn

    ndzinn New Member

    Joined:
    Nov 30, 2013
    Messages:
    31
    Likes Received:
    0
    Maybe an explanation of this little program would be helpful.

    There are two nested loops. The outer loop runs the inner loop 10000 times. That seems good to get times of a couple seconds with the boards we're interested in. The inner loop runs through the integer angles of a circle, 1 through 360 degrees. It adds the square of the sine to the square of the cosine for each of the 360 angles. You'll remember from your basic trigonometry that this addition is exactly 1 (unity). It then multiples this number (unity) times a variable called product that starts out in the program defined as 1 exactly. Now, 1 times 1 should be 1, right? Well, yes mathematically, but not exactly if executed numerically on a computer. So, it does this 360 times in the inner loop and 10000 times in the outer loop. Lots of trig and lots of multiplication. The final value of the product variable represents the numerical round off in the computer. It should be close to 1 (unity). Then there's the time it takes to execute all this, quick if you're using a hard float FPU, slow if you're using software (soft float).

    My interest in this is embedded mathematical computing for navigation (e.g. GPS, inertial systems, acoustics subsea). Udoo is the fastest of these boards I've used. This little program doesn't do anything useful but exercise the board. It's a stand-in for the useful stuff!
     
  7. DracoLlasa

    DracoLlasa UDOOer

    Joined:
    Oct 15, 2013
    Messages:
    419
    Likes Received:
    3
    great! thanks for the feedback and the details.
    I will see if i can get this test run done on all my boards and document the results.
    I will try and post here when i do eventually get it completed. want to make sure your get a chance to see the results make sure i didnt muck anything up :)

    There are a few other tests i found previous when looking to do benchmarking fond here:
    http://elinux.org/RPi_Performance

    This is what i was originally going to work through, some of it looked similar in result to what you have..
    Anyways, it may all be valuable, in the end, multiple benchmarks methods are usually needed to get a well rounded view
     
  8. ndzinn

    ndzinn New Member

    Joined:
    Nov 30, 2013
    Messages:
    31
    Likes Received:
    0
    I agree with your statement that "multiple benchmarks methods are usually needed to get a well rounded view".

    My little program is intended specifically to exercise the FPU (floating point unit). In the ARM Cortex-A9 (Udoo) and A8 (BBB) there are two: NEON which does the graphics and is vectorized and VPU (vector processing unit, I think) version 3 (or maybe 4, I'm writing this from memory), which is not vectorized (despite the name). Cortex-A9 and A8 are version 7 of the ARM architecture. The ARM11 in the RPi is version 6 of the architecture, but it has an earlier version of VPU, v2 maybe. Before that there was no hard float math in the ARM architecture. It was either slow soft floats or fixed-point math, which is not a skill I intend to learn. So we're now on the cusp of a revolution now with FPUs in these efficient ARM processors. But even if it's in the hardware, the OS needs to support it. Debian v6 (squeeze) does not; Debian v7 (wheezy) does. Raspian on the RPi is a version of wheezy and it does. Given the fast results I get with Linaro, I suppose it does, too.

    So, this little test program is very specific. Your link points to broader tests to b e run.
     
  9. andcmp

    andcmp New Member

    Joined:
    May 8, 2013
    Messages:
    161
    Likes Received:
    0
    Many thanks and good work ndzinn! We always claimed that UDOO is 4 times faster than RPi and you empirically proven that.
    I will follow the discussion on this thread, it could be a great featured story on our blog.
     
  10. DracoLlasa

    DracoLlasa UDOOer

    Joined:
    Oct 15, 2013
    Messages:
    419
    Likes Received:
    3
    ndzinn
    So i reproduced your test and here were my results, they seem to be same as yours..
    assuming you agree, i will not start the remaining testing on the other boards along with the other tests
    Code:
    ubuntu@udoo:~$ ./test
    product =  0.999999999954481
    elapsed =  1.740000000000000
     
  11. ndzinn

    ndzinn New Member

    Joined:
    Nov 30, 2013
    Messages:
    31
    Likes Received:
    0
    DracoLlasa, you nailed it! -Noel
     
  12. DracoLlasa

    DracoLlasa UDOOer

    Joined:
    Oct 15, 2013
    Messages:
    419
    Likes Received:
    3
    cool i started working on the linpack benchmarks which seem to do a lot of what you covered or at least have the options to... but the output is much more detailed/complicated. i like having a single number or 2 that i can use for comparison so i will have to see how i can consolidate the feedback from linpack and your scripts along with the others into a single output comparison document.. but that is what we have weekends for now inst it :)
     
  13. Lifeboat_Jim

    Lifeboat_Jim New Member

    Joined:
    Sep 16, 2013
    Messages:
    399
    Likes Received:
    1
    but does it play Crisis?
     
  14. DracoLlasa

    DracoLlasa UDOOer

    Joined:
    Oct 15, 2013
    Messages:
    419
    Likes Received:
    3
    So how do we run this on the SAM3X8 (Due) side, is there a sketch you can share that i can use? im working on bench marking all of my boards with all of these tests
     
  15. ndzinn

    ndzinn New Member

    Joined:
    Nov 30, 2013
    Messages:
    31
    Likes Received:
    0
    Here's the sketch I used for Uno and Due:

    #include <Time.h>

    void setup(){
    Serial.begin(9600); }

    void loop(){
    double atan(double x), pow(double x, double y), sin(double q), cos(double r);
    double x, y, q, r;
    double pii, d2r, product, angle;
    double start_t, end_t;
    double diff_t;
    double counter, dex;
    pii = 4.0*atan(1.0);
    d2r = pii/180.0;
    product = 1.0;
    start_t = 3600.0*hour() + 60.0*minute() + second();
    for(counter=1.0; counter<=10000.0; counter++)
    {
    for(dex=1.0; dex<=360.0; dex++)
    {
    angle = dex*d2r;
    product = product*(pow(sin(angle), 2.0) + pow(cos(angle),2.0));
    }
    }
    end_t = 3600.0*hour() + 60.0*minute() + second();
    diff_t = (end_t - start_t);
    Serial.print("product is ");
    Serial.println(product, 15);
    Serial.print("elapsed is ");
    Serial.println(diff_t, 5); }

    You see that you need the Arduino Time Library <Time.h>, which is different that the C <time.h> library (similar function, different syntax).

    You can get Time.h here:

    http://playground.arduino.cc/Code/time
     
  16. DracoLlasa

    DracoLlasa UDOOer

    Joined:
    Oct 15, 2013
    Messages:
    419
    Likes Received:
    3
    ok so i have spend a lot of time tonight diffing around trying to find some good CPU performance tests to use and i have found a flaw int hat pretty much everything i can run to establish a performance benchmark is single threaded. Now that is find for the other boards i want to test, but or the UDOO it prevents me from getting a real measurement of the UDOO in comparison.

    The script you started this post on is single threaded, the linpack tests from the RPi page are all single threaded, I have a python script that works with prime numbers but still only burns on one CPU.

    are you aware of away to convert the linpack or any other of the tools out there to actually take all of the cores into count?

    i ran the linpack test twice at the same time, ensure it was actually using 2 cores, the numbers for each process were the exact same as running it once, so in theory i could just take the total output (KFLOPS) and multiply it by 4.

    Anyways, interested in some feedback or tips on an linux/ARM/Multi-core benchmarck tool
     
  17. DracoLlasa

    DracoLlasa UDOOer

    Joined:
    Oct 15, 2013
    Messages:
    419
    Likes Received:
    3
    And thanks for this i will check it out.. i need something to make the Due work for thermal testing a long with benchmarking
     
  18. ndzinn

    ndzinn New Member

    Joined:
    Nov 30, 2013
    Messages:
    31
    Likes Received:
    0
    Be prepared for a wait. Here's my results:

    8-bit AVR Arduino Uno, product = 0.799242067337036, time = 968
    Cortex M-3 Arduino Due, product = 0.999999999957812, time = 328

    Granularity of time is just integer seconds, but it doesn't matter.

    Note that the Due is slow but accurate. The Uno is slower but very inaccurate.
     
  19. DracoLlasa

    DracoLlasa UDOOer

    Joined:
    Oct 15, 2013
    Messages:
    419
    Likes Received:
    3
    we both posted at about the same time so check also to see my other post about the multi-core issue, would really like your feedback on that topic.

    Note that i am not skilled with programming at all, im just trying to do my best with what i can find to produce reliable test results for the community... for this and other topics

    also, regarding your script
    you have
    product = value
    time = value

    Time is pretty obvious, but is 'product' like how accurate it was in doing the computations?
     
  20. ndzinn

    ndzinn New Member

    Joined:
    Nov 30, 2013
    Messages:
    31
    Likes Received:
    0
    Pity. Lost my multithreading post in the over-posting. This forum is not up to heavy traffic yet. Here's a reconstruction.

    I'm sure there is multithreading help for you out there, but not being much of a C programmer, I can't point you to it. This little test program does lend itself to parallel programming because it's all for loops. For example, in Matlab the parfor loop replaces the for loop to process in parallel. It's that simple ((if you own the toolbox).

    It's interesting to observe the loads on the Udoo cores using the system monitor in Gnome. Linaro does a good job distributing the load, though on my Udoo quad cores 2 and 3 seem busier than cores 1 and 4. My point is that Linaro will certainly assure that the test program will run on a dedicated core. Perhaps you can observe this. That's got to be faster than on a single core CPU due to the overhead of the OS. In fact, Linaro may be doing more than that. Have you seen the lineup of companies supporting Linaro? I think these companies are real serious about moving ARM architecture forward. And we benefit.

    BTW, I have a Python version of the test program. Disappointing on the RPi. Haven't tried it on the Udoo yet.

    Noel
     

Share This Page