Linux Performance Improvements

This page documents progress on a project to improve Asio performance on Linux 2.6.x.

Goal

The goal of this project is to improve the performance of Asio on Linux, in both single-threaded and multi-threaded uses.

How to participate

If you care about Asio's performance on Linux, then please consider getting involved in this project. There are two ways you can help:

  • Supply performance numbers. For the work to be effective, benchmarks are needed from real applications. (No source code required - just the numbers please!)
  • Test the changes to ensure they don't introduce bugs.

To start, you need to:

  1. Check out the baseline version of Asio.
  2. Benchmark your application.
  3. Publish your numbers (with a short description) below under Baseline.
  4. Consider adding a description of your test setup at the bottom of the page.

And then, as new versions are made available:

  1. Check out new tag.
  2. Benchmark again.
  3. Publish numbers under the tag heading below.
  4. Report any bugs.

How to get a version of Asio for testing

Work will be done on a branch called linux-perf-branch in Asio's CVS repository. To get a particular version, check out (or update to) the tag as specified. If you want to use Boost.Asio rather than Asio, you will need to run the boostify.pl script in Asio's root directory and copy the content of the boostified directory to your Boost distribution.

Progress

Baseline

CVS Tag: linux-perf-branch-start

CK Echo Test 1: 165 MB/s (higher is better)

CK Echo Test 2: 330 MB/s

CK Echo Test 3: 263 MB/s

CK Echo Test 4: 242 MB/s

CK Echo Test 5: 222 MB/s

CK HTTP Test 1: 9460 req/s (higher is better)

Linux-Perf-1

Posts completion handlers generated by the reactor task in a batch, to reduce locking overhead.

CVS Tag: linux-perf-1

CK Echo Test 1: 172 MB/s

CK Echo Test 2: 335 MB/s

CK Echo Test 3: 265 MB/s

CK Echo Test 4: 247 MB/s

CK Echo Test 5: 225 MB/s

CK HTTP Test 1: 9340 req/s

Linux-Perf-2

Uses an edge-triggered (rather than level-triggered) epoll reactor. However, the reactor no longer performs speculative reads and writes outside the reactor mutex.

CVS Tag: linux-perf-2

CK Echo Test 1: 174 MB/s

CK Echo Test 2: 349 MB/s

CK Echo Test 3: 292 MB/s

CK Echo Test 4: 257 MB/s

CK Echo Test 5: 232 MB/s

CK HTTP Test 1: 9100 req/s

Linux-Perf-3

Eliminates signal blocking while reactor operations are performed.

CVS Tag: linux-perf-3

CK Echo Test 1: 177 MB/s

CK Echo Test 2: 352 MB/s

CK Echo Test 3: 293 MB/s

CK Echo Test 4: 256 MB/s

CK Echo Test 5: 232 MB/s

CK HTTP Test 1: 9040 req/s

Linux-Perf-4

Use per-descriptor operation queues. Use an edge-triggered strategy for interrupting the reactor. Don't explicitly delete descriptor from epoll. Re-enable null_buffers support.

CVS Tag: linux-perf-4

CK Echo Test 1: 177 MB/s

CK Echo Test 2: 343 MB/s

CK Echo Test 3: 291 MB/s

CK Echo Test 4: 259 MB/s

CK Echo Test 5: 233 MB/s

CK HTTP Test 1: 10040 req/s

Linux-Perf-5

Use per-descriptor mutexes.

CVS Tag: linux-perf-5

CK Echo Test 1: 177 MB/s

CK Echo Test 2: 352 MB/s

CK Echo Test 3: 282 MB/s

CK Echo Test 4: 259 MB/s

CK Echo Test 5: 233 MB/s

CK HTTP Test 1: 10006 req/s

Linux-Perf-6

Re-run the reactor immediately if there may be more events available (i.e. the events array was full) and there are other threads processing handlers.

CVS Tag: linux-perf-6

CK Echo Test 1: 177 MB/s

CK Echo Test 2: 354 MB/s

CK Echo Test 3: 290 MB/s

CK Echo Test 4: 257 MB/s

CK Echo Test 5: 233 MB/s

CK HTTP Test 1: 9920 req/s

Linux-Perf-7

Fix a bug preventing null_buffers() support form working. Fix a problem where a task_io_service member variable was being incorrectly access outside the lock.

CVS Tag: linux-perf-7

Linux-Perf-8

Change strands so that they share a pool of implementations, to make copying and destruction of strand objects cheaper.

CVS Tag: linux-perf-8

Linux-Perf-9

Use a thread-private handler queue inside run() for running a small number of additional handlers outside of the common handler queue.

CVS Tag: linux-perf-9

Linux-Perf-10

Add support for using timerfd to manage timeouts for timer operations. Ensure items in the thread-private handler queue are moved to the common queue when an exception is thrown.

CVS Tag: linux-perf-10

Linux-Perf-11

Where possible, run the epoll reactor from multiple threads.

CVS Tag: linux-perf-11

CK Echo CPU Scalability:

linux-perf-11-100conn-16KB.png linux-perf-11-1000conn-16KB.png

Test setups

CK Echo Test 1

This test uses the src/tests/performance programs included with non-Boost Asio. It measure total throughput across all sockets in MB/s. Hardware is running two Intel Xeon E5310 quad core processors (1.6GHz), 6GB RAM, 64-bit Debian Linux.

Asio is configured using: CXXFLAGS="-O2 -finline-limit=1000"

Server program is run using: taskset -c 0 ./server 0.0.0.0 55555 1 16384

Client program is run using: taskset -c 1 ./client localhost 55555 1 16384 1 100

CK Echo Test 2

Configuration as for CK Echo Test 1.

Server program is run using: taskset -c 0 ./server 0.0.0.0 55555 1 16384

Client program is run using: taskset -c 1 ./client localhost 55555 1 16384 10 100

CK Echo Test 3

Configuration as for CK Echo Test 1.

Server program is run using: taskset -c 0 ./server 0.0.0.0 55555 1 16384

Client program is run using: taskset -c 1 ./client localhost 55555 1 16384 100 100

CK Echo Test 4

Configuration as for CK Echo Test 1.

Server program is run using: taskset -c 0 ./server 0.0.0.0 55555 1 16384

Client program is run using: taskset -c 1 ./client localhost 55555 1 16384 1000 100

CK Echo Test 5

Configuration as for CK Echo Test 1.

Server program is run using: taskset -c 0 ./server 0.0.0.0 55555 1 16384

Client program is run using: taskset -c 1 ./client localhost 55555 1 16384 10000 100

CK HTTP Test 1

This test uses HTTP Server Example 1 from Asio with ab to measure requests per second. Hardware configuration as for CK Echo Test 1.

Server program is run using: taskset -c 0 ./http_server 0.0.0.0 8090 ../doc_root

Client program is run using: taskset -c 1 ab -c 100 -n 1000000 'http://127.0.0.1:8090/data_4K.html'

CK Echo CPU Scalability

This test compares the throughput of a single io_service running on N CPUs (one thread per CPU) against N io_services each running on one CPU. It uses the src/tests/performance programs included with non-Boost Asio. Hardware configuration as for CK Echo Test 1.

RtB? Echo Tests

Like the CK Echo tests, except that they are run on two E5420 Xeon quad core processors @2.5 GHz, 16 GB RAM. Debian/GNU Linux, GCC 4.3.2. Programs are run using a bash script shown below.

#!/bin/bash
killall server
timeout=100
for bufsize in 16384 32768 65536 do
 for nothreads in 1 2 4 do
 for nosessions in 1 10 100 do
  echo "Bufsize: $bufsize Threads: $nothreads Sessions: $nosessions"
  ./server 0.0.0.0 55555 $nothreads $bufsize & srvpid=$!
  ./client localhost 55555 $nothreads $bufsize $nosessions $timeout 
  kill -9 $srvpid
 done
 done
done

Buffer Threads
Sessions Baseline
perf-1 perf-2
perf-3 perf-4 perf-5 perf-6 perf-7 perf-8 perf-10
16384 1 1 344 348 272 364 280 385 370 299 485 468
16384 1 10 686 549 568 734 581 740 745 742 792 817
16384 1 100 545 495 526 573 568 577 568 625 611 617
16384 2 1 251 231 248 266 256 254 266 253 275 276
16384 2 10 671 615 439 423 450 675 662 687 808 677
16384 2 100 578 577 400 437 411 596 608 620 647 656
16384 4 1 229 248 239 242 242 232 233 233 244 271
16384 4 10 444 515 369 369 370 567 594 594 580 627
16384 4 100 543 589 382 389 421 656 656 655 659 681
32678 1 1 366 450 375 378 678 676 n/a (?) 666 387 388
32678 1 10 701 919 940 947 953 950 719 951 754 745
32678 1 100 590 542 553 555 558 556 614 556 640 641
32678 2 1 373 363 375 380 368 379 355 382 396 395
32678 2 10 638 722 639 590 611 752 803 864 734 784
32678 2 100 600 610 518 478 510 668 629 645 647 700
32678 4 1 326 347 316 350 340 318 322 331 337 417
32678 4 10 588 658 502 488 502 677 686 685 678 703
32678 4 100 601 618 491 490 507 664 663 659 664 660
65536 1 1 1091 1108 1123 539 733 538 1172 536 553 560
65536 1 10 1062 1056 1082 834 1089 1092 1078 1097 810 833
65536 1 100 484 480 475 552 476 478 480 479 571 508
65536 2 1 417 415 422 484 503 511 704 428 486 716
65536 2 10 839 863 678 674 676 843 799 856 820 844
65536 2 100 615 689 516 517 558 657 594 622 618 640
65536 4 1 422 424 402 408 406 403 402 399 430 472
65536 4 10 763 755 609 612 620 747 755 747 747 775
65536 4 100 663 683 534 515 523 674 670 671 662 668

Another test is run on an Intel Core2 Duo T9400 @2.53Ghz.

Buffer Threads Sessions Baseline perf-1 perf-2 perf-3 perf-4 perf-5 perf-6 perf-7 perf-8
16384 1 1 330 343 345 348 353   355 354 367
16384 1 10 870 895 938 930 953 956 966 963 1110
32768 1 1 507 517 520 522   527 534    
32768 1 10 1257 1274 1326 1322 1321 1347 1359 1355 1476
65536 1 1   984 983 898 908 918 918 918  
65536 1 10 1652   1704 1703 1691 1728 1710 1728 1838
131072 1 1       1176 1171 1172     1283
131072 1 10 1200 1343       1326 1366 1370 1356

Edit | Attach | Print version | History: r26 | r23 < r22 < r21 < r20 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r21 - 11 Sep 2009 - 08:22:53 - RutgerTerBorg?
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback