Linux Performance Improvements

This page documents progress on a project to improve Asio performance on Linux 2.6.x.

Goal

The goal of this project is to improve the performance of Asio on Linux, in both single-threaded and multi-threaded uses.

How to participate

If you care about Asio's performance on Linux, then please consider getting involved in this project. There are two ways you can help:

  • Supply performance numbers. For the work to be effective, benchmarks are needed from real applications. (No source code required - just the numbers please!)
  • Test the changes to ensure they don't introduce bugs.

To start, you need to:

  1. Check out the baseline version of Asio.
  2. Benchmark your application.
  3. Publish your numbers (with a short description) below under Baseline.
  4. Consider adding a description of your test setup at the bottom of the page.

And then, as new versions are made available:

  1. Check out new tag.
  2. Benchmark again.
  3. Publish numbers under the tag heading below.
  4. Report any bugs.

How to get a version of Asio for testing

Work will be done on a branch called linux-perf-branch in Asio's CVS repository. To get a particular version, check out (or update to) the tag as specified. If you want to use Boost.Asio rather than Asio, you will need to run the boostify.pl script in Asio's root directory and copy the content of the boostified directory to your Boost distribution.

Progress

Baseline

CVS Tag: linux-perf-branch-start

CK Echo Test 1: 165 MB/s (higher is better)

CK Echo Test 2: 330 MB/s

CK Echo Test 3: 263 MB/s

CK Echo Test 4: 242 MB/s

CK Echo Test 5: 222 MB/s

CK HTTP Test 1: 9460 req/s (higher is better)

Linux-Perf-1

Posts completion handlers generated by the reactor task in a batch, to reduce locking overhead.

CVS Tag: linux-perf-1

CK Echo Test 1: 172 MB/s

CK Echo Test 2: 335 MB/s

CK Echo Test 3: 265 MB/s

CK Echo Test 4: 247 MB/s

CK Echo Test 5: 225 MB/s

CK HTTP Test 1: 9340 req/s

Linux-Perf-2

Uses an edge-triggered (rather than level-triggered) epoll reactor. However, the reactor no longer performs speculative reads and writes outside the reactor mutex.

CVS Tag: linux-perf-2

CK Echo Test 1: 174 MB/s

CK Echo Test 2: 349 MB/s

CK Echo Test 3: 292 MB/s

CK Echo Test 4: 257 MB/s

CK Echo Test 5: 232 MB/s

CK HTTP Test 1: 9100 req/s

Linux-Perf-3

Eliminates signal blocking while reactor operations are performed.

CVS Tag: linux-perf-3

CK Echo Test 1: 177 MB/s

CK Echo Test 2: 352 MB/s

CK Echo Test 3: 293 MB/s

CK Echo Test 4: 256 MB/s

CK Echo Test 5: 232 MB/s

CK HTTP Test 1: 9040 req/s

Linux-Perf-4

Use per-descriptor operation queues. Use an edge-triggered strategy for interrupting the reactor. Don't explicitly delete descriptor from epoll. Re-enable null_buffers support.

CVS Tag: linux-perf-4

CK Echo Test 1: 177 MB/s

CK Echo Test 2: 343 MB/s

CK Echo Test 3: 291 MB/s

CK Echo Test 4: 259 MB/s

CK Echo Test 5: 233 MB/s

CK HTTP Test 1: 10040 req/s

Linux-Perf-5

Use per-descriptor mutexes.

CVS Tag: linux-perf-5

CK Echo Test 1: 177 MB/s

CK Echo Test 2: 352 MB/s

CK Echo Test 3: 282 MB/s

CK Echo Test 4: 259 MB/s

CK Echo Test 5: 233 MB/s

CK HTTP Test 1: 10006 req/s

Linux-Perf-6

Re-run the reactor immediately if there may be more events available (i.e. the events array was full) and there are other threads processing handlers.

CVS Tag: linux-perf-6

CK Echo Test 1: 177 MB/s

CK Echo Test 2: 354 MB/s

CK Echo Test 3: 290 MB/s

CK Echo Test 4: 257 MB/s

CK Echo Test 5: 233 MB/s

CK HTTP Test 1: 9920 req/s

Linux-Perf-7

Fix a bug preventing null_buffers() support form working. Fix a problem where a task_io_service member variable was being incorrectly access outside the lock.

CVS Tag: linux-perf-7

Linux-Perf-8

Change strands so that they share a pool of implementations, to make copying and destruction of strand objects cheaper.

CVS Tag: linux-perf-8

Linux-Perf-9

Use a thread-private handler queue inside run() for running a small number of additional handlers outside of the common handler queue.

CVS Tag: linux-perf-9

Linux-Perf-10

Add support for using timerfd to manage timeouts for timer operations. Ensure items in the thread-private handler queue are moved to the common queue when an exception is thrown.

CVS Tag: linux-perf-10

Linux-Perf-11

Where possible, run the epoll reactor from multiple threads.

CVS Tag: linux-perf-11

CK Echo CPU Scalability:

linux-perf-11-100conn-16KB.png linux-perf-11-1000conn-16KB.png

Test setups

CK Echo Test 1

This test uses the src/tests/performance programs included with non-Boost Asio. It measure total throughput across all sockets in MB/s. Hardware is running two Intel Xeon E5310 quad core processors (1.6GHz), 6GB RAM, 64-bit Debian Linux.

Asio is configured using: CXXFLAGS="-O2 -finline-limit=1000"

Server program is run using: taskset -c 0 ./server 0.0.0.0 55555 1 16384

Client program is run using: taskset -c 1 ./client localhost 55555 1 16384 1 100

CK Echo Test 2

Configuration as for CK Echo Test 1.

Server program is run using: taskset -c 0 ./server 0.0.0.0 55555 1 16384

Client program is run using: taskset -c 1 ./client localhost 55555 1 16384 10 100

CK Echo Test 3

Configuration as for CK Echo Test 1.

Server program is run using: taskset -c 0 ./server 0.0.0.0 55555 1 16384

Client program is run using: taskset -c 1 ./client localhost 55555 1 16384 100 100

CK Echo Test 4

Configuration as for CK Echo Test 1.

Server program is run using: taskset -c 0 ./server 0.0.0.0 55555 1 16384

Client program is run using: taskset -c 1 ./client localhost 55555 1 16384 1000 100

CK Echo Test 5

Configuration as for CK Echo Test 1.

Server program is run using: taskset -c 0 ./server 0.0.0.0 55555 1 16384

Client program is run using: taskset -c 1 ./client localhost 55555 1 16384 10000 100

CK HTTP Test 1

This test uses HTTP Server Example 1 from Asio with ab to measure requests per second. Hardware configuration as for CK Echo Test 1.

Server program is run using: taskset -c 0 ./http_server 0.0.0.0 8090 ../doc_root

Client program is run using: taskset -c 1 ab -c 100 -n 1000000 'http://127.0.0.1:8090/data_4K.html'

CK Echo CPU Scalability

This test compares the throughput of a single io_service running on N CPUs (one thread per CPU) against N io_services each running on one CPU. It uses the src/tests/performance programs included with non-Boost Asio. Hardware configuration as for CK Echo Test 1.

RtB? Echo Tests

Like the CK Echo tests, except that they are run on two E5420 Xeon quad core processors @2.5 GHz, 16 GB RAM. Debian/GNU Linux, GCC 4.3.2. Programs are run using a bash script shown below.

#!/bin/bash
killall server
timeout=100
for bufsize in 16384 32768 65536 do
 for nothreads in 1 2 4 do
 for nosessions in 1 10 100 do
  echo "Bufsize: $bufsize Threads: $nothreads Sessions: $nosessions"
  ./server 0.0.0.0 55555 $nothreads $bufsize & srvpid=$!
  ./client localhost 55555 $nothreads $bufsize $nosessions $timeout 
  kill -9 $srvpid
 done
 done
done

Buffer Threads
Sessions Baseline
perf-1 perf-2
perf-3 perf-4 perf-5 perf-6 perf-7 perf-8 perf-10 perf-11
16384 1 1 344 348 272 364 280 385 370 299 485 468 321
16384 1 10 686 549 568 734 581 740 745 742 792 817 616
16384 1 100 545 495 526 573 568 577 568 625 611 617 571
16384 2 1 251 231 248 266 256 254 266 253 275 276 259
16384 2 10 671 615 439 423 450 675 662 687 808 677 754
16384 2 100 578 577 400 437 411 596 608 620 647 656 649
16384 4 1 229 248 239 242 242 232 233 233 244 271 263
16384 4 10 444 515 369 369 370 567 594 594 580 627 623
16384 4 100 543 589 382 389 421 656 656 655 659 681 571
32678 1 1 366 450 375 378 678 676 n/a (?) 666 387 388 522
32678 1 10 701 919 940 947 953 950 719 951 754 745 985
32678 1 100 590 542 553 555 558 556 614 556 640 641 596
32678 2 1 373 363 375 380 368 379 355 382 396 395 388
32678 2 10 638 722 639 590 611 752 803 864 734 784 794
32678 2 100 600 610 518 478 510 668 629 645 647 700 742
32678 4 1 326 347 316 350 340 318 322 331 337 417 411
32678 4 10 588 658 502 488 502 677 686 685 678 703 700
32678 4 100 601 618 491 490 507 664 663 659 664 660 597
65536 1 1 1091 1108 1123 539 733 538 1172 536 553 560 748
65536 1 10 1062 1056 1082 834 1089 1092 1078 1097 810 833 867
65536 1 100 484 480 475 552 476 478 480 479 571 508 791
65536 2 1 417 415 422 484 503 511 704 428 486 716 511
65536 2 10 839 863 678 674 676 843 799 856 820 844 854
65536 2 100 615 689 516 517 558 657 594 622 618 640 708
65536 4 1 422 424 402 408 406 403 402 399 430 472  
65536 4 10 763 755 609 612 620 747 755 747 747 775 771
65536 4 100 663 683 534 515 523 674 670 671 662 668 681

Another test is run on an Intel Core2 Duo T9400 @2.53Ghz.

Buffer Threads Sessions Baseline perf-1 perf-2 perf-3 perf-4 perf-5 perf-6 perf-7 perf-8 perf-11
16384 1 1 330 343 345 348 353   355 354 367 383
16384 1 10 870 895 938 930 953 956 966 963 1110 1093
32768 1 1 507 517 520 522   527 534      
32768 1 10 1257 1274 1326 1322 1321 1347 1359 1355 1476  
65536 1 1   984 983 898 908 918 918 918   923
65536 1 10 1652   1704 1703 1691 1728 1710 1728 1838 1845
131072 1 1       1176 1171 1172     1283 1357
131072 1 10 1200 1343       1326 1366 1370 1356 1173

Date: 18 november 2009
System: Linux 2.6.31-ARCH #1 SMP PREEMPT Tue Nov 10 19:01:40 CET 2009 x86_64 Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz GenuineIntel? GNU/Linux
Compiler: GCC 4.4.2

Buffer Threads Sessions baseline perf-1 perf-2 perf-3 perf-4 perf-5 perf-6 perf-7 perf-8 perf-9 perf-10 perf-11
16384 1 1 490 513 527 536 0 554 546 553 0 0 0 592
16384 1 10 922 952 1016 1019 1038 1044 1042 1046 0 1167 1173 1180
16384 1 100 478 484 498 484 490 491 487 487 0 522 481 526
16384 2 1 413 451 452 483 493 482 498 492 459 475 528 588
16384 2 10 832 920 985 1001 1013 1022 1019 1003 1152 1137 1133 1000
16384 2 100 430 461 497 487 485 455 460 452 515 540 553 645
16384 4 1 445 441 0 394 456 385 443 397 492 536 530 575
16384 4 10 806 902 952 979 988 1009 996 1010 0 1125 1137 1031
16384 4 100 400 476 497 496 486 479 479 479 516 525 497 661
32768 1 1 684 704 711 741 752 743 748 749 0 779 780 793
32768 1 10 1317 1332 1406 1424 1441 1435 1444 1418 1574 1546 1559 1518
32768 1 100 473 470 475 471 496 464 492 489 499 516 466 532
32768 2 1 595 661 648 649 649 653 654 637 725 710 692 771
32768 2 10 1084 1309 1333 1396 1406 1418 1419 1278 1528 1520 1512 0
32768 2 100 455 468 473 472 485 562 548 505 560 684 495 0
32768 4 1 636 615 637 665 678 600 596 643 725 714 699 792
32768 4 10 989 1288 1352 1378 1365 1390 1389 1407 1526 1500 1516 1401
32768 4 100 449 468 474 474 483 476 480 504 543 470 452 699
65536 1 1 1110 1147 1156 1182 1205 1197 1184 1197 1287 1197 1277 1307
65536 1 10 1609 1635 1467 1684 1646 1689 1650 1727 1811 1721 1714 1643
65536 1 100 511 476 497 460 525 517 510 518 528 0 544 591
65536 2 1 1002 1021 1070 1113 1127 1122 1122 1122 1208 1178 1213 938
65536 2 10 1328 1660 1649 1710 1713 1702 1680 1712 1816 1670 1644 1699
65536 2 100 462 591 448 527 513 551 532 556 535 0 598 873
65536 4 1 1000 714 737 742 757 715 695 706 745 1102 0 793
65536 4 10 1304 1677 1688 1731 1724 1762 1738 1744 1847 1770 0 1689
65536 4 100 461 463 455 0 516 525 511 533 541 552 0 741
http - - 19063 18053 16867 17349 17611 18630 17496 17538 17781 17862 17775 17698

Topic revision: r26 - 07 Sep 2010 - 22:43:06 - TWikiAdminUser
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback