Academia.eduAcademia.edu

Software Pipelining for Packet Filters

2007, Lecture Notes in Computer Science

https://doi.org/10.1007/978-3-540-75444-2_44

Abstract

Packet filters play an essential role in traffic management and security management on the Internet. In order to create software-based packet filters that are fast enough to work even under a DOS attack, it is vital to effectively combine both the higher-level optimization related to algorithmic structure and the lower-level optimization related to acceleration techniques in compiler study. In the present paper, we focus on the lower-level (machine code) optimization using software-pipelining, and report experimental results that indicate the potential of our approach for accelerating packet filter performance. The technical difficulty is that the packet filter is a lump of conditional branches, so that standard optimization techniques usually applied to basic blocks is not directly applicable to this problem. Using predicated execution and enhanced modulo scheduling, we solve this problem and achieve 20 times higher performance compared with a conventional interpreter-based packet filter. We also compare the proposed filters and compiler-based packet filters, and obtain a better than twofold increase in performance.

Software Pipelining for Packet Filters Yoshiyuki Yamashita1 and Masato Tsuru2 1 Saga University, Honjyo 1, Saga, 840-8502 Japan [email protected] 2 Kyushu Institute of Technology, Kawazu 680-4, Iizuka, 820-8502 Japan [email protected] Abstract. Packet filters play an essential role in traffic management and security management on the Internet. In order to create software-based packet filters that are fast enough to work even under a DOS attack, it is vital to effectively combine both the higher-level optimization related to algorithmic structure and the lower-level optimization related to acceler- ation techniques in compiler study. In the present paper, we focus on the lower-level (machine code) optimization using software-pipelining, and report experimental results that indicate the potential of our approach for accelerating packet filter performance. The technical difficulty is that the packet filter is a lump of conditional branches, so that standard opti- mization techniques usually applied to basic blocks is not directly appli- cable to this problem. Using predicated execution and enhanced modulo scheduling, we solve this problem and achieve 20 times higher perfor- mance compared with a conventional interpreter-based packet filter. We also compare the proposed filters and compiler-based packet filters, and obtain a better than two-fold increase in performance. 1 Introduction Packet filters, which basically inspect the header and/or payload of each in- coming packet and perform appropriate actions on each packet, are essential for traffic management and security management on the Internet and so are implemented in a variety of systems/devices, such as IP routers and firewalls. Software-based packet filters are cost-effective and flexible but are general rel- atively slow, whereas hardware-based packet filters (e.g., those using ASIC or FPGA[8,11,6] for pattern matching) are fast but expensive and less flexible. Recently, the rapid growth of network bandwidth has led to the requirement for high-speed packet filters. On the other hand, emerging applications of packet filters require increases scalability and flexibility of filter rules. In particular, in the case of DOS attacks, it is important to drop an enormous number of useless packets efficiently, while valid packets should be handled properly. In addition, these filters should be easily modified in response to new circumstances or new  This work was supported in part by Hitachi, Ltd, the Ministry of Internal Af- fairs Communications, Japan, and JSPS, Grant-in-Aid for Scientific Research (C) (19500057). R. Perrott et al. (Eds.): HPCC 2007, LNCS 4782, pp. 446–459, 2007.  c Springer-Verlag Berlin Heidelberg 2007 Software Pipelining for Packet Filters 447 requirements. In order to realize a packet filter that manages both flexibility and high-speed in a cost-effective manner, it is of practical importance to establish an approach by which to create software-based packet filters that are fast enough to work even when under DOS attack. This requires the effective combination of both higher-level optimization related to algorithmic structure adapted to the input packet sequence and lower-level (machine code) optimization related to acceleration techniques in a compiler study. We will focus on the lower-level optimization. Several studies have attempted to produce a native machine code from a packet filter rule and to make the fil- ter faster compared with the conventional interpreter-based packet filter[2,3,7]. However, to the best of our knowledge, none of these studies attempted to apply the state-of-the-arts optimization techniques based on software-pipelining. Al- though software-pipelining is a common technique in compiler construction, it is technically difficult to apply software-pipelining to a packet filter because the filter consists of several (more than 20 in some cases) conditional branches, but the standard software-pipelining algorithm is not applicable to programs with conditional branches. We solve this program with predicated execution[4] and enhanced modulo scheduling[9]. The present paper is an extension of our previous research [10], which showed that we achieved from 10 to 20 times higher performance. In the present paper, we will show that further improvement of the algorithms and more appropriate choices of scheduling parameters result in 12 to 28 times higher performance and an average of 20 times higher performance. In addition, in order to explain our optimization techniques in detail, we present concrete code examples in the C language and the assembly language using an Intel IA-64 Itanium 2 processor. 2 Framework For the purpose of simplification, we consider a packet filtering procedure to consist of two serial procedures. The first procedure classifies each received packet into one of three categories: usually accept, drop, or forward, according to pre- defined filter rules and the context of the packet. The second procedure is to process the packet according to the above classification. These two procedures are usually invoked serially within a single loop, such as for(;;) { n_packets = receive(); @ for(i = 0; i < n_packets; i++) { result = filter(i); action(result); @ } } where we assume that when a packet stream is incoming at a very high rate (e.g., due to a DOS attack), two or more consecutive packets are received by each receive() and are stored in the receiving buffer, after which they are continuously processed in the internal loop. 448 Y. Yamashita and M. Tsuru Simply speaking, making a packet filter faster is approximately equivalent to executing this loop in a shorter time. Hence, in the present paper, we apply code optimization techniques to this loop. The function filter() in the above loop usually consists of only logical and arithmetic operations. Therefore, we expect that it is rather easy to optimize this function with various code opti- mization techniques. On the other hand, since the function action() includes complex tasks, the function must perform tasks such as hoist a packet to an upper-layer and send a packet to other network ports, it is almost impossible to apply optimization techniques to this function. Then, in order to obtain the maximum effects of optimization, in other words, to enable the application of loop optimization techniques, we divide the above loop into two loops as follows: for(;;) { n_packets = receive(); for(i = 0; i < n_packets; i++){ // Loop1 result[i] = filter(i); } for(i = 0; i < n_packets; i++){ // Loop2 action(result[i]); } } We therefore concentrate on how to optimize Loop1 efficiently. The main idea is to process multiple consecutive packets together using a software-pipelined procedure. This approach is especially effective in suffering from DOS attacks, where a major problem is how to prevent the unprocessed packets from being incessantly lost due to buffer overflow, that is, how to in- crease the maximum acceptable input packet rate. Consider a very simple ana- lytical model. Let T1 and T2 be the average times of executing (non-optimized) filter() for one packet and action() in cases of accept and forward for one packet, respectively. We also assume the execution time of action() in case of drop is negligible. Then, let α denote the ratio of non-dropped (valid) packets to all of the incoming packets. In addition, let A denote the acceleration ratio of the execution time in optimizing Loop1, when many consecutive packets are processed together. The average processing time per packet is then T1 /A + αT2 . In a DOS attack situation, we can assume that α  1 because the number of received packets is huge but the number of valid packets does not vary from a non-DOS attack situation. Therefore, the ratio of the optimized pro- cessing time to the non-optimized processing time is approximately given as (T1 /A + αT2 )/(T1 + αT2 ) ≈ 1/A. It shows that the code optimization in Loop1 directly affects the total performance of the packet filter. 3 Loop Optimization In this section, we briefly explain the three steps of the optimizations compared herein. Our proof-of-concept prototype of the packet filter is based on tcpdump[5] (or bpf as its basis). Actually, the compilers for packet filters in our experiments are implemented by rewriting the source program of tcpdump. Software Pipelining for Packet Filters 449 List 1: bpf’s filter machine code for rule, ip and tcp  (000) ldh [12] (001) jeq #0x800 jt 2 jf 5 (002) ldb [23] (003) jeq #0x6 jt 4 jf 5 (004) ret #96 (005) ret #0   3.1 Interpreter tcpdump translates a filter rule into a code for the pre-defined virtual machine, called a filter machine, which consists of an accumulator, an index register, and an array of scratch memory. tcpdump then executes the code on the interpreter that simulates the filter machine. The code is a sequence of bpf instructions. bpf instructions can load a part of a packet contents into a virtual register, perform arithmetic on virtual registers, and return an integer value indicating whether a packet is accepted. For example, let us consider the tcpdump rule, ip and tcp, which accepts only IPv4 tcp packets. List 1 shows the corresponding filter machine code of bpf instructions. Here, the instructions ldh, ldb, jeq, and ret denote load a half word, load a byte, jump if equal, and return a constant value, respectively. The tags jt m and jf m denote that the execution jumps to location m when the comparison yields true and false, respectively. This code corresponds to the function filter(i) in the previous section. 3.2 Compiling into a C Program Since tcpdump translates a filter rule into a filter machine code and executes the code on an interpreter, the filtering process is very slow. The process generally be- comes faster when we compile the rule into a native code. Several studies???[3,7] have already reported that the filter becomes approximately five times faster. As the first step of our research, we compile the rule into a native code. Specifically, we translate a filter rule into a C program and compile the C program into a native code using an existing C compiler. In the present paper, we examine two types of C programs because we expect that compilers treat them differently. The first type is a C program in which each bpf instruction in a filter machine code is straightforwardly replaced with equivalent C statements, including labels and goto statements. For example, List 2 is the C program translated from List 1. The other type of C program is a structured program without goto statements, like List 3. Structured programs are sometimes more redundant than the unstructured programs but are expected to be compiled into more efficient codes by highly-optimizing C compilers. Rewriting the source program of the bpf interpreter, we can easily build a translator that translates a given filter machine code into both unstructured and structured C programs. The C compilers the authors used in this research are gcc, the Gnu C compiler, and icc, Intel’s C compiler. 450 Y. Yamashita and M. Tsuru List 2: unstructured C program for List 1  for(i = 0; i < n_packets; i++) { byte* p = packets[i]; L0: A = EXTRACT_SHORT(&p[12]); L1: if(A == 0x800) goto L2; else goto L5; L2: A = p[23]; L3: if(A == 6) goto L4; else goto L5; L4: A = 96; goto L6; L5: A = 0; goto L6; L6: result[i] = A; }   List 3: structured C program for List 1  for(i = 0; i < n_packets; i++) { byte* p = packets[i]; short tmps = EXTRACT_SHORT(&p[12]); if(tmps == 0x800) { byte tmpb = p[23]; if(tmpb == 6) A = 96; else A = 0; } else A = 0; result[i] = A; }   3.3 Simple List Scheduling A C compiler sometimes generates slow code because the compiler has little in- formation specific to a given C program so that it is likely to select a conservative but inefficient instruction pattern. As the second step of our research, we build a translator that directly translates a filter machine code into a native assem- bly code using elaborately selected instruction patterns. A simple instruction scheduling, so-called list scheduling, is applied to individual basic blocks. List 4 is an example of such simply list-scheduled code translated from List 3, based on the Intel IA-64 architecture. Here, we assume that the registers r100 and r110 hold the addresses of arrays packets and results (see List 2), respectively, and that r10 and r20 hold the constant values 0x800 and 96, respectively. For ease of reading, in the present paper, every register used in the codes is singly assigned, although register allocation algorithms are applied to actual codes. Double semicolons ";;" indicate a boundary of adjacent in- struction groups, in each group of which all of the instructions can be executed simultaneously. 3.4 Software Pipelining As the third step of our optimization, we apply software-pipelining techniques[1], which are highly sophisticated aggressive instruction scheduling techniques for Software Pipelining for Packet Filters 451 List 4: naive code for List 3  L0: ld8 r32 = [r100],8 ;; // r100 == &packets[i] adds r40 = 12,r32 ;; ld2 r50 = [r40] ;; cmp.eq p6,p7 = r10,r50 // r10 == 0x800 (p7)br L2 ;; adds r60 = 23,r32 ;; ld1 r70 = [r60] ;; cmp.eq p6,p7 = 6,r70 (p7)br L1 ;; mov r80 = r20 // r20 == 96 br L3 ;; L1: mov r80 = r0 br L3 ;; L2: mov r80 = r0 ;; L3: st4 [r110] = r80,4 // r110 == &results[i] br.ctop.dptk L0   loops. Although software pipelining techniques are popular and have been im- plemented for several existing compilers, including gcc, most of the compilers can only apply the techniques to loops whose bodies are basic blocks, but cannot apply the techniques to loops with conditional branches like List 2 or List 3. In order to obtain software-pipelined codes for loops with conditional branches, we adopt two special techniques. The first is called predicated execution[4] (PE for short), which can be performed by processors with the facility of predicate registers, such as the Intel Itanium 2 [4]. The second is called enhanced modulo scheduling (EMS for short) [9], which can be performed by any RISC processor, but generates code of larger size. Since these techniques are the primary theme in the present paper, we will explain them in detail in the following two sections. 4 Predicated Execution In this section, we briefly explain the basic concepts of PE and the software- pipelining technique based on PE. 4.1 Predication Predicate registers are used to eliminate branch instructions, which may seriously slow program execution. List 5, translated from List 4, is an example of code that uses predicate registers. For example, the instruction cmp.eq p10,p20 = r10,r50 assigns true and false values to predicate registers p10 and p20, respec- tively, if the values of r10 and r50 are equal. Otherwise, this instruction assigns false and true values to predicate registers p10 and p20, respectively. The in- struction (p10)adds r60 = 23,r32 performs addition if p10 is true. Otherwise, this instruction works as a nop instruction, and the instruction is said to be nullified. The compare instruction (p10)cmp.eq.unc p30,p40 = 6,r70, which is qualified with the predicate register p10, assigns the appropriate truth values 452 Y. Yamashita and M. Tsuru List 5: naive PE code for List 3  L0: ld8 r32 = [r100],8 ;; adds r40 = 12,r32 ;; ld2 r50 = [r40] ;; cmp.eq p10,p20 = r10,r50 ;; (p10)adds r60 = 23,r32 (p20)mov r80 = r0 ;; (p10)ld1 r70 = [r60] ;; (p10)cmp.eq.unc p30,p40 = 6,r70 ;; (p30)mov r80 = r20 (p40)mov r80 = r0 ;; st4 [r110] = r80,4 br.ctop.dptk L0   List 6: software-pipelined PE code (kernel) for List 3  //stage L0: ld8 r32 = [r100],8 //1 (p12)cmp.eq.unc p30,p40 = 6,r71 //4 (p11)adds r60 = 23,r34 //3 ld2 r50 = [r41] //2 (p41)mov r82 = r0 //5 (p21)mov r80 = r0 ;; //3 (p11)ld1 r70 = [r60] //3 cmp.eq p10,p20 = r10,r50 //2 adds r40 = 12,r32 //1 st4 [r110] = r82,4 //5 (p30)mov r81 = r20 //4 br.ctop.dptk L0   to p30 and p40 according to the results of comparison if p10 is true. Otherwise, the instruction assigns false values to both p30 and p40. 4.2 Software pipelining for Predicated Code Since a loop body without branches is a straight-line code, we can optimize the loop by applying a standard method of software-pipelining[1] (or modulo scheduling as a concrete algorithms). List 6 is such a software-pipelined predi- cated code (hereinafter PE code) optimized from List 5 under the assumption that every memory access latency is one machine cycle (MC). In order to mini- mize the initiation interval of the loop iterations (hereinafter I.I.), we use register rotation[4] for both predicate and integer registers. In List 6, for example, just after the loop-back branch instruction br.ctop is executed, the predicate regis- ters p10, ... , p40 are renamed p11, ..., p41, respectively, and the integer registers r32, ..., r81 are renamed r33, ..., r82, respectively. The code in List 6 consists of five software pipeline stages, denoted 1 through 5 following the corresponding instruction at each line of List 6. Software Pipelining for Packet Filters 453 The idealistic I.I. of the loop in List 6 is 2 MC because the first six instructions of the loop can be executed simultaneously and so the last six instructions can also be executed, although it is almost impossible to attain I.I. because the memory access latency is longer than 1 MC and often varies greatly. A major benefit of using PE is that PE frees us from branches (and branch penalties) and can facilitate the application of software-pipelining. Conversely, a major disadvantage is that the instructions that would be nullified do not contribute any effective computation and waste processor resources. In general, a PE code becomes slower when the code size (precisely speaking, the number of nullified instructions) increases. 5 Enhanced Modulo Scheduling Since it is difficult to explain the EMS algorithm[9] in detail in the limited space available here, we only present the concept of how the processor runs effectively on the code generated by the EMS algorithm. The EMS algorithm generates List 7: EMS code (kernel) for List 3  Lxx_tx: //stage Lfx_tt: //stage ld8 r32 = [r100],8 //1 ld2 r50 = [r40] //1 st4 [r110] = r82,4 //4 mov r81 = r20 //3 cmp.eq p6,p7 = r10,r51 //2 br.ctop.dptk Lxx_xx (p7)br Lfx_tx ;; br Lexit ;; Ltx_tx: Lfx_tf: adds r40 = 12,r32 //1 ld2 r50 = [r40] //1 adds r60 = 23,r33 //2 mov r81 = r0 //3 cmp.eq p6,p7 = 6,r71 //3 br.ctop.dptk Lxx_xx (p7)br Ltx_tf ;; br Lexit ;; Ltx_tt: Lxx_xx: ld2 r50 = [r40] //1 ld8 r32 = [r100],8 //1 mov r81 = r20 //3 st4 [r110] = r82,4 //4 ld1 r70 = [r60] //2 cmp.eq p6,p7 = r10,r51 //2 br.ctop.dptk Lxx_tx (p7)br Lfx_xx ;; br Lexit ;; Ltx_xx: Ltx_tf: adds r40 = 12,r32 //1 ld2 r50 = [r40] //1 adds r60 = 23,r33 ;; //2 mov r81 = r0 //3 ld2 r50 = [r40] //1 ld1 r70 = [r60] //2 ld1 r70 = [r60] //2 br.ctop.dptk Lxx_tx br.ctop.dptk Lxx_tx br Lexit ;; br Lexit ;; Lfx_tx: Lfx_xx: adds r40 = 12,r32 //1 adds r40 = 12,r32 //1 mov r80 = r0 //2 mov r80 = r0 ;; //2 cmp.eq p6,p7 = 6,r71 //3 ld2 r50 = [r40] //1 (p7)br Lfx_tf ;; br.ctop.dptk Lxx_xx ;; Lexit: continues to the right column   454 Y. Yamashita and M. Tsuru such a code. Strictly speaking, the EMS algorithm originally given in [9] treats a loop having a few unnested conditional branches and is not sufficient to optimize the loops with many (more than 20 in the present cases, as shown in the next section) deeply-nested (maximum depth: 10) conditional branches. The authors are now preparing another paper that explains how to treat such complex loops. 5.1 Difference from Predicated Execution A software-pipelined PE code uses a set of predicate registers to memorize trace information of the evaluation results of compare instructions. Review the code in List 6. The rotating predicate register p10 is defined at the second stage and is referred at the third and fourth stages as p11 and p12, respectively. p20 is defined as the second stage and referred at the third stage as p21. p30 is defined and referred at the fourth stage. p40 is defined at the fourth stage and referred at the fifth stage as p41. Thus, at the beginning of every new iteration, the predicate information that must live over iterations is held in the set of the registers, (p11, p12, p21, p41), so that there are at least 16 (=24 ) patterns of code executions for the one code in List 6. The EMS algorithm does not use predicate register, but instead generates two or more code patterns corresponding to the set of the truth values the predicate registers possess in the PE code. 5.2 Code Example List 7 is the kernel part of the software-pipelined code generated with the EMS al- gorithm (hereinafter EMS code). Each basic block in the code has a label of form La1 b1 a2 b2 , where the letters ai and bi represent the truth values of the results of the comparisons cmp.eq p6,p7 = r10,r51 and cmp.eq p6,p7 = 6,r71, re- spectively, for the i-th past iteration. The letters ‘t’, ‘f’, and ‘x’ denote the true, false, and undefined (or useless) values, respectively. For example, the label Lxx tx at the first line in List 7 represents the situation in which no compar- ison has yet been evaluated for the most recent iteration but the comparison cmp.eq p6,p7 = r10,r51 has been evaluated affirmatively for the second most recent iteration. If a series of adjacent packets are all IPv4 tcp packets, the pro- gram traverses only the three basic blocks labeled Lxx tx, Ltx tx, and Ltx tt. If none of the packets is a IPv4 tcp packet, the program traverses the basic blocks labeled Lxx xx and Lfx xx. Otherwise, the trace becomes more complicated. Under the assumption that there is no branch penalty, the ideal I.I. of the code in List 7 is 3 MC, even if the processor runs on any path. In contrast to the PE code, the EMS code is not free from branch penalty in general but rather contains no nullified instructions because the EMS code only executes the instructions required for the computation. Therefore, the EMS code is faster than the PE code if the penalty of the nullified instructions is greater than the branch penalty. Software Pipelining for Packet Filters 455 Table 1. Filter Rule in Our Experiments No Rule depth #ifs #insts matching ratio 1 ip 1 1 7 100.0% 2 ip and tcp 2 2 11 99.8% 3 tcp 3 3 17 99.8% 4 ip and udp 2 2 11 0.2% 5 udp 3 3 17 0.2% 6 ip and dst net ip address 2 2 15 4.9% 7 dst net ip address 4 6 36 4.9% 8 ip and net ip address 3 3 22 10.9% 9 net ip address 5 9 59 10.9% 10 ip and dst port number 7 19 66 9.3% 11 dst port number 8 29 88 9.3% 12 ip and port number 9 19 90 21.41% 13 port number 10 29 125 21.41% 6 Experiments This section reports the experimental results obtained using the code optimiza- tion methods explained in Sections 3.1, 3.2, 3.3, 4, and 5. 6.1 Environment Table 1 shows 13 examples of the filter rules of tcpdump examined herein. Each rule is simple, short, and uni-functional, although, in reality, a longer rule that combines these simple rules is likely to be employed. This is because examining each simple rule separately is suitable for analyzing the effect of our optimizations in depth. In order to clarify how the rule is complex to optimize, the columns depth, #ifs, and #insts in the table show three characteristics of the structured C program in Section 3.2 and the native assembly code in Section 3.3 corresponding to each rule. The depth is the maximum depth of nests of if constructs that the program contains, #ifs is the number of if constructs that the program contains, and #insts is the number of assembly instructions not including branch instructions. The sample packets used in the experiments were captured from the network of the authors’ lab using tcpdump -w. The total number of packets was 10, 000. The column matching ratio in Table 1 is the ratio of the packets accepted by each rule. In our experiments, the 10, 000 packets are loaded on a large buffer (virtual network). Then, every n packets packets are copied from the virtual network buffer to the receiving buffer and are processed by Loop1 10, 000/n packets times (recall Section 2). The total execution time is then divided by 10, 000 to obtain the execution time per packet for the given n packets. For performance evaluation purposes, we use the execution time for the optimal n packets, which varies depending on the filter rule. Experiments were performed on an Intel IA-64 Itanium 2 processor (900 MHz, revision 7), Linux version 2.4.18-1. 456 Y. Yamashita and M. Tsuru Table 2. Execution Time per Packet (MC) u s u s No Tip Tgcc Tgcc Ticc Ticc Tsls Tpe Tems Thyb 1 80.6 17.5 17.6 13.4 7.6 11.2 4.3 4.3 4.3 2 126.7 19.4 19.5 15.5 8.6 14.1 4.5 5.8 4.5 3 143.6 20.6 21.7 16.2 9.6 15.3 5.5 7.5 5.5 4 126.5 19.5 19.6 14.6 8.6 14.1 4.5 7.7 4.5 5 143.6 20.6 21.7 16.2 9.6 15.3 5.5 7.6 5.5 6 132.5 188.1 188.8 184.4 189.5 18.7 6.9 10.4 6.9 7 132.7 190.6 189.8 187.1 184.9 19.0 10.7 9.9 10.7 8 181.8 332.0 332.0 330.1 329.2 24.9 9.1 13.5 9.1 9 181.6 333.7 333.5 331.4 330.5 25.2 18.2 14.4 14.4 10 266.2 54.3 53.6 32.5 34.5 30.0 18.6 15.1 15.1 11 283.2 55.1 55.4 30.7 35.4 31.1 21.2 16.9 16.9 12 309.5 64.6 67.3 37.1 40.6 37.2 22.2 20.5 20.5 13 326.5 66.7 69.4 37.6 39.5 38.0 28.8 22.2 22.2 6.2 Results Table 2 shows the average execution time per packet for Loop1 (in units of machine cycle time). Here, Tip is the execution time of the original filter function u s of tcpdump, Tgcc and Tgcc are the execution times of unstructured and structured C programs, such as List 2 and List 3, respectively, compiled by gcc 2.96 with a u s -O3 option, Ticc and Ticc are the execution times of unstructured and structured C programs compiled by Intel icc 9.1 with a -O3 option, Tsls is the execution time of the simply list-scheduled code, such as List 4, Tpe is the execution time of the software-pipelined PE code, such as List 6, and Tems is the execution time of the EMS code, such as List 7. Since a software-pipelined code and its execution time can vary greatly depending on the memory access latency, we profiled the possible execution times in terms of the memory access latency and selected the optimal execution time as Tpe (or Tems ). Table 3 shows the acceleration ratios Ax = Tip /Tx, and Figure 1 shows this information in the form of a bar graph. From the tables, we observe the following: The ratio Augcc is approximately 5, except for cases No. 5 through No. 8, which agrees with the results reported in [7]. The ratio Asgcc is approximately Augcc for all cases, because gcc applies no special optimization to loops with conditional branches, regardless of their program structures. Asicc is approximately twice as large as Auicc for cases No. 1 through No. 5, because icc optimizes the loops of which the bodies are small and well-structured by applying software pipelining with predicated execution but sacrifices optimization when the program size is sufficiently large. The ratio Ape shows that the PE code is at least 10 times faster and ap- proximately 18 times faster on average than the interpreter-based execution. This ratio decreases as the number #insts increases. The ratio Aems shows that Software Pipelining for Packet Filters 457 Table 3. Acceleration Ratio (Ax = Tip /Tx ) No Aip Augcc Asgcc Auicc Asicc Asls Ape Aems Ahyb 1 1.0 4.6 4.6 6.0 10.6 7.2 18.7 18.7 18.7 2 1.0 6.5 6.5 8.2 14.7 8.9 28.2 21.8 28.2 3 1.0 7.0 6.6 8.9 15.0 9.4 26.1 19.1 26.1 4 1.0 6.5 6.5 8.7 14.7 9.0 28.1 16.4 28.1 5 1.0 7.0 6.6 8.9 15.0 9.4 26.1 18.9 26.1 6 1.0 0.7 0.7 0.7 0.7 7.1 19.2 12.7 19.2 7 1.0 0.7 0.7 0.7 0.7 7.0 12.4 13.4 12.4 8 1.0 0.5 0.5 0.6 0.6 7.3 20.0 13.5 20.0 9 1.0 0.5 0.5 0.5 0.5 7.2 10.0 12.6 12.6 10 1.0 4.9 5.0 8.2 7.7 8.9 14.3 17.6 17.6 11 1.0 5.1 5.1 9.2 8 9.1 13.4 16.8 16.8 12 1.0 4.8 4.9 8.3 7.6 8.3 13.9 15.1 15.1 13 1.0 4.9 4.7 8.7 8.3 8.6 11.3 14.7 14.7 ave. 1.0 5.7∗ 5.6∗ 8.3∗ 11.3∗ 8.3 18.6 16.3 19.7 ∗ Calculated without No. 6 to No. 9. the EMS code is at least 12 times faster and 16 times faster on average than the interpreter-based execution. Since the declining speed of Aems is lower than that of Ape , Aems is larger than Ape when #insts is greater than 30. This will be discussed further in the following subsection. Tsls can be regarded as the minimum execution time (or, conversely, as the maximum execution speed) of a program to which no software-pipelining method has been applied, because in all cases except cases No. 6 through No. 9, Tsls is u approximately Ticc , which is the execution time of the program compiled by the Intel compiler optimized for the Itanium 2 processor. Then, Aems and Ape are approximately twice as large as Asls on average. Roughly speaking, this ratio with respect to Asls indicates the effect of software-pipelining. The values in entries {No. 6, ..., No. 9} × {Tgcc u s , Tgcc u , Ticc s , Ticc } are extraor- dinarily large compared to the corresponding values for Tip . This is because memory access conflicts occur on the Itanium 2 machine used in this study, and the compilers gcc and icc can never recognize this symptom. The algorithms associated with Tsls , Tpe , and Tems manage to avoid this conflict by using a special load/store pattern. 6.3 Comparison of PE and EMS Algorithms In general, Tpe is smaller than Tems in simple rules, (cases No. 1 through No. 5) and conversely Tems is smaller than Tpe in complex rules. (cases No. 10 through No. 13). This is because the PE code is affected by the number of nullified in- structions, which is usually proportional to the code size (or the complexity of the rule). On the other hand, the EMS code contains no nullified instruction, but rather branch instructions that cause branch penalties. In a simple rule, the 458 Y. Yamashita and M. Tsuru acceleration ratio 20 10 0 s s Aip Augcc Agcc Auicc Aicc Asls Ape Aems Ahyb Fig. 1. Average Acceleration Ratios 30 MC Tpe execution time 20 Tems 10 0 0 20 40 60 80 100 120 #insts filter rule complexity Fig. 2. Linear Approximation of Execution Times branch penalties dominate the slowdown of the execution but, in a complex rule, the nullified instructions dominate the slowdown. Figure 2 shows the plots of all Tpe (shown as circles in the graph) and Tems (shown as cross-hairs in the graph) in terms of #insts. Using the least mean square method, we can approximate Tpe and Tems as the following liner equations: Tpe = 0.22#insts + 2.93, Tems = 0.14#insts + 6.01 as in Figure 2 with considerably small errors. The root mean square errors of these equations are 1.13 and 1.86, respectively. The intersection of these two lines is located at #insts = 39.3. Therefore, we can alternatively select the better software-pipelining algorithm according to the size of #insts. The execution time Thyb in Table 2 is a value obtained alternatively from Tpe and Tems and so is the acceleration ratio Ahyb in Table 3. Consequently, this hybrid algorithm is approximately 20 (=19.7) times faster than the interpreter-based execution on average, 3.5 times faster than gcc-based execution, and 1.7 times faster than icc-based execution. Software Pipelining for Packet Filters 459 References 1. Appel, A.W.: Modern Compiler Implementation in C. Cambridge University Press, Cambridge (1997) 2. Begel, A., McCanne, S., Graham, S.: BPF+: Exploiting Global Data-Flow Opti- mization in a Generalized Packet Filter Architecture. In: ACM SIGCOMM 1999 (1999) 3. Cristea, M.L., Bos, H.: A Compiler for Packet Filters. In: Proceedings of ASCI 2004 (2004) 4. Intel Itanium Architecture Software Developer’s Manual, http://www.intel.com/design/itanium2/documentation.htm 5. Jacobson, V., et al.: tcpdump(1), bpf.., Unix Manual Page (1990) 6. Kumar, S., Dharmapurikar, S., Yu, F., Crowly, P., Turner, J.: Algorithms to Accel- erate Multiple Regular Expressions Matching for Deep Packet Inspection In: ACM SIGCOMM 2006 (2006) 7. Okumura, T., Mossé, D., et al.: Network QoS Management Framework for Server Clusters An End-Host Retrofitting Event-Handler Approachusin Netnice. In: 3rd Int. Symp. on Cluster Computing and the Grid (2003) 8. Singh, S., Baboescu, F., Varghese, G., Wang, J.: Packet Classification Using Mul- tidimensional Cutting. In: ACM SIGCOMM 2003 (2003) 9. Warter, N.J., Haab, G.E., Bockhaus, J.W.: Enhanced Modulo Scheduling for Loops with Conditional Branches. In: IEEE MICRO-25 (1992) 10. Yamashita, Y., Tsuru, M.: Code Optimization for Packet Filters. In: SAINT2007, Workshop on Internet Measurement Technology and its Applications to Building Next Generation Internet (2007) 11. Yusuf, S., Luk, W.: Bitwise Optimised CAM for Network Intrusion Detection Sys- tems. In: Int. Conf. Field Programmable Logic Appl. (2005)

References (11)

  1. Appel, A.W.: Modern Compiler Implementation in C. Cambridge University Press, Cambridge (1997)
  2. Begel, A., McCanne, S., Graham, S.: BPF+: Exploiting Global Data-Flow Opti- mization in a Generalized Packet Filter Architecture. In: ACM SIGCOMM 1999 (1999)
  3. Cristea, M.L., Bos, H.: A Compiler for Packet Filters. In: Proceedings of ASCI 2004 (2004)
  4. Intel Itanium Architecture Software Developer's Manual, http://www.intel.com/design/itanium2/documentation.htm
  5. Jacobson, V., et al.: tcpdump(1), bpf.., Unix Manual Page (1990)
  6. Kumar, S., Dharmapurikar, S., Yu, F., Crowly, P., Turner, J.: Algorithms to Accel- erate Multiple Regular Expressions Matching for Deep Packet Inspection In: ACM SIGCOMM 2006 (2006)
  7. Okumura, T., Mossé, D., et al.: Network QoS Management Framework for Server Clusters An End-Host Retrofitting Event-Handler Approachusin Netnice. In: 3rd Int. Symp. on Cluster Computing and the Grid (2003)
  8. Singh, S., Baboescu, F., Varghese, G., Wang, J.: Packet Classification Using Mul- tidimensional Cutting. In: ACM SIGCOMM 2003 (2003)
  9. Warter, N.J., Haab, G.E., Bockhaus, J.W.: Enhanced Modulo Scheduling for Loops with Conditional Branches. In: IEEE MICRO-25 (1992)
  10. Yamashita, Y., Tsuru, M.: Code Optimization for Packet Filters. In: SAINT2007, Workshop on Internet Measurement Technology and its Applications to Building Next Generation Internet (2007)
  11. Yusuf, S., Luk, W.: Bitwise Optimised CAM for Network Intrusion Detection Sys- tems. In: Int. Conf. Field Programmable Logic Appl. (2005)
About the author
Kyushu Institute of Technology, Faculty Member
Papers
60
Followers
5
View all papers from Masato Tsuruarrow_forward