تبلیغات :
ماهان سرور
آکوستیک ، فوم شانه تخم مرغی ، پنل صداگیر ، یونولیت
فروش آنلاین لباس کودک
خرید فالوور ایرانی
خرید فالوور اینستاگرام
خرید ممبر تلگرام

[ + افزودن آگهی متنی جدید ]




نمايش نتايج 1 به 2 از 2

نام تاپيک: ترجمه

  1. #1
    در آغاز فعالیت
    تاريخ عضويت
    Feb 2009
    پست ها
    18

    پيش فرض ترجمه

    سلام کی می تونه از این مقاله یک صفحه خلاصه در باره
    من خیلی سعی کردم ولی متن سنگینیه
    ممنون می شم کمک کنید

    Accelerating Database Operators Using
    a Network Processor
    Brian T. Gold* Anastassia Ailamaki* Larry Huston
    Babak Falsafi*

    *Carnegie Mellon University
    Intel Research Pittsburgh
    Pittsburgh, PA Pittsburgh, PA

    ABSTRACT
    Database management systems (DBMSs) do not take full
    advantage of modern microarchitectural resources, such as
    wide-issue out-of-order processor pipelines. Increases in processor
    clock rate and instruction-level parallelism have left
    memory accesses as the dominant bottleneck in DBMS execution.
    Prior research indicates that simultaneous multithreading
    (SMT) can hide memory access latency from a single
    thread and improve throughput by increasing the number
    of outstanding memory accesses. Rather than expend
    chip area and power on out-of-order execution, as in current
    SMT processors, we demonstrate the effectiveness of using
    many simple processor cores, each with hardware support
    for multiple thread contexts. This paper shows an existing
    hardware architecture—the
    network processor—already fits
    the model for multi-threaded, multi-core execution. Using
    an Intel IXP2400 network processor, we evaluate the performance
    of three key database operations and demonstrate
    improvements of 1.9X to 2.5X when compared to a generalpurpose
    processor.

    1. INTRODUCTION
    Memory access stalls dominate execution time in modern
    database management systems (DBMSs) [3, 6, 15, 19]. Exponential
    increases in processor clock rates and the relatively
    slow improvement of DRAM access latency only exacerbate
    the memory access bottleneck. Moreover, expending chip
    area and power on wide-issue superscalar microarchitectures
    or larger out-of-order instruction windows has produced diminishing
    returns for database workloads [15, 20].
    Research has shown that memory access
    latency is the
    key bottleneck in DBMS performance, and current architectures
    are unable to expose multiple pending accesses to
    the memory system [15, 20]. Unlike most scientific applications,
    database operations exhibit sequences of dependent
    memory accesses, which limit the opportunity for speculation
    and out-of-order execution to issue parallel memory
    accesses in a single-threaded microarchitecture. Instead of

    Permission to make digital or hard copies of all or part of this work for
    personal or classroom use is granted without fee provided that copies are
    not made or distributed for profit or commercial advantage and that copies
    bear this notice and the full citation on the first page. To copy otherwise, to
    republish, to post on servers or to redistribute to lists, requires prior specific
    permission and/or a fee.
    Proceedings of the First International Workshop on Data Management on
    New Hardware (DaMoN 2005); June 12, 2005, Baltimore, Maryland, USA.
    Copyright 2005 ACM XXXXXXXXX/
    XX/XX ...
    $5.00.
    .

    focusing on instruction-level parallelism in a single thread,
    recent proposals show thread-level parallelism can significantly
    improve database performance by increasing the aggregate
    number of pending misses.
    For example, simultaneous multithreading (SMT) hides
    single-thread memory access latency by interleaving the execution
    of threads on an aggressive out-of-order pipeline.
    Lo et al. showed a 3X improvement in OLTP throughput
    with a simulated eight-context SMT architecture [17]. However,
    the overhead of supporting multiple contexts on an
    aggressive microarchitecture limits the expansion of SMT
    architectures beyond four or eight threads.
    In this paper, we increase memory-level parallelism further
    by using many simple processing cores with basic hardware
    support for low-overhead context switching. Unlike
    SMT, each core in our model executes a single thread in
    program order and switches contexts only on off-chip memory
    accesses. The area and power that would be devoted to
    an aggressive microarchitecture are instead used to integrate
    more cores on a single chip, increasing the thread parallelism
    by a factor of eight over the most aggressive SMT proposals
    [17].
    Our evaluation makes use of an existing hardware platform,
    the
    network processor, which we find well-suited for
    executing many relational operators. Using a real hardware
    prototype, we demonstrate a 1.9X to 2.5X performance improvement
    over a 2.8GHz Pentium 4 Xeon on sequential
    scan, index range scan, and hash join operators. In addition
    to reporting performance improvements, we give insight on
    how to map common database operators to the network processor
    architecture.
    This paper is organized as follows. In Section 2, we review
    work related to query execution on current and future hardware.
    In Section 3, we give an overview of network processor
    architecture and the IXP2400 in particular. In Section 4, we
    show how to map common database operators onto a network
    processor. In Section 5, we present the results of this
    study and conclude in Section 6.

    2. RELATEDWORK
    In addition to architectural approaches to improving
    memory-level parallelism, database researchers have proposed
    a number of software techniques designed to reduce
    memory access latency. Cache-conscious data placement attempts
    to restructure the layout and placement of data to
    improve spatial and/or temporal locality [2, 9]. However,
    for many applications where access patterns are effectively
    8 Microengines
    Context
    0
    1 2 3
    4 5 6 7
    Local Memory
    Active Context Waiting Contexts
    Register Files
    IXP2400 Microengine
    DRAM
    Controller
    DRAM
    PCI
    Controller
    Host CPU
    Scratch
    Memory
    Figure 1: IXP2400 network processor features.
    randomized (such as a hash join), memory access remains a
    performance bottleneck.
    Software prefetching hides memory access latency by inserting
    timely prefetch instructions among existing data access
    references. Jump pointers [14, 18, 21] were proposed for
    linked data structures such as the hash table chains studied
    in this paper. The basic idea of a jump pointer is to
    automatically prefetch cache blocks from specially inserted
    pointers in existing data structures. In a hash table, for
    example, the ‘next’ pointer of each linked list element is
    prefetched as soon as a list element is fetched. Jump pointers
    may require extensive modifications to software data structures
    and may provide insufficient lookahead cache miss latency
    increases.
    In [8], Chen et al. showed that with properly inserted
    prefetch instructions in a hash join algorithm, a performance
    increase of 2.0-2.9X could be achieved in simulation. This
    work studied two approaches to prefetching in a hash join
    and detailed how to partition the join code to insert timely
    prefetch instructions. Group prefetching exploits spatial
    parallelism to overlap multiple cache misses from independent
    probe tuples. Software-pipelined prefetching, by contrast,
    hides memory accesses in a pipeline of probe operations.
    Our hash join implementation is most similar to the
    group prefetching in [8], because we exploit inter-tuple parallelism
    across multiple hardware threads.
    In addition to SMT, the architecture community has investigated
    several other multi-threading strategies. Similarly
    to the network processor model, switch-on-miss multithreading
    switches contexts on infrequent events such as L2
    cache misses [1, 10]. Prior evaluations used a limited number
    of thread contexts (four or fewer) and a single core, whereas
    current technology affords us 8 thread contexts per core and
    8 or more processing cores per chip. Tera’s MTA [4] implemented
    fine-grained multi-threading by storing 128 thread
    contexts in hardware and switching among ready threads
    every clock cycle. We are not aware of any evaluation of
    database workloads on the MTA architecture.
    A number of recent papers have explored the use of graphics
    processors in the execution of database operators [5, 11,
    22]. Modern graphics processors are exemplified by 8-16
    parallel pipelines and very high-bandwidth memory interfaces,
    making them well-suited for some database operations.
    The principle disadvantage to graphics architectures
    remains their restrictive programming model, which limits
    control flow and prohibits random writes to memory.
    Active Waiting Ready
    T0 T1 T2 TN T0 T1 T2
    DRAM
    Request
    DRAM
    Available
    Thread States:
    Microengine Threads
    time
    DRAM Latency
    T0
    T1
    T2
    TN
    Aggregate
    Resume
    Execution
    Figure 2: Thread execution on a microengine.
    Consequently, a small subset of database operators have
    been mapped to current graphics architectures. In contrast,
    the network processor’s general purpose programming model
    supports any relational operator that a conventional architecture
    can execute.
    3. NETWORK PROCESSOR OVERVIEW
    Many network applications process numerous independent
    streams of packets. The high bandwidth and parallelism
    required in these situations often exceeds what a
    general-purpose microprocessor can provide. The recent
    emergence of special-purpose architectures, termed network
    processors, addresses the need for high-throughput compute
    elements in such applications.
    In this paper, we use the Intel IXP2400 network processor,
    shown in Figure 1 as a block diagram. Network processors
    contain a number of simple processing cores, termed
    microengines (MEs) in the Intel chips. The IXP2400 has 8
    microengines clocked at 600 MHz, and each microengine has
    hardware support for 8 thread contexts. Switching thread
    contexts takes just 4 clock cycles. More information on the
    network processor hardware can be found in [12, 13].
    The IXP2400 has several levels of memory, listed in Table
    1. Unlike a general-purpose processor, the IXP2400
    has no hardware-managed memory hierarchy. Data in local
    memory can only move to and from DRAM (or SRAM)
    with an explicit sequence of microengine instructions. This
    explicit control allows the programmer to control ‘cached’
    data structures and know they will remain in low-latency
    storage. The software-managed memory makes programming
    more difficult, particularly for algorithms with complex
    control flow. Our experience shows that relatively few
    data structures need to move between levels of memory, and
    memory access is otherwise straightforward.
    Table 1: IXP2400 Memory Hierarchy
    Type Size Latency (cycles)
    DRAM 1GB
    > 120
    SRAM 128MB
    > 90
    Scratchpad 16KB (on-chip)
    > 60
    Local Memory 2560B (per-ME) 3

    Slot array
    0 1
    4
    2
    3
    n
    n 4 3 2 1 0
    1 2
    Page
    Record header Attributes
    Figure 3: Record layout on a page.
    3.1 Programming Model
    Each microengine uses its low-overhead context switching
    to hide memory access latency from individual threads.
    Thread execution is non-preemptive—a thread explicitly releases
    control of the microengine, and the next thread is chosen
    round robin from the set of ready threads. Figure 2 illustrates
    this style of multi-threaded execution. Threads wait
    for specific hardware
    signals from the memory controller or
    other peripheral, indicating their operation has completed
    and the thread is ready to run.
    Although the programming model is somewhat unusual,
    the basics of parallel programming still apply. When mapping
    database operators to threads in a network processor,
    we desire independent threads that do not require synchronization.
    Where synchronization is required, we place these
    threads on the same microengine, wherever possible. The
    non-preemptive programming model means no explicit synchronization
    is required between threads on a single microengine.
    Only one thread runs at any time, and the programmer
    controls when a context switch occurs. Intramicroengine
    coordination is accomplished through global
    registers, which can be accessed in a single cycle and incur
    no overhead.
    In mapping common relational operators to the network
    processor, we find that data accesses can be statically mapped
    to one level in the memory hierarchy (list in Table 1). For
    example, we know that individual record attributes in a sequential
    scan will have no temporal locality and can therefore
    be kept in DRAM. Metadata, such as page headers or
    lock tables, should be kept close the processor in local memory
    or scratchpad space.
    The next section gives an explanation of how to decompose
    common database operators onto network processor
    threads.

    4. RUNNING DATABASE OPERATIONS
    In this paper, we study three fundamental database operators:
    a sequential scan, a clustered or non-clustered index
    scan, and a hash join. Prior work [3] used a similar set
    of operations to evaluate DBMS execution behavior. The
    sequential scan models linear, contiguous access to one relation.
    The non-clustered index scan represents random,
    non-clustered access over a single table. Two-table access
    patterns are modeled through an equijoin, implemented using
    the most common and challenging operator—a hash join.
    hash
    key attributes
    1
    2 3 4
    Figure 4: Probing a hash table.
    The hash join models contiguous access to one table, and a
    dependent, random access to the second relation.
    4.1 Sequential Scan
    We model a sequential scan by iterating over all records
    on a set of pages, accessing one or more attributes in each
    record. We use slotted data pages with a layout derived from
    the Shore storage manager [7]. In general, records are variable
    length, and both a slot array and record header are used
    to find attribute data in the page. Figure 3 illustrates the sequence
    of operations in our sequential scan model. Reading
    an attribute requires at least accessing the slot array (labeled
    ‘1’ in the figure), however accessing variable-length records
    requires a second indirection through the record header (labeled
    ‘2’).
    In the network processor implementation, each page is
    scanned by one microengine, with each thread fetching data
    from one record. By assigning pages to each microengine, we
    avoid costly synchronization overhead while tracking completed
    record indices. Pages are assigned round-robin to
    distribute work.
    When a page is first accessed, the slot array is brought
    into the microengine’s local memory to reduce subsequent
    access latency. Threads access tuple data in 64-byte chunks,
    which are stored in local registers and processed accordingly.
    The evaluations in this paper use attributes smaller than
    the maximum 64-byte DRAM transfer, so only one memory
    access is required.
    4.2 Index Scan
    Similarly to the sequential scan, we model an index scan
    by iterating over a set of pages, accessing one or more attributes
    in each record. In this case, however, we access a
    uniformly-distributed fraction of the records on each page.
    This model corresponds to a range scan where a list of pages
    is built from an index lookup, and then the pages are accessed
    sequentially. We assume the data access time dominates
    index lookup.
    The network processor implementation of the index scan
    follows directly from the sequential scan discussed above.
    The key performance impact of the index scan is the overhead
    of accessing the slot array. When accessing all records
    on a page, this initial cost is amortized. For low selectivities,
    however, the slot array access approaches the tuple access
    in cost.
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    3.0
    0 1 2 3 4 5 6 7 8 9
    # Threads/Microengine
    Throughput Relative to P4
    1 microengine
    2 microengines
    4 microengines
    8 microengines
    Figure 5: Relative throughput in sequential scan for
    varying amounts of parallelism, as compared to a
    2.8GHz Pentium 4.
    4.3 Hash Join
    We model the probe phase of a hash join, which for large
    relations will dominate time spent constructing the hash table
    itself. Figure 4 shows the sequence of data accesses when
    probing the hash table. Similar to the sequential scan, each
    page of the outer relation is assigned to one microengine in
    the network processor. Each thread in a microengine walks
    the hash table with a single record from the assigned page
    until a match is found.
    To minimize hash bucket length, we place the hash table
    header in DRAM to allow it to grow beyond the network
    processor’s limited on-chip storage. A general-purpose
    microprocessor with large on-chip cache has a distinct advantage
    here, as the frequently accessed bucket headers will
    likely be kept in cache. For modestly-sized outer relations,
    however, the network processor will offset the cost of accessing
    the header by overlapping the cost of pointer-chasing
    from many simultaneous probes.
    5. EVALUATION
    In this section, we describe the methodology used in this
    work and our experimental results. We study three fundamental
    operators used in nearly all DBMS queries: sequential
    scan, index scan, and hash join. We discuss the
    implications of these results on other, similar query operators.
    5.1 Methodology
    The experiments presented in this paper use a Radisys
    ENP-2611 development board, which places an Intel IXP2400
    network processor and 256MB PC2100 DRAM on a PCI expansion
    card. The IXP2400 contains 8 microengines clocked
    at 600MHz, each with support for 8 thread contexts.
    Network processor DRAM is separate from the main memory
    on the host computer. When a microengine thread accesses
    DRAM, it accesses the local DRAM on the expansion
    card. Integrating a network processor with a full DBMS on a
    host processor requires transferring data from system memory
    to DRAM on the network processor expansion card. We
    do not model these overheads in this paper.
    0
    5
    10
    15
    20
    25
    0% 20% 40% 60% 80% 100%
    Selectivity
    Throughput (Mtuples/sec)
    Pentium 4
    Network Processor
    Figure 6: Throughput in index scan with varying
    selectivity. The network processor result uses 8 microengines
    with 8 threads per microengine.
    We use a 2.8GHz Pentium 4 Xeon with 512KB L2 cache
    and 3GB of PC2100 DRAM as the general-purpose reference
    platform. We disable Hyper-Threading in our experiments,
    because these operators are bound by the number of miss
    handlers (four in the Pentium 4 used). Experiments using
    Hyper-Threading on this processor showed no performance
    improvements. For the Pentium 4, the database operators
    were coded in C and compiled using gcc-3.2 with
    -O6 optimization.
    Experiments with the Intel C Compiler (icc)
    found no measurable difference. Where applicable, we report
    cache miss rates and branch predictor accuracy using
    the PAPI performance counter library.
    Our evaluations use the TPC-H dbgen tool to generate realistic
    data tables using a scale factor of 0
    .25 (250MB ASCII
    dataset). The scan operators use the
    orders table, while
    the join operator uses
    orders as the inner relation and the

    lineitem
    relation as the outer. We chose a 250MB dataset so
    that all data would fit in memory on the network processor
    expansion card.

    5.2 Sequential Scan
    We sequentially scan the records of the TPC-H
    orders

    table and extract customer keys from each record. Figure
    5 shows the relative throughput for varying levels of
    thread parallelism, as compared to the 2.8GHz Pentium
    4 implementation. We observe that an aggregate total of
    four threads are required to (nearly) match the Pentium 4
    throughput. More than four threads on a single microengine
    produces diminishing returns as the bottleneck becomes instruction
    execution in the microengine’s simple pipeline. That
    is, threads remain in the
    ready state waiting for another
    thread to yield control of the pipeline.
    The number of miss handlers constrains the Pentium 4
    throughput, as only four L1D misses can be outstanding
    at any time. Because TPC-H relations consist of variablelength
    records, the hardware stride prefetcher reduces L2
    misses by just 10%. Each record access incurs one off-chip
    miss; however, the sequential scan consumes just over onethird
    of the available memory bandwidth with hardware
    prefetching turned off. The Pentium 4 is unable to expose

    97.5%
    98.0%
    98.5%
    99.0%
    99.5%
    100.0%
    0% 20% 40% 60% 80% 100%
    Selectivity
    Branch Predictor Accuracy
    Figure 7: Branch predictor accuracy for index scan.
    sufficient memory-level parallelism to hide the long off-chip
    access latency.
    The peak speedup of 1.9X is obtained with an aggregate
    of 16 threads on the network processor. Note, however, that
    the bottleneck with 16 threads is now the memory controller,
    which limits the number of simultaneous outstanding accesses
    to 16. Adding more than 16 threads produces no improvement,
    whether through more microengines or increasing
    the number of threads per microengine. Alleviating this
    bottleneck requires a more aggressive memory controller or
    increasing the number of memory controllers and distributing
    accesses across them. The Intel IXP2800 takes the latter
    approach, using three Rambus DRAM channels and distributing
    all DRAM requests across the three channels.
    5.3 Index Scan
    The index scan also iterates over the
    orders table, but
    now tuples are selected at random within each page. Figure
    6 illustrates the results, where the network processor
    throughput is obtained from 8 microengines with 8 threads
    per microengine. Results were identical using any configuration
    providing at least 16 threads.
    We see a sharp drop in throughput on the network processor
    as selectivity decreases, because of the relative increase
    in overhead of accessing the slot array. As fewer tuples are
    fetched from a page, the initial slot array access becomes
    more expensive. The shape of the Pentium 4 curve differs
    due to branch misprediction effects. As selectivity approaches
    50%, branch mispredictions peak, as illustrated in
    Figure 7. The cost of instruction rollback on a misprediction
    is not the primary source of performance overhead. Rather,
    the lost opportunity for look-ahead and prefetching of future
    tuples is the major overhead in the non-clustered range
    scan. Prior work indicates a significant fraction of time in
    DBMS execution is due to branch mispredictions [3].

    5.4 Hash Join
    We model a ‘simple’ hash join where the inner relation fits
    in a single hash table in memory. The results here are applicable
    to partition-based hash joins, such as GRACE [16],
    which repetitively execute simple hash joins over each partition.
    In our model, elements in the hash table contain order
    keys from the TPC-H
    orders relation. We use 32K hash

    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    3.0
    0 1 2 3 4 5 6 7 8 9
    # Threads/Microengine
    Throughput Relative to P4
    1 microengine
    2 microengines
    4 microengines
    8 microengines
    Figure 8: Relative throughput in probe of hash table
    for varying amounts of parallelism, as compared to
    a 2.8GHz Pentium 4.
    buckets to restrict the average bucket depth to 2
    .08. This
    configuration was intentionally biased towards the Pentium
    4 to make our analysis conservative. Using fewer buckets
    increases the length of each bucket chain and improves the
    network processor’s relative performance. Each cell in the
    hash bucket chain contains four (
    key, pointer) pairs to minimize
    memory accesses in both the network processor and
    Pentium 4 implementations.
    Figure 8 shows the relative join throughput as the number
    of microengines and threads varies. The data-dependent
    branches and cache misses on the Pentium 4 implementation
    have a significant impact on single-thread performance.
    Given a total of 4 threads, the network processor meets or
    outperforms the Pentium 4. As in the sequential and index
    scans, the limited parallelism in the memory controller of
    the IXP2400 limits speedup beyond 2.5X.
    The thread-level parallelism and memory-controller bottleneck
    on the IXP2400 account for the similarity in Figures
    5 and 8. In these operations, the fundamental performance
    improvement comes from issuing more parallel requests
    to memory on the network processor. The relative
    performance in the hash join is higher than the sequential
    scan because the data-dependent, random memory access
    patterns hamper the Pentium 4’s ability to expose parallel
    memory accesses.

    6. CONCLUSIONS
    Memory access stalls dominate DBMS execution time,
    and the continued scaling of processor clock frequency only
    exacerbate the problem. We demonstrate that existing network
    processors, with their multi-core, multi-threaded architecture,
    can expose more parallel accesses to memory than
    a conventional, single-threaded processor. We study the implementation
    of three fundamental database operators: sequential
    scan, index scan, and hash join. We demonstrate
    speedups ranging from 1.9X for the sequential scan to 2.5X
    for the hash join. With selectivity less than 50%, the network
    processor outperforms a 2.8GHz Pentium 4 by more
    than 3X in an index scan.
    7. ACKNOWLEDGEMENTS
    The authors would like to thank Jared Smolens for help
    with early versions of this work, Minglong Shao for help with
    Shore, and members of the Carnegie Mellon Impetus group
    and the anonymous reviewers for their feedback on earlier
    drafts of this paper. This work was partially supported by
    grants and equipment donations from Intel corporation and
    a graduate fellowship from the Department of Defense.
    8.

  2. #2

  3. این کاربر از ***Spring*** بخاطر این مطلب مفید تشکر کرده است


Thread Information

Users Browsing this Thread

هم اکنون 1 کاربر در حال مشاهده این تاپیک میباشد. (0 کاربر عضو شده و 1 مهمان)

User Tag List

قوانين ايجاد تاپيک در انجمن

  • شما نمی توانید تاپیک ایحاد کنید
  • شما نمی توانید پاسخی ارسال کنید
  • شما نمی توانید فایل پیوست کنید
  • شما نمی توانید پاسخ خود را ویرایش کنید
  •