網頁

2011年1月9日 星期日

Cache Miss

3 Kinds of Cache Miss

  • Compulsory
  • Capacity
  • Conflict

Reduce Miss Rate

  • Larger Block Size — compulsory
  • Higher Associativity — conflict
  • Victim Cache — conflict
  • HW Prefetching Instruction, Data — compulsory
  • SW Prefetching Data — compulsory
  • Compiler Optimizations

Reduce Miss Penalty

  • Read priority over write on miss
  • Early Restart and CriticalWord First on miss
  • Non-blocking Caches (Hit Under Miss)
  • Multi-level Caches

Cache Bus

Cache Size is 64KB, Block size is 32B and the cache is Two-Way Set Associative. For a 32-bit physical address, give the division between Block Offset, Index and Tag.
Number of Blocks:
64k/32 = 2048
Index:
2 way set associative — 2000/2 = 1000 lines-> 10 bits for index
Block Offset:
32B block -> 5 bits for block offset
Tag:
32-10-5= 17 bits for tag

Snooping vs Snarfing

Snooping is the process where the individual caches monitor address lines for accesses to memory locations that they have cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location.
Snarfing is where a cache controller watches both address and data in an attempt to update its own copy of a memory location when a second master modifies a location in main memory

Write Back vs Write Through

Write-Through cache: In this, once the writing is done in the cache memory, the main memory is also updated immediately to maintain the reliability.
Write-Back cache: In this, once the writing is done in the cache memory, a flag bit known as dirty bit is set. It write the data back to the main memory after a period of time or when the data is swapped out of the cache.
The differences between the Write-through and Write-back cache can be based on two factors (i.e Performance and Integrity of Data). The write-through has a better integrity and it will also flush for each writes. The write-back will hold up the write till the same cache line has to be used up for a read. Thus, write-back will proivde good performance, as it save many memory write cycles

Memory Mapped IO vs Port IO

Memory-mapped I/O (MMIO) and port I/O (also called port-mapped I/O or PMIO) are two complementary methods of performing input/output between the CPU and peripheral devices in a computer.
Memory-mapped I/O (not to be confused with memory-mapped file I/O) uses the same address bus to address both memory and I/O devices, and the CPU instructions used to access the memory are also used for accessing devices. In order to accommodate the I/O devices, areas of the CPU's addressable space must be reserved for I/O. The reservation might be temporary — the Commodore 64 could bank switch between its I/O devices and regular memory — or permanent. Each I/O device monitors the CPU's address bus and responds to any of the CPU's access of device-assigned address space, connecting the data bus to a desirable device's hardware register.
Port-mapped I/O uses a special class of CPU instructions specifically for performing I/O. This is generally found on Intel microprocessors, specifically the IN and OUT instructions which can read and write one to four bytes (outb, outw, outl) to an I/O device. I/O devices have a separate address space from general memory, either accomplished by an extra "I/O" pin on the CPU's physical interface, or anentire bus dedicated to I/O.

DMA

DMA Advantage

DMA is a hardware device that helps the data transmission between hardware device and memory. Computers that have DMA can transfer data to and from devices with much less CPU overhead than computers without a DMA.
Without DMA, using programmed input/output (PIO) mode for communication with peripheral devices , or load/store instructions in the case of multicore chips(with polled waiting loop or interrupt driven IO), the CPU is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the CPU would initiate the transfer, do other operations while the transfer is in progress, and receive an interrupt from the DMA controller once the operation has been done. This is especially useful in real-time computing applications where not stalling behind concurrent operations is critical. Another and related application area is various forms of stream processing where it is essential to have data processing and transfer in parallel, in order to achieve sufficient throughput.

Single Mode & Burst Mode

Traditional Synchronous DMA moves a byte or word at a time between system memory and a peripheral, handshaking with the I/O port for each transfer. This sort of transfer recognizes that the port may not always be in a ready condition; the handshaking is a hardware mechanism to throttle the transactions.
With this sort of transfer, the program sets up the controller and then carries on, oblivious to the state of the DMA transaction. The hardware moves one byte or word between memory and I/O each time the I/O port signals it is ready for another transaction. On each read indication, the DMA controller asserts Bus Request, waits for a Bus Acknowledge in response, and then takes over the bus for a single cycle. Then, the DMA controller goes idle again, waiting for another ready signal from the port. Thus, the program and DMA cycles share bus cycles, with the controller winning any contest for control of the bus. Sometimes this is called "Cycle Stealing".
Burst Mode DMA, in contrast, generally assumes that the destination and source addresses can take transfers as fast as the controller can generate them. The program sets up the controller, and then (perhaps after a single ready indication from a port occurs), the entire source block is copied to the destination. The DMA controller gains exclusive access to the bus for the duration of the transfer, during which time the program is effectively shut down. Burst mode DMA can transfer data very rapidly indeed.

Reference

RISC vs CISC

The major difference is that RISC has fewer number of instructions and each of them performs simpler function, has simpler format, and most instructions have the same number of clock cycles, so we can apply pipeline technique to improve performance. On the other hand, CISC processors have a large amount of different, complex instructions, and each of them may require various number of clock cycles and contain various number of bits. RISC processor has better performance, but CISC processor can has code which has much less lines than RISC processor. Moreover, CISC typically requires more hardware resource to support its powerful instructions, therefore also has higher power consumption. For example, CISC may has a dedicated hardware for multiplication, while RISC may combine shift and add instruction to achieve multiplication.

Pipeline

Multiple Cycle CPU

as its name implies, the multiple cycle cpu requires multiple cycles to execute a single instruction. this means that our cpi will be greater than 1.
the big advantage of the multi-cycle design is that we can use more or less cycles to execute each instruction, depending on the complexity of the instruction. for example, we can take five cycles to execute a load instruction, but we can take just three cycles to execute a branch instruction. the big disadvantage of the multi-cycle design is increased complexity. control is now a finite state machine - before it was just combinational logic.
another important difference between the single-cycle design and the multi-cycle design is the cycle time. in the single cycle processor, the cycle time was determined by the slowest instruction. in the multi-cycle design, the cycle time is determined by the slowest functional unit [memory, registers, alu]. this greatly reduces our cycle time.

Outline

this outline describes all the things that happen on various cycles in our multi-cycle cpu. all the events described in each numbered item take place in one clock cycle.
1) instruction fetch: load ir with instruction at pc, load pc with pc + 4. If its not a branch instruction, pc+4 will be used to fetch instruction in the next clock cycle. If its a branch instruction, taken or non-taken will be resolved after after 2 clock cycles (after ALU). At the 3th cycle, pc+4 or branch address will be selected to fetch the instruction.
2) instruction decode, read registers: parse the instruction, load registers A and B with values from the register file, load aluout with the target address of the branch
3) execute: if executing a load or store, perform the effective address computation and put the result in aluout. for arithmetic instructions, load aluout with the result of the appropriate computation. for a beq instruction, if the result of register A is zero, load pc with the value in aluout. if executing a beq, we are done - return to step 1
4) memory: if executing a load, load mdr with the data at address aluout. if executing a store, write the data in register b into memory at address aluout. if executing an arithmetic instruction, write the value in aluout into the register file. if executing a store or an arithmetic instruction, we are done - return to step 1
5) writeback: if we are here, we are executing a load instruction. write the value in mdr into the register file, and return to step 1

Pipeline Performance

Ideally, for an n stages pipeline, the performance improvement is n times than non-pipeline structure. However, mostly it can not be achieved, because:
1) uneven pipeline stages: we can hardly separate task to stages with the same execution time.
2) hazards: so we can't always let the pipeline full.
3) latch overhead

Reference

Hazards

A hazard is a potential problem that can cause a pipelined processor to obtain incorrect computing result. There are typically three types of hazards: data hazards, structural hazards, and branching hazards (control hazards).

Data Hazard

Data hazard can happen when a CPU tries to simultaneously execute multiple instructions which exhibit data dependence
RAW - Read After Write
A RAW Data Hazard refers to a situation where we refer to a result that has not yet been calculated, for example:
i1. R2 <- R1 + R3
i2. R4 <- R2 + R3
The first instruction is calculating a value to be saved in register 2, and the second is going to use this value to compute a result for register 4. However, in a pipeline, when we fetch the operands for the 2nd operation, the results from the first will not yet have been saved, and hence we have a data dependency.
We say that there is a data dependency with instruction 2, as it is dependent on the completion of instruction 1.
WAR - Write After Read
A WAR Data Hazard represents a problem with concurrent execution, for example:
i1. R4 <- R1 + R3
i2. R3 <- R1 + R2
If we are in a situation that there is a chance that i2 may be completed before i1 (i.e. with concurrent execution) we must ensure that we do not store the result of register 3 before i1 has had a chance to fetch the operands.
WAW - Write After Write
A WAW Data Hazard is another situation which may occur in a concurrent execution environment, for example:
i1. R2 <- R1 + R2
i2. R2 <- R4 + R7
We must delay the WB (Write Back) of i2 until the execution of i1.
Solutions
1) data forwarding: some data hazards can't be solved by it. Ex. lw instruction in MIPS architecture
2) pipeline stall: Ex. for lw instruction
3) instruction scheduling: Compiler should do that. Ex.
lw R1, 1000(R2)
lw R3, 2000(R2)
add R4, R1, R3
lw R1, 3000(R2)

add R6, R4, R1
sw R6, 1000(R2)
change to
lw R1, 1000(R2)
lw R3, 2000(R2)
lw Rn, 3000(R2) // swap "lw" and "add"
add R4, R1, R3

add R6, R4, Rn

sw R6, 1000(R2)

Structural Hazard

A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time. A structural hazard might occur, for instance, if a program were to execute a branch instruction followed by a computation instruction. Because they are executed in parallel, and because branching is typically slow (requiring a comparison, program counter-related computation, and writing to registers), it is quite possible (depending on architecture) that the computation instruction and the branch instruction will both require the ALU (arithmetic logic unit) at the same time.

Branch Harzard

Branching hazards (also known as control hazards) occur when the processor is told to branch - i.e., if a certain condition is true, then jump from one part of the instruction stream to another - not necessarily to the next instruction sequentially. In such a case, the processor cannot tell in advance whether it should process the next instruction (when it may instead have to move to a distant instruction).
This can result in the processor doing unwanted actions.
1) stall pipeline: until we know its taken or non-taken
2) predict taken or non-taken: and fill the pipeline with associated instruction. Compiler uses like taken and unlikely taken to denote the branch instruction, and if its found not the case, CPU cancel or nullify the delayed slot instruction.
3) delayed branch: Fill instruction that will be valid no matter branch is taken or non-taken into pipeline. Compiler should do that.
4) hardware branch predictor: Its dynamic branch prediction. static branch prediction is done by compiler.

Reference

Harvard Vs Von Neuman

Difference

With a Harvard architecture, there are two separate memory space, one for instruction and one for data. We can increase the throughput because when we are executing one instruction, we can be fetching the next instruction. Other advantage is that, we can have different width on instruction bus and data bus.

Reference

Reentrant Function

Definition
reentrant function is one that can be used by more than one task concurrently without fear of data corruption. Conversely, a non-reentrant function is one that cannot be shared by more than one task unless mutual exclusion to the function is ensured either by using a semaphore or by disabling interrupts during critical sections of code. A reentrant function can be interrupted at any time and resumed at a later time without loss of data. Reentrant functions either use local variables or protect their data when global variables are used.
A reentrant function:
1) Does not hold static data over successive calls
2) 
Does not return a pointer to static data; all data is provided by the caller of the function
3) Uses local data or 
ensures protection of global data by making a local copy of it
4) 
Must not call any non-reentrant functions
Floating Point Reentrant for Keil C51 C compiler
Floating point operations the compiler generates code for (+ - * /) are fully reentrant. But only a few functions in math.h are reentrant.
Those that are not reentrant must be protected from interrupts. One way this can be accomplished is to put a wrapper that disables and re-enables interrupts around the floating point math calls. For example:
#pragma disable        // Disable Interrupts for this function
float ISR_sin (float x)
{
    return (sin(x));
}
Examples
void func(int* x, int* y)
{
    int tmp;
    tmp=*x;
    *x=*y;
    *y=tmp;
}

the above is reentrant or not?
Reentrant
Non-Reentrant to Reentrant
Non-reentrant version
char* FileName(char *name)
{
    static char fname[13];
    sprintf(fname, "%s", name);       // can't write fname = name
    return fname;
}
Reentrant versions
char* FileName1(char *name, char *fname_buf)      // use thread's own buffer
{
    sprintf(fname_buf, "%s", name);          // can write fname_buf = name
    return fname_buf;
}
char* FileName2(char *name)
{
    char *fname_buf = (char*)malloc(13);       // malloc will allocate memory every call
    sprintf(fname_buf, "%s", name);       // can write fname_buf = name
    return fname_buf;
}

The first one better, because second requires free() to release memory, and takes longer time to allocate memory.

Inter Process Communication

Advantages of process cooperation
  • Information sharing
  • Computation speed-up
  • Modularity
  • Convenience
Disadvantage of process cooperation
  • Data corruption, deadlocks, increased complexity
  • Requires processes to synchronize their processing

Event Flags

Shared Memory

If threads are in different process, we need to request a segment of shared memory from kernel, and processes communicate through this shared memory by reading and writing data on it.
If threads are in the same process, there are several types of shared memory:
1) Shared global variable: global variables within source code of multiple threads.
2) Shared private data: private variables whose address are given to other threads.
3) Static variable: all static variables within a shared function will become shared data. All functions which are called in a shared function are also shared functions.
With shared memory, synchronization should be achieved by processes/threads themselves to avoid data corruption.

Message Passing

In message passing, there is no shared memory between processes. Using kernel function, kernel will copy data from the source process to kernel space, and then copy the data from kernel space to destination process. There are basically two operations,send(pid, message) and receive(pid, message). Synchronization and data protection are handled by kernel.
Blocking & Non-Blocking
  • Blocking Send — sender blocked until message received by mailbox or process
  • Nonblocking Send — sender resumes operation immediately after sending
  • Blocking Receive — receiver blocks until a message is available
  • Nonblocking Receive — receiver returns immediately with either a valid or null message.
Buffering
All messaging system require framework to temporarily buffer messages. These queues are implemented in one of three ways:
  • Zero Capacity — No messages may be queued within the link, requires sender to block until receives retrieves message.
  • Bounded Capacity — Link has finite number of message buffers. If no buffers are available(buffer is full) then sender must block until one is freed up.
  • Unbounded Capacity — Link has unlimited buffer space, consequently send never needs to block. It can be approximated by linked list, while it still has a boundary (when heap section is full).

Polled Waiting vs Interrupt Driven

Real-time systems are said to be “event-driven”, meaning that a primary function of the system is to respond to “events” that occur in the system’s environment. How does the program detect an event happening? There are two fundamental approaches:

Polled Waiting

The program begins with some initialization and then enters an infinite loop that tests each possible event to which the system must respond. For each event that is set, the program invokes the appropriate servicing function.
int main (void)
{
    sys_init();
    while (TRUE)
    {
        if (event_1)
            service_event_1();
        if (event_2)
            service_event_2();
        // ........
        if (event_n)
            service_event_n();
    }
}

When to use

  • Fast device: for devices which input or output data in very fast speed, it won't execute useless loops for many times, so won't introduce much overhead. However, if we use interrupt driven IO, the overall overhead generated by dealing with interrupts may be very high.
  • Periodic device: we can let the task which run the polled waiting loop sleep for a period of time, and wake up it when the data is about ready.

Disadvantage

  • The response time to an event varies widely depending on where in the loop the program is when the event occurs. For example, if event_1 occurs immediately before the if (event_1) statement is executed, the response time is very short. However if it occurs immediately after the test, the program must go through the entire loop before servicing event_1.
  • For slow device, introduce huge overhead because of busy waiting.
  • All events are treated as having equal priority (If all event are tested in the same process).
  • As new features, hence new events are added to the system, the loop gets longer and so does the response time.
  • The response time is also a function of how many events happen to be set at the same time and consequently get serviced in the same pass through the loop.

Interrupt Driven

When to use

  • Slow and Non-periodic device: for this kind of device, interrupt driven IO can achieve much higher performance than polled waiting.
  • Device requires guaranteed short latency: since polled waiting IO introduce various latency.

Disadvantage

  • It has large overhead/byte, therefore low transfer rate when transmits one or few bytes of data per interrupt. It needs to 1. push program counter, flag register. 2. read interrupt type code, access interrupt vector table, load address of ISR. 3. after that, it might need to push values of all register that will be used in the ISR. 4. when the ISR return, CPU need to restore it previous status by popping values from stack to registers.
  • According to the above reason, for device which has high speed, the overhead introduced by interrupt driven IO will be significant.
  • Increase complexity, and introduce problem of synchronization.

Reference