24 Aug 2016

Important Questions and Solutions - CS704

MSCS
Question 1:
The running program pattern 
is 0x0 0x8 0x10 0x18 0x20 0x28
a) If you directed mapped cache size 1KB and block size of 8 bytes (2 words)how many set: 
b) With the same cache and block size what is miss rate of the directed mapped.
c) On which would decrease miss rate the most
i. Increasing the degree of associative by 2.
ii. Increasing the block size to 16 bytes.

(a)
• Cache size = 1KB=1024 Bytes

• Block size = 8 Bytes
• No. of blocks = 1024/8 = 128
• if we makes sets of 2 blocks then the number of sets = 128/2 = 64


(b)When we access 0x0 it is miss
Block of 8 bytes (0-7) comes into cache
When we access 0x8 it is miss because till now it is not in cache
Block of 8 bytes (8-15) comes into cache
When we access 0x10(16 binary) it is miss because till now it is not in cache
Block of 8 bytes (16-23) comes into cache
When we access 0x18(24 binary) it is miss because till now it is not in cache
Block of 8 bytes (24-31) comes into cache
When we access 0x20(32 binary) it is miss because till now it is not in cache
Block of 8 bytes (32-39) comes into cache
When we access 0x28(40 binary) it is miss because till now it is not in cache
Block of 8 bytes (40-47) comes into cache
Number of access = 6
Number of misses = 6
Miss rate = 100 %


(C) Increasing the block size to 16 bytes.

_______________________________

Question 2: In the following code (15)
For (i=2 ; i < 100 ; i = i+1)
{

a[i] = b[i] + a[i] /*s1*/
c[i-1] = a[i] + d[i] /*s2*/
a[i-1] = 2 * b[i] /*s3*/
b[i+1] = 2 * b[i] /*s4*/
}
I. List all dependences (output, anti and true)
II. Indicate whether the true dependences are loop carried or not? 

III. Why the loop is not parallel?

Solution:
I)) 
There are total six dependences in the loop

1. Ther is antidependence from S1 to s1 on a a[i] = b[i] + a[i] /*S1*/

2. There is true dependence form S2 to S1

a[i] = b[i] + a[i] /*S1*/

c[i-1] = a[i] + d[i] /*s2*/
Hence, the value of a in S2 is dependet on the result of a in S1.

3. There is loop carried true dependence from S4 to S1 on b. 

4. There is loop carried true dependence form S4 to S3 on b. 

5. There is loop carried true dependence form S3 to S3 on b. 

6. There is loop carried true dependence form S3 to S3 on a. 

II))  We know that for loop to be parallel, each iteration must be independent of all other. Here in this case, as dependences 3,4,5 (in part i) are true dependences. They cannot be removed by renaming or any such technique. These dependence are loop carried as the iterations of the llp are not independent.

III)) The factor mention in (ii), imply the loop is not parallel as the loop is written. Hence, loop can be made parallel by rewriting the loop to find a loop that is functionality equivalent to the original loop that can be made parallel.

______________________________________________________

Question 3:
As cache increased in size blocks often increase in size as well 
a)if a large instruction cache has larger data blocks is there a need for prefetching? Explain the interaction between prefechting and increase block size instruction cache?
b)is there a need for data prefetch instruction when data blocks gets larger ?

Solution:
a)) Program basic blocks are often short (<10 instr.), and thus program executions does not continue to follow sequential locations for very long. As block gets larger, program is more likely to not execute all instructions in the block but branch out early, making instruction prefetching less attractive.

b)) Data structures often comprise lengthy sequences of memory addresses, and program access of data structure often takes the form of sequential sweep. Large data blocks work well with such access patterns, and prefetching is likely still of value to the highly sequential access patterns.
_____________________________________


































_____________________________
Implementations of directory protocol where the whole directory is centralized in one location have problems scaling to more than a couple hundred processors. What is the bottleneck in basic directory protocols and why does it occur? How might this bottleneck be removed?
The bottleneck is the directory accesses. With a central directory, all memory requests
must be checked in a single location so as more processors are added, the directory has to
be able to handle more and more requests. Directories can also be distributed across
individual nodes. This way, directory requests can be directed at the node responsible for
tracking the block rather than the central location.

--------------------------------------------------
How many total bits are required for a direct-mapped cache with 16 KB of data, 4 word blocks, assuming a 32 bit address?
  • 16 KB = 4K words = 2^12 words
  • Block size of 4 words => 2^10 blocks
  • Each block has 4 x 32 = 128 bits of data + tag + valid bit
  • tag + valid bit = (32  10  2  2) + 1 = 19
  • Total cache size = 2^10*(128 + 19) = 2^10 * 147

  Therefore, 147 KB are needed for the cache
_________________________________
Q: How is redundancy achieved in a RAID system? 
Ans::
There is no redundancy in RAID 0. For RAID 1, redundancy is achieved by having two identical copies of all data. For higher levels, redundancy is achieved by the use of error-correcting codes.

It is also achieved through the use of parity bits, or, as with RAID 2, hamming code. This information can be spread across all the disks, or like RAID 4, there could be a single disk. RAID 1 uses mirroring to protect data, (the same data is on each disk).
____________________________________
Question 3: Determine the path from source address 110 to the destination address 010 for a 3-stage omega topology network ?



The switch connections are shown Red Circles
Number of Node=n= 8 – 
Number of Stages= 3 – 
Number of switches/stage= 4 – 
Total number of required switches= 12 –
Number of Crosspoints= 16

Omega Network showing the 3-bit code a2,a1,a0 represents 3 stages of the network, as stage S2,S1,S0, from left to right. which is equal to number of stages of the switch
To find the connection pattern XOR the source and destination, e.g., 
Src (110) -> dest (010) then XOR results
011 -> Cross (S2)  Straight (S1)  Straight (S0)

let us see how the switches at each stage operate to establish connection
For the stage i
IF   the source and destination differ in ith bit 
THEN connection Cross the switch in the ith stage”
ELSE Connection is Straight in the ith stage”

This is shown here as:
- the path 110 -> 010 (red) and 
- the path 010 -> 011 (blue) 

have blockage as the S2 for 010 has to wait till 110 has passed otherwise it results in collision
____________________________________________________
2D Mesh and 2D Torus
  • Two dimensional Mesh is an example of asymmetric network topology and uses bisection bandwidth as the performance metric
  • Here, the nodes (processors) are arranged in a array structure forming 2D Mesh
  • An example 3x4 mesh structure is shown here
  • A switch is associated to each (processor) node [shown as blue circle]
  • Each switch has one port for the processor and four ports to interconnect the processor to the four nearest-neighbor nodes, i.e., the nodes to the left - right and up - down position
  • This structure is sometimes also referred to as NEWS communication pattern, representing North, East, West and South communication
  • Note that here the switches associated with the top/bottom rows or left/right columns don‘t connect among themselves, thus have unused ports
  • Connecting the unused ports of switches of the top/bottom rows and the left/right columns forms 2D Torus, using wraparound links.
____________________________________________
Q))
(a) Why the hardware speculation techniques differ from the instruction set extension approach to make the branches predictable at the compile time?
(b) Discuss poison-bit approach used to preserve hardware speculation.

Solution(a)
Hardware speculation techniques
• Allow execution of an instruction dependent on a predicted-taken branch such that there are no consequences (including exceptions such as memory violation) if branch is not actually taken
Further, it prevent speculative instruction to cause exceptions that stop programs (i.e. memory violation)
• This can be achieved: If hardware support for speculation buffers the results and exceptions from instructions, until it is known that the instruction would execute

It combines three key ideas:
Dynamic Branch Prediction
o Dynamic branch prediction facilitates to choose which instruction to
execute; i.e., next in sequence or branch.
Speculation
o Speculate to allow the execution of the instructions before the control dependence is resolved.

Dynamic scheduling
o Dynamic scheduling to deal with the scheduling of different combinations of basic blocks.

Instruction Set Extension
• The extended instruction set including Conditional or Predicated Instructions
 allow the compiler to group instructions across branches
 eliminate branches
 convert control dependence into data dependence
• These approaches are equally useful for hardware-intensive as well as software intensive scheme, i.e., the dynamic as well as static scheduling

Solution(b)
A set of bits called ―poison bits‖ are attached to the result register that are written by speculated instructions when the instruction causes exceptions.
• The poison bits cause a fault when a normal instruction attempts to use the register
• This approach suggest to track the exceptions as they occur but postpone any terminating exception until a value is actually used
• The scheme adds a poison pit to every register and another bit to every instruction to indicate if whether the instruction is speculative
• The poison bit of the destination register is set whenever a speculative instruction results in terminating exception
• And all other exceptions are handled immediately
• If the speculative instruction uses a register with poison bit on; the destination register has its poison bit on

 Now if the normal instruction attempts a register source with poison bit on, the instruction causes a fault.
_______________________________
Q: 1 define multiprocessor cache coherence and explain considering three processes which see old values in their caches.
Multiprocessor Cache Coherence
• When shared data are cached the shared value may be replicated in multiple caches.
• This results in reduction in access latency and fulfill the bandwidth requirements,
• but, due to difference in the communication for load/store and strategy to write in the caches, values in different caches may not be consistent, i.e.,
• There may be conflict (or inconsistency) for the shared data being read by the multiple processors simultaneously.
• This conflict or contention in caching of sheared data is referred to as the cache coherence problem
• Informally, we can say that memory system is coherent if any read of a data item returns the most recently written value of that data item.
• This definition contains two aspects of memory behavior:
Coherence that defines what value can be returned by a read?
Consistency that determines when a written value will be returned by a read?
Cache Coherency Problem?
The processors P1, P2, P3 see old values in their caches as there exist several alternative to write to caches!
• For example, in write-back caches, value written back to memory depends on which cache flushes or writes back value (and when);
• i.e., value returned depends on the program order, program issue order or order of completion etc.
• The cache coherency problem exists even on uniprocessors where due interaction between caches and I/O devices the infrequent software solutions work well
• However, the problem is performance-critical in multiprocessors where the order among multiple processes is crucial and needs to be treated as a basic hardware design issue

_______________________________________________
A)) what is cluster SAN? and explain performance confronts.

SAN is basically the cluster
System Area Network-SAN: Interconnection network of hundreds of nodes within the machine room; so the distance of the link is less than 100 meters.
Massively parallel machine providing high bandwidth can be built from off-the-shelf components, instead of depending on the custom machines or networks
B)) 
The clusters face some performance confront such as:
        Non-standard connections 
        Division of memory• Non-Standard Confront: As you know that the multiprocessors are usually connected memory bus which offers high bandwidth and low latency; and
• Contrary to this, the clusters are connected using I/O bus of the computer, thus have large conflicts at high speed.
• Division of Memory: A large single program running on a cluster of N machines requires N independent memory units and N copies of operating system; on the other hand,
• A shared address multiprocessor allows to use almost all memory in the computer.
• However, contrary to these challenges, clusters have advantages in respect of dependability and scalability.
___________________________________
Qs)) Difference between DRAm cell and SRAM cell and Explain Enhanced DRAMs
Static RAM: Basic Cell
• Write:
1. Drive bit lines (bit=1, bit=0)
2. Select row
• Read:
1. Pre-charge bit and bit to Vdd
2. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
Basic DRAM Cell
• Write:
Drive bit line and Select row
• Read:
Pre-charge bit line to Vdd and Select row
Cell and bit line share charges
o Here, very small voltage changes occurs on the bit line therefore Sense amplifier is used to detect changes
Apply Write to restore the value
• Refresh
Just do a dummy read to every cell.

The state of the art DRAM cell only has one transistor. The bit is stored in a tiny transistor.

Enhanced DRAMs
1. Fast Page Mode DRAM (FPM DRAM)
In normal DRAM, we can only read and write M-bit at time because only one row and one column is selected at any time by the row and column address
In other words, for each M-bit memory access, we have to provide a row address followed by a column address.
Page Mode DRAM: Motivation
• Regular DRAM Organization:
N rows x N column x M-bit
Read & Write M-bit at a time
Each M-bit access requires a RAS / CAS cycle
• Fast Page Mode DRAM
N x M ―register‖ to save a row

2. Extended data out DRAM (EDO DRAM)
It is the Enhanced FPM DRAM with more closely spaced CAS signals
3. Synchronous DRAM (SDRAM)
It is driven with rising clock edge instead of asynchronous control signals.
4. Double Data Rate synchronous DRAM (DDR SDRAM)
It is Enhancement of SDRAM that uses both clock edges as control signals.
5. Video RAM (VRAM)
It is like FPM DRAM, but output is produced by shifting row buffer; and is
Dual ported to allow concurrent reads and writes
__________________________________
Fully Associative Cache
• The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output.
• This is called an associative lookup.
• Hardware intensive.
• Fully associative cache is limited to 64 or less entries.
• Conflict miss is zero for a fully associative cache. Assume we have 64 entries here. The first 64 items we accessed can fit in.
• But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item.
• This bring us to the cache misses of type Capacity Miss

Set Associative Cache
• This organization allows to place a block in a restricted set of places in the cache, where a set is a group of blocks in the cache at each index value
• Here a block is first mapped onto a set (i.e., mapped at an index value and then it can be placed anywhere within that set
• The set is usually chosen by bit-selection; i.e.,
• (block address) MOD (Number of sets in cache)
• If there are n-blocks in a set, the cache placement is referred to as the n-way set associative
• This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel.
• This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address.
• If neither tag matches the incoming address tag, we have a cache miss.
• Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur.
_______________________________________________
Reducing Hit Time
1. Clock rate of the processor
2. Cache Access time
The four commonly used techniques
1. Small and Simple Caches
2. Avoiding Address translation during Indexing
3. Pipelined Cache Access
4. Trace Cache

Small and Simple Caches
• Most time consuming operation in cache hit is to compare long address tag
• Which requires bigger circuit which is always slow as compared to smaller circuits
• Thus an obvious approach to keep the hit time small is to keep the cache Smaller and simpler
• The cache be kept small to fit it on the same chip as the processor to avoid the time penalty going off chip; and
• Simpler, such as the direct-mapped the tag comparison can be overlapped with the data transmission

Pipelined Cache Access
• Cache hit time
• Pipelining the cache access
• Latency of the first level cache-hit
• Fast cycle time which increases the bandwidth of instructions
         for Pentium I takes 1 clock-cycle to access the instruction cache;
          for Pentium Pro through Pentium II its takes 2 clock cycles, and
          for Pentium IV it takes 4 clock cycles

Trace Caches
• Multiple-issue processors
• Trace caches
• Dynamic sequence
• Load into the cache block
• Branch prediction
• Instruction prefetching
• No wasted words and no conflicts
• Conflicts are avoided
• Complicates the address mapping
• No longer aligned to power 2 multiples of words
• Requirement for address mapping
• Demerit
• Same instruction may be stored multiple times

______________________________________________-

No comments:

Post a Comment