Locating Failed Chips on a Sun 501-1102 "Sirius" VME RAM Board
Introduction
The Sun 501-1102 8 MB ECC DRAM board is often found in older Sun3 and Sun4 series VME-based servers. It contains 288 DRAM chips which provide 8 MB of usable data storage as well as additional storage for ECC check bits. The check bits allow for on-the-fly detection and correction of single-bit errors, detection of all double-bit errors, as well as detection of some triple-bit errors.
Given the age of these boards, it is not uncommon for them to develop various faults. Generally, these faults are caused by one or more failing DRAM chips. Thanks to the detailed diagnostic mode present in Sun boot PROMs, the address of the failure and the value of the ECC syndrome bits are reported. This information is generally sufficient to locate the failing chip or chips --if you know the layout of the board and the wiring of the ECC processors.
Having recently encountered a failing board I asked around to see if I could find anyone who knew the requisite magic to translate the failure data into a part designator. Failing that, I then set out to reverse-engineer enough of the DRAM layout to repair my board with a minimum number of chip replacements. After a rather lot of experimentation and PCB tracing I managed to completely map the DRAM layout. Armed with that information I was able to successfully locate and replace the single faulty chip on my board.
This page presents the DRAM layout information I learned and provides a tool that will take the failure address and ECC syndrome displayed by the boot PROM error message and tell you what chip or chips need replacing. If you're only interested the tool, you can jump to it HERE.
DRAM IC Layout
The board contains 288 individual 256Kx1 DRAM chips, physically arranged in 16 columns of 18 chips each. 32 chips are used to store ECC check bits. The remaining 256 chips store actual data bits (256K bits per chip multiplied by 256 chips gives 8 MBytes). Electronically, the chips are divided into two major banks with each bank storing 4 MB of data as well as the associated check bits. Within each major bank the chips are further divided into two minor banks with each minor bank containing 2 MB of data storage plus the associated check bits. Each minor bank contains 64 ICs to store data bits and 8 ICs to store ECC check bits. Figure 1 shows the major and minor banks as you view the board from the component side with the VME connectors away from you. The numbers in each box represent the bit position (within a 32 bit word) that each DRAM IC is responsible for. The positions are described as the CPU will see the data. Interestingly, the data bits are wired to the EDC engine in a much different order, as described in the next section.
EDC Engine Layout
The ECC engine consists of four AMD Am29C60 ICs cascaded together to form a single unit capable of handling 64 bit words. (Each AMD chip handles 16 data bits.) Handling 64 bits at a time allows the engine to detect and correct all single bit errors using an average of one additional check bit per 8 data bits. Thus the cost of EDC, in terms of additional DRAM ICs needed, is the same as the cost of simple parity but provides far greater protection.
The DRAM ICs in each minor bank are arranged 64 bits wide by 256K deep. The bits are fed to the EDC chips in a different order than they are sent on the data bus. I'm not sure why there is a difference. Perhaps it is due to board layout constraints. Figure 2 shows the bit positions represented by the DRAM ICs as sent to the EDC engine.
Decoding Syndrome Bits
The EDC engine reports "syndrome" bits when it encounters an error. The syndrome is calculated as the difference between (i.e., XOR of) the check bits stored when the data was written and the calculated check bits determined when the data is read. The syndrome can be used to determine exactly which bit is faulty in the case of a single bit error or to determine the existence of double or triple bit errors. The syndrome bits are decoded according to the table shown in Figure 3 (taken from the IDT39C60 datasheet, a clone of the Am29C60). Note that the bit positions described by the syndrome correspond to the DRAM ICs as connected to the EDC engines, not as connected to the data bus. Take the value shown in the table and look it up in Figure 2 using the correct major bank and pair of minor banks based on the address of the failure.
Feel free to send questions, comments, and reports of success or failure to me at akropel1@rochester.rr.com.
This page Copyright (c) 2002 by Adam Kropelin, All Rights Reserved.