Server Products
Data Center Products including boards, integrated systems, Intel® Xeon® Processors, RAID Storage, and Intel® Xeon® Processors
4920 Discussions

s1200btlr - software able to test memory?

idata
Employee
2,356 Views

Hi,

I have a server with an e3-1280v2 on an s1200btlr board with 16GB (1600) RAM

Does anyone know of any software that can test the memory in this setup?

I suspect a fault but having trouble proving it, as have infrequent crashes of OS with event logs indicating hardware failure in the memory.

I've tried 'memtest86 v4.0s' (latest for servers) and also 'gold memory'.

They don't work well, passes some tests - shows failed for lots of others.

I put in slower memory (1333) but the tests still fail in same fashion.

So something up with how these memory test softwares work on these boards...

Cheers,

Bryce S.

0 Kudos
7 Replies
idata
Employee
909 Views

Bryce,

I've used Memtest86 on various Intel platforms and I generally assume it is accurate.

In what way is your memory failing?

I suggest you look at the System Event Log and see what it says regarding your memory. Is there an amber light on the front panel?

0 Kudos
idata
Employee
909 Views

Hi Jason,

Here is photo of memory test failing - happens early in test when it hits 'moving inversions':

Shows these errors for 1600Mhz ram and 1333Mhz ram, and since server works I can't trust this memtest86 ??

The system event log for the time the server crashed had:

Source: WHEA-Logger

Event ID: 46

A fatal hardware error has occurred.

Component: Memory

Error Source: Generic

So I was wanting to run tests for a day or two to prove/disprove a memory problem exists.

The front panel system status led blinks green (no amber).

Any ideas?

Regards, Bryce.

0 Kudos
idata
Employee
909 Views

Bryce,

Sorry, I should have specified, when I said SEL I meant the BMC log. If there's something logged at the hardware level then I would think that is definitive.

By the way, you never really described why you think the memory is faulting besides indicating a system crash. Is there anything more there that you can say about the event - was it a single event?

Is memetest86 performing a modulo-x test? If so, does the memory pass this test? If so, you can likely ignore moving-inversions results: "caching, buffering and out of order execution will interfere with the http://www.memtest86.com/tech.html# algo moving inversions algorithm and make less effective."

Kick that around and see how it goes,

Jason

0 Kudos
idata
Employee
909 Views

Hi Jason,

memtest86 had a Modulo-20 test, but this failed as dramatically as all the rest.

Do you know how do I get to look at this BMC log - on older machines I use to be able to view via the BIOS, but that is not part of this machine. I booted the management CD that came with server but it has no utility to view log. I can't install the active system console software as it does not run (/thread/30361 http://communities.intel.com/thread/30361) and its php conflicts with some database monitoring software installed - they won't play nice together.

I'm not sure memory is the problem, I've had two random crashes so far - windows event log seems to indicate memory but I'm not sure - I just want to test it to eliminate it or confirm it as the problem.

Cheers, Bryce.

0 Kudos
idata
Employee
909 Views

Bryce,

There is a SEL viewer in the latest firmware update package for http://downloadcenter.intel.com/confirm.aspx?httpDown=http://downloadmirror.intel.com/21059/eng/E5_Windows.zip&lang=eng&Dwnldid=21059&DownloadType=undefined&OSFullname=undefined S1200BTL. Not a bad idea to consider upgrading your firmware before taking any further steps - besides looking into the SEL that is.

Jason

0 Kudos
idata
Employee
909 Views

Hi Jason,

thanks for that - got the logs now.

Items of interest before the unexpected reboots:

7738/15/2012-4:37:31 AMMemory Mmry ECC Sensor (# 0x02)CRITICAL event: Mmry ECC Sensor reports uncorrectable error. There has been an uncorrectable ECC or other uncorrectable memory error for the memory module CPU_1, Channel = A, DIMM = 2.BIOS - LUN# 0 (Channel# 0

12318/20/2012-2:57:17 PMMemory Mmry ECC Sensor (# 0x02)CRITICAL event: Mmry ECC Sensor reports uncorrectable error. There has been an uncorrectable ECC or other uncorrectable memory error for the memory module CPU_1, Channel = A, DIMM = 2.BIOS - LUN# 0 (Channel# 0

Still not sure if memory or maybe the controller, as between these two above failures all the memory was changed to 1333 for testing and then 1600 put back, so only a 1 in 4 chance the same piece ended up in same slot.

And by the way, there is now an amber light on the front panel for system status.

Any thoughts on this?

Regards, Bryce.

0 Kudos
idata
Employee
909 Views

Bryce,

When you want to isolate the controller vs the DIMM, move the DIMMS around and make note of the moves. When a further failure occurs, you can pin it down to one or the other. Still not out of the woods yet, but it looks like there is a discovery path to your issue.

Jason

0 Kudos
Reply