ASRock C224/C226 + Brocade 1020 = … this is why

This is why this build out process has taken so long (and why I have to hold on to so many damn boxes). Only this time, it’s not just my frustrations at side problems that arise and derail; this one could hurt someone.

On Saturday night, I planned to use a combined server/workstation build intended for Miami to perform a mirror copy of my main 5TB image store, and then go out. I didn’t leave the house for another 24 hours.

The machine has an ASRock E3C226D4i-14s board currently running a E3-1245v3, and to accommodate the ITX “plus” motherboard, and to allow me to over deploy storage to a site 1,500 miles away, I chose a Silverstone DS380 case which has eight hotswap 3.5″ bays. Seeing as I have to make/seed a few of these copies, and that integrity is critical, I chose to use this build to mirror off my DAM intake station. Initially, I left it as is, and tried to get a bit of a throughput boost by teaming the onboard NICs. I was getting 1.1Gbps, but the source array can perform much higher reads and even the write array can sustain 130+MB/s writes.

To speed up the process, I decided to stop and replace the Quadro K600 with a Brocade 1020 that had been sitting around for a few weeks (and worked just fine before that) and move the machine next to the DAM station, and run it by direct SFP+ connection, with the added benefit of putting on a UPS backed outlet. Even if I chose to do two copies at once, at least I’d have that benefit.

In short, Windows booted twice with the card – and much slower than expected, the first time I grabbed the full Brocade management pack off the DAM server, installed and rebooted. On the second boot I started a simple SMB file copy of ~5TB and within seconds, bluescreen. Pulled that card, replaced it with a working one from an ASRock C2750D4I. Slightly different revision. Same deal.

What followed was hours of reboots and simple checks and permutations that are both inane and necessary.  By 10AM – Sunday now – yet to sleep, I started a very long and organized email to ASRock because I either had a defective board or there was a firmware problem. I am convinced of the second at this point.

I had certainly noticed that my temps were at least 12C higher – sometimes 20C – on the Brocade (8.5W) than a sealed case running with the Quadro (41W), and certainly the 300W+ draw on a UPS that sits at 230W with NEC PA241W, the DAM station, and whichever Haswell+Quadro machine is attached, all active. I attributed this in part to the Brocade stressing the system in some sort of processing lock and decreased airflow with the case open. At its worst, I believe I saw mid-60s, but by then I was done actively diagnosing and instead organizing my write up and grabbing screenshots.

Monday arrived; so did a CPU for an E3C226D2I compact workstation build I wanted to try. The intended case is a Norco ITX-S4, essentially a half height, half RAM version of the E3C224D4I-14S build with no server responsibilities or permanent place; I’ve even considered 12V and some inbuilt battery options.

Unfortunately, I ordered a 1265v2, not v3. So I pulled the Xeon E3-1230v3 (and RAM) from my test bench Supermicro X10SL7 and went about putting a lot of cables and drives in a very small case.

The first concern, of course, was heat. With only a single 80mm providing exhaust and a Noctua NHL9i HSF I was concerned about airflow. I decided to boot from IPMI and watch the temps. To kill two birds with one stone, I decided to go directly back to now very (still) overdue task of making a mirror of my image archive/composite work, so I went to use the 2nd Brocade 1020. I plugged it in, hit the power button and walked over to my workbench station.

In the dashboard, I noticed two things:

  • crap: preview window was sitting at the Brocade OpROM splash
  • double crap: my temps with less than a minute uptime were ~67C.

So, at this point, it’s 3AM Tuesday morning so I go to the local bar for a drink and to vaguely rethink shuffling parts around, because this build wasn’t going to work.
Not exactly.

What I didn’t realize that while my temps are of course higher – they aren’t that high.  Right now the C226 has been steady at 45C/42C for hours with 2x8GB DDR3L ECC, the Quadro K600, 4x4TB drives, and Seagate 600 Pro 240MB.

Norco C226+K600

This build is completely feasible for its intended use, though I may drill holes to replace the 80mm (already swapped for a Noctua 80mm PWM) with a 92mm and bolt on a 120mm outside the case behind it.

So, why didn’t I immediately jump and say “oh, it’s the Brocade being mishandled again” when I’m having obvious boot problems?

Because the problem is slightly different.  The E3C224D4I-14S surges and locks up the system completely: for example, keyboard LED lights, whether by PS/2 or USB, will not toggle.  And if I see a power surge, after no more than thirty seconds, but usually just a few, power draw drops to (slightly below) normal levels because most of the system is halted.  As a result, temps took a lot longer to climb on the C224; sure, I was seeing 47C/53C  CPU/MB to start, but it took hours of rebooting, sometimes with the card pulled, to get to 60+C temps.  Also “luckily” the E3C226D2I handled its problems “better” – the system wasn’t locked, I was free to hit ‘x’ and watch the OpROM load all over again.

And keep cooking.  In the minute it took me watch the boot process just from LEDs to leave my desk area and head over to the workbench with the IPMI page, it got 22C hotter than it’s running right now – after hours of uptime with a “hotter” card.

So now it’s Tuesday – yesterday – and I’m doing similar fiddling.  I pull the card and install Windows just fine.  Since I never had a hard lock with the Brocade, I decide to give it another shot.  Incidentally, it was at this point that I saw that ASRock got back to me and I’m looking at my phone.
It was where I also got very lucky.
My air conditioner turned on.
My lights turned off.

And of course, now everything in the room is beeping: the desk APC, the rack Eaton 5125, two sets of desktop speakers and three tablets and two phones with emails from Spiceworks, SNMP, the Eaton UPS directly, from whatever other gremlins that may roam – fun.  And on the way back from resetting my circuit breaker I walk by my workbench station, open to the ASRock’s IPMI page.

77/79 CPU/MB

Yeah, I ran to pull the plug.  My UPS was happy to supply power as my little 8 inch computer with $900 in HDDs alone roasted itself with no end in sight, straight through the power event.  Had I just sat there – hey, the system’s still responsive, don’t fiddle, give it a minute and observe… or worse – avoid the temptation to fiddle – let it sit for a bit and go to the store… this is a Manhattan apartment building.  I have old 50s wood floors.  There are real worst case scenarios that transcend “off site backup.”

What’s also scary is that this card is not new, well established, tested in turnkey and big contract custom deployments, used in conjunction with boards from IBM, Dell, rebranded, etc – it should represent an in spec PCIe as a card can be, so it raises the question, what else could trigger this?  Even if the answer is “nothing,” consider that is recently in vogue at some of the advanced home networking forums because of eBay liquidations plus the curiosity surrounding some really interesting and compelling motherboards that ASRock is selling – it’s not a huge number of people, but still: people.

So that’s why this, that’s why the emails, the blog post, the hours upon hours of walking this through and overly detailed emails, and presenting everything as best as I have recorded or done, errors of distraction, sleep deprivation, and all.

Oh, and to completely rule out the card: that second Brocade card, the working pull, went back into the ASRock Avoton, booted fine, installed Windows fine, and managed 2+Gbps for a few seconds (locking up the whole UI) but recovered without error to do a 100MB/s file copy over 65GB+, and then to reduce IO bottlenecks, LANBench where it did 2.7+Gbps.  Less than an hour after being cooked at 80C.  It’s not the card.
Both ASRock builds shared nothing other than the 4x4TB drives (and only briefly, to reduce risk and diagnostic complexity)… and ASRock Haswell Xeon firmware.

The boards are otherwise great, doing exactly what I want, giving me options that no one else really makes.  ASRock knows how to make motherboards.  That’s not the question.
But something is very wrong here.

Comments

comments

This entry was posted on ‍‍כ״ד ניסן ה׳ תשע״ד - Wednesday, April 23rd, 2014 at 19:50 and is filed under ruminations. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply