Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Sign In with Google

Poll

No poll attached to this discussion.

In order to better support our growing community we've set up a new more powerful forum.


The new forum is at: http://community.covecube.com


If you already have an account on this forum, then you can use your existing username and password to sign into the new forum.


The new forum is running IP.Board and will be our primary forum from now on.


This forum is being retired, but will remain online indefinitely in order to preserve its contents. This forum is now read only.


Thank you,

Critical: BSOD when accessing DP over a network share

edited May 2013 in DrivePool
Hi there,

I am experiencing lots of BSOD (message BAD_POOL_CALLER) when using a pooled drive over the network.

My setup:
+ Drivepool installed on: Win Server 2012 64, 16GB mem, core i7
+ Pool created with:
  - Archive disks: 3 volumes of 2.6TB located on 3 HDD of 3TB
  - Feeder disk: 1 volume on 1 HDD of 750GB
+ Pool shared over the network:
  - Mounted drive letter directly;
  - Some folders in the drive.
+ Network going through gigabit interfaces and a switch
+ Client PC connecting to the shared pool using Windows 8 Pro 64

When does the BSOD happen:
+ Install TreeSizeFree
+ Analyse a shared pool with a lot of content: 7334 folders, 25 361 files, 4.21 TB

I cannot upload the memory_dump, it is 1GB...
Checked the drivepool logs, nothing appears at the time of the BSOD.

Any help appreciated, this issue is critical to me, making all of my config unworkable...
«1

Comments

  • edited May 2013 Member
    Just ran some more tests.

    Doing it again, another BSOD, so seems consistent.
    I am doing my tests on a Media folder located at the root of my pool (the share is called Media as well).
    I then have shared (read only) the Media folder located in my three archive disks and my feeder disk, named Media{0-2,F}

    Did the treesize test on all of them, no problem, no BSOD at all.
    Moreover, the performance of treesize on these shares is amazing, I obtain the result in 2 secs for a share containing 14493 files, 2TB.

    Then I did the treesize again on the DrivePool share.
    This time no BSOD... but very slow to process all the folders, something like 1 minute (compared to the 4x2s...)

    Perhaps a problem with idle disks and DrivePool...?
    By testing all of the shares on my regular HDDs, I certainly waked them up.
  • Member
    Found where the performance difference comes from, between the direct disk share and the DP share...
    Nothing to do with DP here, just that I activated the share option "Enable access-based enumeration" for the DP share.
    Deactivating it brings the performance at the same level as for direct disk shares.

    Redid some tests, before and after removing that option on the share (before and after a server reboot), and unable to trigger the BSOD anymore.
    So the mystery remains... perhaps my client machine is caching some data that are required... I don't know.

    Tell me if you have ideas of something I could try.
    I will get back to you if that happens again.
  • edited May 2013 Member
    Happened again.

    In fact I have as well XBMC on my client PC.
    Opening a video in XBMC triggered the BSOD after 5 seconds...
    Note that DP was rebalancing at the same time, don't know if that could have an impact.
    Perhaps accessing some specific files triggers that, investigating...

    Note that after the server reboot, I have been notified by Windows that one of my disks need repair.
    Testing with Stablebit Scanner, it tells me that the file system is damaged.
  • Member
    Ok, finalized my tests.

    Found that the BSOD happens under those conditions:
    + I am reading a large file (XBMC reading a mkv file of 7.64GB)
    + DP is rebalancing (would be a big coincidence that it is moving that exact file; plus I think the balancing did not start moving anything, it was analyzing the disks?)

    Reproducibility: every time when those conditions are met.

    What does not seem to impact the BSOD:
    + listing folders content (the treesize thing was my mistake, first time I had the BSOD I was running XBMC and treesize at the same time)
    + enabling or not access-based enumeration on the share

    Mitigation: can the DP staff give me some directions?
    (for the moment I have deactivated balancing completely, but it is not really a good option to me)


  • Member
    And another BSOD with another message on the server: KMODE_EXCEPTION_NOT_HANDLED (srv2.sys)

    This time it happened under those conditions:
    + DP not balancing
    + XBMC on the client machine refreshing its video library (so analyzing lots of files on the DP share)
  • edited May 2013 Member
    Ok doing more tests.

    I have Symantec Endpoint Protection on my Server 2012.
    Deactivated it as it could be a potential cause.

    Redoing an XBMC update (and playing a video at the same time), once again a BSOD after 20 secs or so.
    Message: BAD_POOL_HEADER
  • Member
    Here is the MS Event log for the BSOD:
    The computer has rebooted from a bugcheck.  The bugcheck was: 0x00000019 (0x0000000000000020, 0xfffffa801a32f1d0, 0xfffffa801a32fff0, 0x0000000014e24a00). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 052513-18782-01.

    After this last one I had some errors reported by DP as well.

    Attached are all the errors from DP and the minidumps from MS.
  • edited May 2013 Member
    Tried rolling back my network card drivers, does not change anything (still experiencing BSOD).

    It is consistently happening when I read a big file while listing folders contents (playing a video while refreshing the XBMC Media lib).

    Another one just happened again, this time not using XBMC.
    Just exploring to a folder, and opening a video file with MediaPlayer Classic.
    The file starts to play, and right after a BSOD: PAGE_FAULT_IN_NONPAGED_DATA

    After server reboot, I tried playing a file from a share pointing to the real hard drive managed by DP.
    No problem.
    Then opened the same file from the DP share... BSOD: KMODE_EXCEPTION_NOT_HANDLED
    So the problem really seems to come from DP.
  • edited May 2013 Member
    Did more tests.

    Opened to play many of my video files through explorer in my client PC using the shares directly on the HDDs (read only).
    Tested perhaps 10, no BSOD.

    Then did the same (same files) on the share coming from DP... no BSOD.

    Then updated the media library of XBMC (processing all the files from the DP share).
    And here we have the BSOD after 20secs.

    Here is the report from WhoCrashed:
    On Sat 25/05/2013 16:32:23
    GMT your computer crashed
    crash dump file:
    C:\Windows\Minidump\052513-33633-01.dmp
    This was probably caused by the
    following module: ntoskrnl.exe (nt+0x5A440)
    Bugcheck code: 0x1E
    (0xFFFFFFFFC0000005, 0xFFFFF8004020E9FA, 0x0, 0x49A)
    file path:
    C:\Windows\system32\ntoskrnl.exe
    description: NT Kernel &
    System
    Bug check description: This indicates that a kernel-mode program
    generated an exception which the error handler did not catch.
    This appears to
    be a typical software driver bug and is not likely to be caused by a hardware
    problem.
    The crash took place in the Windows kernel. Possibly this problem
    is caused by another driver that cannot be identified at this time.

    On
    Sat 25/05/2013 16:32:23 GMT your computer crashed
    crash dump file:
    C:\Windows\memory.dmp
    This was probably caused by the following module: srv2.sys
    (srv2+0x6F353)
    Bugcheck code: 0x1E (0xFFFFFFFFC0000005, 0xFFFFF8004020E9FA,
    0x0, 0x49A)
    file path:
    C:\Windows\system32\drivers\srv2.sys
    description: Smb 2.0 Server
    driver
    Bug check description: This indicates that a kernel-mode program
    generated an exception which the error handler did not catch.
    This appears to
    be a typical software driver bug and is not likely to be caused by a hardware
    problem.
    The crash took place in a standard Microsoft module. Your system
    configuration may be incorrect. Possibly this problem is caused by another
    driver on your system that cannot be identified at this time. 

    The crash comes from srv2.sys which is the Samba module.
    Bad interaction with the covefs hdd driver ?
  • Member
    Ok, my final word to this (for now).

    I activated verifier.exe for the srv2.sys driver, the two cove drivers, and my network card driver.
    I then provoked a BSOD again, same technique (XBMC lib update on client PC).

    Here is the analysis by WhoCrashed:
    On Sat 25/05/2013 17:13:31
    GMT your computer crashed
    crash dump file:
    C:\Windows\Minidump\052513-23431-01.dmp
    This was probably caused by the
    following module: srv2.sys
    (srv2+0x6F339)
    Bugcheck code: 0x1000007E (0xFFFFFFFFC0000005,
    0xFFFFF88017F59339, 0xFFFFF88017CBC748, 0xFFFFF88017CBBF80)
    file path:
    C:\Windows\system32\drivers\srv2.sys
    description: Smb 2.0 Server
    driver
    Bug check description: This indicates that a system thread generated
    an exception which the error handler did not catch.
    This appears to be a
    typical software driver bug and is not likely to be caused by a hardware
    problem.
    The crash took place in a standard Microsoft module. Your system
    configuration may be incorrect. Possibly this problem is caused by another
    driver on your system that cannot be identified at this time.

    On
    Sat 25/05/2013 17:13:31 GMT your computer crashed
    crash dump file:
    C:\Windows\memory.dmp
    This was probably caused by the following module: srv2.sys
    (srv2+0x6F339)
    Bugcheck code: 0x7E (0xFFFFFFFFC0000005, 0xFFFFF88017F59339,
    0xFFFFF88017CBC748, 0xFFFFF88017CBBF80)
    file path:
    C:\Windows\system32\drivers\srv2.sys
    description: Smb 2.0 Server
    driver
    Bug check description: This bug check indicates that a system thread
    generated an exception that the error handler did not catch.
    The crash took
    place in a standard Microsoft module. Your system configuration may be
    incorrect. Possibly this problem is caused by another driver on your system that
    cannot be identified at this time. 

    Just to let you know, my server is almost new (3 weeks).
    It is used as HyperV host, AD DC, DNS, and file server (hopefully with DP).
    Before installation the memory has been tested with memTest86 from a boot iso during a whole afternoon, no problem.
    The bios has been updated with the last firmware.
    All the drivers from the manufacturer are installed, the device manager reports no problem at all.
    I scanned as well the system for viruses, nothing.
    The power supply is behind a UPS and over-dimensioned (650W for a measured consumption of 80W).
    The PC temperature is almost negative (30°C for the CPU and the HDDs).
    All the HDD have been tested with no error reported (using Stablebit scanner).

    No... definitely I think I have eliminated all the potential causes but DP...
  • edited May 2013 Member
    I'm experiencing the same kind of BSOD's. I have submitted a memory dump to Alex.

    I can provoke a crash by opening a large HD movie in VLC on a Client PC (running Windows 8), and rapidly fast forward then rewind multiple times.

    My suspicion falls on an incompatibility between DP and the new version of SMB in Windows 8/2012.
    http://support.microsoft.com/kb/2709568
  • Covecube
    Guys, just the let you know, this "Srv2.sys" crash isn't related to DrivePool, unfortunately. So there really isn't anything that can be done on our end.

    The issues seems to be a bad update. Not SMB specifically, but an update/fix to it recently (some time in April actually from wha I can tell).

    I've been hit by this bad update myself, and nothing short of using server backup to restore to a point before then was able to fix the issue for me (a reinstall would have worked, but that's wielding a sledgehammer to remove a screw, IMO).

    The worst part is that whichever update it was, Microsoft seems to have silently pulled it and replaced it with a fixed version, so it's all but impossible to tell which update it actually was.

    And like I've said, I've been hit by this myself (mid April) and Alex was able to help a good deal. But only in so much as confirming that it was an issue with the Srv2.sys file and not actually with DrivePool.
    And I can confirm that if you're experiencing this issue, uninstalling DrivePool won't help. But DrivePool does seem to exacerbate the crashing. It got the point that I could copy any large file from \\server\c$\fs\ (where my pool drives reside) and cause a crash within minutes. (but accessing from drivePool directly didn't cause any crash).


    I kind of wish this was something caused by DrivePool, because then alex could fix it. But... 


  • Member
    Erk, how can you prevent this update from Microsoft then if you do not know which one it is?
    Is it possible to roll-back this driver only?
    I am afraid I do not have server backups before that specific update :/

    Another question as well...
    Some crashes happened while DP was balancing, and now I have many files with 1ee9da1-709f-4fe2-be8a-e373276b9375.copytemp at the end (I assume the failed balanced copies when the BSOD arrived).
    Does DP garbage collect these files, or is it safe to remove them by hand?
  • Member
    Plus wait a minute, very strange:
    + my server install is ~ one week old only, so all the updates I made to it are one week old most;
    + the crash really only happen when using a share over DP, never with a share over the HDD directly.

    So even if the problem comes from Samba, it is really DP that triggers it, isn't it?
    So yes, uninstalling DP would fix the problem; but I am not there already.
  • Here are some of the recent updates/hotfixes that changes srv2.sys


    Will try Installing/Uninstalling these.
  • edited May 2013 Member
    On my side I will try a fresh install of the OS in a VM, without any update, and check if it works...
    But perhaps will do that tomorrow... getting late in here :(

    By the way, I just watched an entire movie on a share directly on one HDD, no BSOD :/
  • edited May 2013 Member
    Ok, just tested it, fresh install of server 2012 in a VM;
    added the disks, DP recognizes them directly (good :D)

    Join my domain, share over net...
    The test now, reset my XBMC db (thank you GIT, just another branch)...
    ... and go, rediscover my whole media library, and to stress a little play a movie at the same time...

    ... and... no BSOD, everything ok.
    Ok, now will try installing the updates, not all at the same time, to figure out which one is faulty.
    Pfff, Microsoft, what a shame!
  • edited May 2013 Member
    I have install all updates but KB2811660 and KB2836988.
    And BSOD here we are!
    So none of these is involved :/

    Drashna, if you mitigated this problem for yourself, could you give me the details of your srv2.sys file?
    (date, version, size)
    And what did you do regarding the updates then? You deactivated them completely?
  • Covecube
    Guys,

    I've seen this a few times and have analyzed these dumps from a number of people.

    The crashes started coming in after Microsoft screwed up one of their updated (yes, it's true).

    See:

    And:

    ----

    As far as I can see, all of the dumps that I've looked at are implicating srv2.sys (which is Microsoft's SMB driver).

    All of the dumps that I've received indicate a memory corruption around network read / write I/O in srv2.sys.

    While it's possible for any kernel driver to corrupt memory of another driver (including ours), our read / write code hasn't changed recently (well, for over a year in that area), and we've never seen this type of error until recently.

    Also, all of the problems started around the time of MS13-036.

    In addition, and very importantly, none of the test servers here are experiencing this problem.

    This is the information that I gave to Christopher (our Technical Support) and this is what he's been conveying to you.

    As you know, I am always open to entertain other options, but at this time, it really looks like this is not caused by DrivePool.

    I'd really hate to push off the problem to someone else, but I have no other explanation in this case:
    • The problem only started recently.
    • The dump implicates srv2.sys (a Microsoft driver).
    • Our read /write code hasn't changed.
    • Clearly not everyone is experiencing this problem.
  • edited May 2013 Covecube
    I should also note one other important fact.

    My analysis has determined that this error only happens in the Windows 8 Kernel (that includes Windows Server 2012 and Essentials).

    Please let me know if this is not the case.
  • Covecube
    Paris,

    Another thing to note is that Drashna (our technical support rep) has repaired the issue without any code changes on my part in DrivePool.

    His BSOD was very easily reproducible (within minutes, well, he can explain this better than I can). While none of of the test servers here were experiencing any type of instability at all with the same version.
  • Covecube
    Thanks alex.

    And as for reproducable, I picked a large file, somewhere in the 6GB ballpark and would try copying that file, again and again until the system crashed. Sometimes it would immediately crash, sometimes it would take a couple times to get it to crash. But usually on the first file.  After a while, I uninstalled DrivePool to make sure that wasn't it. But when I tried using \\SERVER\c$\FS\drive\file (yes, I have all my drives mounted to folders in C:\FS\), it would still crash. It tended to take a bit longer with drivePool installed. But just as reliably crashed.

    The irony is, none of my VMs were experience the same issue. Only the main server.

    And I tested this *thoroughly* over the course of 2 days. Knowing that every crash had a chance of corrupting data and damaging the drive. A week *after* I fixed my server, one of my 3TBs lost the partition table, and came up as a "RAW" disk. I have no doubt that was related to all the rebooting due to BSODs.

    As for my srv2.sys file, the current, "working" version is:
    Version: 6.2.9200.16579
    Date: 04/08/2013 19:33PM
    Size: 608KB

    And by mitigating .... I had a good server backup that worked. 
  • Member
    Hi, thank you all for the details.

    Myself I have reproduced the exact same problem on a VM, so seems consistent:
    + install of server 2012;
    + activate remote desktop connections, change name, and make member of the domain;
    + reboot;
    + install DP 2.0.0.260
    + share the DP drive, stress that share from remote client: read one big file in XBMC while update the media library (lots of file accesses in little time)
    + No BSOD at all at this point.
    + make all updates from Microsoft;
    + reboot;
    + stress the share as before;
    + BSOD in seconds.

    And once again, no crash when accessing the drives shared directly, not through DP.
    So even if DP is not the first culprit (and it is proven by the tests above), it is definitely interacting badly with the updated srv2.sys

    Now I am searching how to workaround this.
    My (supposedly faulty) srv2.sys version is exactly the same as yours Drashna.
    So perhaps is it something else coming from an update that has a side effect on srv2.sys
    I have tried uninstalling some updates from my test VM... reverting srv2.sys to a previous version (from 2012).
    No change at all, still the BSOD in seconds. So that would indicate that srv2.sys is not really the culprit.

    I do not really understand how you managed to make your one working Drashna.
    Understand me, I am not trying to push the fault on DP here, but to find a solution.
    If a MS update is faulty and you do not know which one, what did you do?
    Deactivate all updates and not install any?
  • Member
    Ok, I have tried uninstalling all the updates from my VM (but one that you cannot).
    Still experiencing the BSOD!

    So I am good for a complete re-install of my system.
    I think I will deactivate MS updates on my server main OS.
    Use it for AD DC and file sharing, that's it.
    Then install another OS as VM that will be used for other things like SQL server.
    What a pity MS...
  • Member
    I have very bad news.

    Just reinstalled everything, without any update at all
    + install server 2012 
    + install drivers 
    + activate roles: hyperv, AD, file server 
    + activate windows 
    + reconfigure my users and groups in AD 
    + install SB scanner and DP 
    + recreate my shares 

    I do my stress test... 
    BSOD after 3 secs! 

    So guys I am afraid that the bad MS update explanation does not work anymore...
  • Member
    I have even tried reverting to my first backup, so having that state:
    + fresh server 2012 install
    + no specific driver installed, using MS defaults
    + no specific role
    + Windows not activated
    + install DP only
    + recreate one share in read only using the Administrator authentication

    Stress test... BSOD after 3 secs.
    I cannot have a cleaner install I am afraid.
  • Covecube
    Paris,

    Thanks for the info.

    I will follow your steps to try and reproduce the issue over here.
  • Covecube
    Paris, 
    I'd say run a memory test, but this is too specific to be memory (IMO). I'd recommend checking out the medium you're using to install, verify that it is indeed "intact". Also, have you run "sfc /scannow" after installing, just to make sure?

    And though I doubt it would help... have you tried it without DrivePool installed? I'm sure that you'd most likely experience the same BSOD issues, but... leave no stone unturned, right?

    And I agree, it sounds like there is definitely something else going on here and that srv2.sys is only part of the issue. And there isn't a way to disable SMB v3 (which is new to Server 2012), without disabling SMB v2. So no way to eliminate SMB3 as a culprit. And it's generally recommended NOT to disable smb v2 for any real length of time. And since both SMB2 and SMB3 appear to be part of the same module....
    If you're interested in how to disable it just to test:

    And it sounds like you are running Server 2012 Standard. If you got this through MSDN/TechNet/SA/DreamSpark/etc, you should have access to free support from Microsoft. It may not be a bad idea to follow up on that as well. (Otherwise, Microsoft changes in the $250 ballpark for each and any support requests, even if it's something wrong with their code...)


  • edited May 2013 Member
    Hi there, thank you for the feedback.
    I will try what you recommend ASAP.
    Regarding the memtest, my server hardware is less than a month old.
    Before installing anything, I have run a memtest86 from a bootable USB key during few hours, everything went fine.

    For the moment, the only workaround I found is:
    + install server 2012 (I am using the Datacenter edition)
    + install drivers 
    + activate roles: hyperv, AD, file server 
    + activate windows 
    + reconfigure my users and groups in AD
    + install another server 2012 in a VM
    + make it member of my domain
    + do not do any update in it, do not install any additional stuff (no driver, no AV)
    + install DP in the VM
    + mount my HDD as SCSI devices in the VM
    + share DP managed folder into that VM

    Doing so, it seems stable. No BSOD in the VM nor in the hosting server.
    But if I perform a Windows update into the VM, then BSOD in the VM.

    What really surprises me is that a fresh server install provokes BSOD in the host OS, but not in the VM.
    Looks like a bad interaction with some hardware drivers isn't it?
    I will publish my hardware specs in case it could be useful.

    I will try the sfc /scannow.
    The only things I tried for my HDDs until now are the Stablebit scanner (full scan of every HDDs, including sectors and filesystem), and no problem detected.
  • Covecube
    Paris,

    Just an update on my investigation:

    I've set up new hardware to test your case with a brand new Windows Server 2012 Standard install.

    I've updated the server with the latest updates and have been throwing all kinds of data at it.

    I've been copying large file test sets (up to 13 GB / file) and tiny test sets. I've also tried using both Windows 8 and Windows 7 as the client. I've also tried USB 3.0 vs. SATA and a duplicated vs. non-duplicated pool.

    So far, not a single crash.

    Next I'm going to disassemble srv2.sys to try and figure out exactly what it's trying to do at the point of the crash. I do know (from the dumps) that the crash is always happening on an uncached read.
This discussion has been closed.