Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Sign In with Google

Our Sites

covecube.com 
community.covecube.com 
blog.covecube.com 
wiki.covecube.com 
bitflock.com 
stablebit.com 

Poll

No poll attached to this discussion.

In order to better support our growing community we've set up a new more powerful forum.


The new forum is at: http://community.covecube.com


The new forum is running IP.Board and will be our primary forum from now on.


This forum is being retired, but will remain online indefinitely in order to preserve its contents. This forum is now read only.


Thank you,

Critical: BSOD when accessing DP over a network share

2»

Comments

  • Member
    Thanks for the feedback Alex.
    I do not know if that would make a difference, but my server version is Datacenter.

    Really strange that the bug is not consistent on a VM...
    Do you think it could come from my client PC (doing illegal smb call or whatever).

    Anyway, I really appreciate you involvement and feedback about this :D
  • Covecube
    Do you think it could come from my client PC (doing illegal smb call or whatever).
    Yes, that's something that I'm considering. Moreover, perhaps some network appliance is making those calls?

    So far, I've identified the exact piece of code in srv2.sys that is causing the crash, but I can't get any copy routine from a Windows client to hit that code.

    But it's too early to tell, I'll keep looking until I have something more definitive.
  • Covecube
    Paris,

    Once you get a chance, can you try to turn on Network I/O Boost in DrivePool and see if that avoids the crash?
  • edited May 2013 Member
    Hi Alex. Thank you very much for handling that.
    I am very busy at work currently, but will try that out ASAP.

    Moreover, I am doing software dev as part of my job, so if you want me to try more advanced things I can even handle Visual Studio (c++).
    I don't know if there is a special debug-mode for DP to create traces or whatever.
  • Covecube
    Paris, there is a tracing function for drivepool. "CoveFs_TracingEnabled" in here:


  • Member
    Should I set CoveFsDebugOutput as well before I get a BSOD to get maximal info?
  • Member
    I've experienced this issue also in WSE 2012 and Windows 8 Pro.  I sent a stack in some time ago but since the dump implicates srv2.sys it was pretty much dismissed.  I've tried to use driver verifier to get a bsod/dump, and while it will crash, for some reason it never produces a dump file for me so I have abandoned that.

    I've had Network IO boost enabled for some time and still experience the BSOD, so I don't think that will remedy it.

    I had started a thread on technet about this issue:

  • Member
    One other note, I am using XBMC to access shares, but I've also experienced the crash using VLC to access music over the network from a share, and I believe (all things being equal) I've seen it just hitting the share via explorer, so while I'd like to say it's due to XBMC's SMB access being screwy, I don't think that's the root cause either.
  • edited May 2013 Covecube
    Paris,

    Since you're a c++ coder, here's what's going on:

    • In the kernel we have something called MDLs (http://msdn.microsoft.com/en-us/library/windows/hardware/gg463193.aspx). In short, a MDL is a struct that describes a memory allocation.

    • In the process of servicing network requests, srv2.sys of course manages a set of buffers and requests the file system to fill them.

    • CoveFS (the DrivePool file system) simply forwards these MDLs down to NTFS without touching them in any way.

    • In the srv2.sys routine that is written to continue an existing read, there is a piece of code that is freeing a MDL. The problem is that this particular MDL is already freed, and double freeing memory is a very bad thing which leads to an immediate BSOD.

    • The curiosity here is that it only runs this code for paged MDLs. Which means memory that is backed by a page file. It would be very unusual to create small temporary buffers from paged memory because that would slow everything down, so I'm not sure why srv2 would be doing this. And I haven't seen a single MDL pass through this code that is paged.

      This is what I've been trying to reproduce and none of the machines here are generating these paged MDLs.

      I'm now suspecting that this may have something to do with the environment in which srv2 is running in (like the CPU type).

    • The reason why I suggested turning on Network I/O Boost is because this particular crash is happening in a branch of code that is executed right after a successful Fast I/O cached read. This should never happen using the Network I/O Boost, so I'm wondering if this will "trick" srv2 into going around the bad code and not crash. Or it could just crash in a different place, but it would be good to know either way.

    I still don't see anything to indicate that this is being caused by DrivePool, but will continue to investigate.

    ----

    In terms of getting a better dump, yes you can help there.

    You can enable driver verifier:

    • [Start] + R -> verifier
    • Select Create standard settings.
    • Select Automatically select all drivers installed on this computer.
    • Click Finish.
    • Reboot.

    Then you can launch:
    C:\Program Files\StableBit\DrivePool\DrivePool.Service.exe

    (as admin)

    This will stop the system service and get you into the console when you can interact with the DrivePool service.

    Hit 'T' to turn off the service logging.

    Now hit 'T' again to turn on tracing (this will turn on all tracing including file system tracing).

    Then reproduce the crash.

    Once you have a dump upload it here:

    This will generate the best dump possible, and it will include file system trace data so that I can take a look at the exact sequence of events that led to this issue.
  • Covecube
    @Samael and @Paris

    Exactly what CPU are you using?
  • edited May 2013 Member
    Hi everyone,

    Alex, I will do what you recommend on Thursday and post the dumps in here.

    Samael, I thought at the beginning myself that it could have been a bad interaction with XBMC.
    But the same as you, reading a file on the share through VLC produced exactly the same result.

    Here are the complete hardware specs of my server:
    • Seasonic G-650, 650W
    • ASRock Fatal1ty Z77 Professional
    • Intel Core i7 3770S
    • Crucial Ballistix Tactical LP, 2 x 8 GB, PC3-12800
    • 1xINTEL Cherryville 520 Series 240GB
    • 1xWD Black 750GB + 2xWD Red 3TB + 1xWD Green 3TB
  • Member
    i5 3570
    16GB Corsair Ballistix ram (4x4, DDR3 1300) run through 9 passes of memtest86 while at work
    1x2TB WD Black (boot drive)
    1x1TB WD Green (backup drive)
    3x4TB WD Black (pool drives)
    1x2TB Samsung F4 (pool drive)
    I've seen this happen with 2 mobos, a gigabyte Z77 chipset board, and an ASRock Z77, but I don't think the model is really relevant.
    600 watt corsair psu
  • Member
    3 network cards, 1 is the onboard realtek gigabit chipset, with 2 additional pci reatek based gigabit cards (dedicated to hyper-v instances).  Traffic to host os (with DP) goes through onboard nic exclusively.
  • edited May 2013 Covecube
    Which just adds to the oddness, I'm using an AMD system and was getting the issue.
  • Member
    Did not mention it, but my mb has 2 gigabit integrated network cards.
    They are configured as a NIC team, and all the traffic goes through them (host os and HyperV).
  • Covecube
    Hmm... I also have two NICs, and they were teamed... 
    Try breaking the team and see what happens.
  • Member
    @Drashna: yep, already tried it ;) no change.
  • Member
    Alex, I followed your steps and uploaded a 7z archive with:
    - the MEMORY.DMP
    - the minidump folder
    - the DrivePool logs folder

    This time it took more time to get the BSOD.
    I had to open a few video files for it to happen (usually it was on the first file).

    Now I will try the same while activating the Network IO boost.
  • edited May 2013 Member
    Tried with the Net IO boost, and for the moment not able to reproduce the BSOD.

    To resume:
    - host with Server 2012, hyperV, DNS, no update
    - VM with Server 2012 all updates, DP, share over DP
      + Net IO boost unactivated: BSOD
      + Net IO boost activated: no BSOD

    Will leave it like that for long run tests during the WE (with verifier and DB in trace mode).
    If it does not crash, will try to reactivate DP on the host OS, with the same configuration.
    Keep you in touch...
  • Covecube
    Paris,

    Thanks for the dump, it was very informative.

    Right now it looks like a bad interaction between these drivers:

    • netvsc63.sys - Virtual NDIS6.3 Miniport
    • vmbkmcl.sys - Hyper-V VMBus KMCL
    • srv2.sys - SMB2 driver

    So it's a bad interaction between the networking drivers and srv2. DrivePool is not involved.

    I will post the full analysis on the new forum and how I reached these conclusions.
  • Member
    Thanks for the feedback Alex.

    You say that Drivepool is not involved but:
    - changing an option in Drivepool prevent the BSOD (tested the whole day and it seems ok with the Net boost)
    - the BSOD only happens when using shares over Drivepool

    So to me the conclusion would be more: Drivepool is not causing the BSOD directly, but as a side effect...
This discussion has been closed.