I am trying to solve a persistent IO problem when we try to read or write to a Windows 2003 Clustered Fileshare. It is happening regularly and seem to be triggered by traffic. We are writing via .NET's FileStream object.
Basically we are writing from a Windows 2003 Server running IIS to a Windows 2003 file share cluster. When writing to the file share, the IIS server often gets two errors. One is an Application Popup from Windows, the other is a warning from MRxSmb. Both say the same thing:
[Delayed Write Failed] Windows was unable to save all the data for the file \Device\LanmanRedirector. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elswhere.
On reads, we are also getting errors, which are System.IO.IOException errors: "The specified network name is no longer available."
We have other servers writing more and larger files to this File Share Cluster without an issue. It's only coming from the one group of servers that the issue comes up. So it doesn't seem related to writing large files. We've applied all the hotfixes referenced in articles online dealing with this issue, and yet it continues.
Our network team ran Network Monitor and didn't see any packet loss, from what I understand, but as I wasn't present for that test I can't say that for certain.
Any ideas of where to check? I'm out of avenues to explore or tests to run. I'm guessing the issue is some kind of network problem, but as it's only happening when these servers connect to that File Share cluster, I'm not sure what kind of problem it might be.
This issue is awfully specific, and potentially hardware related, but any help you can give would be of assistance.
Eric Sipple
I've heard of AutoDisconnect causing similar issues (even if the device isn't idle). You may want to try disabling that on the server.
I am having similar problems:
writing to a machine that is also part of a Windows 2003 R2 NLB cluster sometimes results in "Delayed Write Failed" or "the semaphore has timed out" or "the specified network name is no longer available"
this is reproducible for the same files, even after rebooting all machines involved
if I rename the problem-files (some of which are quite small), the problem remains
if I write the files to another location (fysical disk) on the same machine, the problem remains
I uninstalled all anti-virus software, problem remains
I have reset the tcp-ip stack, problem temporarily disappears, but after some time the problem returns for the same files
PARTLY SOLVED the problem:
I deleted (not stopped) the host from the NLB cluster. Problem solved.
Seems to have to do something with writing to a share on a server that is also part of a network load balancing cluster
I have not yet found other people posting NLB cluster related file write problems. However, I did find many posts complaining about similar problems, none of which seem to have been solved.
Anne
I've seen other people reporting the "delayed write failed" error. One recommendation was to adjust the size of the cache, there's a utility from sysinternals (http://technet.microsoft.com/en-us/sysinternals/bb897561.aspx) that will allow you to do that.
Related
I am running an Azure App Service instance and after about a day or so with perhaps 15-30 people using the site, no one can access the MySQL database anymore and they just get this error when initiating a request:
"An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full"
I have many instances that work just fine with MSSQL (all of my other App Services have SSL's), but this is my first App Service that has given this error (and just so happens to be a MySQL database). This particular instance that has the issue also currently does not have an SSL, but I'm not sure if that has anything to do with the issue.
I have tried mimicking all settings from my working App Services and it still does not work. I'm not sure exactly how to diagnose the issue. All of my database calls are being closed and I am disposing of the connections, and also they all use the "using" statement, yet I still get this error. Any guidance would be greatly appreciated. Thanks.
An operation on a socket could not be performed
because the system lacked sufficient buffer space or
because a queue was full
The above error is usually not because of the MySQL but it is because of the OS on top the MySQL is running.
There are two OS level reasons for this the first one is related to the memory allocation and the other one would regarding the TCP port allocation.
The above error will be encountered is the MySQL program gets more memory (RAM) than OS itself so much so that the OS can't function properly, and you will get the error.
The second reason would be that the port which is used is higher than the allowed TCP “ephemeral” ports. For windows the limit is set to 5000. Since you are using MYSQL and MYSQL default TCP port is 3306 . This probably is not the reason.
Now to remedy the memory related issue we have to edit the boot.ini file in which we have to remove the /3gb switch as it will prevent the applications from accessing extra memory and will give more memory to the OS.
To edit the boot.ini we can either do it through Bootcfg which is a command line tool for editing boot.ini another way to edit the boot.ini would be through a text editor ideally notepad. You can open the boot.ini by running the following command.
notepad.exe Boot.ini
Refer this MSDOC on editing Boot.ini
you can also refer this Artilcle by Dan Benediktson on the same issue.
This is my first Stack Overflow question so apologies if this isn't great...
I'm sure this is something either super simple I am missing or something very complex that I've gotten myself into, but I am using ClickOnce for the first time to create an automated updater for a company application I developed.
The application itself was originally written in VB but I have translated it into C#. We use this to automate a database of assets, which changes very frequently. I have been tasked to allow it to complete automated updates to keep from confusing some of the techs with uninstall/reinstalling the application weekly.
I volunteered to make an FTP server using a personal server machine I use at home. Normally this machine would be used for local networking but I've wanted to create an FTP server for some time (this is my first FTP server too).
So I went on my way, set the publish location for the build to ftp://[IP.ADDRESS]:21/Folder/Subfolder and the Installation folder URL to http://[IP.ADDRESS]:21/Folder/Subfolder
Long story short, when I try to test an update (changing only the assembly version), I am an error:
System.Deployment.Application.DeploymentDownloadException: Downloading http://[IP.ADDRESS]:21/Folder/Subfolder/applciation.application did not succeed ---> System.Net.WebException: The server committed a protocol violation.
I did some research and tried adding an SSL certificate and changed the update path to https://[IP.ADDRESS]:21/Folder/Subfolder/ then tested that. This time around, I get this error:
System.Deployment.Application.DeploymentDownloadException: Downloading http://[IP.ADDRESS]:21/Folder/Subfolder/applciation.application did not succeed ---> System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a send. --> System.IO.IOException: The handshake failed due to an unexpected format.
I cannot tell if this is progress or if I moved backwards here LOL. I've been jumping back and forth and going to many threads to try to figure out where this is going wrong. I'm also having a pretty tricky time finding out if this is an error with how I've set up ClickOnce or if this is an error in how I have set up FTP with IIS.
Apologies if this is not enough information, I can provide more if necessary. Also apologies if this is too much information! Any help or guidance is appreciated!
I'm guessing you're working for a small company and infrastructure/resources are at a premium. With that in mind I'll offer some suggestions:
Does your company have a network shared drive? I don't like ClickOnce, but I have deployed it to network shares in the past with success. This has the benefit of you not needing to deal with security.
Have you considered migrating this to a web application? Web development seemed really daunting when I was a native app developer, but with Blazor and ASP.NET Core it's become a lot more accessible. This would completely get rid of the need for updating the application.
Consider an alternative deployment route. ClickOnce is not incredibly well supported.
I'd be remiss if I didn't throw a red flag on security. FTP is a very old protocol and is basically insecure by design. Hosting it on your home server means that you're transmitting the app over the public internet... What would happen if someone outside your company installed the application?
Background
I have written a utility that watches for files in a certain directory, and then copies them to defined target locations on remote machines. There is also a feature that allows stopping defined services in order to allow copying to the target.
In our work environment, these remote machines are typically VMs (we use VMWare Workstation) and the machines are part of a VM sub-domain, and are configured to use NAT networking (share the host machine's IP address). So when I say "remote" it's really referring to a VM running on the host.
Problem
For my utility, I'm trying to copy files using a UNC path to the target directory, and using the machine name get a list of services using the ServiceController.GetServices(string machineName) method.
So if you had a VM named server-1, you might be trying to copy a file to \\server-1\c$\destinationfolder. Most of the time this works, but sometimes I see an excetion because the target directory can't be found. When this happens, we also see an error when trying to get services on the remote machine - "The RPC server is unavailable."
When the VM is restarted, everything works fine... for a while.
I'm having a hard time trying to nail down the issue, because it's sporadic and doesn't affect most people. I'm wondering if it's an IP issue, where VMWare changes the IP and it's stale in the host's cache? (If I sound like I don't really know what I'm talking about here, it's only because I don't... my networking knowledge is fairly basic). When I look up issues with the 'RPC server is unavailable' error, I see a lot of answers regarding firewalls, which I don't believe is the case here. We don't run anything like McAfee internally and since it works most of the time, it doesn't seem like the cause.
Actual Questions
Anyone have any thoughts as to what might cause this problem? As a follow-up, if it is a stale IP issue, how could I recreate the issue for debugging purposes, so I can try to come up with a good way to resolve it going forward?
(Sorry if this is a really long question, it said to be specific)
The company I work for has a number of sites, which have been running for some time with no problems. The applications are a mix of ASP.NET 2.0, 3.5, and 4.0, all using an ADO.NET to connect to a SQL Server Standard instance (on the same webserver) all being hosted with IIS7.
The problem began when we moved to an upgraded webserver. We made every effort to set up the server, db instance and IIS with the exact same settings (except for the different machine name, and the fact that we had upgraded from SQLExpress to Standard), and as far as we could tell, we did. Both servers are running Windows Server 2008 R2 (all current updates applied), and received a default install.
The problem is very apparent when starting up one of these applications. When you reach the login page of our application, the page itself loads extremely fast. This is true even when you load the page from a new machine that could not possibly have the page cached, with IIS caching disabled. The problem is actually visible when you enter your login information and click the login button. Because of the (not great)design of our databases, the login process must access a number of databases, theoretically up to 150 separate DBs, but in practice usually 2. The problem occurs even when only 2 databases (the minimum) are opened. Not a great design, but we have to live with it for now.
When trying to initially open a connection to the database, the entire process stops for about 20 seconds every time, regardless of whether you are connecting to 2 dbs or 40. I have run a .NET profiler (jetbrains dottrace) against the process, and the only information I could take from it was that one or all of the calls to sqlconnection.open() was accounting for 90% of the time. This only happens on first-use of the application, but the problem is compounded by the fact that IIS seems to disregard the recycling settings we have set for it, and recycles the application after a few minutes of idle, causing the problem to occur again.
I also tried to use the SQL Server profiler to see which database operations were the cause of the slowdown, but because of all the other DB activity, (and the fact that I had to do this on our production server, because the problem doesnt occur in our test environments) I couldn't pin down the exact operation that was causing the stoppage. I will try coming in late at night and shutting down the production sites to run the SQL profiler, but I might not be able to do this right away.
In the course of researching the problem, I have tried a couple solutions
Thinking it might be a name resolution problem, I tried modifiying both the hosts file on the webserver as well as giving the connectionstrings an IP address instead of the servername to resolve, with no difference. I have heard of the LLMNR protocol causing problems like this, but I think trying to connect by IP or resolving with the hosts file should have eliminated that possibility, tho i admit I never tried actually turning off LLMNR.
I have increased the idle timeouts, recycling intervals etc in IIS, but this doesn't even seem to be respected, much less solving the problem. This leads me to believe there is a setting overriding the IIS application settings on the machine.
multiple other code fixes, none of which made any difference. Is a SqlServer setting causing the problem?
other stuff that i forgot by now.
Any ideas, experience or whatevers would be greatly appreciated in helping me solve this problem!
I would advise using a non-tcp connection if you are still running the SQL instance on the local machine. SQL Server supports several protocols, tcp, named pipes, and shared memory are the more common.
Named Pipes
Data Source=np:computer\instance
Shared Memory
Data Source=lpc:computer\instance
Personally I prefer the Shared Memory. Remember you need to enable these protocols, and to avoid configuration mistakes I suggest you disable all you are not using.
see http://msdn.microsoft.com/en-us/library/ms187892.aspx
IIS Reset
In IIS7 there are two ways to configure the idle-timeout. Both begin by clicking on the "Application Pools" section and right-clicking the appropriate app domain. If you click the "Recycling..." option there is one setting. The other is in "Advanced Settings..." under the section for "Process Model" you will find "Idle Time-out (minutes)" which set to zero disables the process timeout. This later option is the one that works for us.
If I were you I'd solve this problem first as restarting the appdomain and/or worker process is always painful even if you don't have a 20 second lag.
Some ideas:
from the web server, can you ping the db server and get a "normal"
response, or are you seeing a similar delay?
if you're seeing a delay, run a tracert to see if you can nail down where the slowness is occurring
try using a tool like QueryExpress (http://www.albahari.com/queryexpress.aspx) which doesn't require an install to run. You can download this EXE and run it from your web server. See if you can connect to your db using this and run queries in a normal fashion.
Try something like SysInternals' TcpView (http://technet.microsoft.com/en-us/sysinternals/bb897437) to take a look at your open connections and see what activity is happening on your server and how much data is being sent to and received from your db server.
Just some initial thoughts on where I'd start to look based upon your problem description. I hope this helps. Good luck with things!
With IIS not respecting recycling settings: did restarting IIS/rebooting change the behavior?
Whenever I debug my application (ASP.NET Web Application converted to Web Role), I am able to get to the login page. I go ahead and sign in, debug through that and it seems to work fine, but as soon as it takes me to the landing page after login, DevFC.exe stops working with the error:
An unhandled exception ('System.Net.Sockets.SocketException') occurred in DevFC.exe [8072].
Now, I've searched for the issue and have seen something about DevFC.exe crashing due to VMWare Workstation that listens on the same port (12000) and HTC Sync that also listens on that port. I have neither of those applications on my machine, so I am lost here. Using TCPView (from Sysinternals), I find no other application using that port.
The one thing I do notice is that [System Process] goes crazy creating connections to localhost:12000 once DevFC.exe gets started.
Anyone have some insight on this?
This might sound ridiculous but restarting your machine might solve the issue. If that doesn't work, try your project on a different machine. If the project works then there is an issue with your primary machine try uninstalling then reinstalling the Azure SDK. If the devFabric still crashes on the secondary machine then it's something related to your project.
You might want to take a look at the logs created in the DevFC folder here: %localappdata%\dftemp\DevFCLogs (C:\Users\[user]\AppData\Local\dftmp\DevFCLogs). This will hopefully shed light on the actual error (which in my case, was a conflict on port 12001. Ran netstat -ab afterwards and found it was vmware-hostd.exe. This is a service included with VMWare Workstation 8. I know you said you don't have that, but you may have some other conflicting software).
See this thread as well for more detail.
http://social.msdn.microsoft.com/Forums/en-US/windowsazuredevelopment/thread/7e205afd-4b9a-4387-8e10-99e4b8f27788