I'm trying to convert a file's encoding and replace some text along the way. Unfortunately, I'm getting an OutOfMemory exception. I'm not sure why. As I understand it, it streams the original file line by line into a var (str), completes a couple of string replacements, and then writes the converted line to the StreamWriter.
Can someone tell me what I'm doing wrong here?
EDIT 1
- I'm currently testing a single file - 1GB:2.5m rows.
- Replaced read and replace into a single line. Same results!
EDIT 2
???By the way, can anyone tell me why the question was downgraded? I'd like to know for future postings.???
The problem is with the file itself. It's output from SQL Server BCP where I explicitly flag the row terminator with a specific string. By default, when the row terminator flag is omitted, BCP adds a newline at the end of each row and the code below works perfectly.
What I still don't understand is: when I set the row terminator flag with a specific string, each record appears on a newline, so why doesn't streamreader see each record on a separate line? Instead, it appears it views the entire file as one long line. That still doesn't explain the OOM exception since I have well over a 100G of memory.
Unfortunately, explicitly setting the row terminator flag is a must. For now, I'll take this over to dba exchange.
Thanks
static void Main(string[] args)
{
String msg = String.Empty;
String str = String.Empty;
DirectoryInfo dInfo = new DirectoryInfo(#"\\server\share");
foreach (var f in dInfo.GetFiles())
{
using (StreamReader sr = new StreamReader(f.FullName, Encoding.Unicode, false))
{
using (StreamWriter sw = new StreamWriter(f.DirectoryName + "\\new\\" + f.Name, false, Encoding.UTF8))
{
try
{
while (!sr.EndOfStream)
{
str = sr.ReadLine().Replace("this","that");
sw.WriteLine(str);
}
}
catch (Exception e)
{
msg += f.Name + ": " + e.Message;
}
}
}
}
Console.WriteLine(msg);
Console.ReadLine();
}
Well, you're main reading and writing code needs just one line of data. Your msg string, on the other hand, keeps getting larger and larger with each exception.
You'll need to have many millions of files in the folder to get an OutOfMemory exception this way, though.
Related
I am trying to modify a file-stream inline as the file has the potential to be very large and I don't want to load it into memory. The piece of information I'm editing will always be the same length so in theory I can just swap the content out using a stream reader but it doesn't seem to be writing to the correct place
I have created a section of code that using a stream reader will read line by line until it finds a regex match and will then attempt to swap the bytes out with the edited line. The code is as follows:
private void UpdateFile(string newValue, string path, string pattern)
{
var regex = new Regex(pattern, RegexOptions.IgnoreCase);
int index = 0;
string line = "";
using (var fileStream = File.OpenRead(path))
using (var streamReader = new StreamReader(fileStream, Encoding.Default, true, 128))
{
while ((line = streamReader.ReadLine()) != null)
{
if (regex.Match(line).Success)
{
break;
}
index += Encoding.Default.GetBytes(line).Length;
}
}
if (line != null)
{
using (Stream stream = File.Open(path, FileMode.Open))
{
stream.Position = index + 1;
var newLine = regex.Replace(line, newValue);
var oldBytes = Encoding.Default.GetBytes(line);
var newBytes = Encoding.Default.GetBytes("\n" + newLine);
stream.Write(newBytes, 0, newBytes.Length);
}
}
}
The code almost works as expected, it inserts the updated line but it always does it a little early, just how early varies slightly based on the file I'm editing. I expect it is something to do with the way I am managing the stream position but I don't know the correct way to approach this.
Unfortunately the exact files I'm working on are under NDA.
The structure is as follows though:
A file will have an unkown amount of data followed by a line of a known format, for example:
Description: ABCDEF
I know the portion that follows "Description: " will always be 6 characters, so I do a replace on the line to replace with, for example, UVWXYZ.
The problem is that for example if a file read as
'...
UNIMPORTANT UNKNOWN DATA
DESCRIPTION: ABCDEF
MORE DATA
...'
it will come out as something like
'...
UNIMPORTANT UNKNOWN DDESCRIPTION: UVWXYZDEF
MORE DATA
...'
I think the problem here is that you are not considering the line feed ("\n") for each line you are getting and therefore your index is incorrectly setting the position of your stream. Try the following code:
private void UpdateFile(string newValue, string path, string pattern)
{
var regex = new Regex(pattern, RegexOptions.IgnoreCase);
int index = 0;
string line = "";
using (var fileStream = File.OpenRead(path))
using (var streamReader = new StreamReader(fileStream, Encoding.Default, true, 128))
{
while ((line = streamReader.ReadLine()) != null)
{
if (regex.Match(line).Success)
{
break;
}
index += Encoding.ASCII.GetBytes(line + "\n").Length;
}
}
if (line != null)
{
using (Stream stream = File.Open(path, FileMode.Open))
{
stream.Position = index;
var newBytes = Encoding.Default.GetBytes(regex.Replace(line + "\n", newValue));
stream.Write(newBytes, 0, newBytes.Length);
}
}
}
In your example, you are "off" by 4 Characters. Not quite the common "off by one error", but close. But maybe a different pattern would help the most?
Programms nowadays rarely work "on the file" like that. There is just too much to go wrong, all the way to a power loss mid-process. Instead they:
create a empty new file at the same location. Often temporary named and hidden.
write the output to the new file
Once you are done and eveyrthing is good - all the caches are flushed and everything is on the disk (done by Stream.Close() or Dispose()) - just replace the old file with the new file using the OS move operation.
The advantage is that it is impossible to have data-loss. Even if the computer looses power mid-operation, at tops the temporary file is messed up. You still got the orignal file and yoou can just delte the temporary file and restart the work from scratch if you need too. Indeed recovery only makes sense in rare cases (Word Processors)
The replacement of old file by new file is done with a move order. If they are on the same partition, that is literally just a rename operation in the Filesytem. And as modern FS are basically designed like a topline, robust relational Databases there is no danger in this.
You can find that pattern in everything from your Word Porcessor of choice, to backup programms, the download manager of Firefox (as you might be overriding a file that was there befroe) and even zipping programms. Everytime you got a long writing phase and want to minimize the danger, it is to go to pattern.
And as you can work entirely in memory without having to deal with moving around the read/write head, it will get around your issue too.
Edit: I made some source code for it from memory/documentation. Might contain syntax errors
string sourcepath; //containts the source file path, set by other code
string temppath; //containts teh path of the tempfile. Should be in the same folder, and thus same partiion
//Open both Streams, can use a single using for this
//The supression of any Buffering on the output should be optional and will be detrimental to performance
using(var sourceStream = File.OpenRead(sourcepath),
outStream = File.Create(temppath, 0, FileOptions.WriteThrough )){
string line = "";
//itterte over the input
while((line = streamReader.ReadLine()) != null){
//do processing on line here
outStream.Write(line);
}
}
//replace the files. Pretty sure it will just overwrite without asking
File.Move(temppath, sourcepath);
I created a class with the responsibility to generate a text file where each line represents the information of an object of 'MyDataClass' class. Below is a simplification of my code:
public class Generator
{
private readonly Stream _stream;
private readonly StreamWriter _streamWriter;
private readonly List<MyDataClass> _items;
public Generator(Stream stream)
{
_stream = stream;
_streamWriter = new StreamWriter(_stream, Encoding.GetEncoding("ISO-8859-1"));
}
public void Generate()
{
foreach (var item in _items)
{
var line = AnotherClass.GetLineFrom(item);
_streamWriter.WriteLine(line);
}
_streamWriter.Flush();
_stream.Position = 0;
}
}
And I call this class like this:
using (var file = new FileStream("name", FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
new Generator(file).Generate();
}
When I run the application on visual studio (I test with run (Ctrl+F5), debug (F5), with debug and release mode) all goes according to the plan. But I publish the application in a IIS server and now StreamWriter class put an extra \r before the end of the line.
Check it out the hexadecimal reading of both generated files:
Running in Visual Studio:
http://www.jonataspiazzi.xpg.com.br/hex_vs.bmp
Running in IIS:
http://www.jonataspiazzi.xpg.com.br/hex_iis.bmp
Some things I already checked:
Write the line variable (in var line = AnotherClass.GetLineFrom(item);) in a log to see if an extra '\r' is uncluded by the class AnotherClass.
Didn't result in nothing, the last char in line is a regular char like expected (in example above is a space).
Write another code to see if the problem is general for all IIS StreamWriter instances.
I tried this:
var ms = new MemoryStream();
var sw = new StreamWriter(ms, Encoding.GetEncoding("ISO-8859-1"));
sw.WriteLine("Test");
sw.WriteLine("Of");
sw.WriteLine("Lines");
sw.Flush();
ms.Position = 0;
In this case the code works well for both visual studio and IIS.
I'm in this for 3 days, I already try everything my brain can think. Did anyone have any clue for what I can try?
UPDATE
Get weirder! I try to replace the line _streamWriter.WriteLine(line); with:
_streamWriter.Write(linhaTexto + Environment.NewLine);
And even worse:
_streamWriter.Write(linhaTexto + "\r\n");
Both keep generating the extra \r character.
I try replace with this:
_streamWriter.Write(linhaTexto + "#\r\n#");
And get:
http://www.jonataspiazzi.xpg.com.br/hex_sharp.bmp
According to MSDN, WriteLine
Writes data followed by a line terminator to the text string or stream.
your last line should be
_streamWriter.Write(line);
Put it outside of your loop and change your loop so it doesn't manage the last line.
My guess is that the extra \r is added during FTP (maybe try a binary transfer)
Like here
I've tested the code and the extra /r is not due to the code in the current question
I had a similar issue. Environment.NewLine and WriteLine gave me extra \r character. But this below worked for me:
StringBuilder sbFileContent = new StringBuilder();
sbFileContent.Append(line);
sbFileContent.Append("\n");
streamWriter.Write(sbFileContent.ToString());
I just now had a similar problem where the code below would randomly insert blank lines in the output file (outFile)
using (StreamWriter outFile = new StreamWriter(outFilePath, true)) {
foreach (string line in File.ReadLines(logPath)) {
string concatLine = parse(line, out bool shouldWrite);
if (shouldWrite) {
outFile.WriteLine(concatLine);
}
}
}
Using Antar's idea I changed my parse function so that it returned a line with Environment.NewLine appended, ie
return myStringBuilder.Append(Environment.NewLine).ToString();
and then in the foreach loop above, changed the
outFile.WriteLine(concatLine);
to
outFile.Write(concatLine);
and now it writes the file without a bunch of random new lines inserted. However, I still have absolutely no idea why I should have to do this.
I created a class with the responsibility to generate a text file where each line represents the information of an object of 'MyDataClass' class. Below is a simplification of my code:
public class Generator
{
private readonly Stream _stream;
private readonly StreamWriter _streamWriter;
private readonly List<MyDataClass> _items;
public Generator(Stream stream)
{
_stream = stream;
_streamWriter = new StreamWriter(_stream, Encoding.GetEncoding("ISO-8859-1"));
}
public void Generate()
{
foreach (var item in _items)
{
var line = AnotherClass.GetLineFrom(item);
_streamWriter.WriteLine(line);
}
_streamWriter.Flush();
_stream.Position = 0;
}
}
And I call this class like this:
using (var file = new FileStream("name", FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
new Generator(file).Generate();
}
When I run the application on visual studio (I test with run (Ctrl+F5), debug (F5), with debug and release mode) all goes according to the plan. But I publish the application in a IIS server and now StreamWriter class put an extra \r before the end of the line.
Check it out the hexadecimal reading of both generated files:
Running in Visual Studio:
http://www.jonataspiazzi.xpg.com.br/hex_vs.bmp
Running in IIS:
http://www.jonataspiazzi.xpg.com.br/hex_iis.bmp
Some things I already checked:
Write the line variable (in var line = AnotherClass.GetLineFrom(item);) in a log to see if an extra '\r' is uncluded by the class AnotherClass.
Didn't result in nothing, the last char in line is a regular char like expected (in example above is a space).
Write another code to see if the problem is general for all IIS StreamWriter instances.
I tried this:
var ms = new MemoryStream();
var sw = new StreamWriter(ms, Encoding.GetEncoding("ISO-8859-1"));
sw.WriteLine("Test");
sw.WriteLine("Of");
sw.WriteLine("Lines");
sw.Flush();
ms.Position = 0;
In this case the code works well for both visual studio and IIS.
I'm in this for 3 days, I already try everything my brain can think. Did anyone have any clue for what I can try?
UPDATE
Get weirder! I try to replace the line _streamWriter.WriteLine(line); with:
_streamWriter.Write(linhaTexto + Environment.NewLine);
And even worse:
_streamWriter.Write(linhaTexto + "\r\n");
Both keep generating the extra \r character.
I try replace with this:
_streamWriter.Write(linhaTexto + "#\r\n#");
And get:
http://www.jonataspiazzi.xpg.com.br/hex_sharp.bmp
According to MSDN, WriteLine
Writes data followed by a line terminator to the text string or stream.
your last line should be
_streamWriter.Write(line);
Put it outside of your loop and change your loop so it doesn't manage the last line.
My guess is that the extra \r is added during FTP (maybe try a binary transfer)
Like here
I've tested the code and the extra /r is not due to the code in the current question
I had a similar issue. Environment.NewLine and WriteLine gave me extra \r character. But this below worked for me:
StringBuilder sbFileContent = new StringBuilder();
sbFileContent.Append(line);
sbFileContent.Append("\n");
streamWriter.Write(sbFileContent.ToString());
I just now had a similar problem where the code below would randomly insert blank lines in the output file (outFile)
using (StreamWriter outFile = new StreamWriter(outFilePath, true)) {
foreach (string line in File.ReadLines(logPath)) {
string concatLine = parse(line, out bool shouldWrite);
if (shouldWrite) {
outFile.WriteLine(concatLine);
}
}
}
Using Antar's idea I changed my parse function so that it returned a line with Environment.NewLine appended, ie
return myStringBuilder.Append(Environment.NewLine).ToString();
and then in the foreach loop above, changed the
outFile.WriteLine(concatLine);
to
outFile.Write(concatLine);
and now it writes the file without a bunch of random new lines inserted. However, I still have absolutely no idea why I should have to do this.
I need to be able to take a text file with unknown encoding (e.g., UTF-8, UTF-16, ...) and copy it line by line, making specific changes as I go. In this example, I am changing the encoding, however there are other uses for this kind of processing.
What I can't figure out is how to determine if the last line has a newline! Some programs care about the difference between a file with these records:
Rec1<newline>
Rec2<newline>
And a file with these:
Rec1<newline>
Rec2
How can I tell the difference in my code so that I can take appropriate action?
using (StreamReader reader = new StreamReader(sourcePath))
using (StreamWriter writer = new StreamWriter(destinationPath, false, outputEncoding))
{
bool isFirstLine = true;
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (isFirstLine)
{
writer.Write(line);
isFirstLine = false;
}
else
{
writer.Write("\r\n" + line);
}
}
//if (LastLineHasNewline)
//{
// writer.Write("\n");
//}
writer.Flush();
}
The commented out code is what I want to be able to do, but I can't figure out how to set the condition lastInputLineHadNewline! Remember, I have no a priori knowledge of the input file encoding.
Remember, I have no a priori knowledge of the input file encoding.
That's the fundamental problem to solve.
If the file could be using any encoding, then there is no concept of reading "line by line" as you can't possibly tell what the line ending is.
I suggest you first address this part, and the rest will be easy. Now, without knowing the context it's hard to say whether that means you should be asking the user for the encoding, or detecting it heuristically, or something else - but I wouldn't start trying to use the data before you can fully understand it.
As often happens, the moment you go to ask for help, the answer comes to the surface. The commented out code becomes:
if (LastLineHasNewline(reader))
{
writer.Write("\n");
}
And the function looks like this:
private static bool LastLineHasNewline(StreamReader reader)
{
byte[] newlineBytes = reader.CurrentEncoding.GetBytes("\n");
int newlineByteCount = newlineBytes.Length;
reader.BaseStream.Seek(-newlineByteCount, SeekOrigin.End);
byte[] inputBytes = new byte[newlineByteCount];
reader.BaseStream.Read(inputBytes, 0, newlineByteCount);
for (int i = 0; i < newlineByteCount; i++)
{
if (newlineBytes[i] != inputBytes[i])
return false;
}
return true;
}
I have an application that crunches a bunch of text files. Currently, I have code like this (snipped-together excerpt):
FileInfo info = new FileInfo(...)
if (info.Length > 0) {
string content = getFileContents(...);
// uses a StreamReader
// returns reader.ReadToEnd();
Debug.Assert(!string.IsNullOrEmpty(contents)); // FAIL
}
private string getFileContents(string filename)
{
TextReader reader = null;
string text = "";
try
{
reader = new StreamReader(filename);
text = reader.ReadToEnd();
}
catch (IOException e)
{
// File is concurrently accessed. Come back later.
text = "";
}
finally
{
if (reader != null)
{
reader.Close();
}
}
return text;
}
Why am I getting a failed assert? The FileInfo.Length attribute was already used to validate that the file is non-empty.
Edit: This appears to be a bug -- I'm catching IO exceptions and returning empty-string. But, because of the discussion around fileInfo.Length(), here's something interesting: fileInfo.Length returns 2 for an empty, only-BOM-marker text file (created in Notepad).
You might have a file which is empty apart from a byte-order mark. I think TextReader.ReadToEnd() would remove the byte-order mark, giving you an empty string.
Alternatively, the file could have been truncated between checking the length and reading it.
For diagnostic purposes, I suggest you log the file length when you get an empty string.
See that catch (IOException) block you have? That's what returns an empty string and triggers the assert even when the file is not empty.
If I remember well, a file ends with end of file, which won't be included when you call ReadToEnd.
Therefore, the file size is not 0, but it's content size is.
What's in the getFileContents method?
It may be repositioning the stream's pointer to the end of the stream before ReadToEnd() is called.