Problem with indexed XML file

Problem with indexed XML file - c#

I scanned 2,8GB XML file for positions (Index) of particular tags. The I use Seek method to set a start point in that file. File is UTF-8 encoded.
So indexing is like that:
using(StreamReader sr = new StreamReader(pathToFile)){
long index = 0;
while(!sr.EndOfStream){
string line = sr.ReadLine();
index += (line.Length + 2); //remeber of \r\n chars
if(LineHasTag(line)){
SaveIndex(index-line.Length); //need beginning of the line
}
}
}
So afterwards I have in another file indexed positions. But when I use seek it doesn't seem to be good, because the position is set somewhere before it should be.
I have loaded some content of that file into char array and I manually checked the good index of a tag I need. It's the same as I indexed by code above. But still Seek method on StreamReader.BaseStream places the pointer earlier in the file. Quite strange.
Any suggestions?
Best regards,
ventus

Seek deals in bytes - you're assuming there's one byte per character. In UTF-8, one character in the BMP can take up to three bytes.
My guess is that you've got non-ASCII characters in your file - those will take more than one byte.
I think there may also be a potential problem with the byte order mark, if there is one. I can't remember offhand whether StreamReader will swallow that automatically - which would put you 3 bytes to start with.

Related

Write text to file in C# with 513 space characters

Here is a code that writes the string to a file
System.IO.File.WriteAllText("test.txt", "P ");
It's basically the character 'P' followed by a total of 513 space character.
When I open the file in Notepad++, it appears to be fine. However, when I open in windows Notepad, all I see is garbled characters.
If instead of 513 space character, I add 514 or 512, it opens fine in Notepad.
What am I missing?

What you are missing is that Notepad is guessing, and it is not because your length is specifically 513 spaces ... it is because it is an even number of bytes and the file size is >= 100 total bytes. Try 511 or 515 spaces ... or 99 ... you'll see the same misinterpretation of your file contents. With an odd number of bytes, Notepad can assume that your file is not any of the double-byte encodings, because those would all result in 2 bytes per character = even number of total bytes in the file. If you give the file a few more low-order ASCII characters at the beginning (e.g., "PICKLE" + spaces), Notepad does a much better job of understanding that it should treat the content as single-byte chars.
The suggested approach of including Encoding.UTF8 is the easiest fix ... it will write a BOM to the beginning of the file which tells Notepad (and Notepad++) what the format of the data is, so that it doesn't have to resort to this guessing behavior (you can see the difference between your original approach and the BOM approach by opening both in Notepad++, then look in the bottom-right corner of the app. With the BOM, it will tell you the encoding is UTF-8-BOM ... without it, it will just say UTF-8).
I should also say that the contents of your file are not 'wrong', per se... the weird format is purely due to Notepad's "guessing" algorithm. So unless it's a requirement that people use Notepad to read your file with 1 letter and a large, odd number of spaces ... maybe just don't sweat it. If you do change to writing the file with Encoding.UTF8, then you do need to ensure that any other system that reads your file knows how to honor the BOM, because it is a real change to the contents of your file. If you cannot verify that all consumers of your file can/will handle the BOM, then it may be safer to just understand that Notepad happens to make a bad guess for your specific use case, and leave the raw contents exactly how you want them.
You can verify the physical difference in your file with the BOM by doing a binary read and then converting them to a string (you can't "see" the change with ReadAllText, because it honors & strips the BOM):
byte[] contents = System.IO.File.ReadAllBytes("test.txt");
Console.WriteLine(Encoding.ASCII.GetString(contents));

Try passing in a different encoding:
i. System.IO.File.WriteAllText(filename , stringVariable, Encoding.UTF8);
ii. System.IO.File.WriteAllText(filename , stringVariable, Encoding.UTF32);
iii. etc.
Also You could try using another way to build your string, to make it be easier to read, change and count, instead of tapping the space bar 513 times;
i. Use the string constructor (like #Tigran suggested)
var result = "P" + new String(' ', 513);
ii. Use the stringBuilder
var stringBuilder = new StringBuilder();
stringBuilder.Append("P");
for (var i = 1; i <= 513; i++) { stringBuilder.Append(" "); }
iii. Or both
public string AppendSpacesToString(string stringValue, int numberOfSpaces)
{
var stringBuilder = new StringBuilder();
stringBuilder.Append(stringValue);
stringBuilder.Append(new String(' ', numberOfSpaces));
return stringBuilder.ToString();
}

Reading a text file from Unity3d

I have a error in a script which reads from a text file outside the program.
The error is
FormatException: Input string was not in the correct format
Its obvious whats wrong, but I just don't understand why it cant read it properly.
My code:
using (FileStream fs = new FileStream(#"D:\Program Files (x86)\Steam\SteamApps\common\blabla...to my file.txt))
{
byte[] b = new byte[1024];
UTF8Encoding temp = new UTF8Encoding(true);
while (fs.Read(b, 0, b.Length) > 0)
{
//Debug.Log(temp.GetString(b));
var converToInt = int.Parse(temp.GetString(b));
externalAmount = converToInt;
}
fs.Close();
}
The text file has 4 lines of values.
Each line represent a object in a game. All I am trying to do is read these values. Unfortunately I get the error which I can't explain.
So how can I read new lines without getting the error?
the text file looks like this
12
5
6
0
4 lines no more, all values on a seperate line.

There's no closing " on your new Filestream(" ...); but I'm gonna assume that's an issue when copy pasting your code to Stackoverflow.
The error you're getting is likely because you're trying to parse spaces to int, which wont work; the input string (" " in this case) was not in the correct format (int).
Split your lines on spaces (Split.(' ')) and parse every item in the created array.

A couple problems:
Problem 1
fs.Read(b, 0, b.Length) may read one byte, or all of them. The normal way to read a text file like this is to use StreamReader instead of FileStream. The Streamreader has a convenience constructor for opening a file that works the same way, but it can read line by line and is much more convenient. Here's the documentation and an excellent example: https://msdn.microsoft.com/en-us/library/f2ke0fzy(v=vs.110).aspx
If you insist on reading directly from a filestream, you will either need to
Parse your string outside the loop so you can be certain you've read the whole file into your byte buffer (b), or
Parse the new content byte by byte until you find a particular separator (for example a space or a newline) and then parse everything in your buffer and reset the buffer.
Problem 2
Most likely your buffer already contains everything in the file. Your file is so small that the filestream object is probably reading the whole thing in a single shot, even though that's not gauranteed.
Since your string buffer contains ALL the characters in the file you are effectively trying to parse "12\n5\n6\n0" as an integer and the parser is choking on the newline characters. Since newlines are non-numeric, it has no idea how to interpret them.

Without Overwriting, is it possible to remove string from a text file?

I am trying to append a line to a text file but by removing the enclosing bracket first.
Below is how my text file data format looks like
{
"1455494402": 8,
"1456272000": 2,
"1456358400": 1}
Now when I append the text file data should look like this
{
"1455494402": 8,
"1456272000": 2,
"1456358400": 1,
"1454716800": 1,
"1454803200": 4,
"1454889600": 7,
"1458345600": 17,
"1458518400": 1 }
There are two options to do this, I think,
By overwriting the whole file with new data (burden right? Performance hit)
Or By just appending the latest data(seems fast but not possible without removing last bracket)
First option is not so smart, I think.
Second option is cool but how do I remove the last bracket before appending the latest data, can't I just replace } with new data ?
So far my research has taught me that writing to the file again is the only better option, do you also think so? can't just append (in the sense remove bracket and append)
EDIT: Please do not consider this as duplicate, i am willing to know if that second option possible or not ? However I know i can do with first option mentioned in the details above

Assuming that:
You know that there will be a curly bracket (or any other character that is encoded as a single byte) at the end of the file,
The file is written as UTF8
then you can overwrite the last character as follows:
string filename = "test.txt";
File.WriteAllText(filename, "{One\nTwo\nThree}"); // Note curly brace at the end.
using (var file = new StreamWriter(File.OpenWrite(filename)))
{
file.BaseStream.Position = file.BaseStream.Length - 1;
file.Write("\nfour}"); // New line at end, previous brace is replaced.
}
This is very fragile, however. If the last character happens to be one that is encoded in more than one byte, then this will not work.
It is likely that it's not worth you taking the chance unless the files are very large and you have made timings that indicate it is worth introducing such brittle code to speed it up.
Note that this code can also be modified to work with ASCII or ANSI files by changing the encoding passed to the StreamWriter() constructor.

C# fast way to replace text in a html file

I want to replace text from a certain range in my HTML file (like from position 1000 to 200000) with text from another HTML file. Can someone recommend me the best way to do this?

Pieter's way will work, but it does involve loading the whole file into memory. That may well be okay, but if you've got particularly large files you may want to consider an alternative:
Open a TextReader on the original file
Open a TextWriter for the target file
Copy blocks of text by calling Read/Write repeatedly, with a buffer of say 8K characters until you've read the initial amount (1000 characters in your example)
Write the replacement text out to the target writer by again opening a reader and copying blocks
Skip the text you want to ignore in the original file, by repeatedly reading into a buffer and just ignoring it (incrementing a counter so you know how much you've skipped, of course)
Copy the rest of the text from the original file in the same way.
Basically it's just lots of copying operations, including one "copy" which doesn't go anywhere (for skipping the text in the original file).

Try this:
string input = File.ReadAllText("<< input HTML file >>");
string replacement = File.ReadAllText("<< replacement HTML file >>");
int startIndex = 1000;
int endIndex = 200000;
var sb = new StringBuilder(
input.Length - (endIndex - startIndex) + replacement.Length
);
sb.Append(input.Substring(0, startIndex));
sb.Append(replacement);
sb.Append(input.Substring(endIndex));
string output = sb.ToString();

The replacement code Pieter posted does the job, and using the StringBuilder with the known resulting length is a clever way to save performance.
Should do what you asked, but sometimes when working with structured data like html, it is preferable to load it as XML (I have used the HtmlAgilityPack for that). Then you could use XPath to find the node you want to replace, and work with it. It might be slower, but as I said, you can work with the structure then.

C# How can i remove newline characters from binary?

Basically i have binary data, i dont mind if it's unreadable but im writing it to a file which is parsed and so it's importance newline characters are taken out.
I thought i had done the right thing when i converted to string....
byte[] b = (byte[])SubKey.GetValue(v[i]);
s = System.Text.ASCIIEncoding.ASCII.GetString(b);
and then removed the newlines
String t = s.replace("\n","")
but its not working ?

Newline might be \r\n, and your binary data might not be ASCII encoded.

Firstly newline (Environment.Newline) is usually two characters on Windows, do you mean removing single carriage-return or line-feed characters?
Secondly, applying a text encoding to binary data is likely to lead to unexpected conversions. E.g. what will happen to buyes of the binary data that do not map to ASCII characters?

New line character may be \n or \r or \r\n depends on operating system type, in order this is markers for Linux, Macintosh and Windows.
But if you say you file is binary from what you know they have newlines in ASCII in her content?
If this is binary file this may be a some struct, if this they struct you after remove newline characters shift left all data after the this newline and corrupt data in her.

I would imagine removing the bytes in a binary chunk which correspond the line feeds would actually corrupt the binary data, thereby making it useless.
Perhaps you'd be better off using base64 encoding, which will produce ASCII-safe output.

If this is text data, then load it as text data (using the correct encoding), replace it as as a string, and re-encode it (using the correct encoding). For some encodings you might be able to do a swap at the file level (without decoding/encoding), but I wouldn't bet on it.
If this is any other binary representation, you will have to know the exact details. For example, it is common (but not for certain) for strings embedded in part of a binary file to have a length prefix. If you change the data without changing the length prefix, you've just corrupted the file. And to change the length prefix you need to know the format (it might be big-endian/little-endian, any fixed number of bytes, or the prefix itself could be variable length). Or it might be delimited. Or there might be relative offsets scattered through the file that all need fixing.
Just as likely; you could by chance have the same byte sequence in the binary that doesn't represent a newline; you could be completely trashing the data.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.