C# fast way to replace text in a html file - c#

I want to replace text from a certain range in my HTML file (like from position 1000 to 200000) with text from another HTML file. Can someone recommend me the best way to do this?

Pieter's way will work, but it does involve loading the whole file into memory. That may well be okay, but if you've got particularly large files you may want to consider an alternative:
Open a TextReader on the original file
Open a TextWriter for the target file
Copy blocks of text by calling Read/Write repeatedly, with a buffer of say 8K characters until you've read the initial amount (1000 characters in your example)
Write the replacement text out to the target writer by again opening a reader and copying blocks
Skip the text you want to ignore in the original file, by repeatedly reading into a buffer and just ignoring it (incrementing a counter so you know how much you've skipped, of course)
Copy the rest of the text from the original file in the same way.
Basically it's just lots of copying operations, including one "copy" which doesn't go anywhere (for skipping the text in the original file).

Try this:
string input = File.ReadAllText("<< input HTML file >>");
string replacement = File.ReadAllText("<< replacement HTML file >>");
int startIndex = 1000;
int endIndex = 200000;
var sb = new StringBuilder(
input.Length - (endIndex - startIndex) + replacement.Length
);
sb.Append(input.Substring(0, startIndex));
sb.Append(replacement);
sb.Append(input.Substring(endIndex));
string output = sb.ToString();

The replacement code Pieter posted does the job, and using the StringBuilder with the known resulting length is a clever way to save performance.
Should do what you asked, but sometimes when working with structured data like html, it is preferable to load it as XML (I have used the HtmlAgilityPack for that). Then you could use XPath to find the node you want to replace, and work with it. It might be slower, but as I said, you can work with the structure then.

Related

Get only strings from binary file

I try to print to screen a string from a binary file using xaml labels, but when i display the file content I got a beautiful "corrupted" character instead of the entire file content.
I think the problem is reading the file, I already can change the label content using the most basic technique it work pretty well till today....
label.Text = mystring ;
The fact is : I have data in my binaries files that inst text (some random data that I don't care) located to the start of the file, my theory is my program start reading, read a non ascii character and stop reading...
I read using the File class, maybe the wrong thing.....
label.Text = File.ReadAllText(my_file);
So, im lock now. I don't exactly know what im supposed to do....
Hope you can help me :D
I can't tell much without looking at the text, but it seems you need to add the Encoding
Something like this:
string myText = File.ReadAllText(path, Encoding.Default);
You need to know how your binary file is structured. You need to know the encoding of the strings. A normal text file normally has markers at the beginning two or so bytes that identify its encoding if it is Unicode. This way the system can know whether its UTF-8, UTF-16, ...
If you try to read a binary file this Information is not there. Instead the reading process will most probably find unexpected binary data. So you cannot read a binary file as text. If your file is structured the way that at the beginning is binary data and later only text, just skip the first part and start reading at the start of the second part. But I don't think, that it is that easy:
if it really is binary data, chances are that the file structure is much more complicated and you need to do more work to read it.
if only the first two bytes are binary data, then maybe its a text file and you can read it without problems, you maybe only need to pass the right encoding to the reading function

Write text to file in C# with 513 space characters

Here is a code that writes the string to a file
System.IO.File.WriteAllText("test.txt", "P ");
It's basically the character 'P' followed by a total of 513 space character.
When I open the file in Notepad++, it appears to be fine. However, when I open in windows Notepad, all I see is garbled characters.
If instead of 513 space character, I add 514 or 512, it opens fine in Notepad.
What am I missing?
What you are missing is that Notepad is guessing, and it is not because your length is specifically 513 spaces ... it is because it is an even number of bytes and the file size is >= 100 total bytes. Try 511 or 515 spaces ... or 99 ... you'll see the same misinterpretation of your file contents. With an odd number of bytes, Notepad can assume that your file is not any of the double-byte encodings, because those would all result in 2 bytes per character = even number of total bytes in the file. If you give the file a few more low-order ASCII characters at the beginning (e.g., "PICKLE" + spaces), Notepad does a much better job of understanding that it should treat the content as single-byte chars.
The suggested approach of including Encoding.UTF8 is the easiest fix ... it will write a BOM to the beginning of the file which tells Notepad (and Notepad++) what the format of the data is, so that it doesn't have to resort to this guessing behavior (you can see the difference between your original approach and the BOM approach by opening both in Notepad++, then look in the bottom-right corner of the app. With the BOM, it will tell you the encoding is UTF-8-BOM ... without it, it will just say UTF-8).
I should also say that the contents of your file are not 'wrong', per se... the weird format is purely due to Notepad's "guessing" algorithm. So unless it's a requirement that people use Notepad to read your file with 1 letter and a large, odd number of spaces ... maybe just don't sweat it. If you do change to writing the file with Encoding.UTF8, then you do need to ensure that any other system that reads your file knows how to honor the BOM, because it is a real change to the contents of your file. If you cannot verify that all consumers of your file can/will handle the BOM, then it may be safer to just understand that Notepad happens to make a bad guess for your specific use case, and leave the raw contents exactly how you want them.
You can verify the physical difference in your file with the BOM by doing a binary read and then converting them to a string (you can't "see" the change with ReadAllText, because it honors & strips the BOM):
byte[] contents = System.IO.File.ReadAllBytes("test.txt");
Console.WriteLine(Encoding.ASCII.GetString(contents));
Try passing in a different encoding:
i. System.IO.File.WriteAllText(filename , stringVariable, Encoding.UTF8);
ii. System.IO.File.WriteAllText(filename , stringVariable, Encoding.UTF32);
iii. etc.
Also You could try using another way to build your string, to make it be easier to read, change and count, instead of tapping the space bar 513 times;
i. Use the string constructor (like #Tigran suggested)
var result = "P" + new String(' ', 513);
ii. Use the stringBuilder
var stringBuilder = new StringBuilder();
stringBuilder.Append("P");
for (var i = 1; i <= 513; i++) { stringBuilder.Append(" "); }
iii. Or both
public string AppendSpacesToString(string stringValue, int numberOfSpaces)
{
var stringBuilder = new StringBuilder();
stringBuilder.Append(stringValue);
stringBuilder.Append(new String(' ', numberOfSpaces));
return stringBuilder.ToString();
}

Reading a text file from Unity3d

I have a error in a script which reads from a text file outside the program.
The error is
FormatException: Input string was not in the correct format
Its obvious whats wrong, but I just don't understand why it cant read it properly.
My code:
using (FileStream fs = new FileStream(#"D:\Program Files (x86)\Steam\SteamApps\common\blabla...to my file.txt))
{
byte[] b = new byte[1024];
UTF8Encoding temp = new UTF8Encoding(true);
while (fs.Read(b, 0, b.Length) > 0)
{
//Debug.Log(temp.GetString(b));
var converToInt = int.Parse(temp.GetString(b));
externalAmount = converToInt;
}
fs.Close();
}
The text file has 4 lines of values.
Each line represent a object in a game. All I am trying to do is read these values. Unfortunately I get the error which I can't explain.
So how can I read new lines without getting the error?
the text file looks like this
12
5
6
0
4 lines no more, all values on a seperate line.
There's no closing " on your new Filestream(" ...); but I'm gonna assume that's an issue when copy pasting your code to Stackoverflow.
The error you're getting is likely because you're trying to parse spaces to int, which wont work; the input string (" " in this case) was not in the correct format (int).
Split your lines on spaces (Split.(' ')) and parse every item in the created array.
A couple problems:
Problem 1
fs.Read(b, 0, b.Length) may read one byte, or all of them. The normal way to read a text file like this is to use StreamReader instead of FileStream. The Streamreader has a convenience constructor for opening a file that works the same way, but it can read line by line and is much more convenient. Here's the documentation and an excellent example: https://msdn.microsoft.com/en-us/library/f2ke0fzy(v=vs.110).aspx
If you insist on reading directly from a filestream, you will either need to
Parse your string outside the loop so you can be certain you've read the whole file into your byte buffer (b), or
Parse the new content byte by byte until you find a particular separator (for example a space or a newline) and then parse everything in your buffer and reset the buffer.
Problem 2
Most likely your buffer already contains everything in the file. Your file is so small that the filestream object is probably reading the whole thing in a single shot, even though that's not gauranteed.
Since your string buffer contains ALL the characters in the file you are effectively trying to parse "12\n5\n6\n0" as an integer and the parser is choking on the newline characters. Since newlines are non-numeric, it has no idea how to interpret them.

Replacing a word in a text file

I'm doing a little program where the data saved on some users are stored in a text file. I'm using Sytem.IO with the Streamwriter to write new information to my text file.
The text in the file is formatted like so :
name1, 1000, 387
name2, 2500, 144
... and so on. I'm using infos = line.Split(',') to return the different values into an array that is more useful for searching purposes. What I'm doing is using a While loop to search for the correct line (where the name match) and I return the number of points by using infos[1].
I'd like to modify this infos[1] value and set it to something else. I'm trying to find a way to replace a word in C# but I can't find a good way to do it. From what I've read there is no way to replace a single word, you have to rewrite the complete file.
Is there a way to delete a line completely, so that I could rewrite it at the end of the text file and not have to worried about it being duplicated?
I tried using the Replace keyword, but it didn't work. I'm a bit lost by looking at the answers proposed for similar problems, so I would really appreciate if someone could explain me what my options are.
If I understand you correctly, you can use File.ReadLines method and LINQ to accomplish this.First, get the line you want:
var line = File.ReadLines("path")
.FirstOrDefault(x => x.StartsWith("name1 or whatever"));
if(line != null)
{
/* change the line */
}
Then write the new line to your file excluding the old line:
var lines = File.ReadLines("path")
.Where(x => !x.StartsWith("name1 or whatever"));
var newLines = lines.Concat(new [] { line });
File.WriteAllLines("path", newLines);
The concept you are looking for is called 'RandomAccess' for file reading/writing. Most of the easy-to-use I/O methods in C# are 'SequentialAccess', meaning you read a chunk or a line and move forward to the next.
However, what you want to do is possible, but you need to read some tutorials on file streams. Here is a related SO question. .NET C# - Random access in text files - no easy way?
You are probably either reading the whole file, or reading it line-for-line as part of your search. If your fields are fixed length, you can read a fixed number of bytes, keep track of the Stream.Position as you read, know how many characters you are going to read and need to replace, and then open the file for writing, move to that exact position in the stream, and write the new value.
It's a bit complex if you are new to streams. If your file is not huge, copying a file line for line can be done pretty efficiently by the System.IO library if coded correctly, so you might just follow your second suggestion which is read the file line-for-line, write it to a new Stream (memory, temp file, whatever), replace the line in question when you get to that value, and when done, replace the original.
It is most likely you are new to C# and don't realize the strings are immutable (a fancy way of saying you can't change them). You can only get new strings from modifying the old:
String MyString = "abc 123 xyz";
MyString.Replace("123", "999"); // does not work
MyString = MyString.Replace("123", "999"); // works
[Edit:]
If I understand your follow-up question, you could do this:
infos[1] = infos[1].Replace("1000", "1500");

Problem with indexed XML file

I scanned 2,8GB XML file for positions (Index) of particular tags. The I use Seek method to set a start point in that file. File is UTF-8 encoded.
So indexing is like that:
using(StreamReader sr = new StreamReader(pathToFile)){
long index = 0;
while(!sr.EndOfStream){
string line = sr.ReadLine();
index += (line.Length + 2); //remeber of \r\n chars
if(LineHasTag(line)){
SaveIndex(index-line.Length); //need beginning of the line
}
}
}
So afterwards I have in another file indexed positions. But when I use seek it doesn't seem to be good, because the position is set somewhere before it should be.
I have loaded some content of that file into char array and I manually checked the good index of a tag I need. It's the same as I indexed by code above. But still Seek method on StreamReader.BaseStream places the pointer earlier in the file. Quite strange.
Any suggestions?
Best regards,
ventus
Seek deals in bytes - you're assuming there's one byte per character. In UTF-8, one character in the BMP can take up to three bytes.
My guess is that you've got non-ASCII characters in your file - those will take more than one byte.
I think there may also be a potential problem with the byte order mark, if there is one. I can't remember offhand whether StreamReader will swallow that automatically - which would put you 3 bytes to start with.

Categories