I am trying to replace non-printable characters ie extended ASCII characters from a HUGE string.
foreach (string line in File.ReadLines(txtfileName.Text))
{
MessageBox.Show( Regex.Replace(line,
#"\p{Cc}",
a => string.Format("[{0:X2}]", " ")
)); ;
}
this doesnt seem to be working.
EX:
AAÂAA should be converted to AA AA
Assuming the Encoding to be UTF8 try this:
string strReplacedVal = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(" "),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(line)
)
);
Since you are opening the file as UTF-8, it must be. So, its code units are one byte and UTF-8 has the very nice feature of encoding characters above ␡ with bytes exclusively above 0x7f and characters at or below ␡ with bytes exclusively at or below 0x7f.
For efficiency, you can rewrite the file in place a few KB at a time.
Note: that some characters might be replaced by more than one space, though.
// Operates on a UTF-8 encoded text file
using (var stream = File.Open(path, FileMode.Open, FileAccess.ReadWrite))
{
const int size = 4096;
var buffer = new byte[size];
int count;
while ((count = stream.Read(buffer, 0, size)) > 0)
{
var changed = false;
for (int i = 0; i < count; i++)
{
// obliterate all bytes that are not encoded characters between ␠ and ␡
if (buffer[i] < ' ' | buffer[i] > '\x7f')
{
buffer[i] = (byte)' ';
changed = true;
}
}
if (changed)
{
stream.Seek(-count, SeekOrigin.Current);
stream.Write(buffer, 0, count);
}
}
}
Related
In c#, I have a file which has Unix line endings(\r) I need to replace those to Windows (\r\n). But,
1 - I don't know the original file encoding (utf-8, unicode, iso8852-1, etc) and
2 - I don't know how big the original file may be.
The first point is important - I cannot simply read and write each line using a StreamWriter because I don't know the original encoding.
How can I achieve this?
private void Unix2Dos(string fileName)
{
const byte CR = 0x0D;
const byte LF = 0x0A;
byte[] DOS_LINE_ENDING = new byte[] { CR, LF };
byte[] data = File.ReadAllBytes(fileName);
using (FileStream fileStream = File.OpenWrite(fileName))
{
BinaryWriter bw = new BinaryWriter(fileStream);
int position = 0;
int index = 0;
do
{
index = Array.IndexOf<byte>(data, LF, position);
if (index >= 0)
{
if ( ( index > 0 ) && (data[index - 1] == CR ))
{
// already dos ending
bw.Write(data, position, index - position + 1);
}
else
{
bw.Write(data, position, index - position);
bw.Write(DOS_LINE_ENDING);
}
position = index + 1;
}
}
while (index > 0);
bw.Write(data, position, data.Length - position);
fileStream.SetLength(fileStream.Position);
}
}
Reference: http://csharp-goodies.blogspot.com/2011/02/convert-files-from-dos-to-unix-and-back.html
I would like to make a method that counts the occurences of a series of characters in a .txt file (C#). I've found some related questions here that have valid answers. However, there are certain circumstances that restrict the possible solutions:
The method has to work quite fast, because I have to use it more hundred times in the program.
The text in the file is overlong to be read in a string.
Thank you for your help.
The method has to work quite fast, because I have to use it more hundred times in the program.
According to recent benchmarks, SequenceEqual of Span<T> tends to be the fastest way to compare array slices in .NET nowadays (except for unsafe or P/Invoke approaches).
The text in the file is overlong to be read in a string.
This issue can easily be tackled using FileStream or StreamReader.
In a nutshell, you need to read the file chunked: read a fixed size part from the file, look for occurences in it, read the next part, look for occurences, and so on. This can be coded without moving back the cursor, just the leftover of each part needs to be taken into account when dealing with the next part.
Here is my approach using FileStream and Span<T>:
public static int CountOccurences(Stream stream, string searchString, Encoding encoding = null, int bufferSize = 4096)
{
if (stream == null)
throw new ArgumentNullException(nameof(stream));
if (searchString == null)
throw new ArgumentNullException(nameof(searchString));
if (!stream.CanRead)
throw new ArgumentException("Stream must be readable.", nameof(stream));
if (bufferSize <= 0)
throw new ArgumentException("Buffer size must be a positive number.", nameof(bufferSize));
// detecting encoding
Span<byte> bom = stackalloc byte[4];
var actualLength = stream.Read(bom);
if (actualLength == 0)
return 0;
bom = bom.Slice(0, actualLength);
Encoding detectedEncoding;
if (bom.StartsWith(Encoding.UTF8.GetPreamble()))
detectedEncoding = Encoding.UTF8;
else if (bom.StartsWith(Encoding.UTF32.GetPreamble()))
detectedEncoding = Encoding.UTF32;
else if (bom.StartsWith(Encoding.Unicode.GetPreamble()))
detectedEncoding = Encoding.Unicode;
else if (bom.StartsWith(Encoding.BigEndianUnicode.GetPreamble()))
detectedEncoding = Encoding.BigEndianUnicode;
else
detectedEncoding = null;
if (detectedEncoding != null)
{
if (encoding == null)
encoding = detectedEncoding;
if (encoding == detectedEncoding)
bom = bom.Slice(detectedEncoding.GetPreamble().Length);
}
else if (encoding == null)
encoding = Encoding.ASCII;
// acquiring a buffer
ReadOnlySpan<byte> searchBytes = encoding.GetBytes(searchString);
bufferSize = Math.Max(Math.Max(bufferSize, searchBytes.Length), 128);
var bufferArray = ArrayPool<byte>.Shared.Rent(bufferSize);
try
{
var buffer = new Span<byte>(bufferArray, 0, bufferSize);
// looking for occurences
bom.CopyTo(buffer);
actualLength = bom.Length + stream.Read(buffer.Slice(bom.Length));
var occurrences = 0;
do
{
var index = 0;
var endIndex = actualLength - searchBytes.Length;
for (; index <= endIndex; index++)
if (buffer.Slice(index, searchBytes.Length).SequenceEqual(searchBytes))
occurrences++;
if (actualLength < buffer.Length)
break;
ReadOnlySpan<byte> leftover = buffer.Slice(index);
leftover.CopyTo(buffer);
actualLength = leftover.Length + stream.Read(buffer.Slice(leftover.Length));
}
while (true);
return occurrences;
}
finally { ArrayPool<byte>.Shared.Return(bufferArray); }
}
This code requires C# 7.2 to compile. You may have to include the System.Buffers and System.Memory NuGet packages, as well. If you use .NET Core version lower than 2.1 or another platform than .NET Core, you need to include this "polyfill", as well:
static class Compatibility
{
public static int Read(this Stream stream, Span<byte> buffer)
{
// copied over from corefx sources (https://github.com/dotnet/corefx/blob/master/src/Common/src/CoreLib/System/IO/Stream.cs)
byte[] sharedBuffer = ArrayPool<byte>.Shared.Rent(buffer.Length);
try
{
int numRead = stream.Read(sharedBuffer, 0, buffer.Length);
if ((uint)numRead > buffer.Length)
throw new IOException("Stream was too long.");
new Span<byte>(sharedBuffer, 0, numRead).CopyTo(buffer);
return numRead;
}
finally { ArrayPool<byte>.Shared.Return(sharedBuffer); }
}
}
Usage:
using (var fs = new FileStream(#"path-to-file", FileMode.Open, FileAccess.Read, FileShare.Read))
Console.WriteLine(CountOccurences(fs, "string to search"));
When you don't specify the encoding argument, the encoding will be auto-detected by examining the BOM of the file. If BOM is not present, ASCII encoding is assumed as a fallback.
I am writing a winform to convert written text into Unicode numbers and UTF8 numbers. This bit is working well
//------------------------------------------------------------------------
// Convert to UTF8
// The return will be either 1 byte, 2 bytes or 3 bytes.
//-----------------------------------------------------------------------
UTF8Encoding utf8 = new UTF8Encoding();
StringBuilder builder = new StringBuilder();
string utext = rchtxbx_text.Text;
// do one char at a time
for (int text_index = 0; text_index < utext.Length; text_index++)
{
byte[] encodedBytes = utf8.GetBytes(utext.Substring(text_index, 1));
for (int index = 0; index < encodedBytes.Length; index++)
{
builder.AppendFormat("{0}", Convert.ToString(encodedBytes[index], 16));
}
builder.Append(" ");
}
rchtxtbx_UTF8.SelectionFont = new System.Drawing.Font("San Serif", 20);
rchtxtbx_UTF8.AppendText(builder.ToString() + "\r");
As an example the characters 乘义ש give me e4b998 e4b989 d7a9, note I have a mix LtoR and RtoL text. Now if the user inputs the number e4b998 I want to show them it is 乘, in Unicode 4E58
I have tried a few things and the closest I got, but still far away, is
Encoding utf8 = Encoding.UTF8;
rchtxbx_text.Text = Encoding.ASCII.GetString(utf8.GetBytes(e4b998));
What do I need to do to input e4b998 and write 乘 to a textbox?
Something like this:
Split source into 2-character chunks: "e4b998" -> {"e4", "b9", "98"}
Convert chunks into bytes
Encode bytes into the final string
Implementation:
string source = "e4b998";
string result = Encoding.UTF8.GetString(Enumerable
.Range(0, source.Length / 2)
.Select(i => Convert.ToByte(source.Substring(i * 2, 2), 16))
.ToArray());
If you have an int as source:
string s_unicode2 = System.Text.Encoding.UTF8.GetString(utf8.GetBytes(e4b998));
I have a large text file which should be processed after every 2000 characters with a new line to it I have done so far as
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
StreamReader reader = new StreamReader(FilePath);
string firstLine = reader.ReadLine();
if (firstLine.Length > 2000)
{
string text = File.ReadAllText(FilePath);
text = Regex.Replace(text, #"(.{2000})", "$1\r\n", RegexOptions.Multiline);
reader.Close();
File.WriteAllText(FilePath, text);
}
it is giving
out of memory exception
please, anyone, refer me some advice
In case of very large (multi Gigabyte) file which doesn't fit memory, you can try storing processed data into a temporary file. Avoid ReadAllText, but read and write with a help of buffer (which is convenient to be of 2000 chars in the context)
// Initial and target file
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
// Temporary file
string tempFile = Path.ChangeExtension(FilePath, ".~temp");
char[] buffer = new char[2000];
using (StreamReader reader = new StreamReader(FilePath)) {
bool first = true;
using (StreamWriter writer = new StreamWriter(tempFile)) {
while (true) {
int size = reader.ReadBlock(buffer, 0, buffer.Length);
if (size > 0) { // Do we have anything to write?
if (!first) // Are we in the middle and have to add a new line?
writer.WriteLine();
for (int i = 0; i < size; ++i)
writer.Write(buffer[i]);
}
// The last (incomplete) chunk
if (size < buffer.Length)
break;
first = false;
}
}
}
File.Delete(FilePath);
// Move temporary file into target one
File.Move(tempFile, FilePath);
// And finally removing temporary file
File.Delete(tempFile);
Edit: Even if you have not that large (300MB, see comments) avoid string processing (several copies of the initial string can well lead to Out Of Memory).
Something like this
private static IEnumerable<string> ToChunks(string text, int size) {
int n = text.Length / size + (text.Length % size == 0 ? 0 : 1);
for (int i = 0; i < n; ++i)
if (i == n - 1)
yield return text.Substring(i * size); // Last chunk
else
yield return text.Substring(i * size, size); // Inner chunk
}
...
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
// Read once, do not Replace ao do something with the string
string text = File.ReadAllText(FilePath);
// ... but extracting 2000 char chunks
File.WriteAllLines(FilePath, ToChunks(text, 2000));
You can't simply insert newlines into an exiting file - you need to rewrite the entire thing, basically. The easiest way to do that is to use two files - a source and destination - and then perhaps delete and rename at the end (so the temporary destination file takes the name of the original). This means you can now loop over the source file without reading it all into memory first; essentially, as pseudo-code:
using(...open source for read...)
using(...create dest for write...)
{
char[] buffer = new char[2000];
int charCount;
while(TryBuffer(source, buffer, out charCount)) {
// if true, we filled the buffer; don't need to worry
// about charCount
Write(destination, buffer, buffer.Length);
Write(destination, CRLF);
}
if(charCount != 0) // final chunk when returned false
{
// write any remaining charCount chars as a final chunk
Write(destination, buffer, charCount);
}
}
So that leaves the implementation of TryBuffer and Write. In this case, TextReader and TextWriter are probably your friends, since you are dealing in characters rather than bytes.
I am try to do some code using BinaryWriter and Then BinaryReader.
When I wanna write I use method Write().
But the problem is that between two lines of Write method there appears a new byte which is in ASCII table in decimal 31 (sometines 24).
You can see it on this image:
You can see that byte at index 4 (5th byte) is of ASCII decimal value 31. I didnt insert it there. As you can see 1st 4 bytes are reserved for a number (Int32), next are other data (some text mostly - this is not important now).
As you can see from the code i write:
- into 1st line a number 10
- into 2nd line text "This is some text..."
How come came that 5th byte (dec 31) in between??
And this is the code I have:
static void Main(string[] args)
{
//
//// SEND - RECEIVE:
//
SendingData();
Console.ReadLine();
}
private static void SendingData()
{
int[] commandNumbers = { 1, 5, 10 }; //10 is for the users (when they send some text)!
for (int i = 0; i < commandNumbers.Length; i++)
{
//convert to byte[]
byte[] allBytes;
using (MemoryStream ms = new MemoryStream())
{
using (BinaryWriter bw = new BinaryWriter(ms))
{
bw.Write(commandNumbers[i]); //allocates 1st 4 bytes - FOR MAIN COMMANDS!
if (commandNumbers[i] == 10)
bw.Write("This is some text at command " + commandNumbers[i]); //HERE ON THIS LINE IS MY QUESTION!!!
}
allBytes = ms.ToArray();
}
//convert back:
int valueA = 0;
StringBuilder sb = new StringBuilder();
foreach (var b in GetData(allBytes).Select((a, b) => new { Value = a, Index = b }))
{
if (b.Index == 0) //1st num
valueA = BitConverter.ToInt32(b.Value, 0);
else //other text
{
foreach (byte _byte in b.Value)
sb.Append(Convert.ToChar(_byte));
}
}
if (sb.ToString().Length == 0)
sb.Append("ONLY COMMAND");
Console.WriteLine("Command = {0} and Text is \"{1}\".", valueA, sb.ToString());
}
}
private static IEnumerable<byte[]> GetData(byte[] data)
{
using (MemoryStream ms = new MemoryStream(data))
{
using (BinaryReader br = new BinaryReader(ms))
{
int j = 0;
byte[] buffer = new byte[4];
for (int i = 0; i < data.Length; i++)
{
buffer[j++] = data[i];
if (i == 3) //SENDING COMMAND DATA
{
yield return buffer;
buffer = new byte[1];
j = 0;
}
else if (i > 3) //SENDING TEXT
{
yield return buffer;
j = 0;
}
}
}
}
}
If you look at the documentation for Write(string), you'll see that it writes a length-prefixed string. So the 31 is the number of characters in your string -- perfectly normal.
You should probably be using Encoding.GetBytes and then write the bytes instead of writing a string
for example
bw.Write(
Encoding.UTF8.GetBytes("This is some text at command " + commandNumbers[i])
);
When a string is written to a binary stream, the first thing it does is write the length of the string. The string "This is some text at command 10" has 31 characters, which is the value you're seeing.
You should check the documentation of methods you use before asking questions about them:
A length-prefixed string represents the string length by prefixing to
the string a single byte or word that contains the length of that
string. This method first writes the length of the string as a UTF-7
encoded unsigned integer, and then writes that many characters to the
stream by using the BinaryWriter instance's current encoding.
;-)
(Though in fact it is an LEB128 and not UTF-7, according to Wikipedia).
The reason this byte is there because you're adding a variable amount of information, so the length is needed. If you were to add two strings, where would you know where the first ended and the second began?
If you really don't want or need that length byte, you can always convert the string to a byte array and use that.
Ok, here is my edited code. I removed BinaryWriter (while BinaryReader is still there!!), and now it works very well - no more extra bytes.
What do you thing? Is there anytihng to do better, to make it run faster?
Expecially Im interesting for that foreach loop, which read from another method that is yield return type!!
New Code:
static void Main(string[] args)
{
//
//// SEND - RECEIVE:
//
SendingData();
Console.ReadLine();
}
private static void SendingData()
{
int[] commands = { 1, 2, 3 };
// 1 - user text
// 2 - new game
// 3 - join game
// ...
for (int i = 0; i < commands.Length; i++)
{
//convert to byte[]
byte[] allBytes;
using (MemoryStream ms = new MemoryStream())
{
// 1.st - write a command:
ms.Write(BitConverter.GetBytes(commands[i]), 0, 4);
// 2nd - write a text:
if (commands[i] == 1)
{
//some example text (like that user sends it):
string myText = "This is some text at command " + commands[i];
byte[] myBytes = Encoding.UTF8.GetBytes(myText);
ms.Write(myBytes, 0, myBytes.Length);
}
allBytes = ms.ToArray();
}
//convert back:
int valueA = 0;
StringBuilder sb = new StringBuilder();
foreach (var b in ReadingData(allBytes).Select((a, b) => new { Value = a, Index = b }))
{
if (b.Index == 0)
{
valueA = BitConverter.ToInt32(b.Value, 0);
}
else
{
sb.Append(Convert.ToChar(b.Value[0]));
}
}
if (sb.ToString().Length == 0)
sb.Append("ONLY COMMAND");
Console.WriteLine("Command = {0} and Text is \"{1}\".", valueA, sb.ToString());
}
}
private static IEnumerable<byte[]> ReadingData(byte[] data)
{
using (MemoryStream ms = new MemoryStream(data))
{
using (BinaryReader br = new BinaryReader(ms))
{
int j = 0;
byte[] buffer = new byte[4];
for (int i = 0; i < data.Length; i++)
{
buffer[j++] = data[i];
if (i == 3) //SENDING COMMAND DATA
{
yield return buffer;
buffer = new byte[1];
j = 0;
}
else if (i > 3) //SENDING TEXT
{
yield return buffer;
j = 0;
}
}
}
}
}