In C++, we can define a custom locale that enables stream object to ignore non-digits in the file, and reads only the integers.
Can we do something similar? How can we efficiently read only integers from a text file? Does C# stream object use locale? If yes, can we define custom locale that we can use with stream object so as to ignore unwanted characters while reading the file?
Here is one example in C++ which efficiently counts frequency of words in a text file:
Elegant ways to count the frequency of words in a file
My proposal:
public void ReadJustNumbers()
{
Regex r = new Regex(#"\d+");
using (var sr = new StreamReader("xxx"))
{
string line;
while (null != (line=sr.ReadLine()))
{
foreach (Match m in r.Matches(line))
{
Console.WriteLine(m.Value);
}
}
}
}
where xxx is the file name, obviously you will use the matching digit in a more elegant way than dumping on the console ;)
Related
Trying to validate if file is a svg by looking at the first few bytes. I know I can do this for png and other image file types but what about svg?
Maybe I have to convert bytes to string and validate using regex instead?
If performance is a concern and you don't want to read all the SVG file contents, you can use the XmlReader class to have a look at the first element:
private static bool IsSvgFile(Stream fileStream)
{
try
{
using (var xmlReader = XmlReader.Create(fileStream))
{
return xmlReader.MoveToContent() == XmlNodeType.Element && "svg".Equals(xmlReader.Name, StringComparison.OrdinalIgnoreCase);
}
}
catch
{
return false;
}
}
If you don't want to use an XML parser (you probably don't), then I think a 99%+ reliable method would be to read the first, say, 256 bytes. Then check for the string "<svg ", or use regex /^<svg /gm.
And/or check for the string " xmlns=\"http://www.w3.org/2000/svg\"".
From my experience working with SVG, this would catch almost all SVG files, with very few false negatives.
You can retrieve the file using a stream and knowing that the SVG format is a XML, you can simply check it like this:
var text = Encoding.UTF8.GetString(img);
isSvg = text.StartsWith("<?xml ") || text.StartsWith("<svg ");
I am working on a project with reads in 2 csv files:
var myFullCsv = ReadFile(myFullCsvFilePath);
var masterCsv = ReadFile(csvFilePath);
and then creates a new var containing the extra lines that exist in myFullCsv but not master Csv. The code is great because of its simplicity:
var extraFilesCsv = myFullCsv.Except(masterCsv);
The csv files read in contain data like this:
c01.jpg,95182,24f77a1e,\Folder1\FolderA\,
c02.jpg,131088,c17b1f13,\Folder1\FolderA\,
c03.jpg,129485,ddc964ec,\Folder1\FolderA\,
c04.jpg,100999,930ee633,\Folder1\FolderA\,
c05.jpg,101638,b89f1f28,\Folder1\FolderA\,
However, I have just found a situation where the case of some characters in each file does not match. For example (JPG in caps):
c01.JPG,95182,24f77a1e,\Folder1\FolderA\,
If the data is like this then it is not included in extraFilesCsv but I need it to be. Can anybody tell me how I can make this code insensitive to the case of the text?
Edit: Sorry, I forgot that ReadFile was not a standard command. Here is the code:
public static IEnumerable<string> ReadFile(string path)
{
string line;
using (var reader = File.OpenText(path))
while ((line = reader.ReadLine()) != null)
yield return line;
}
I'm assuming you've read in both csv files and have a collection of strings representing each file.
You can specify a specific EqualityComparer in the call to Except(), which instructs on the type of comparison to do between two collections of objects.
You can create your own comparer or, assuming both collections are of strings, try specifying an existing one that ignores case:
var extraFilesCsv
= myFullCsv.Except(masterCsv, StringComparer.CurrentCultureIgnoreCase);
By default, if you don't specify a comparer, it uses EqualityComparer<TElement>.Default, which differs based on the class type you're comparing.
For strings, it first does a straight-up a==b comparison by default, which is case-sensitive. (The exact implementation on the string class is a little more complicated, but it's probably unnecessary to post it here.)
Ok I'm not sure how to explain this, but take a look at this:
string fornavn = borgerXML.GetElementsByTagName("Fornavne")[id].InnerXml
Say I have 4 of these strings, which are getting different elements from the XML file.
On every single one of these, I want to do the following:
.Replace("æ", "æ")
.Replace("Ã?", "Æ")
.Replace("ø", "ø")
.Replace("Ã?", "Ø")
.Replace("Ã¥", "å")
.Replace("Ã?", "Å");
If I could store these 6 lines of code as a method or something, I'd only have to paste that method for every one of those 4 string elements, instead of pasting 4 times 6 lines of code.
Is there some way to do this?
Thanks in advance.
The actual problem here is that the file, which according to your comment, specifies that it is encoded using the iso-8859-1 encoding:
<?xml version="1.0" encoding="iso-8859-1"?>
does not in fact match that encoding, try this with LINQPad:
void Main()
{
string wrong = "æ ø å";
var bytes = Encoding.GetEncoding("iso-8859-1").GetBytes(wrong);
string correct = Encoding.UTF8.GetString(bytes);
correct.Dump();
}
This will output:
æ ø å
(note that I only used every second letter as the characters you've posted in the question are not an exact match to how they are actually encoded).
To fix this, go back to the code that produces the file and ensure it actually uses iso-8859-1 encoding when writing the file, instead of UTF-8.
public string Convert(string source)
{
return source.Replace("æ", "æ")
.Replace("Ã?", "Æ")
.Replace("ø", "ø")
.Replace("Ã?", "Ø")
.Replace("Ã¥", "å")
.Replace("Ã?", "Å");
}
If I undestood you correctly. For each string you want to "process" call
Convert(value);
I borrow a bit from here Regex Replace - Multiple Characters (actually you would have found this yourself with the search function) to make it a more generic solution which you can use for more or less characters:
define a function that can substitute dynamically:
string MultiReplace(string original, Dictionary<string, string> replacements)
{
//initialize result with original in case no replacements have been requested
string result = original;
foreach (var r in replacements)
{
result = result.Replace(r.Key, r.Value);
}
return result;
}
Define a multi-use variable to save overhead of creating it every time you need:
//You can create & initialize a dictionary variable like this:
var charmap = new Dictionary<string, string>
{
{ "æ", "æ" },
{ "Ã?", "Æ" },
{ "ø", "ø" },
{ "Ã?", "Ø" },
{ "Ã¥", "å" },
{ "Ã?", "Å" },
};
Call this function with your specific task:
var a = MultiReplace(fornavn, charmap);
In general it should be possible to fix this on the XML level though.
BR Florian
I would like to know that if I have an english dictionary in a text file what is the best way to check whether a given string is a proper and correct english word. My dictionary contains about 100000 english words and I have to check on an average of 60000 words in one go. I am just looking for the most efficient way. Also should I store all the strings first or I just process them as they are generated.
Thanx
100k is not too great a number, so you can just pop everything in a Hashset<string>.
Hashset lookup is key-based, so it will be lightning fast.
example how this might look in code is:
string[] lines = File.ReadAllLines(#"C:\MyDictionary.txt");
HashSet<string> myDictionary = new HashSet<string>();
foreach (string line in lines)
{
myDictionary.Add(line);
}
string word = "aadvark";
if (myDictionary.Contains(word))
{
Console.WriteLine("There is an aadvark");
}
else
{
Console.WriteLine("The aadvark is a lie");
}
You should probably use HashSet<string> if you're using .NET 3.5 or higher.
Just load the dictionary of valid words into a HashSet<string> and then either use Contains on each candidate string, or use some of the set operators to find all words which aren't valid.
For example:
// There are loads of ways of loading words from a file, of course
var valid = new HashSet<string>(File.ReadAllLines("dictionary.txt"));
var candidates = new HashSet<string>(File.ReadAllLines("candidate.txt"));
var validCandidates = candidates.Intersect(valid);
var invalidCandidates = candidates.Except(valid);
You may also wish to use case-insensitive comparisons or something similar - use the StringComparer static properties to get at appropriate instances of StringComparer which you can pass to the HashSet constructor.
If you're using .NET 2, you can use a Dictionary<string, whatever> as a poor-man's set - basically use whatever you like as the value, and just check for keys.
In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.
So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using
if (xml.StartsWith(ByteOrderMarkUtf8))
{
xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}
but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?
I recently had issues with the .NET 4 upgrade, but until then the simple answer is
String.Trim()
removes the BOM up until .NET 3.5.
However, in .NET 4 you need to change it slightly:
String.Trim(new char[]{'\uFEFF'});
That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):
String.Trim(new char[]{'\uFEFF','\u200B'});
This you could also use to remove other unwanted characters.
Some further information is from
String.Trim Method:
The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).
I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:
private readonly string _byteOrderMarkUtf8 =
Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
public string GetXmlResponse(Uri resource)
{
string xml;
using (var client = new WebClient())
{
client.Encoding = Encoding.UTF8;
xml = client.DownloadString(resource);
}
if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
{
xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
}
return xml;
}
Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.
This works as well
int index = xmlResponse.IndexOf('<');
if (index > 0)
{
xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}
A quick and simple method to remove it directly from a string:
private static string RemoveBom(string p)
{
string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (p.StartsWith(BOMMarkUtf8))
p = p.Remove(0, BOMMarkUtf8.Length);
return p.Replace("\0", "");
}
How to use it:
string yourCleanString=RemoveBom(yourBOMString);
If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.
Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).
I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:
var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);
It's that simple.
If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):
var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);
I wrote the following post after coming across this issue.
Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.
It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.
Usage:
string feed = ""; // input
bool hadBOM = FixBOMIfNeeded(ref feed);
var xElem = XElement.Parse(feed); // now does not fail
/// <summary>
/// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0];
/// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char.
/// </summary>
public const char BOMChar = (char)65279;
public static bool FixBOMIfNeeded(ref string str)
{
if (string.IsNullOrEmpty(str))
return false;
bool hasBom = str[0] == BOMChar;
if (hasBom)
str = str.Substring(1);
return hasBom;
}
Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.
Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.
I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):
public static string GetUTF8String(byte[] data)
{
byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
if (data.StartsWith(utf8Preamble))
{
return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
}
else
{
return Encoding.UTF8.GetString(data);
}
}
Where StartsWith(byte[]) is the logical extension:
public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
// Handle invalid/unexpected input
// (nulls, thisArray.Length < otherArray.Length, etc.)
for (int i = 0; i < otherArray.Length; ++i)
{
if (thisArray[i] != otherArray[i])
{
return false;
}
}
return true;
}
StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);
Yet another generic variation to get rid of the UTF-8 BOM preamble:
var preamble = Encoding.UTF8.GetPreamble();
if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble))
preamble = Array.Empty<Byte>();
return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);
Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:
certficateThumbprint = Regex.Replace(certficateThumbprint, #"[^a-zA-Z0-9\-\s*]", "");
And there you go. Voila!! It worked for me.
I solved the issue with the following code
using System.Xml.Linq;
void method()
{
byte[] bytes = GetXmlBytes();
XDocument doc;
using (var stream = new MemoryStream(docBytes))
{
doc = XDocument.Load(stream);
}
}