I am using Roslyn to generate a big amount of code (about 60k lines).
The problem comes when I use Formatter.Format() to format the whitespace. The actual formatting takes way too long (~60k lines in ~200s).
Used code.
public string GenerateCode()
{
var workspace = new AdhocWorkspace();
OptionSet options = workspace.Options;
options = options.WithChangedOption(CSharpFormattingOptions.NewLinesForBracesInMethods, true);
options = options.WithChangedOption(CSharpFormattingOptions.NewLinesForBracesInProperties, true);
CompilationUnitSyntax compilationUnit = CreateCompilationUnit();// this method builds the syntax tree.
SyntaxNode formattedNode = Formatter.Format(compilationUnit, workspace, options);
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
{
formattedNode.WriteTo(writer);
}
return sb.ToString();
}
I came to a realization, that a human readable formatting is not essential (still would be nice). I stopped formatting the code but then the generated code is actually unable to compile. That is because some keywords don't have the necessary whitespace around them. For example "publicstaticclassMyClass".
I tried different options of the Formatter but none were sufficient.
Then I was looking for an alternative "minimal" formatter. To my knowledge, there isn't any.
Finally, I managed to solve this by putting extra whitespace in the identifiers themselves.
var className = "MyClass";
SyntaxFactory.ClassDeclaration(" " + className)
.AddModifiers(
// Using explicit identifier with extra whitespace
SF.Identifier(" public "),
SF.Identifier(" static "));
// Instead of the SyntaxKind enum
//SF.Token(SyntaxKind.PublicKeyword),
//SF.Token(SyntaxKind.StaticKeyword));
And for the code generation.
public string GenerateCode()
{
var workspace = new AdhocWorkspace();
CompilationUnitSyntax compilationUnit = CreateCompilationUnit(); // this method builds the syntax tree.
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
{
compilationUnit.WriteTo(writer);
}
return sb.ToString();
}
This way the generation is much faster (~60k lines in ~2s Not really lines since it is not formatted but it is the same amount of code). Although this works, it seems kinda hacky. Another solution might be to create an alternative Formatter but that is a task i don't wish to undertake.
Did anyone come up with a better solution? Is there some way to use the Formatter more efficiently?
Note: The time measurements provided include the time of building the syntax tree and several other procedures. In both cases the formatting is about 98% of the measured time. So it is still possible to use them for comparison.
The "minimal formatter" you're looking for is the .NormalizeWhitespace() method.
It's not suitable for code you intend humans to maintain, but I'm assuming that shouldn't be an issue since you're dealing with a 60k line file!
Related
I've a common problem where I've not found a proper solution. I've multiple XML strings with a specific tag (e.g. MIME_SOURCE) and I don't know which XML string contains which value. But I have to replace all occurrences.
On the other hand I have a dictionary containing all possible values of the XML as a key and the value to replace with as value. As I said, I don't know what to replace in which XML.
E.g.
Part of first XML
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\icon_top.jpg</MIME_SOURCE>
</MIME>
Part of second XML:
<MIME>
<MIME_SOURCE>\Web\Bilder klein\5478.jpg</MIME_SOURCE>
</MIME>
Dictionary looks like:
{"\Web\Bilder Groß\1509_131_021_01.jpg", "/Web/Bilder Groß/1509_131_021_01.jpg"}
{"\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg", "/Web/Bilder Groß/1509_131_021_01_MitWasserzeichen.jpg"}
{"\Web\Bilder Groß\icon_top.jpg", "icon_top.jpg"}
{"\Web\Bilder klein\5478.jpg", "5478.jpg"}
My main problem is, if I iterate through the dictionary for each XML string the effort will be count of XML strings multiplied with count of entries in the dictionary (n*m). This is really bad in my case as there can be around a million XML strings and at least thousands of entries in the dictionary.
Currently I'm using string.Replace for each key of the dictionary for each XML.
Do you have a good idea how to speed up this process?
Edit:
I've changed code to the following one:
var regex = new Regex(#"<MIME_SOURCE>[\s\S]*?<\/MIME_SOURCE>");
foreach (Match match in regex.Matches(stringForXml))
{
DoReplacements...
}
This fits to the requirements for now as the replacement will only be done for each MIME_SOURCE in the XML. But I will as well have a look at the mentioned algorithm.
The most correct way is to properly parse your XML. Then you can go through it in a single pass:
var xml = #"<root>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\icon_top.jpg</MIME_SOURCE>
</MIME>
</root>";
var replacements = new Dictionary<string, string>()
{
{#"\Web\Bilder Groß\1509_131_021_01.jpg", "/Web/Bilder Groß/1509_131_021_01.jpg"},
{#"\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg", "/Web/Bilder Groß/1509_131_021_01_MitWasserzeichen.jpg"},
{#"\Web\Bilder Groß\icon_top.jpg", "icon_top.jpg"},
{#"\Web\Bilder klein\5478.jpg", "5478.jpg"}
};
var doc = XDocument.Parse(xml);
foreach (var source in doc.Root.Descendants("MIME_SOURCE"))
{
if (replacements.TryGetValue(source.Value, out var replacement))
{
source.Value = replacement;
}
}
var result = doc.ToString();
If you can make some assumptions about how your XML is structured (e.g. no whitespace between the <MINE_SOURCE> tags, no attributes, etc), then you can use some regex, allowing you to again make a single pass:
var result = Regex.Replace(xml, #"<MIME_SOURCE>([^<]+)</MIME_SOURCE>", match =>
{
if (replacements.TryGetValue(match.Groups[1].Value, out var replacement))
{
return $"<MIME_SOURCE>{replacement}</MIME_SOURCE>";
}
return match.Value;
});
You'll have to benchmark different approaches yourself on your own data. Use BenchmarkDotNet.
As I already mentioned in a comment above, I used to have a similar problem (see: c# Fastest string search in all files).
Using the Aho–Corasick algorithm that has been suggested to me in the accepted answer I was able to conduct a string search in fast enough time for my problem (going from a minutes execution time to merely seconds).
An implementation of said algorithm can be found here.
Here is a little sample on how to use the implementation linked above. (looking some needles in a haystack)
static bool anyViaAhoCorasick(string[] needles, string haystack)
{
var trie = new Trie();
trie.Add(needles);
trie.Build();
return trie.Find(haystack).Any();
}
I have been trying to search string patterns in a large text file. I am reading line by line and checking each line which is causing a lot of time. I did try with HashSet and ReadAllLines.
HashSet<string> strings = new HashSet<string>(File.ReadAllLines(#"D:\Doc\Tst.txt"));
Now when I am trying to search the string, it's not matching. As it is looking for a match of the entire row. I just want to check if the string appears in the row.
I had tried by using this:
using (System.IO.StreamReader file = new System.IO.StreamReader(#"D:\Doc\Tst.txt"))
{
while ((CurrentLine = file.ReadLine()) != null)
{
vals = chk_log(CurrentLine, date_Format, (range.Cells[i][counter]).Value2, vals);
if (vals == true)
break;
}
}
bool chk_log(string LineText, string date_to_chk, string publisher, bool tvals)
{
if (LineText.Contains(date_to_chk))
if (LineText.Contains(publisher))
{
tvals = true;
}
else
tvals = false;
else tvals = false;
return tvals;
}
But this is consuming too much time. Any help on this would be good.
Reading into a HashSet doesn't make sense to me (unless there are a lot of duplicated lines) since you aren't testing for membership of the set.
Taking a really naive approach you could just do this.
var isItThere = File.ReadAllLines(#"d:\docs\st.txt").Any(x =>
x.Contains(date_to_chk) && x.Contains(publisher));
65K lines at (say) 1K a line isn't a lot of memory to worry about, and I personally wouldn't bother with Parallel since it sounds like it would be superfast to do anyway.
You could replace Any where First to find the first result or Where to get an IEnumerable<string> containing all results.
You can use a compiled regular expression instead of String.Contains (compile once before looping over the lines). This typically gives better performance.
var regex = new Regex($"{date}|{publisher}", RegexOptions.Compiled);
foreach (string line in File.ReadLines(#"D:\Doc\Tst.txt"))
{
if (regex.IsMatch(line)) break;
}
This also shows a convenient standard library function for reading a file line by line.
Or, depending on what you want to do...
var isItThere = File.ReadLines(#"D:\Doc\Tst.txt").Any(regex.IsMatch);
I'm currently using the snippet below to convert xml data(not well formed) to .CSV format after doing some processing in between. It only converts those elements in the xml data that contain a integer from the list testList (List<int> testList). It only converts and writes to the file once that match has been made. I need to use this algorithm for files that are several GB's in size. Currently it processes a 1 Gb file in ~7.5 minutes. Can someone suggest any changes that I could make to improve performance? I've fixed everything I could but it won't get any faster. Any help will be appreciated!
Note: Message.TryParse is an external parsing method that I have to use and can't exclude or change.
Note: StreamElements is just a customized Xmlreader that improves performance.
foreach (var element in StreamElements(p, "XML"))
{
string joined = string.Concat(element.ToString().Split().Take(3)) + string.Join(" ", element.
ToString().Split().Skip(3));
List<string> listX = new List<string>();
listX.Add(joined.ToString());
Message msg = null;
if (Message.TryParse(joined.ToString(), out msg))
{
var values = element.DescendantNodes().OfType<XText>()
.Select(v => Regex.Replace(v.Value, "\\s+", " "));
foreach (var val in values)
{
for (int i = 0; i < testList.Count; i++)
{
if (val.ToString().Contains("," + testList[i].ToString() + ","))
{
var line = string.Join(",", values);
sss.WriteLine(line);
}
}
}
}
}
I'm seeing some things you could probably improve:
You're calling .ToString() on joined a couple of times, when joined is already a string.
You may be able to speed up your regex replace by compiling your regex first, outside of the loop.
You're iterating over values multiple times, and each time it has to re-evaluate the LINQ that makes up the definition for values. Try using .ToList() before saving the result of that LINQ statement into values.
But before focusing on stuff like this, you really need to identify what's taking the time in your code. My guess is that it's almost all spent in these two places:
Reading from the XML stream
Writing to sss
If I'm right, then anything else you focus on is going to be premature optimization. Spend some time testing what happens if you comment out various parts of your for loop, to see where all the time is being spent.
As part of a recent project I had to read and write from a CSV file and put in a grid view in c#. In the end decided to use a ready built parser to do the work for me.
Because I like to do that kind of stuff, I wondered how to go about writing my own.
So far all I've managed to do is this:
//Read the header
StreamReader reader = new StreamReader(dialog.FileName);
string row = reader.ReadLine();
string[] cells = row.Split(',');
//Create the columns of the dataGridView
for (int i = 0; i < cells.Count() - 1; i++)
{
DataGridViewTextBoxColumn column = new DataGridViewTextBoxColumn();
column.Name = cells[i];
column.HeaderText = cells[i];
dataGridView1.Columns.Add(column);
}
//Display the contents of the file
while (reader.Peek() != -1)
{
row = reader.ReadLine();
cells = row.Split(',');
dataGridView1.Rows.Add(cells);
}
My question: is carrying on like this a wise idea, and if it is (or isn't) how would I test it properly?
As a programming exercise (for learning and gaining experience) it is probably a very reasonable thing to do. For production code, it may be better to use an existing library mainly because the work is already done. There are quite a few things to address with a CSV parser. For example (randomly off the top of my head):
Quoted values (strings)
Embedded quotes in quoted strings
Empty values (NULL ... or maybe even NULL vs. empty).
Lines without the correct number of entries
Headers vs. no headers.
Recognizing different data types (e.g., different date formats).
If you have a very specific input format in a very controlled environment, though, you may not need to deal with all of those.
... is carrying on like this a wise idea ...?
Since you're doing this as a learning exercise, you may want to dig deeper into lexing and parsing theory. Your current approach will show its shortcomings fairly quickly as described in Stop Rolling Your Own CSV Parser!. It's not that parsing CSV data is difficult. (It's not.) It's just that most CSV parser projects treat the problem as a text splitting problem versus a parsing problem. If you take the time to define the CSV "language", the parser almost writes itself.
RFC 4180 defines a grammar for CSV data in ABNF form:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234
DQUOTE = %x22 ;as per section 6.1 of RFC 2234
LF = %x0A ;as per section 6.1 of RFC 2234
CRLF = CR LF ;as per section 6.1 of RFC 2234
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
This grammar shows how single characters are built up to create more and more complex language elements. (As written, definitions go the opposite direction from complex to simple.)
If you start with a grammar, you can write parsing functions that mirror non-terminal grammar elements (the lowercase items). Julian M Bucknall describes the process in Writing a parser for CSV data. Take a look at Test-Driven Development with ANTLR for an example of the same process using a parser generator.
Keep in mind, there is no one accepted CSV definition. CSV data in the wild is not guaranteed to implement all of the RFC 4180 suggestions.
Get (or make) some CSV data and write Unit Tests using NUnit or Visual Studio Testing Tools.
Be sure to test edge cases like
"csv","Data","with","a","trailing","comma",
and
"csv","Data","with,","commas","and","""quotes""","in","it"
This come from
http://www.gigawebsolution.com/Posts/Details/61/Building-a-Simple-CSV-Parser-in-C#
public interface ICsvReaderWriter
{
List<string[]> Read(string filePath, char delimiter);
void Write(string filePath, List<string[]> lines, char delimiter);
}
public class CsvReaderWriter : ICsvReaderWriter
{
public List<string[]> Read(string filePath, char delimiter)
{
var fileContent = new List<string[]>();
using (var reader = new StreamReader(filePath, Encoding.Unicode))
{
string line;
while ((line = reader.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(line))
{
fileContent.Add(line.Split(delimiter));
}
}
}
return fileContent;
}
public void Write(string filePath, List<string[]> lines, char delimiter)
{
using (var writer = new StreamWriter(filePath, true, Encoding.Unicode))
{
foreach (var line in lines)
{
var data = line.Aggregate(string.Empty,
(current, column) => current +
string.Format("{0}{1}", column,delimiter))
.TrimEnd(delimiter);
writer.WriteLine(data);
}
}
}
}
Parsing a CSV file isn't difficult, but it involves more than simply calling String.Split().
You are breaking the lines at each comma. But it's possible for fields to contain embedded commas. In these cases, CSV wraps the field in double quotes. So you must also look for double quotes and ignore commas within those quotes. In addition, it's even possible for fields to contain embedded double quotes. Double quotes must appear within double quotes and be "doubled up" to indicate the quote is a literal character.
If you'd like to see how I did it, you can check out this article.
I guess I need some regex help. I want to find all tags like <?abc?> so that I can replace it with whatever the results are for the code ran inside. I just need help regexing the tag/code string, not parsing the code inside :p.
<b><?abc print 'test' ?></b> would result in <b>test</b>
Edit: Not specifically but in general, matching (<?[chars] (code group) ?>)
This will build up a new copy of the string source, replacing <?abc code?> with the result of process(code)
Regex abcTagRegex = new Regex(#"\<\?abc(?<code>.*?)\?>");
StringBuilder newSource = new StringBuilder();
int curPos = 0;
foreach (Match abcTagMatch in abcTagRegex.Matches(source)) {
string code = abcTagMatch.Groups["code"].Value;
string result = process(code);
newSource.Append(source.Substring(curPos, abcTagMatch.Index));
newSource.Append(result);
curPos = abcTagMatch.Index + abcTagMatch.Length;
}
newSource.Append(source.Substring(curPos));
source = newSource.ToString();
N.B. I've not been able to test this code, so some of the functions may be slightly the wrong name, or there may be some off-by-one errors.
var new Regex(#"<\?(\w+) (\w+) (.+?)\?>")
This will take this source
<b><?abc print 'test' ?></b>
and break it up like this:
Value: <?abc print 'test' ?>
SubMatch: abc
SubMatch: print
SubMatch: 'test'
These can then be sent to a method that can handle it differently depending on what the parts are.
If you need more advanced syntax handling you need to go beyond regex I believe.
I designed a template engine using Antlr but thats way more complex ;)
exp = new Regex(#"<\?abc print'(.+)' \?>");
str = exp.Replace(str, "$1")
Something like this should do the trick. Change the regexes how you see fit