I'm currently learning C# and I've been working on a XML parser for the last two days. It's actually working fine my issue is the amount of time it take to parse more than 10k pages. this is my code.
public static void startParse(int id_min, int id_max, int numberofthreads)
{
int start;
int end;
int part;
int threadnbrs;
threadnbrs = numberofthreads;
List<Thread> workerThreads;
List<string> results;
part = (id_max - id_min) / threadnbrs;
start = id_min;
end = 0;
workerThreads = new List<Thread>();
results = new List<string>();
for (int i = 0; i < threadnbrs; i++)
{
if (i != 0)
start = end + 1;
end = start + (part);
if (i == (threadnbrs - 1))
end = id_max;
int _i = i;
int _start = start;
int _end = end;
Thread t = new Thread(() =>
{
Console.WriteLine("i = " + _i);
Console.WriteLine("start =" + _start);
Console.WriteLine("end =" + _end + "\r\n");
string parse = new ParseWH().parse(_start, _end);
lock (results)
{
results.Add(parse);
}
});
workerThreads.Add(t);
t.Start();
}
foreach (Thread thread in workerThreads)
thread.Join();
File.WriteAllText(".\\result.txt", String.Join("", results));
Console.Beep();
}
what i'm actually doing is splitting in different thread a range of element that need to be parsed so each thread handle X elements.
for each 100 elements it take approx 20 seconds.
however it took me 17 minutes to parse 10 0000 Elements.
what i need is each thread working simultaneously on 100 of those 10 000 Elements so it can be done in 20 seconds. is there is a solution for that ?
Parse Code :
public string parse(int id_min, int id_max)
{
XmlDocument xml;
WebClient user;
XmlElement element;
XmlNodeList nodes;
string result;
string address;
int i;
//Console.WriteLine(id_min);
//Console.WriteLine(id_max);
i = id_min;
result = "";
xml = new XmlDocument();
while (i <= id_max)
{
user = new WebClient();
// user.Headers.Add("User-Agent", "Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30");
user.Encoding = UTF8Encoding.UTF8;
address = "http://fr.wowhead.com/item=" + i + "?xml";
if (address != null)
xml.LoadXml(user.DownloadString(new Uri(address)));
element = xml.DocumentElement;
nodes = element.SelectNodes("/wowhead");
if (xml.SelectSingleNode("/wowhead/error") != null)
{
Console.WriteLine("error " + i);
i++;
continue;
}
result += "INSERT INTO item_wh (entry, class, subclass, displayId, ,quality, name, level) VALUES (";
foreach (XmlNode node in nodes)
{
// entry
result += node["item"].Attributes["id"].InnerText;
result += ", ";
// class
result += node["item"]["class"].Attributes["id"].InnerText;
result += ", ";
// subclass
result += node["item"]["subclass"].Attributes["id"].InnerText;
result += ", ";
// displayId
result += node["item"]["icon"].Attributes["displayId"].InnerText;
result += ", ";
// quality
result += node["item"]["quality"].Attributes["id"].InnerText;
result += ", \"";
// name
result += node["item"]["name"].InnerText;
result += "\", ";
// level
result += node["item"]["level"].InnerText;
result += ");";
// bakcline
result += "\r\n";
}
i++;
}
return (result);
}
Best solution for CPU bound work (such as parsing) is to launch as many threads as the numbers of the cores in your machine, less than that and you are not taking advantage of all of your cores, more than that and excessive context-switching might kick in and hinder performance.
So essentially, threadnbrs should be set to Environment.ProcessorCount
Also, consider using the Parallel class instead of creating threads yourself:
Parallel.ForEach(thingsToParse, (somethingToParse) =>
{
var parsed = Parse(somethingToParse);
results.Add(parsed);
});
You must agree that it looks much cleaner and much easier to maintain.
Also, you'll be better off using ConcurrentBag instead of a regular List + lock as ConcurrentBag is more built for concurrent loads and could give you better performance.
Finally ! Got it working by launching multiple process of my application Simultaneously.
Witch means if i have 10 k elements i run in 10 process of 1000 Elements. increase the numbers of process to decrease the number of elements and it goes faster and faster !
(I'm currently running on a very fast Internet Speed) and have a Samsung M.2 960 as Storage as well as core I7 Skylake 6 cores
Okay so i found "Trying to run multiple HTTP requests in parallel, but being limited by Windows (registry)" it's called "Thread Pool" I finally decided to download directly the XML file then parse the document directly offline, instead of parsing the website directly to get an SQL format. the new method work, i can download and write up to 10 000 K XML in only 9 seconds. I tried to push it to 150 K (All Websites Pages) but now i got a strange bug i got duplicates items... I'm going to try to rewrite the full code using the correct method for pools, multi Task/Thread, dictionary and IEnumerable Containers cross finger to work on 150 k Item without losing data in process and post back the full code.
Related
I am working on a C# .NET core 3.1 application that needs to insert 300 - 500 million rows avro file data into a GBQ table. My idea is to batch insert the data using .Net Task to insert data asynchronously that doesn't block the main thread and when all tasks are finished, log the success or fail message. I did a sample code, if I use Task.Run(), it will break the batchId and lose some data. However, if using RunSynchronously works fine, but it will block the main thread and take some time, which is still acceptable. Just wondering if what's wrong with my code and is Task.Run() a good idea for my case. Thanks a lot! Here is my code: https://dotnetfiddle.net/CPKsMv Just in case, it doesn't work well, pasted here again:
using System;
using System.Collections;
using System.Threading.Tasks;
public class Program
{
public static void Main()
{
ArrayList forecasts = new ArrayList();
for(var k = 0; k < 100; k++){
forecasts.Add(k);
}
int size = 6;
var taskNum = (int) Math.Ceiling(forecasts.Count / (double) size);
Console.WriteLine("task number:" + taskNum);
Console.WriteLine("item number:" + forecasts.Count);
Task[] tasks = new Task[taskNum];
var i = 0;
for(i = 0; i < taskNum; i++) {
int start = i * size;
if (forecasts.Count - start < size) {
size = forecasts.Count - start;
}
// Method 1: This works well, but need take some time to finish
//tasks[i] = new Task(() => {
//var batchedforecastRows = forecasts.GetRange(start, size);
// GbqTable.InsertRowsAsync(batchedforecastRows);
//Console.WriteLine("batchID:" + (i + 1) + "["+string.Join( ",", batchedforecastRows.ToArray())+"]");
//});
// tasks[i].RunSynchronously();
// Method 2: will lose data: (94, 95) and batchId is messed
// Sample Print below:
// batchID:18 Inserted:[90,91,92,93]
// batchID:18 Inserted:[96,97,98,99]
tasks[i] = Task.Run(() => {
var batchedforecastRows = forecasts.GetRange(start, size);
// GbqTable.InsertRowsAsync(batchedforecastRows);
Console.WriteLine("batchID:" + (i + 1) + " Inserted:["+string.Join( ",", batchedforecastRows.ToArray())+"]");
});
}
}
}
I am working on a program that tracks the amount of time it takes to get the sum of all prime numbers up to a certain number and am trying to find the most efficient possible way to obtain this value, as I have a Stopwatch (System.Diagnostics) tracking how long it takes. Currently, I can find the sum of all prime numbers up to 40,000 in about 33-34 seconds with the below code:
private void ListThePrimes()
{
prime = false;
while (primes < 30000)
{
for (int i = 2; i < n; i++)
{
output = n % i;
if (output == 0)
{
primeNum = i;
prime = false;
break;
}
else
{
prime = true;
}
}
if (prime == true)
{
sum += primeNum;
primes++;
}
n++;
}
}
However, I feel like there is a way to write this code more efficiently as my goal was to reach the same amount of time with much higher numbers like 200,000 or so. This is my Stopwatch code, which I perform on a button click, if needed:
var timer = new Stopwatch();
timer.Start();
ListThePrimes();
timer.Stop();
TimeSpan timeTaken = timer.Elapsed;
string foo = timeTaken.ToString(#"m\:ss\.fff");
MessageBox.Show("The sum is " + sum + ". It took this program " + foo + " seconds to run.");
Would appreciate it if someone could let me know if there is a more efficient way to perform this action.
You need to optimize how you get prime numbers, your way is extremely inefficient. The common way to do so is using the Sieve of Eratosthenes. Using this method I can easily get all prime numbers up to 100000 in milliseconds. Summing them is trivial beyond that.
var n = 100000;
var a = Enumerable.Range(0,n+1).Select(_ => true).ToArray();
for(var i=2;i<Math.Sqrt(n);i++)
{
if(a[i])
{
for(var j = i*i;j<=n;j += i)
{
a[j] = false;
}
}
}
var result = a.Select( (x,i) => new {IsPrime = x,Prime = i})
.Where(x => x.IsPrime && x.Prime > 1)
.Sum(x => x.Prime);
Console.WriteLine(result);
Live example: https://dotnetfiddle.net/eBelZD
You really have to work on your prime number method. A new approach would be the Erathostenes sieve, but your current code can also be improved quite a bit.
Remove unnecessary variables like output, just put it directly in the if tag.
You can half the numbers you're checking because primes can only be odd numbers, so don't do n++ but n+=2;.
This is from Joseph Albahari's excellent C# 5.0 in a Nutshell book
In one of his chapters, he mentions a race-condition in this code block ..my guess is it's meant to be pretty self-evident, as he didn't bother to specify where it was but running the code multiple times I was unable to produce the said-race condition
_button.Click += (sender, args) =>
{
_button.IsEnabled = false;
Task.Run (() => Go());
};
void Go()
{
for (int i = 1; i < 5; i++)
{
int result = GetPrimesCount (i * 1000000, 1000000);
Dispatcher.BeginInvoke (new Action (() =>
_results.Text += result + " primes between " + (i*1000000) + " and " +
((i+1)*1000000-1) + Environment.NewLine));
}
Dispatcher.BeginInvoke (new Action (() => _button.IsEnabled = true));
}
I don't agree with #Serge's answer. You don't even need multiple threads to see the problem. Try to run your code in its original form and notice the output. For me it's the following and is sometimes random (I fixed the first value):
1000000 primes between 5000000 and 5999999
1000000 primes between 5000000 and 5999999
1000000 primes between 5000000 and 5999999
1000000 primes between 5000000 and 5999999
Notice the last 2 values. They're all the same, but they should depend on i. The problem is not that the operation is not atomic, because the GUI thread will execute the actions sequentially anyway.
The reason for this occurring is that the lambda function passed to BeginInvoke takes the value of i at the moment of execution, not at the moment of initialization, so they will all see the last value of i by the time they get executed. The solution is to explicitly pass i as a parameter to the lambda like so:
for (int i = 1; i < 5; i++)
{
int result = 1000000;
Dispatcher.BeginInvoke(new Action<int>(j =>
results.Text += result + " primes between " + (j * 1000000) + " and " +
((j + 1) * 1000000 - 1) + Environment.NewLine), i);
}
This line will not be calculated atomically:
_results.Text += result + " primes between " + (i*1000000) + " and " + ((i+1)*1000000-1) + Environment.NewLine));
So being executed from 5 concurrently running threads it may produce different funny results.
I could convert the pdf pages into images. if it is less than 50 pages its working fast...
if any pdf large than 1000 pages... it acquires lot of time to complete.
any one can review this code and make it work for large file size...
i have used PdfLibNet dll(will not work in 4.0) in .NET3.5
here is my sample code:
public void ConverIMG(string filename)
{
PDFWrapper wrapper = new PDFWrapper();
wrapper.RenderDPI = Dpi;
wrapper.LoadPDF(filename);
int count = wrapper.PageCount;
for (int i = 1; i <= wrapper.PageCount; i++)
{
string fileName = AppDomain.CurrentDomain.BaseDirectory + #"IMG\" + i.ToString() + ".png";
wrapper.ExportJpg(fileName, i, i, (double)100, 100);
while (wrapper.IsJpgBusy)
{
Thread.Sleep(50);
}
}
wrapper.Dispose();
}
PS:
we need to split pages and convert to images parallely and we need to get completed status.
If PDFWrapper performance degrades for documents bigger then 50 pages it suggests it is not very well written. To overcome this you could do conversion in 50 page batches and recreate PDFWrapper after each batch. There is an assumption that ExportJpg() gets slower with number of calls and its initial speed does not depend on the size of PDF.
This is only a workaround for apparent problems in PDFWrapper and a proper solution would be to use a fixed library. Also I would suggest Thread.Sleep(1) if you really need to wait with yielding.
public void ConverIMG(string filename)
{
PDFWrapper wrapper = new PDFWrapper();
wrapper.RenderDPI = Dpi;
wrapper.LoadPDF(filename);
int count = wrapper.PageCount;
for (int i = 1; i <= count; i++)
{
string fileName = AppDomain.CurrentDomain.BaseDirectory + #"IMG\" + i.ToString() + ".png";
wrapper.ExportJpg(fileName, i, i, (double) 100, 100);
while (wrapper.IsJpgBusy)
{
Thread.Sleep(1);
}
if (i % 50 == 0)
{
wrapper.Dispose();
wrapper = new PDFWrapper();
wrapper.RenderDPI = Dpi;
wrapper.LoadPDF(filename);
}
}
wrapper.Dispose();
}
in my program i need to write large text files (~300 mb), the text files contains numbers seperated by spaces, i'm using this code :
TextWriter guessesWriter = TextWriter.Synchronized(new StreamWriter("guesses.txt"));
private void QueueStart()
{
while (true)
{
if (writeQueue.Count > 0)
{
guessesWriter.WriteLine(writeQueue[0]);
writeQueue.Remove(writeQueue[0]);
}
}
}
private static void Check()
{
TextReader tr = new StreamReader("data.txt");
string guess = tr.ReadLine();
b = 0;
List<Thread> threads = new List<Thread>();
while (guess != null) // Reading each row and analyze it
{
string[] guessNumbers = guess.Split(' ');
List<int> numbers = new List<int>();
foreach (string s in guessNumbers) // Converting each guess to a list of numbers
numbers.Add(int.Parse(s));
threads.Add(new Thread(GuessCheck));
threads[b].Start(numbers);
b++;
guess = tr.ReadLine();
}
}
private static void GuessCheck(object listNums)
{
List<int> numbers = (List<int>) listNums;
if (!CloseNumbersCheck(numbers))
{
writeQueue.Add(numbers[0] + " " + numbers[1] + " " + numbers[2] + " " + numbers[3] + " " + numbers[4] + " " + numbers[5] + " " + numbers[6]);
}
}
private static bool CloseNumbersCheck(List<int> numbers)
{
int divideResult = numbers[0]/10;
for (int i = 1; i < 6; i++)
{
if (numbers[i]/10 != divideResult)
return false;
}
return true;
}
the file data.txt contains data in this format : (dots mean more numbers following the same logic)
1 2 3 4 5 6 1
1 2 3 4 5 6 2
1 2 3 4 5 6 3
.
.
.
1 2 3 4 5 6 8
1 2 3 4 5 7 1
.
.
.
i know this is not very efficient and i was looking for some advice on how to make it quicker.
if you night know how to save LARGE amount of numbers more efficiently than a .txt i would appreciate it.
One way to improve efficiency is with a larger buffer on your output stream. You are using the defaults, which give you probably a 1k buffer, but you won't see maximum performance with less than a 64k buffer. Open your file like this:
new StreamWriter("guesses.txt", new UTF8Encoding(false, true), 65536)
Instead of reading and writing line by line (ReadLine and WriteLine), you should read and write big block of data (ReadBlock and Write). This way you will access disk alot less and have a big performance boost. But you will need to manage the end of each line (look at Environment.NewLine).
The effeciency could be improved by using BinaryWriter. Then you could just write out integers directly. This would allow you to skip the parsing step on the read and the ToString conversion on the write.
It also looks like you are creating a bunch of threads in there. Additional threads will slow down your performance. You should do all of the work on a single thread, since threads are very heavyweight objects.
Here is a more-or-less direct conversion of your code to use a BinaryWriter. (This does not address the thread problem.)
BinaryWriter guessesWriter = new BinaryWriter(new StreamWriter("guesses.dat"));
private void QueueStart()
{
while (true)
{
if (writeQueue.Count > 0)
{
lock (guessesWriter)
{
guessesWriter.Write(writeQueue[0]);
}
writeQueue.Remove(writeQueue[0]);
}
}
}
private const int numbersPerThread = 6;
private static void Check()
{
BinaryReader tr = new BinaryReader(new StreamReader("data.txt"));
b = 0;
List<Thread> threads = new List<Thread>();
while (tr.BaseStream.Position < tr.BaseStream.Length)
{
List<int> numbers = new List<int>(numbersPerThread);
for (int index = 0; index < numbersPerThread; index++)
{
numbers.Add(tr.ReadInt32());
}
threads.Add(new Thread(GuessCheck));
threads[b].Start(numbers);
b++;
}
}
Try using a bufferi n between. There is a BGufferdSTream. Right now you use very inefficient disc access patterns.