I could convert the pdf pages into images. if it is less than 50 pages its working fast...
if any pdf large than 1000 pages... it acquires lot of time to complete.
any one can review this code and make it work for large file size...
i have used PdfLibNet dll(will not work in 4.0) in .NET3.5
here is my sample code:
public void ConverIMG(string filename)
{
PDFWrapper wrapper = new PDFWrapper();
wrapper.RenderDPI = Dpi;
wrapper.LoadPDF(filename);
int count = wrapper.PageCount;
for (int i = 1; i <= wrapper.PageCount; i++)
{
string fileName = AppDomain.CurrentDomain.BaseDirectory + #"IMG\" + i.ToString() + ".png";
wrapper.ExportJpg(fileName, i, i, (double)100, 100);
while (wrapper.IsJpgBusy)
{
Thread.Sleep(50);
}
}
wrapper.Dispose();
}
PS:
we need to split pages and convert to images parallely and we need to get completed status.
If PDFWrapper performance degrades for documents bigger then 50 pages it suggests it is not very well written. To overcome this you could do conversion in 50 page batches and recreate PDFWrapper after each batch. There is an assumption that ExportJpg() gets slower with number of calls and its initial speed does not depend on the size of PDF.
This is only a workaround for apparent problems in PDFWrapper and a proper solution would be to use a fixed library. Also I would suggest Thread.Sleep(1) if you really need to wait with yielding.
public void ConverIMG(string filename)
{
PDFWrapper wrapper = new PDFWrapper();
wrapper.RenderDPI = Dpi;
wrapper.LoadPDF(filename);
int count = wrapper.PageCount;
for (int i = 1; i <= count; i++)
{
string fileName = AppDomain.CurrentDomain.BaseDirectory + #"IMG\" + i.ToString() + ".png";
wrapper.ExportJpg(fileName, i, i, (double) 100, 100);
while (wrapper.IsJpgBusy)
{
Thread.Sleep(1);
}
if (i % 50 == 0)
{
wrapper.Dispose();
wrapper = new PDFWrapper();
wrapper.RenderDPI = Dpi;
wrapper.LoadPDF(filename);
}
}
wrapper.Dispose();
}
Related
I am working on a C# .NET core 3.1 application that needs to insert 300 - 500 million rows avro file data into a GBQ table. My idea is to batch insert the data using .Net Task to insert data asynchronously that doesn't block the main thread and when all tasks are finished, log the success or fail message. I did a sample code, if I use Task.Run(), it will break the batchId and lose some data. However, if using RunSynchronously works fine, but it will block the main thread and take some time, which is still acceptable. Just wondering if what's wrong with my code and is Task.Run() a good idea for my case. Thanks a lot! Here is my code: https://dotnetfiddle.net/CPKsMv Just in case, it doesn't work well, pasted here again:
using System;
using System.Collections;
using System.Threading.Tasks;
public class Program
{
public static void Main()
{
ArrayList forecasts = new ArrayList();
for(var k = 0; k < 100; k++){
forecasts.Add(k);
}
int size = 6;
var taskNum = (int) Math.Ceiling(forecasts.Count / (double) size);
Console.WriteLine("task number:" + taskNum);
Console.WriteLine("item number:" + forecasts.Count);
Task[] tasks = new Task[taskNum];
var i = 0;
for(i = 0; i < taskNum; i++) {
int start = i * size;
if (forecasts.Count - start < size) {
size = forecasts.Count - start;
}
// Method 1: This works well, but need take some time to finish
//tasks[i] = new Task(() => {
//var batchedforecastRows = forecasts.GetRange(start, size);
// GbqTable.InsertRowsAsync(batchedforecastRows);
//Console.WriteLine("batchID:" + (i + 1) + "["+string.Join( ",", batchedforecastRows.ToArray())+"]");
//});
// tasks[i].RunSynchronously();
// Method 2: will lose data: (94, 95) and batchId is messed
// Sample Print below:
// batchID:18 Inserted:[90,91,92,93]
// batchID:18 Inserted:[96,97,98,99]
tasks[i] = Task.Run(() => {
var batchedforecastRows = forecasts.GetRange(start, size);
// GbqTable.InsertRowsAsync(batchedforecastRows);
Console.WriteLine("batchID:" + (i + 1) + " Inserted:["+string.Join( ",", batchedforecastRows.ToArray())+"]");
});
}
}
}
my goal is to calculate how much space a file occupies in memorycache ,
to do this I made a method that loads thousands of objects into memory and calculates the difference of some values between before loading the objects and after ,
this before file addition:
Process p = Process.GetCurrentProcess();
long memoriaNonPaginata = p.NonpagedSystemMemorySize64;
long memoriaPaginata = p.PagedMemorySize64;
long memoriaDiSistema = p.PagedSystemMemorySize64;
long ramIniziale = memoriaNonPaginata + memoriaPaginata + memoriaDiSistema;
I add files:
while (counter < numeroOggettiInCache)
{
string k = counter.ToString();
data = data + k.Substring(0,1);
int position = r.Next(0, 28000);
data = data.Remove(position, 1);
_cache.CreateEntry("cucu" + counter.ToString());
_cache.Set("cucu" + counter.ToString(), data, opt);
counter++;
}
I'm going to calculate the values later:
Process p1 = Process.GetCurrentProcess();
long memoriaNonPaginataPost = p1.NonpagedSystemMemorySize64;
long memoriaPaginataPost = p1.PagedMemorySize64;
long memoriaDiSistemaPost = p1.PagedSystemMemorySize64;
long ramFinale = memoriaNonPaginataPost + memoriaPaginataPost + memoriaDiSistemaPost;
I ask if the properties I used (NonpagedSystemMemorySize64,PagedMemorySize64,PagedSystemMemorySize64) are correct and sufficient to calculate how much space the cached files use, thank you,regards
In C#, 64bit Windows + .NET 4.5 (or later) + enabling gcAllowVeryLargeObjects in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in a character array is still limited to about 2^31 = 2.15 billion chars. Testing confirmed this.
To overcome this, Microsoft recommends in Option B creating the arrays natively (their 'Option C' doesn't even compile). That suits me, as speed is also a concern. Is there some tried and trusted unsafe / native / interop / PInvoke code for .NET out there that can replace and act as an enhanced StringBuilder to get around the 2 billion element limit?
Unsafe/pinvoke code is preferred, but not a deal breaker. Alternatively, is there a .NET (safe) version available?
Ideally, the StringBuilder replacement will start off small (preferably user-defined), and then repeatedly double in size each time the capacity has been exceeded. I'm mostly looking for append() functionality here. Saving the string to a file would be useful too, though I'm sure I could program that bit if substring() functionality is also incorporated. If the code uses pinvoke, then obviously some degree of memory management must be taken into account to avoid memory loss.
I don't want to recreate the wheel if some simple code already exists, but on the other hand, I don't want to download and incorporate a DLL just for this simple functionality.
I'm also using .NET 3.5 to cater for users who don't have the latest version of Windows.
The size of strings in C++ is unlimited according to this answer.
You could write your string processing code in C++ and use a DLL import to communicate between your C# code and C++ code. This makes it simple to call your C++ functions from the C# code.
The parts of your code which do the processing on the large strings will dictate where the border between the C++ and C# code will need to be. Obviously any references to the large strings will need to be kept on the C++ side, but aggregate processing result information can then be communicated back to the C# code.
Here is a link to a code project page that gives some guidance on C# to C++ DLL imports.
So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char[]>).
Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone.
Speed-wise, some quick tests show that it's only around 33% slower than a StringBuilder when appending. I got very similar performance if I went for a 2D jagged char array (char[][]) instead of List<char[]>, but Lists are simpler to work with, so I stuck with that.
Hope somebody else finds it useful! There may be bugs, so use with caution. I tested it fairly well though.
// A simplified version specially for StackOverflow
public class BigStringBuilder
{
List<char[]> c = new List<char[]>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}
// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
public string[] returnPagesForTestingPurposes() {
string[] s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char[]>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}
public void fileOpen(string path)
{
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0) {
if (!sw.EndOfStream) {
currentPage++;
if (currentPage > (c.Count - 1)) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}
}
I'm working on an UI project which has to work with huge datasets (every second 35 new values) which will then be displayed in a graph. The user shall be able to change the view from 10 Minutes up to Month view. To archive this I wrote myself a helper function which truncate a lot of data to a 600 byte array which then should be displayed on a LiveView Chart.
I found out that at the beginning the software works very well and fast, but as longer the software runs (e.g. for a month) and the memory usage raises (to ca. 600 mb) the function get's a lot of slower (up to 8x).
So I made some tests to to find the source of this. Quite surprised I found out that there is something like a magic number where the function get's 2x slower , just by changing 71494 loops to 71495 from 19ms to 39ms runtime
I'm really confused. Even when you comment out the second for loop (where the arrays are geting truncated) it is a lot of slower.
Maybe this has something to do with the Garbage Collector? Or does C# compress memory automatically?
Using Visual Studio 2017 with newest updates.
The Code
using System;
using System.Collections.Generic;
using System.Diagnostics;
namespace TempoaryTest
{
class ProductNameStream
{
public struct FileValue
{
public DateTime Time;
public ushort[] Value;
public ushort[] Avg1;
public ushort[] Avg2;
public ushort[] DAvg;
public ushort AlarmDelta;
public ushort AlarmAverage;
public ushort AlarmSum;
}
}
public static class Program
{
private const int MAX_MEASURE_MODEL = 600;
private const int TEST = 71494;
//private const int TEST = 71495;//this one doubles the consuming time!
public static void Main(string[] bleg)
{
List<ProductNameStream.FileValue> fileValues = new List<ProductNameStream.FileValue>();
ProductNameStream.FileValue fil = new ProductNameStream.FileValue();
DateTime testTime = DateTime.Now;
Console.WriteLine("TEST: {0} {1:X}", TEST, TEST);
//Creating example List
for (int n = 0; n < TEST; n++)
{
fil = new ProductNameStream.FileValue
{
Time = testTime = testTime.AddSeconds(1),
Value = new ushort[8],
Avg1 = new ushort[8],
Avg2 = new ushort[8],
DAvg = new ushort[8]
};
for (int i = 0; i < 8; i++)
{
fil.Value[i] = (ushort)(n + i);
fil.Avg1[i] = (ushort)(TEST - n - i);
fil.Avg2[i] = (ushort)(n / (i + 1));
fil.DAvg[i] = (ushort)(n * (i + 1));
}
fil.AlarmDelta = (ushort)DateTime.Now.Ticks;
fil.AlarmAverage = (ushort)(fil.AlarmDelta / 2);
fil.AlarmSum = (ushort)(n);
fileValues.Add(fil);
}
var sw = Stopwatch.StartNew();
/* May look like the same as MAX_MEASURE_MODEL but since we use int
* as counter we must be aware of the int round down.*/
int cnt = (fileValues.Count / (fileValues.Count / MAX_MEASURE_MODEL)) + 1;
ProductNameStream.FileValue[] newFileValues = new ProductNameStream.FileValue[cnt];
ProductNameStream.FileValue[] fileValuesArray = fileValues.ToArray();
//Truncate the big list to a 600 Array
for (int n = 0; n < fileValues.Count; n++)
{
if ((n % (fileValues.Count / MAX_MEASURE_MODEL)) == 0)
{
cnt = n / (fileValues.Count / MAX_MEASURE_MODEL);
newFileValues[cnt] = fileValuesArray[n];
newFileValues[cnt].Value = new ushort[8];
newFileValues[cnt].Avg1 = new ushort[8];
newFileValues[cnt].Avg2 = new ushort[8];
newFileValues[cnt].DAvg = new ushort[8];
}
else
{
for (int i = 0; i < 8; i++)
{
if (newFileValues[cnt].Value[i] < fileValuesArray[n].Value[i])
newFileValues[cnt].Value[i] = fileValuesArray[n].Value[i];
if (newFileValues[cnt].Avg1[i] < fileValuesArray[n].Avg1[i])
newFileValues[cnt].Avg1[i] = fileValuesArray[n].Avg1[i];
if (newFileValues[cnt].Avg2[i] < fileValuesArray[n].Avg2[i])
newFileValues[cnt].Avg2[i] = fileValuesArray[n].Avg2[i];
if (newFileValues[cnt].DAvg[i] < fileValuesArray[n].DAvg[i])
newFileValues[cnt].DAvg[i] = fileValuesArray[n].DAvg[i];
}
if (newFileValues[cnt].AlarmSum < fileValuesArray[n].AlarmSum)
newFileValues[cnt].AlarmSum = fileValuesArray[n].AlarmSum;
if (newFileValues[cnt].AlarmDelta < fileValuesArray[n].AlarmDelta)
newFileValues[cnt].AlarmDelta = fileValuesArray[n].AlarmDelta;
if (newFileValues[cnt].AlarmAverage < fileValuesArray[n].AlarmAverage)
newFileValues[cnt].AlarmAverage = fileValuesArray[n].AlarmAverage;
}
}
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
}
This is most likely being caused by the garbage collector, as you suggested.
I can offer two pieces of evidence to indicate that this is so:
If you put GC.Collect() just before you start the stopwatch, the difference in times goes away.
If you instead change the initialisation of the list to new List<ProductNameStream.FileValue>(TEST);, the difference in time also goes away.
(Initialising the list's capacity to the final size in its constructor prevents multiple reallocations of its internal array while items are being added to it, which will reduce pressure on the garbage collector.)
Therefore, I assert based on this evidence that it is indeed the garbage collector that is impacting your timings.
Incidentally, the threshold value was slightly different for me, and for at least one other person too (which isn't surprising if the timing differences are being caused by the garbage collector).
Your data structure is inefficient and is forcing you to do a lot of allocations during computation. Have a look of thisfixed size array inside a struct
. Also preallocate the list. Don't rely on the list to constantly adjust its size which also creates garbage.
in my program i need to write large text files (~300 mb), the text files contains numbers seperated by spaces, i'm using this code :
TextWriter guessesWriter = TextWriter.Synchronized(new StreamWriter("guesses.txt"));
private void QueueStart()
{
while (true)
{
if (writeQueue.Count > 0)
{
guessesWriter.WriteLine(writeQueue[0]);
writeQueue.Remove(writeQueue[0]);
}
}
}
private static void Check()
{
TextReader tr = new StreamReader("data.txt");
string guess = tr.ReadLine();
b = 0;
List<Thread> threads = new List<Thread>();
while (guess != null) // Reading each row and analyze it
{
string[] guessNumbers = guess.Split(' ');
List<int> numbers = new List<int>();
foreach (string s in guessNumbers) // Converting each guess to a list of numbers
numbers.Add(int.Parse(s));
threads.Add(new Thread(GuessCheck));
threads[b].Start(numbers);
b++;
guess = tr.ReadLine();
}
}
private static void GuessCheck(object listNums)
{
List<int> numbers = (List<int>) listNums;
if (!CloseNumbersCheck(numbers))
{
writeQueue.Add(numbers[0] + " " + numbers[1] + " " + numbers[2] + " " + numbers[3] + " " + numbers[4] + " " + numbers[5] + " " + numbers[6]);
}
}
private static bool CloseNumbersCheck(List<int> numbers)
{
int divideResult = numbers[0]/10;
for (int i = 1; i < 6; i++)
{
if (numbers[i]/10 != divideResult)
return false;
}
return true;
}
the file data.txt contains data in this format : (dots mean more numbers following the same logic)
1 2 3 4 5 6 1
1 2 3 4 5 6 2
1 2 3 4 5 6 3
.
.
.
1 2 3 4 5 6 8
1 2 3 4 5 7 1
.
.
.
i know this is not very efficient and i was looking for some advice on how to make it quicker.
if you night know how to save LARGE amount of numbers more efficiently than a .txt i would appreciate it.
One way to improve efficiency is with a larger buffer on your output stream. You are using the defaults, which give you probably a 1k buffer, but you won't see maximum performance with less than a 64k buffer. Open your file like this:
new StreamWriter("guesses.txt", new UTF8Encoding(false, true), 65536)
Instead of reading and writing line by line (ReadLine and WriteLine), you should read and write big block of data (ReadBlock and Write). This way you will access disk alot less and have a big performance boost. But you will need to manage the end of each line (look at Environment.NewLine).
The effeciency could be improved by using BinaryWriter. Then you could just write out integers directly. This would allow you to skip the parsing step on the read and the ToString conversion on the write.
It also looks like you are creating a bunch of threads in there. Additional threads will slow down your performance. You should do all of the work on a single thread, since threads are very heavyweight objects.
Here is a more-or-less direct conversion of your code to use a BinaryWriter. (This does not address the thread problem.)
BinaryWriter guessesWriter = new BinaryWriter(new StreamWriter("guesses.dat"));
private void QueueStart()
{
while (true)
{
if (writeQueue.Count > 0)
{
lock (guessesWriter)
{
guessesWriter.Write(writeQueue[0]);
}
writeQueue.Remove(writeQueue[0]);
}
}
}
private const int numbersPerThread = 6;
private static void Check()
{
BinaryReader tr = new BinaryReader(new StreamReader("data.txt"));
b = 0;
List<Thread> threads = new List<Thread>();
while (tr.BaseStream.Position < tr.BaseStream.Length)
{
List<int> numbers = new List<int>(numbersPerThread);
for (int index = 0; index < numbersPerThread; index++)
{
numbers.Add(tr.ReadInt32());
}
threads.Add(new Thread(GuessCheck));
threads[b].Start(numbers);
b++;
}
}
Try using a bufferi n between. There is a BGufferdSTream. Right now you use very inefficient disc access patterns.