So, I have a problem, connected with updating database, containing my own entities
class WordEntity
{
public int ID { get; set; }
public string word { get; set; }
public int frequency { get; set; }
public override string ToString()
{
return word;
}
}
I have already filled it with words from some txt file and counted the number of their appearence (for each word).
Now I need to add more words from another txt file and count the number of their appearence as well. The problem is to write LINQ statement, which have to update existing words (their frequencies) and to add more new words.
I used this, but EF throws an exception, connected with .Concat
var t = context.Words
`.Concat`(tempList)
.GroupBy(w => w.word)
.Select(w => new WordEntity() {word = w.Key, frequency = w.Sum(z => z.frequency)});
templist is a List<WordEntity> of new words from new txt file.
Please help.
There's different strategies you could use, but in essence you need to check, for each word in the new file, whether the word is already in the database. If it isn't, you need to add one (with frequency 1). Otherwise, you need to increase the frequency of the existing word. One solution is something like this:
using var context = new MyContext();
foreach (var word in ReadWordsFromFile(filename))
{
var existingWord = await context.Words.SingleOrDefaultAsync(w => w.Word == word);
if (existingWord is null)
{
context.Add(new WordEntity { Word = word, Frequency = 1 });
}
else
{
existingWord.Frequency++;
}
}
await context.SaveChangesAsync();
You can also (as you were doing) try to read all entities from the database at once and do the whole operation in memory:
var existingWords = await context.Words.ToDictionaryAsync(w => w.Word);
foreach (var word in ReadWordsFromFile(filename))
{
if (existingWords.ContainsKey(word))
existingWords[word].Frequency++;
else
{
var wordEntity = new WordEntity { Word = word, Frequency = 1 };
context.Add(wordEntity);
existingWords[word] = wordEntity;
}
}
This may be faster (as everything is done in memory) but could become problematic as the database grows, as you will need more and more memory to fetch all data from the database. The first solution will only fetch the words from the database that are actually required.
Although Jeroen's answer will work, it is not very efficient. Suppose the word "The" appears 10.000 times in the file, then he will fetch the frequency 10.000 times from the database and add +1
Wouldn't it be better to first check that this word appears 10.000 times, and then add or update the frequency +1000.
You could do this with the following:
IEnumerable<string> newWords = ReadWordsFromFile(...)
var newWordFrequencies = newWords.GroupBy(word => word,
// parameter resultSelector: from every key (which is a word, and all occurences
// of this word, make one new:
(key, wordsEqualToThisKey) => new
{
Word = key,
Count = wordsEqualToThisKey.Count(),
});
foreach (var newWord in newWordFrequencies)
{
// fetch the frequency from the database. if it exists: add count
var fetchedExistingWord = dbContext.Words
.Where(existingWord => existingWord.Word == newWord)
.FirstOrDefault();
if (fetchedExistingWord != null)
{
fetchedExistingWord.Frequency += newWord.Count;
}
else
{
// new Word is not in the database yet; add it
dbContext.Words.Add(new WordEntity
{
Word = newWord.Word,
Frequency = newWord.Count,
});
}
}
I am trying to achieve below things:
get the data from SQL DB .
Pass data to PerformStuff method which has third party method
MethodforResponse(It checks input and provide repsonse)
Save response(xml) back to SQL DB.
below is the sample code.performance wise its not good ,if there are 1000,000 Records in DB its very slow.
its there a better of doing it?any idea or hints to make it better.
please help.
using thirdpartylib;
public class Program
{
static void Main(string[] args)
{
var response = PerformStuff();
Save(response);
}
public class TestRequest
{
public int col1 { get; set; }
public bool col2 { get; set; }
public string col3 { get; set; }
public bool col4 { get; set; }
public string col5 { get; set; }
public bool col6 { get; set; }
public string col7 { get; set; }
}
public class TestResponse
{
public int col1 { get; set; }
public string col2 { get; set; }
public string col3 { get; set; }
public int col4 { get; set; }
}
public TestRequest GetDataId(int id)
{
TestRequest testReq = null;
try
{
SqlCommand cmd = DB.GetSqlCommand("proc_name");
cmd.AddInSqlParam("#Id", SqlDbType.Int, id);
SqlDataReader dr = new SqlDataReader(DB.GetDataReader(cmd));
while (dr.Read())
{
testReq = new TestRequest();
testReq.col1 = dr.GetInt32("col1");
testReq.col2 = dr.GetBoolean("col2");
testReq.col3 = dr.GetString("col3");
testReq.col4 = dr.GetBoolean("col4");
testReq.col5 = dr.GetString("col5");
testReq.col6 = dr.GetBoolean("col6");
testReq.col7 = dr.GetString("col7");
}
dr.Close();
}
catch (Exception ex)
{
throw;
}
return testReq;
}
public static TestResponse PerformStuff()
{
var response = new TestResponse();
//give ids in list
var ids = thirdpartylib.Methodforid()
foreach (int id in ids)
{
var request = GetDataId(id);
var output = thirdpartylib.MethodforResponse(request);
foreach (var data in output.Elements())
{
response.col4 = Convert.ToInt32(data.Id().Class());
response.col2 = data.Id().Name().ToString();
}
}
//request details
response.col1 = request.col1;
response.col2 = request.col2;
response.col3 = request.col3;
return response;
}
public static void Save(TestResponse response)
{
var Sb = new StringBuilder();
try
{
Sb.Append("<ROOT>");
Sb.Append("<id");
Sb.Append(" col1='" + response.col1 + "'");
Sb.Append(" col2='" + response.col2 + "'");
Sb.Append(" col3='" + response.col3 + "'");
Sb.Append(" col4='" + response.col4 + "'");
Sb.Append("></Id>");
Sb.Append("</ROOT>");
var cmd = DB.GetSqlCommand("saveproc");
cmd.AddInSqlParam("#Data", SqlDbType.VarChar, Sb.ToString());
DB.ExecuteNoQuery(cmd);
}
catch (Exception ex)
{
throw;
}
}
}
Thanks!
I think the root of your problem is that you get and insert data record-by-record. There is no possible way to optimize it. You need to change the approach in general.
You should think of a solution that:
1. Gets all the data in one command to the database.
2. Process it.
3. Save it back to the database in one command, using a technique like BULK INSERT. Please be ware that BULK INSERT has certain limitations, so read the documentation carefully.
Your question is very broad and the method PerformStuff() will be fundamentally slow because it takes O(n) * db_lookup_time before another iteration of the output. So, to me it seems you're going about this problem the wrong way.
Database query languages are made to optimize data traversal. So iterating by id, and then checking values, goes around this producing the slowest lookup time possible.
Instead, leverage SQL's powerful query language and use clauses like where id < 10 and value > 100 because you ultimately want to limit the size of the data set needed to be processed by C#.
So:
Read just the smallest amount data you need from the DB
Process this data as a unit, parallelism might help.
Write back modifications in one DB connection.
Hope this sets you in the right direction.
Based on your comment, there are multiple things you can enhance in your solution, from memory consumption to CPU usage.
Take advantage of paging at the database level. Do not fetch all records at once, to avoid having memory leaks and/or high memory consumption in cases of 1+ million records, rather take chunk by chunk and do whatever you need to do with it.
Since you don't need to save XML into a database, you can choose to save response into the file. Saving XML into file gives you an opportunity to stream data onto your local disc.
Instead of assembling XML by yourself, use XmlSerializer to do that job for you. XmlSerializer works nicely with XmlWriter which in the end can work with any stream including FileStream. There is a thread about it, which you can take as an example.
To conclude, PerformStuff method won't be only faster, but it will require way fewer resources (memory, CPU) and the most important thing, you'll be easily able to constraint resource consumption of your program (by changing the size of database page).
Observation: your requirement looks like it matches the map / reduce pattern.
If the values in your ids collection returned by thirdpartylib.Methodforid() are reasonably dense, and the number of rows in the table behind your proc_name stored procedure has close to the same number of items in the ids collection, you can retrieve all the records you need with a single SQL query (and a many-row result set) rather than retrieving them one by one. That might look something like this:
public static TestResponse PerformStuff()
{
var response = new TestResponse();
var idHash = new HashSet<int> (thirdpartylib.Methodforid());
SqlCommand cmd = DB.GetSqlCommand("proc_name_for_all_ids");
using (SqlDataReader dr = new SqlDataReader(DB.GetDataReader(cmd)) {
while (dr.Read()) {
var id = dr.GetInt32("id");
if (idHash.Contains(id)) {
testReq = new TestRequest();
testReq.col1 = dr.GetInt32("col1");
testReq.col2 = dr.GetBoolean("col2");
testReq.col3 = dr.GetString("col3");
testReq.col4 = dr.GetBoolean("col4");
testReq.col5 = dr.GetString("col5");
testReq.col6 = dr.GetBoolean("col6");
testReq.col7 = dr.GetString("col7");
var output = thirdpartylib.MethodforResponse(request);
foreach (var data in output.Elements()) {
response.col4 = Convert.ToInt32(data.Id().Class());
response.col2 = data.Id().Name().ToString();
}
} /* end if hash.Contains(id) */
} /* end while dr.Read() */
} /* end using() */
return response;
}
Why might this be faster? It makes many fewer database queries, and instead streams in the multiple rows of data to process. This will be far more efficient than your example.
Why might it not work?
if the id values must be processed in the same order produced by thirdpartylib.Methodforid() it won't work.
if there's no way to retrieve all the rows, that is no proc_name_for_all_ids stored procedure available, you won't be able to stream the rows.
This following code works perfectly fine on a small data set. However, the GetMatchCount and BuildMatchArrary are very sluggish on large result. Can anyone recommend any different approach so save processing time? Would it be better to write the array to a file? Are lists just generally slow and not the best option?
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
public class Client
{
public int Id;
public string FirstName
{
get
{
var firstName = //<call to get from database via Id>
return firstName;
}
}
public string MiddleName
{
get
{
var middleName = //<call to get from database via Id>
return middleName;
}
}
public string LastName
{
get
{
var lastName = //<call to get from database via Id>
return lastName;
}
}
public string FullName
{
get
{
return FirstName + " " + MiddleName + " " + LastName;
}
}
public int GetMatchCount(IEnumerable<string> clientFirstNames, IEnumerable<string> clientMiddleNames, IEnumerable<string> clientLastNames)
{
var clientFullNames = BuildMatchArray(clientFirstNames, clientMiddleNames, clientLastNames);
return clientFullNames.Count(x => x == FullName);
}
public string[] BuildMatchArray(IEnumerable<string> clientFirstNames, IEnumerable<string> clientMiddleNames, IEnumerable<string> clientLastNames)
{
Debug.Assert(clientFirstNames.Count() == clientMiddleNames.Count() && clientMiddleNames.Count() == clientLastNames.Count());
var clientFullNames = new List<string>();
for (int i = 0; i < clientFirstNames.Count(); i++)
{
clientFullNames.Add(clientFirstNames.ElementAt(i) + " " + clientMiddleNames.ElementAt(i) + " " + clientLastNames.ElementAt(i));
}
return clientFullNames.ToArray();
}
}
Where are you getting these strings? If you are using lazy sequences, every time you call Count() you will have to iterate the entire sequence to count how many objects are in the sequence. If the IEnumerable<T> is really a T[] or List<T>, then Count() is optimized to just call the Length or Count property, which isn't expensive. Similarly, ElementAt is also very inefficient and iterates the collection. So with an in-memory lazy sequence this performance will be bad, but if you are streaming results from SQL or an external source, it will be really bad or possibly even incorrect.
A more performant implementation of of BuildMatchArray would be like this:
public IEnumerable<string> ZipNames(IEnumerable<string> firsts,
IEnumerable<string> middles, IEnumerable<string> lasts)
{
using(var e1 = firsts.GetEnumerator())
using(var e2 = middles.GetEnumerator())
using(var e3 = lasts.GetEnumerator())
{
var stop = false;
while (!stop)
{
var hasNext1 = e1.MoveNext();
var hasNext2 = e2.MoveNext();
var hasNext3 = e3.MoveNext();
if (hasNext1 && hasNext2 && hasNext3)
{
yield return $"{e1.Current} {e2.Current} {e3.Current}";
}
else
{
stop = true;
Debug.Assert(!(hasNext1 || hasNext2 || hasNext3));
}
}
}
}
This requires only one iteration of each input collection, and doesn't need to copy elements to a new List<T>. Another point to note, is that List<T> starts with capacity for 4 elements, and when it fills up, it copies all elements to a new list with double capacity. So if you have a large sequence, you will copy many times.
This implementation is very similar to System.Linq.Enumerable.Zip
In your case, you also shouldn't do a ToArray to your sequence. This will require another copying, and can potentially be a huge array. If you are only sending that array to .Count(x => x == y), then keeping a lazy IEnumerable would be better, because Count operates lazily for lazy sequences and streams data in and counts elements as it sees them, without ever requiring the full collection to be in memory.
See IEnumerable vs List - What to Use? How do they work?
I have the following piece of code that I use to try to see if copying data from one table to an other missed some records.
There are reasons why this can happen but I won't go into the details here.
Now fortunately, this code runs against a few hundred records at a time, so I can allow myself lo load them into memory and use LINQ to Objects.
As I expected, my code is very slow and I'm wondering if anyone could suggest any way to improve the speed.
void Main()
{
var crossed_data = from kv in key_and_value_table
from ckv in copy_of_key_and_value_table
where kv.key != ckv.key
select new { KeyTable = kv, copyKeyTable = ckv };
List<Key_and_value> difference = new List<Key_and_value>();
foreach (var v in crossed_data)
{
if (crossed_data.Select(s => s.Kv.key).ToList().
Contains(v.ckv.Key) == false)
{
difference.Add(v.ckv);
}
}
}
public class Key_and_value
{
public string Key { get; set; }
public decimal Value { get; set; }
}
many thanks in advance
B
You are doing your Select every iteration when you do not need to. You can move it to the external scope like so.
var keys = crossed_data.Select(s=>s.ckv.key).ToList();
foreach(var v in crossed_data )
{
if (keys.Contains(v.kv.Key) == false)
{
difference.Add(v.Kv);
}
}
This should improve the speed a fair bit.
I have a class as follows :
public class Test
{
public int Id {get;set;}
public string Name { get; set; }
public string CreatedDate {get;set;}
public string DueDate { get; set; }
public string ReferenceNo { get; set; }
public string Parent { get; set; }
}
and I have a list of Test objects
List<Test>testobjs=new List();
Now I would like to convert it into csv in following format:
"1,John Grisham,9/5/2014,9/5/2014,1356,0\n2,Stephen King,9/3/2014,9/9/2014,1367,0\n3,The Rainmaker,4/9/2014,18/9/2014,1";
I searched for "Converting list to csv c#" and I got solutions as follows:
string.Join(",", list.Select(n => n.ToString()).ToArray())
But this will not put the \n as needed i.e for each object
Is there any fastest way other than string building to do this? Please help...
Use servicestack.text
Install-Package ServiceStack.Text
and then use the string extension methods ToCsv(T)/FromCsv()
Examples:
https://github.com/ServiceStack/ServiceStack.Text
Update:
Servicestack.Text is now free also in v4 which used to be commercial. No need to specify the version anymore! Happy serializing!
Because speed was mentioned in the question, my interest was piqued on just what the relative performances might be, and just how fast I could get it.
I know that StringBuilder was excluded, but it still felt like probably the fastest, and StreamWriter has of course the advantage of writing to either a MemoryStream or directly to a file, which makes it versatile.
So I knocked up a quick test.
I built a list half a million objects identical to yours.
Then I serialized with CsvSerializer, and with two hand-rolled tight versions, one using a StreamWriter to a MemoryStream and the other using a StringBuilder.
The hand rolled code was coded to cope with quotes but nothing more sophisticated. This code was pretty tight with the minimum I could manage of intermediate strings, no concatenation... but not production and certainly no points for style or flexibility.
But the output was identical in all three methods.
The timings were interesting:
Serializing half a million objects, five runs with each method, all times to the nearest whole mS:
StringBuilder 703 734 828 671 718 Avge= 730.8
MemoryStream 812 937 874 890 906 Avge= 883.8
CsvSerializer 1,734 1,469 1,719 1,593 1,578 Avge= 1,618.6
This was on a high end i7 with plenty of RAM.
Other things being equal, I would always use the library.
But if a 2:1 performance difference became critical, or if RAM or other issues turned out to exaggerate the difference on a larger dataset, or if the data were arriving in chunks and was to be sent straight to disk, I might just be tempted...
Just in case anyone's interested, the core of the code (for the StringBuilder version) was
private void writeProperty(StringBuilder sb, string value, bool first, bool last)
{
if (! value.Contains('\"'))
{
if (!first)
sb.Append(',');
sb.Append(value);
if (last)
sb.AppendLine();
}
else
{
if (!first)
sb.Append(",\"");
else
sb.Append('\"');
sb.Append(value.Replace("\"", "\"\""));
if (last)
sb.AppendLine("\"");
else
sb.Append('\"');
}
}
private void writeItem(StringBuilder sb, Test item)
{
writeProperty(sb, item.Id.ToString(), true, false);
writeProperty(sb, item.Name, false, false);
writeProperty(sb, item.CreatedDate, false, false);
writeProperty(sb, item.DueDate, false, false);
writeProperty(sb, item.ReferenceNo, false, false);
writeProperty(sb, item.Parent, false, true);
}
If you don't want to load library's than you can create the following method:
private void SaveToCsv<T>(List<T> reportData, string path)
{
var lines = new List<string>();
IEnumerable<PropertyDescriptor> props = TypeDescriptor.GetProperties(typeof(T)).OfType<PropertyDescriptor>();
var header = string.Join(",", props.ToList().Select(x => x.Name));
lines.Add(header);
var valueLines = reportData.Select(row => string.Join(",", header.Split(',').Select(a => row.GetType().GetProperty(a).GetValue(row, null))));
lines.AddRange(valueLines);
File.WriteAllLines(path, lines.ToArray());
}
and than call the method:
SaveToCsv(testobjs, "C:/PathYouLike/FileYouLike.csv")
Your best option would be to use an existing library. It saves you the hassle of figuring it out yourself and it will probably deal with escaping special characters, adding header lines etc.
You could use the CSVSerializer from ServiceStack. But there are several other in nuget.
Creating the CSV will then be as easy as string csv = CsvSerializer.SerializeToCsv(testobjs);
You could use the FileHelpers library to convert a List of objects to CSV.
Consider the given object, add the DelimitedRecord Attribute to it.
[DelimitedRecord(",")]
public class Test
{
public int Id {get;set;}
public string Name { get; set; }
public string CreatedDate {get;set;}
public string DueDate { get; set; }
public string ReferenceNo { get; set; }
public string Parent { get; set; }
}
Once the List is populated, (as per question it is testobjs)
var engine = new FileHelperEngine<Test>();
engine.HeaderText = engine.GetFileHeader();
string dirPath = Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData) + "\\" + ConfigurationManager.AppSettings["MyPath"];
if (!Directory.Exists(dirPath))
{
Directory.CreateDirectory(dirPath);
}
//File location, where the .csv goes and gets stored.
string filePath = Path.Combine(dirPath, "MyTestFile_" + ".csv");
engine.WriteFile(filePath, testobjs);
This will just do the job for you. I'd been using this to generate data reports for a while until I switched to Python.
PS: Too late to answer but hope this helps somebody.
Use Cinchoo ETL
Install-Package ChoETL
or
Install-Package ChoETL.NETStandard
Sample shows how to use it
List<Test> list = new List<Test>();
list.Add(new Test { Id = 1, Name = "Tom" });
list.Add(new Test { Id = 2, Name = "Mark" });
using (var w = new ChoCSVWriter<Test>(Console.Out)
.WithFirstLineHeader()
)
{
w.Write(list);
}
Output CSV:
Id,Name,CreatedDate,DueDate,ReferenceNo,Parent
1,Tom,,,,
2,Mark,,,,
For more information, go to github
https://github.com/Cinchoo/ChoETL
Sample fiddle: https://dotnetfiddle.net/M7v7Hi
LINQtoCSV is the fastest and lightest I've found and is available on GitHub. Lets you specify options via property attributes.
Necromancing this one a bit; ran into the exact same scenario as above, went down the road of using FastMember so we didn't have to adjust the code every time we added a property to the class:
[HttpGet]
public FileResult GetCSVOfList()
{
// Get your list
IEnumerable<MyObject> myObjects =_service.GetMyObject();
//Get the type properties
var myObjectType = TypeAccessor.Create(typeof(MyObject));
var myObjectProperties = myObjectType.GetMembers().Select(x => x.Name);
//Set the first row as your property names
var csvFile = string.Join(',', myObjectProperties);
foreach(var myObject in myObjects)
{
// Use ObjectAccessor in order to maintain column parity
var currentMyObject = ObjectAccessor.Create(myObject);
var csvRow = Environment.NewLine;
foreach (var myObjectProperty in myObjectProperties)
{
csvRow += $"{currentMyObject[myObjectProperty]},";
}
csvRow.TrimEnd(',');
csvFile += csvRow;
}
return File(Encoding.ASCII.GetBytes(csvFile), "text/csv", "MyObjects.csv");
}
Should yield a CSV with the first row being the names of the fields, and rows following. Now... to read in a csv and create it back into a list of objects...
Note: example is in ASP.NET Core MVC, but should be very similar to .NET framework. Also had considered ServiceStack.Text but the license was not easy to follow.
For the best solution, you can read this article: Convert List of Object to CSV File C# - Codingvila
using Codingvila.Models;
using System;
using System.Collections.Generic;
using System.ComponentModel.DataAnnotations;
using System.Linq;
using System.Text;
using System.Web;
using System.Web.Mvc;
namespace Codingvila.Controllers
{
public class HomeController : Controller
{
public ActionResult Index()
{
CodingvilaEntities entities = new CodingvilaEntities();
var lstStudents = (from Student in entities.Students
select Student);
return View(lstStudents);
}
[HttpPost]
public FileResult ExportToCSV()
{
#region Get list of Students from Database
CodingvilaEntities entities = new CodingvilaEntities();
List<object> lstStudents = (from Student in entities.Students.ToList()
select new[] { Student.RollNo.ToString(),
Student.EnrollmentNo,
Student.Name,
Student.Branch,
Student.University
}).ToList<object>();
#endregion
#region Create Name of Columns
var names = typeof(Student).GetProperties()
.Select(property => property.Name)
.ToArray();
lstStudents.Insert(0, names.Where(x => x != names[0]).ToArray());
#endregion
#region Generate CSV
StringBuilder sb = new StringBuilder();
foreach (var item in lstStudents)
{
string[] arrStudents = (string[])item;
foreach (var data in arrStudents)
{
//Append data with comma(,) separator.
sb.Append(data + ',');
}
//Append new line character.
sb.Append("\r\n");
}
#endregion
#region Download CSV
return File(Encoding.ASCII.GetBytes(sb.ToString()), "text/csv", "Students.csv");
#endregion
}
}
}