Merge 2 TSV files in C# code cleanup

Merge 2 TSV files in C# code cleanup - c#

I'm provided with 2 Excel files that I convert to TSV files and in the end have to deliver in a TSV file. The 1st file is the main file (strWorksheetPath) and all lines have to be included. The 2nd file (PrintPath) has additional information but not each line in the main file has extra information. To do this in C# I followed this msdn guide to do what I have to do and it's working fine. Unfortunatly, file 1 has 23 columns and file 2 has 10 adding up to 33 columns and so 33 properties in total. I created some temp classes to see if everything is working but it looks very messy in my opinion.
Is there a way to clean up my code and make it look more tidy by possibly not having to make temp classes, condense some repetitive code, ...?
public static void ConvertTSVtoMontDataTable(string strWorksheetPath, string strPrintPath,
bool closeConnection = true)
{
// Check if the main file exist.
if (!File.Exists(strWorksheetPath)) return;
// Load both files.
var mainFile = File.ReadAllLines(strWorksheetPath);
var extraFile = File.ReadAllLines(strPrintPath);
// Create 2 lists.
var mainLines = mainFile.Select(line => new TempMainLine(line)).ToList();
var extraLines = extraFile.Select(line => new TempExtraLine(line)).ToList();
var lines = new List<TempLine>();
// Merge both files.
var leftOuterJoinQuery =
from worksheetLine in mainLines
join printLine in extraLines on string.Concat(worksheetLine.prop6, worksheetLine.prop8) equals
string.Concat(printLine.prop4, printLine.prop5) into lineGroup
from line in lineGroup.DefaultIfEmpty()
select
new TempLine(worksheetLine.prop0, worksheetLine.prop1, worksheetLine.prop2, worksheetLine.prop3,
worksheetLine.prop4, worksheetLine.prop5, worksheetLine.prop6, worksheetLine.prop7,
worksheetLine.prop8, worksheetLine.prop9, worksheetLine.prop10, worksheetLine.prop11,
worksheetLine.prop12, worksheetLine.prop13, worksheetLine.prop14, worksheetLine.prop15,
worksheetLine.prop16, worksheetLine.prop17, worksheetLine.prop18, worksheetLine.prop19,
worksheetLine.prop20, worksheetLine.prop21, worksheetLine.prop22, line == null ? "" : line.prop0,
line == null ? "" : line.prop1, line == null ? "" : line.prop2, line == null ? "" : line.prop3,
line == null ? "" : line.prop4, line == null ? "" : line.prop5, line == null ? "" : line.prop6,
line == null ? "" : line.prop7, line == null ? "" : line.prop8, line == null ? "" : line.prop9);
foreach (var tempLine in leftOuterJoinQuery)
{
lines.Add(tempLine);
}
// Write output to new temp file (TESTING)
using (
var file =
new StreamWriter(Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location),
"output.txt")))
{
foreach (var item in lines)
{
file.WriteLine(item.prop0 + (char)9 + item.prop1 + (char)9 + item.prop2 + (char)9 + item.prop3 +
(char)9 + item.prop4 + (char)9 + item.prop5 + (char)9 + item.prop6 + (char)9 +
item.prop7 + (char)9 + item.prop8 + (char)9 + item.prop9 + (char)9 + item.prop10 +
(char)9 + item.prop11 + (char)9 + item.prop12 + (char)9 + item.prop13 + (char)9 +
item.prop14 + (char)9 + item.prop15 + (char)9 + item.prop16 + (char)9 +
item.prop17 + (char)9 + item.prop18 + (char)9 + item.prop19 + (char)9 +
item.prop20 + (char)9 + item.prop21 + (char)9 + item.prop22 + (char)9 +
item.prop23 + (char)9 + item.prop24 + (char)9 + item.prop25 + (char)9 +
item.prop26 + (char)9 + item.prop27 + (char)9 + item.prop28 + (char)9 +
item.prop29 + (char)9 + item.prop30 + (char)9 + item.prop31 + (char)9 +
item.prop32);
}
}
}

I thought about this some more and regardless of what your Temp* classes look like, something along the lines of the below will work given the assumption that (based on the code you presented), you're outputting every column from both files in the order in which they came in. If you needed to exclude fields, change the order, etc., that would require some changes to the below or a different solution entirely.
It's basically just reading those two files in, joining on the Split() result and then combining the two lines. I didn't see a point in handling the LOJ logic for a null printFile line but if you need the extra tabs, you could replace the line ?? "" with something like line ?? new String('\t', 10)
Note that this is probably not the most efficient way to go about this and if your files are huge, you'd definitely want to optimize this a bit.
// Merge both files.
var lines =
from worksheetLine in mainFile
join printLine in extraFile on string.Concat(worksheetLine.Split('\t')[6], worksheetLine.Split('\t')[8]) equals
string.Concat(printLine.Split('\t')[4], printLine.Split('\t')[5]) into lineGroup
from line in lineGroup.DefaultIfEmpty()
select string.Concat(worksheetLine, line ?? "");
// Write output to new temp file (TESTING)
using (
var file =
new StreamWriter(Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location),
"output.txt")))
{
foreach (var item in lines)
{
file.WriteLine(item);
}
}

Related

Transferring a tab separated variable file via clipboard to Excel but retaining line breaks WITHIN the cell

I am using C# Script in Tabular Editor to read a Power BI file and use clipboard to get information to Excel. Totally newbie to both so my code isn't great but works. My code is below.
For the Expression field I am using .Replace("\n"," ") to replace the line breaks with a blank space.
Otherwise the line broken text moves into the next row in Excel and doesn't align with the corresponding column anymore.
Is there a way I can achieve both
i.e. replace the line breaks with something in Excel that is recognized as a multiple line but still remains with in an Excel Cell.
I have googled and read multiple threads and tried \r, \x0A, CHAR(10), WrapText in Excel etc.
var tsv = "Table_Name\tTable_MeasureCount\tMeasure_Name\tMeasure_Description\tMeasure_DisplayFolder\tMeasure_IsHidden\tMeasure_DataType\tMeasure_FormatString\tMeasure_DataCategory\tMeasure_ErrorMessage\tMeasure_Expression";
foreach(var Table in Model.Tables)
foreach(var Measure in Table.Measures)
{
tsv += "\r\n" + Table.Name
+ "\t" + Table.Measures.Count
+ "\t" + Measure.Name
+ "\t" + Measure.Description
+ "\t" + Measure.DisplayFolder
+ "\t" + Measure.IsHidden
+ "\t" + Measure.DataType
+ "\t" + Measure.FormatString
+ "\t" + Measure.DataCategory
+ "\t" + Measure.ErrorMessage.Replace("\n"," ")
+ "\t" + Measure.Expression.Replace("\n"," ");
}
tsv.Output();

This is the logic I ended up with (I couldn't get a function to work in Tabular Editor)
var csv = "\"Table_Name\",\"Table_MeasureCount\",\"Measure_Name\",\"Measure_Description\",\"Measure_DisplayFolder\",\"Measure_IsHidden\",\"Measure_DataType\",\"Measure_FormatString\",\"Measure_DataCategory\",\"Measure_ErrorMessage\",\"Measure_Expression\"";
string str1 = "\r\n\"";
string str2 = "\",\"";
string str3 = "\"";
//foreach(var Measure in Model.AllMeasures)
foreach(var Table in Model.Tables)
{
foreach(var Measure in Table.Measures)
{
csv += str1 + Table.Name
+ str2 + Table.Measures.Count
+ str2 + Measure.Name
+ str2 + Measure.Description
+ str2 + Measure.DisplayFolder
+ str2 + Measure.IsHidden
+ str2 + Measure.DataType
+ str2 + Measure.FormatString.Replace("\"","\"\"")
+ str2 + Measure.DataCategory
+ str2 + Measure.ErrorMessage.Replace("\"","\"\"")
+ str2 + Measure.Expression.Replace("\"","\"\"")
+ str3;
}
}
csv.Output();

Now we have established CSV is the best option as it allows for line-breaks, I'd suggest the following:
First create a simple function to parse a string into a safe format:
public string stringToCsvSafe(string str)
{
return "\"" + str.Replace("\"", "\"\"") + "\"";
}
Then in your loop build your output string, something like this:
var newRow = new List<object>() {
Table.Name,
Table.Measures.Count,
stringToCsvSafe(Measure.Name),
stringToCsvSafe(Measure.Description),
...
};
csv += string.Join(",", newRow) + "\n";

Updating front end during postback

I have a method that runs through a loop that can take quite a while to complete as it requires getting data back form an API.
What I would like to do is display a message on the front end explaining how the system is progressing during each loop. Is there a way to update the front end while processing?
public static void GetScreenshot(List<string> urlList, List<DesiredCapabilities> capabilities, String platform, Literal updateNote)
{
foreach (String url in urlList)
{
String siteName = new Uri(url).Host;
String dir = AppDomain.CurrentDomain.BaseDirectory+ "/Screenshots/" + siteName + "/" + DateTime.Now.ToString("yyyy-MM-dd_HH-mm");
foreach (DesiredCapabilities cap in capabilities)
{
String saveDirectory = "";
if (platform == "btnGenDesktopScreens")
{
saveDirectory = dir + "/" + cap.GetCapability("os") + "-" + cap.GetCapability("os_version") + "-" + cap.GetCapability("browser") + cap.GetCapability("browser_version");
}
else if(platform == "btnMobile")
{
saveDirectory = dir + "/" + cap.GetCapability("platform") + "" + cap.GetCapability("device") + "-" + cap.GetCapability("browserName");
}
updateNote.Text += "<br/>" + cap.GetCapability("platform") + " - " + cap.GetCapability("device") + "-" + cap.GetCapability("browserName");
//I'd like to display a message here
TakeScreenshot(url, cap, saveDirectory);
//I'd like to display a message here
}
}
}
Has anyone come across a method of doing this?

Depending on how you're returning the feedback to the user, you might be able to do this by using HttpResponse.Flush in a loop to push parts of the HTML response to the user a bit at a time. See https://msdn.microsoft.com/en-us/library/system.web.httpresponse.flush(v=vs.100).aspx

Remove duplicated items from XML by an attribute

Trying to delete <shipmentIndex Name=\"shipments\">whatever...</shipmentIndex>
if it appear more then 1 time, keeping only one.
I have surrounded the item i want to delete here with ***..
The code i am using worked before, but then i added .Value == "shipments"
and now it fail.
How can i keep this code and only fix .Value == "shipments" to work?
class Program
{
static void Main(string[] args)
{
string renderedOutput =
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<RootDTO xmlns:json='http://james.newtonking.com/projects/json'>" +
"<destination>" +
"<name>xxx</name>" +
"</destination>" +
"<orderData>" +
"<items json:Array='true'>" +
"<shipmentIndex Name=\"items\" >111</shipmentIndex>" +
"<barcode>12345</barcode>" +
"</items>" +
"<items json:Array='true'>" +
"<shipmentIndex Name=\"items\">222</shipmentIndex>" +
"<barcode>12345</barcode>" +
"</items>" +
"<items json:Array='true'>" +
"<shipmentIndex Name=\"items\">222</shipmentIndex>" +
"<barcode>12345</barcode>" +
"</items>" +
"<misCode>9876543210</misCode>" +
"<shipments>" +
"<sourceShipmentId></sourceShipmentId>" +
"<shipmentIndex shipments=\"shipments\">111</shipmentIndex>" +
"</shipments>" +
"<shipments>" +
"<sourceShipmentId></sourceShipmentId>" +
"<shipmentIndex Name=\"shipments\">222</shipmentIndex>" +
****
"<shipmentIndex Name=\"shipments\">222</shipmentIndex>" +
****
"</shipments>" +
"</orderData>" +
"</RootDTO>";
var xml = XElement.Parse(renderedOutput);
xml.Element("orderData").Descendants("shipments")
.SelectMany(s => s.Elements("shipmentIndex")
.GroupBy(g => g.Attribute("Name").Value == "shipments")
.SelectMany(m => m.Skip(1))).Remove();
}
}

Not sure I understand the question 100% but here goes:
I am thinking you want to filter the results to only include those elements where the name attribute is equal to 'shipments', although not all of the shipmentIndex elements have a 'Name' attribute so you are probably getting a null reference exception. You need to add a check to ensure that the 'Name' attribute exists.
xml.Element("orderData").Descendants("shipments")
.SelectMany(s => s.Elements("shipmentIndex")
.GroupBy(g => g.Attribute("Name") != null && g.Attribute("Name").Value == "shipments")
.SelectMany(m => m.Skip(1))).Remove();

If you want to delete the duplicate from the renderedOutput string:
Match match = Regex.Match(renderedOutput, "<shipmentIndex Name=\"shipments\">([^<]*)</shipmentIndex>");
int index = renderedOutput.IndexOf(match.ToString());
renderedOutput = renderedOutput.Remove(index, match.ToString().Length);

how to increase the size of array or free the memory after each iteration. Error: Index was outside the bounds of the array c#

I read data from a text file which is 27 MB file and contains 10001 rows, I need to handle large data. I perform some kind of processing in each row of data and then write it back to a text file. This is the code I have am using
StreamReader streamReader = System.IO.File.OpenText("D:\\input.txt");
string lineContent = streamReader.ReadLine();
int count = 0;
using (StreamWriter writer = new StreamWriter("D:\\ft1.txt"))
{
do
{
if (lineContent != null)
{
string a = JsonConvert.DeserializeObject(lineContent).ToString();
string b = "[" + a + "]";
List<TweetModel> deserializedUsers = JsonConvert.DeserializeObject<List<TweetModel>>(b);
var CreatedAt = deserializedUsers.Select(user => user.created_at).ToArray();
var Text = deserializedUsers.Where(m => m.text != null).Select(user => new
{
a = Regex.Replace(user.text, #"[^\u0000-\u007F]", string.Empty)
.Replace(#"\/", "/")
.Replace("\\", #"\")
.Replace("\'", "'")
.Replace("\''", "''")
.Replace("\n", " ")
.Replace("\t", " ")
}).ToArray();
var TextWithTimeStamp = Text[0].a + " (timestamp:" + CreatedAt[0] + ")";
writer.WriteLine(TextWithTimeStamp);
}
lineContent = streamReader.ReadLine();
}
while (streamReader.Peek() != -1);
streamReader.Close();
This code helps does well up to 54 iterations as I get 54 lines in the output file. After that it gives error "Index was outside the bounds of the array." at line
var TextWithTimeStamp = Text[0].a + " (timestamp:" + CreatedAt[0] + ")";
I am not very clear about the issue if the maximum capacity of array has been violated, if so how can I increase it or If I can write the individual line encountered in loop through
writer.WriteLine(TextWithTimeStamp);
And clean the storage or something that can solve this issue. I tried using list insead of array , still issue is the same.Please help.

Change this line
var TextWithTimeStamp = Text[0].a + " (timestamp:" + CreatedAt[0] + ")";
to
var TextWithTimeStamp = (Text.Any() ? Text.First().a : string.Empty) +
" (timestamp:" + (CreatedAt.Any() ? CreatedAt.First() : string.Empty) + ")";
As you are creating Text and CreatedAt collection objects, they might be empty (0 total item) based on some scenarios and conditions.
Those cases, Text[0] and CreatedAt[0] will fail. So, before using the first element, check if there are any items in the collection. Linq method Any() is used for that purpose.
Update
If you want to skip the lines that do not contain text, change this lines
var TextWithTimeStamp = Text[0].a + " (timestamp:" + CreatedAt[0] + ")";
writer.WriteLine(TextWithTimeStamp);
to
if (Text.Any())
{
var TextWithTimeStamp = Text.First().a + " (timestamp:" + CreatedAt.First() + ")";
writer.WriteLine(TextWithTimeStamp);
}
Update 2
To include all the stringss from CreatedAt rather than only the first one, you can add all the values in comma separated strings. A general example
var strings = new List<string> { "a", "b", "c" };
var allStrings = string.Join(",", strings); //"a,b,c"

Reading Multiple XML files

I have Created a small XML tool, to find the numbers of element present in Multiple XML files.
This code gives the fine result for the elements which are must in XML files.
But when it comes to specific elements, which may be present or not in XML files, Software give me result as:
10/8/2012 11:27:51 AM
C:\Documents and Settings\AlaspuMK\Desktop\KS\success\4CPK-PMF0-004D-P565-00000-00.xml
Instance: 0
10/8/2012 11:27:51 AM
C:\Documents and Settings\AlaspuMK\Desktop\KS\success\4CPK-PMF0-004D-P566-00000-00.xml
Instance: 0
10/8/2012 11:27:51 AM
C:\Documents and Settings\AlaspuMK\Desktop\KS\success\4CPK-PMF0-004D-P567-00000-00.xml
Instance: 0
10/8/2012 11:27:51 AM
C:\Documents and Settings\AlaspuMK\Desktop\KS\success\4CPK-PMG0-004D-P001-00000-00.xml
**Instance: 11**
10/8/2012 11:27:51 AM
C:\Documents and Settings\AlaspuMK\Desktop\KS\success\4CPK-PMG0-004D-P002-00000-00.xml
Instance: 0
Now here the problem is XML files may be 500-1000 when i search the tag which may be present or not the tool gives me result for each and every files. In this case specific tag present instance may be 0 or multiple.
Can any one suggest the changes in my Code to find the file name in which instance is greater than 0. and if instance > 0 print it in text box.
My current code:
public void SearchMultipleTags()
{
if (txtSearchTag.Text != "")
{
try
{
//string str = null;
//XmlNodeList nodelist;
string folderPath = textBox2.Text;
DirectoryInfo di = new DirectoryInfo(folderPath);
FileInfo[] rgFiles = di.GetFiles("*.xml");
foreach (FileInfo fi in rgFiles)
{
int i = 0;
XmlDocument xmldoc = new XmlDocument();
xmldoc.Load(fi.FullName);
//rtbox2.Text = fi.FullName.ToString();
foreach (XmlNode node in xmldoc.GetElementsByTagName(txtSearchTag.Text))
{
i = i + 1;
//
}
rtbox2.Text += DateTime.Now + "\n" + fi.FullName + " \nInstance: " + i.ToString() + "\n\n";
//rtbox2.Text += fi.FullName + "instances: " + str.ToString();
}
}
catch (Exception ex)
{
MessageBox.Show("Invalid Path or Empty File name field.");
}
}
else
{
MessageBox.Show("Dont leave field blanks.");
}
}

If I understand correctly, you want to display text only if the i is greater than 0?
if(i > 0 )
rtbox2.Text += DateTime.Now + "\n" + fi.FullName + " \nInstance: " + i.ToString() + "\n\n";

Use
if(i > 0)
rtbox2.Text += DateTime.Now + "\n" + fi.FullName + " \nInstance: " + i.ToString() + "\n\n";
instead of simple
rtbox2.Text += DateTime.Now + "\n" + fi.FullName + " \nInstance: " + i.ToString() + "\n\n";

You could always just use this code inside the try block:
rtbox2.Text =
String.Join(Environment.NewLine + Environment.NewLine,
from fi in (new DirectoryInfo(textBox2.Text)).GetFiles("*.xml")
let xd = XDocument.Load(fi.FullName)
let i = xd.Descendants(txtSearchTag.Text).Count()
where i > 0
select String.Join(Environment.NewLine, new []
{
DateTime.Now.ToString(),
fi.FullName,
i.ToString(),
}));
Does it all in one line (bar the formatting). :-)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Merge 2 TSV files in C# code cleanup - c#

Related

Transferring a tab separated variable file via clipboard to Excel but retaining line breaks WITHIN the cell

Updating front end during postback

Remove duplicated items from XML by an attribute

how to increase the size of array or free the memory after each iteration. Error: Index was outside the bounds of the array c#

Reading Multiple XML files

Categories

Resources