I am having some problems with a quite easy task - i feel like im missing something very obvious here.
I have a .csv file which is semicolon seperated. In this file are several numbers that contain dots like "1.300" but there are also dates included like "2015.12.01". The task is to find and delete all dots but only those that are in numbers and not in dates. The dates and numbers are completely variable and never at the same position in the file.
My question now: What is the 'best' way to handle this problem?
From a programmers point of view: Is it a good solution to just split at every semilicon, count the dots and if there is only one dot, delete it? This is the only way to solve the problem i could think of by now.
Example source file:
2015.12.01;
13.100;
500;
1.200;
100;
Example result:
2015.12.01;
13100;
500;
1200;
100;
If you can rely on the fact that dates have two dots and numbers just one, you can use that as a filter:
string s = "123.45";
if (s.Count(x => x == '.') == 1)
{
s = s.Replace(".", null);
}
The source file looks like a valid file generated by a program running on a machine whose locale uses . as the thousand separator (most of Europe does) and date separator (German locales only I think). Such locales also use ; as the list separator.
If the question was only how to parse such dates, numbers, the answer would be to pass the proper culture to the parse function, eg: decimal.Parse("13.500",new CultureInfo("de-at")) would return 13500. The actual issue though is that the data must be fed to another program that uses . as the decimal separator.
The safest option would be to change the locale used by the exporting program, eg change the thread CultureInfo if the exporter is a .NET program, the locale in an SSIS package etc, to a locale like en-gb to export with . and avoid the weird date format. This assumes that the next program in the pipeline doesn't use German for the date, English for numbers
Another option would be to load the text, parse the fields using the proper locale then export them in the format required by the next program.
Finally, a regular expression could be used to match only the numeric fields and remove the dot. This can be a bit tricky and depends on the actual contents.
For example (\d+)\.(\d{3}) can be used to match numbers if there is only one thousand separator. This can fail if some text field contains similar values. Or ;(\d+)\.(\d{3}); could match only a full field, except the first and last fields, eg:
Regex.Replace("1.457;2016.12.30;13.000;1,50;2015.12.04;13.456",#";(\d+)\.(\d{3});",#"$1$2;")
produces :
1.457;2016.12.3013000;1,50;2015.12.04;13.456
A regular expression that would match either numbers between ; or the first/last field could be
(^|;)(\d+)\.(\d{3})(;|$)
This would produce 1457;2016.12.30;13000;1,50;2015.12.04;13456, eg:
var data="1.457;2016.12.30;13.000;1,50;2015.12.04;13.456";
var pattern=#"(^|;)(\d+)\.(\d{3})(;|$)";
var replacement=#"$1$2$3$4";
var result= Regex.Replace(data,pattern,replacement);
The advantage of a regex over splitting and replacing strings is that it's a lot faster and more memory efficient. Instead of generating temporary strings for each split, manipulation, a Regex only calculates indexes in the source. A string object is generated only when you request the final text result. This results in far fewer allocations and garbage collections.
Even in medium-sized files this can result in 10x better performance
I wouldn't rely on the number of dots as mistakes can be made.
You can use the double.TryParse to safely test if the string is a number
var data = "2015.12.01;13.100;500;1.200;100;";
var dataArray = data.Split(';');
foreach (var s in dataArray)
{
double result;
if(double.TryParse(s,out result))
// implement your logic here
Console.WriteLine(s.Replace(".",string.Empty));
}
Related
I tried to figure out the basics of these numeric string formatters. So I think I understand the basics but there is one thing I'm not sure about
So, for example
#,##0.00
It turns out that it produces identical results as
#,#0.00
or
#,0.00
#,#########0.00
So my question is, why are people using the #,## so often (I see it a lot when googling)
Maybe I missed something.
You can try it out yourself here and put the following inside that main function
double value = 1234.67890;
Console.WriteLine(value.ToString("#,0.00"));
Console.WriteLine(value.ToString("#,#0.00"));
Console.WriteLine(value.ToString("#,##0.00"));
Console.WriteLine(value.ToString("#,########0.00"));
Probably because Microsoft uses the same format specifier in their documentation, including the page you linked. It's not too hard to figure out why; #,##0.00 more clearly states the programmer's intent: three-digit groups separated by commas.
What happens?
The following function is called:
public string ToString(string? format)
{
return Number.FormatDouble(m_value, format, NumberFormatInfo.CurrentInfo);
}
It is important to realize that the format is used to format the string, but your formats happen to give the same result.
Examples:
value.ToString("#,#") // 1,235
value.ToString("0,0") // 1,235
value.ToString("#") // 1235
value.ToString("0") // 1235
value.ToString("#.#")) // 1234.7
value.ToString("#.##") // 1234.68
value.ToString("#.###") // 1234.679
value.ToString("#.#####") // 1234.6789
value.ToString("#.######") // = value.ToString("#.#######") = 1234.6789
We see that
it doesn't matter whether you put #, 0, or any other digit for that matter
One occurrence means: any arbitrary large number
double value = 123467890;
Console.WriteLine(value.ToString("#")); // Prints the full number
, and . however, are treated different for double
After a dot or comma, it will only show the amount of character that are provided (or less: as for #.######).
At this point it's clear that it has to do with the programmer's intent. If you want to display the number as 1,234.68 or 1234.67890, you would format it as
"#,###.##" or "#,#.##" // 1,234.68
"####.#####" or "#.#####" // 1234.67890
Given the input 123.45, I'm trying to get the output 123,45 via String.Format.
Because of the system I'm working in, the actual format strings (e.g. {0:0.00}) are saved in a config file and transformed later down the pipeline.
Editing the config file to add a new format is "safe" and I can get this through quite quickly. Editing the actual parser later down the line is "risky" so will need a more significant QA resource and I need to avoid this.
As a result, some caveats:
I only have access to string.Format(pattern, input). No overloads.
I cannot send a localisation. I know that if I send string.Format(new System.Globalization.CultureInfo("de-DE"), "{0:0.00}", 123.45) then I've got what I need. But I cannot pass the localisation.
So can it be done?
Is there any format I can pass to string.Format which will transform 123.45 into 123,45?
If you can you multiply the input by 100 then the following should work:
double input = 123.45;
string pattern = "{0:###\\,##}";
var result = String.Format(pattern, input*100);
Just for the fun of it :)
double value =123.45;
Console.WriteLine(String.Format("{0:#0.00}\b\b\b,", value));
This of course only works when there is a cursor, like in the console, otherwise the backspace control characters have no effect.
Sorry, but i can't think of a real way in accomplishing this.
Struggling with the basics - I'm trying to code a simple currency converter. The XML provided by external source uses comma as a decimal separator for exchange rate (kurs_sredni):
<pozycja>
<nazwa_waluty>bat (Tajlandia)</nazwa_waluty>
<przelicznik>1</przelicznik>
<kod_waluty>THB</kod_waluty>
<kurs_sredni>0,1099</kurs_sredni>
</pozycja>
I already managed to load the data from XML into a nifty list of objects (kursyAktualne), and now i'm trying to do the math. I'm stuck with conversion.
First of all i'm assigning "kurs_sredni" to a string, trying to replace "," with "." and converting the hell out of it:
string kursS = kursyAktualne[iNa].kurs_sredni;
kursS.Replace(",",".");
kurs = Convert.ToDouble(kursS);
MessageBox.Show(kurs.ToString());
The messagebox show 1099 instead of expected 0.1099 and kursS still has comma, not dot.
Tried toying with some cultureInfo stuff i googled, but that was too random. I need to understand how to control this.
Just use decimal.Parse but specify a CultureInfo. There's nothing "random" about it - pick an appropriate CultureInfo, and then use that. For example:
using System;
using System.Globalization;
class Test
{
static void Main()
{
var french = CultureInfo.GetCultureInfo("fr-FR");
decimal value = decimal.Parse("0,1099", french);
Console.WriteLine(value.ToString(CultureInfo.InvariantCulture)); // 0.1099
}
}
This is just using French as one example of a culture which uses , as a decimal separator. It would probably make sense to use the culture of the origin of the data.
Note that decimal is a better pick for currency values than double - you're trying to represent an "artificial" construct which is naturally specified in base10, rather than a "natural" continuous value such as a weight.
(I would also be wary of a data provider who provides data in a non-standard format. If they're getting that wrong, who knows what else they'll get wrong. It's not like XML doesn't have a well-specified format for numbers...)
It is because Replace method returns new string with replaced characters. It does not modify your original string.
So you need to reassign it:
kursS = kursS.Replace(",",".");
Replace returns a string. So you need an assignment.
kursS = kursS.Replace(",", ".");
There is "neater" way of doing this by using CulturInfo. Look this up on the MSDN website.
You replace result isn't used, but the original value that doesn't contain the replace.
You should do:
kursS = kursS.Replace(",", ".")
In addition this method isn't really safe if there are thousands-separators.
So if you are not using culture settings you should do:
kursS = kursS.Replace(".", "").Replace(",", ".")
Just looking to see what the best way to approach the following situation would be.
I am trying to make a small job that reads in a txt file which has a thousand or so lines;
Each line is about 40 characters long (mostly numbers, some letter identifiers).
I have used
DataTable txtCache = new DataTable();
txtCache.Columns.Add(new DataColumn("Column1"));
string[] lines = System.IO.File.ReadAllLines(FILEcheck.Properties.Settings.Default.filePath);
foreach (string line in lines)
{
txtCache.Rows.Add(line);
}
However, what I really want to do is a bit confusing and hard to explain so i'll do my best. An example of line is below:
5498494000584454684840}eD44448774V6468465 Z
In the beginning of that long string is a "84", and then a "58" a little bit later. I need to do a comparison on these two numbers. They could be anything, but only a few combinations are acceptable in the file. They will always be in the same spot and same amount of characters (so it will always be 2 numbers and always in the 4-5 location). So I want to have 3 columns. I want the full string in 1 column, and then the 2 individual smaller numbers in columns of themselves. I can then compare them later on, and if there is an issue, I can return the full string which caused the issue.
Is this possible? I am just not sure how to parse out a substring based on character location and then loading it into a datatable.
Any advice would be appreciated. Thank you,
You could create the columns for each of items you are looking to store (whole string, first number, second number), and then add a row for each of the lines in the input file. You could just use the substring method to parse out the two digit numbers and store them. To do your analysis, you could parse the numbers out from the strings, or whatever else you need to do.
lines[0].Substring(3,2) will give you "84" in your above example. If you want the int, you could use Int32.Parse(lines[0].Substring(3,2))
Substring reference: http://msdn.microsoft.com/en-us/library/aka44szs%28v=vs.110%29.aspx
How can I change values in string from 0,00 to 0.00? - only numeric values, not all chars "," to "."
FROM
string myInputString = "<?xml version=\"1.0\"?>\n<List xmlns:Table=\"urn:www.navision.com/Formats/Table\"><Row><HostelMST>12,0000</HostelMST><PublicMST>0,0000</PublicMST><TaxiMST>0,0000</TaxiMST><ParkMST>0,0000</ParkMST><RoadMST>0,0000</RoadMST><FoodMST>0,0000</FoodMST><ErrorCode>0</ErrorCode><ErrorDescription></ErrorDescription></Row></List>\n";
TO
string myInputString = "<?xml version=\"1.0\"?>\n<List xmlns:Table=\"urn:www.navision.com/Formats/Table\"><Row><HostelMST>12.0000</HostelMST><PublicMST>0.0000</PublicMST><TaxiMST>0.0000</TaxiMST><ParkMST>0.0000</ParkMST><RoadMST>0.0000</RoadMST><FoodMST>0.0000</FoodMST><ErrorCode>0</ErrorCode><ErrorDescription></ErrorDescription></Row></List>\n";
Thanks for answers, but I mean to change only numeric values, not all chars "," to "."
I don't want change string from
string = "<Attrib>txt txt, txt</Attrib><Attrib1>12,1223</Attrib1>";
to
string = "<Attrib>txt txt. txt</Attrib><Attrib1>12.1223</Attrib1>";
but this one is ok
string = "<Attrib>txt txt, txt</Attrib><Attrib1>12.1223</Attrib1>";
Try this :
Regex.Replace("attrib1='12,34' attrib2='43,22'", "(\\d),(\\d)", "$1.$2")
output : attrib1='12.34' attrib2='43.22'
The best method depends on the context. Are you parsing the XML? Are you writing the XML. Either way it's all to do with culture.
If you are writing it then I am assuming your culture is set to something which uses commas as decimal seperators and you're not aware of that fact. Firstly go change your culture in Windows settings to something which better fits your culture and the way you do things. Secondly, if you were writing the numbers out for human display then I would leave it as culturally sensative so it will fit whoever is reading it. If it is to be parsed by another machine then you can use the Invariant Culture like so:
12.1223.ToString(CultureInfo.InvariantCulture);
If you are reading (which I assume is what you are doing) then you can use the culture info again. If it was from a human source (e.g. they typed it in a box) then again use their default culture info (default in float.Parse). If it is from a computer then use InvariantCulture again:
float f = float.Parse("12.1223", CultureInfo.InvariantCulture);
Of course, this assumes that the text was written with an invariant culutre. But as you're asking the question it's not (unless you have control over it being written, in which case use InvariantCulture to write it was suggested above). You can then use a specific culture which does understand commas to parse it:
NumberFormatInfo commaNumberFormatInfo = new NumberFormatInfo();
commaNumberFormatInfo.NumberDecimalSeperator = ",";
float f = float.Parse("12,1223", commaNumberFormatInfo);
I strongly recommend joel.neely's regex approach or the one below:
Use XmlReader to read all nodes
Use double.TryParse with the formatter = a NumberFormatInfo that uses a comma as decimal separator, to identify numbers
Use XmlWriter to write a new XML
Use CultureInfo.InvariantCulture to write the numbers on that XML
The answer from ScarletGarden is a start, but you'll need to know the complete context and grammar of "numeric values" in your data.
The problem with the short answer is that cases such as this get modified:
<elem1>quantity<elem2>12,6 of which were broken</elem2></elem1>
Yes, there's probably a typo (missing space after the comma) but human-entered data often has such errors.
If you include more context, you're likely to reduce the false positives. A pattern like
([\s>]-?$?\d+),(\d+[\s<])
(which you can escape to taste for your programming language of choice) would only match when the "digits-comma-digits" portion (with optional sign and currency symbol) was bounded by space or an end of an element. If all of your numeric values are isolated within XML elements, then you'll have an easier time.
string newStr = myInputString.Replace("0,00", "0.00");
While you could theoretically do this using a Regex, the pattern would be complex and hard to to test. ICR is on the right track, you need to do this based on culture.
Do you know that your numbers are always going to be using a comma as a decimal separator instead of a period? It looks like you can, given that Navision is a Danish company.
If so, you'll need to traverse the XML document in the string, and rewrite the numeric values. It appears you can determine this on node name, so this won't be an issue.
When you convert the number, use something similar to this:
here's what you want to do:
internal double ConvertNavisionNumber(string rawValue)
{
double result = 0;
if (double.TryParse(rawValue, NumberStyles.Number, new CultureInfo("da-DK").NumberFormat, out result))
return result;
else
return 0;
}
This tells the TryParse() method that you're converting a number from Danish (da-DK). Once you call the function, you can use ToString() to write the number out in your local format (which I'm assuming is US or Canadian) to get a period for your decimal separator. This will also take into account numbers with different thousands digit separator (1,234.56 in Canada is written as 1 234,56 in Denmark).
ConvertNavisionNumber("4,43").ToString()
will result in "4.43".
ConvertNavisionNumber("1 234").ToString()
will result in "1,234".
if the , is not used anywhere else but number with in the string you can use the following:
string newStr = myInputString.Replace(",", ".");