This question already has answers here:
How to Convert Persian Digits in variable to English Digits Using Culture?
(18 answers)
Closed 9 years ago.
i have a variable with persian culture digits like this:
string Value="۱۰۳۶۷۵۱";
i want to convert this digits to English version and save it again in my string like this
Value="1036751";
please help me how can i do this
if i can use easy way like culture info instead of switch case
You can use the Windows.Globalization.NumberFormatting.DecimalFormatter class to parse the string. This will parse strings in any of the supported numeral systems (as long as it is internally coherent).
You can do that with a number of tools. iTextPdfSharp will likely be able to do it. It will amount to opening the document and walking the tree in the catalog that has the bookmarks in it. Their code works fine, but be sure to download the spec so you can understand the structure of the tree. I worked the original version of Acrobat and many of my fellow engineers engineers felt that the bookmark tree was a little over-complicated.
BitMiracle offers similar code. They routinely patrol Stack Overflow, so you might see an answer from them too (HI!) - you can see a sample of their work here for authoring bookmarks.
If you're willing to pay money, this is easy using Atalasoft's DotPdf (disclaimer: I work for Atalasoft and wrote nearly all of DotPdf). In our API, we try to hide the complexity of the structure where possible (for example, if you want to iterate of the chain of chains of actions taken when a bookmark is clicked, it's a foreach instead of a tree walk) and we've wrapped the bookmark tree into standard List<T> collections.
public void WalkBookmarks(Stream pdf)
{
// open the doc
PdfDocument doc = new PdfDocument(pdf);
if (doc.BookmarkTree != null)
{
// walk the list of top level bookmarks
WalkBookmarks(doc.BookmarkTree.Bookmarks, 0);
}
}
public void WalkBookmarks(PdfBookmarkList list, int depth)
{
if (list == null) return;
foreach (PdfBookmark bookmark in list)
{
// indent to the depth of the list and write the Text
// you can also get the color, basic font styling and
// the action associated with the bookmark
for (i = 0; i < depth; i++) Console.Write(" ");
Console.Writeline(bookmark.Text);
// recurse on any children
WalkBookmarks(bookmark.Children, depth + 1);
}
}
PDFs can contain at least three different things which may be called "table of contents":
Document outline (bookmarks), a set of specific PDF structures
List of hyperlinks in the beginning of a document. Each hyperlink leads to a place withing the document
List of text strings where each string names a part of the document and, optionally, specifies on which page this part starts.
I do not know about any out-of-the box or easy to implement solutions for the third case. Other cases are simpler.
For the first case, almost any PDF library will do. #plinth (Hi!) gave at least two solutions for such a case.
For the second case a solution could be implemented using Docotic.Pdf library. Basically, you might try to:
enumerate all links in a document
find all links that are close to each other (you'll need to build up some heuristics for what to treat as "close")
retrieve text from found links
If your case is "list of hyperlinks" then the Extract text from link target sample might give you some clues for a start.
Disclaimer: I work for Bit Miracle, vendor of Docotic.Pdf library.
You'll need to use a pdf-library like pdflib in order to read pdf-files (http://www.pdflib.com/) . That should do the trick, good luck!
Related
I'm working on an application which has to create word documents with the use of Office Open XML SDK 2.5. The idea that I'm having now is that I will start from a template with an empty body (so I have all the namespaces etc. defined already), and add Paragraphsto it. If I need images I will add the ImageParts and try to give the ImagePart the Id present in the predefined paragraphpart which will contain the image. I will store the paragraphs as xml in a database, fetch the ones I need, fill in/modify some values if needed and insert them into my word document. But this is the tricky part, how can I easily insert them in a way so I don't have to query on their content to later on find one of the paragraphs? In other words, I need Id's. I have some options in mind:
For each possible paragraph I have, manually create a SdtBlock. This SdtBlock will have an Id which matches the Id of each paragraph in the database. This seems like a lot of manual work though, and I'd rather be able to create future word documents easier...
I chose this approach but I insert Building Blocks which can be stored in templates with a specific tagname.
Create the paragraphs, copy the xml from the developer tool, and manually add a ParagraphId. This seems even more of a nightmare though, because for every future new paragraphs I will have to create new Id's etc. Also it would be impossible to insert tables as there is no way (afaik) to give those an Id.
Work with bookmarks to know where to insert the data. I don't really like this either as bookmarks are visible for everyone. I know I can replace them, but then I don't have any way to identify individual paragraphs later on.
**** my database and just add everything in the template :D Remove the paragraphs I don't need by deleting the bookmarks with their content. This idea seems the worst of all though as I don't want to depend on having a templatefile with all possible content per word-file I need to generate.
Anyone with experience in OpenXml who knows which approach would be the best? Maybe there is another approach which is better and I have completely overlooked? The ideal solution would be that I can add Ids in Office Word but that's a no-go as I haven't found anything to do that yet.
Thanks in advance!
Content Controls (std) were designed for this, although I'm not sure the designers ever contemplated "targeting" each and every paragraph in the document...
Back in the 2003/2007 days it was possible to add custom XML mark-up to a Word document, which would have been exactly what you're looking for. But Microsoft lost a patent court case around 2009 and had to pull the functionality. So content controls are really your only good choice.
Your approach could, possibly, be combined with the BuildingBlocks concept. BuildingBlocks are Word content stored in a Word template file as valid Word Open XML. They can be assigned to "galleries" and categorized. There is a Content Control of type BuildingBlock that can be associated with a specific Gallery and Category which might help you to a certain extent and would be an alternative to storing content in a database.
Ok, I did a small research, you can do it in strict OpenXML, but only before you open your file in Word. Word will remove everything it cannot read.
using (WordprocessingDocument document = WordprocessingDocument.Open(path, true)) {
document.MainDocumentPart.Document.Body.Ancestors().First()
.SetAttribute(new OpenXmlAttribute() {
LocalName = "someIdName",
Value = "111" });
}
Here, for example, I set attribute "someIdName", which doesn't exits in OpenXML, to some random element. You can set it anywhere and use it as id
I have an observable collection with a class that has 2 string properties: Word and Translation. I want to create a word file in format:
word = translation word = translation
word = translation word = translation...
The word document needs to be in 2 Columns (PageLayout) and the Word should be in bold.
I have first tried Microsoft.Office.Interop.Word.
PageSetup.TextColumns.SetCount(2) sets the PageLayout. As for the text itself I used a foreach loop and in each iteration I did this:
paragraph.Range.Text = Word + " = " + Translation;
object boldStart = paragraph.Range.Start;
object boldEnd = paragraph.Range.Start + Word.Length;
Word.Range boldPart = document.Range(boldStart, boldEnd);
boldPart.Bold = 1;
paragraph.Range.InsertParagraphAfter();
This does exactly what I want, but if there are 1000 items in the collection it takes about 10sec, much much more if the number is 10k+. I then used a StringBuilder and just set document.Content.Text = sb.ToString(); and that takes less than a sec, but I can't set the word to be bold that way.
Then I switched to using Open XML SDK 2.5, but even after reading the msdn documentation I still have no idea how to make just a part of the text bold, and I don't know if it's even possible to set PageLayout Columns count. The only thing I could do was to make it look the same as with Interop.Word, but with just 1 column and <1sec creation time.
Should I be using Interop.Word or Open XML (or maybe combined) for this? And can someone pls show me how to write this properly, so it doesn't take forever if the collection is relatively large? Any help is appreciated. :)
OOXML can be intimidating at first. http://officeopenxml.com/anatomyofOOXML.php has some good examples. Whenever you get confused unzip the docx and browse the contents to see how it's done.
The basic idea is you'd open Word, create a template with the styling you want and a code word to find the paragraph, then multiply the paragraph, replacing the text in that template with each word.
Your Word template would look like this:
Here's some pseudo code to get you started, assuming you have the SDK installed
var templateRegex = new Regex("\\[templateForWords\\]");
var wordPlacementRegex = new Regex("\\[word\\]");
var translationPlacementRegex = new Regex("\\[translation]\\]");
using (var document = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mainPart = document.MainDocumentPart;
// do your work here...
var paragraphTemplate = mainPart.Document.Body
.Descendants<Paragraph>()
.Where(p=>templateRegex.IsMatch(p.InnerText)); //pseudo
//... or whatever gives you the text of the Para, I don't have the SDK right now
foreach (string word in YourDictionary){
var paraClone = paragraphTemplate.Clone(); // pseudo
// you may need to do something like
// paraClone.Descendents<Text>().Where(t=>regex.IsMatch(t.Value))
// to find the exact element containing template text
paraClone.Text = templateRegex.Replace(paraClone.Text,"");// pseudo
paraClone.Text = wordPlacementRegex.Replace(paraClone.Text,word);
paraClone.Text = translationPlacementRegex.Replace(paraClone.Text,YourDictionary[word]);
paragraphTemplate.Parent.InsertAfter(paraClone,ParagraphTemplate); // pseudo
}
paragraphTemplate.Remove();
// document should auto-save
document.Package.Flush();
}
OpenXML is absolutely better, because it is faster, has less bugs, more reliable and flexible in runtime (especially in server environment). And it's not really difficult to find out how to make one or another element using OpenXML. As docx file is just a zip file with xml files inside, I open it and read the xml to get the idea, how word itself makes it. First of all, I create a document, then format it (in your case, you can create some file with two columns and bold words inside), save it, rename it to .zip file. Then open it, open "word" directory inside and the file "document.xml" inside the directory. This document contains essential part of xml, looking at this it's not difficult to figure out how to recreate it in OpenXML
Open XML is a much better option than Office COM. But the problem is that it is a low-level file format library that unlike Office COM doesn’t work on a high abstraction level. You might want to go that route but I recommend you to first consider looking into a commercial library that will give you the benefits of a high-level DOM without the need to have MS Word installed on the production machine. Our company recently purchased this toolkit which allows you to use template based approach and also DOM/programmatic approach to generate/modify/create documents.
I am trying to write code (in C#) that can search for any plain-text word or phrase in a markdown file. Currently I'm doing this by a long-winded method: convert the markdown to HTML, strip HTML element tags out of the HTML text and then use a simple regular expression to search that for the word/phrase in question. Needless to say, this can be pretty slow.
A concrete example might show the problem. Say the markdown file contains
Something ***significant***
I would like to be able to find that by providing the search phrase something significant (i.e. ignoring the ***'s).
Is there an efficient way of doing this (i.e. that avoids the conversion to HTML) and doesn't involve me writing my own markdown parser?
Edit:
I want a generic way to search for any text or phrase in markdown text that contains any valid markdown formatting. The first answers were ways to match the specific text example I gave.
Edit:
I should have made it clear: this is required for a simple user-facing search and the markdown files could contain any valid markdown formatting. For this reason I need to be able to ignore anything in the markdown that the user wouldn't see as text if they converted the markdown to HTML. E.g. the markdown text that specifies an image (like ![Valid XHTML](http://w3.org/Icons/valid-xhtml10). should be skipped during the search). Converting to HTML produces decent results for the user because it then reasonably accurately reflects what a user sees (but it's just a slow solution, esp when there's a lot of markdown text to look through).
Use a regexp
var str = "Something ***significant***";
var regexp = new Regex("Something.+significant.+");
Console.WriteLine(regexp.Match(str).Success);
I want to do the same thing. I think of one way to achieve that.
Your method has two steps.
Get the plain text out of the markdown source (which has also two steps. Markdown->HTML and HTML->stripped to plain text)
Search within the plain text
Now, if the markdown source is persisted in a data store, then you may be able to also persist the plain text for search purposes only. So the step to extract the plain text from the markdown may be executed only once when persisting the markdown source (or every time the markdown source is updated), but the code that actually searches in the markdown could be executed immediately on the already persisted plain text data as many times as you want.
For example, if you have a relational DB with a column like markdown_text, you could also create a plain_text column and recreate its value every time the markdown_text column is changed.
Users won't bother if saving their markdown takes a few milliseconds (or even seconds) more than before. Users tend to feel safe when something that alters the system's state takes some time (they feel that something is actually happening in the system), rather than happen immediately (they feel that something went wrong and their command did not execute). But they will feel frustrated if searching took more than a few ms to complete. In general users want queries to complete immediately but commands to take some time (not more than a few seconds though).
Try this:
string input = "Something ***significant***";
string v = input.Replace("***", "");
Console.WriteLine(v)
look this example: enter link description here
I have been having trouble finding a solution to this problem.
I am parsing the content of a number of ebooks, finding specific terms and characters, marking the locations and lengths of each term.
A normal case would be something like this (excerpts from A Game of Thrones):
"When he paused to look down, his head swam dizzily and he felt his fingers slipping. Bran cried out and clung for dear life."
If we are searching for the character "Bran", its location is 85 and length is 4. Easy enough.
My issue arises when there is a paragraph like this:
<span height="-0em"><font size="7">D</font></span>aenerys Targaryen wed Khal Drogo
We need to match "Daenerys Targaryn". It is easy enough to strip the HTML and match the string, but in this example the result needs to include the HTML. Thus the expected result would here be would be location = 0, length = 67.
Another situation, caused by random anchor tags scattered throughout:
Did anyone outside the Vale even suspect where Catelyn <a></a>Stark had taken him?
Again, searching for "Catelyn Stark" needs to include the HTML, so location = 47, length = 20.
I have been able to get around it temporarily by adding those specific cases (searching for "Catelyn <a></a>Stark specifically), but clearly I should have a more robust solution, which I cannot seem to get my head around. My attempts have been using RegEx but with limited success.
I have found various questions regarding HTML matching/stripping (and whether or not to use RegEx =)), but this case seems to be somewhat unique.
Stripping the tags isn't an option as the content must be preserved.
This is within a stand-alone C# application.
Any ideas, steps in the right direction, or similar examples should your search go better than mine would be greatly appreciated!
One possible approach would be to insert the following between each letter in your search string:
(?:<[^>]*>)*
So when searching for the character "Bran" your regex would become the following:
(?:<[^>]*>)*B(?:<[^>]*>)*r(?:<[^>]*>)*a(?:<[^>]*>)*n
This will allow your regex to match any number of HTML tags anywhere within the search string. Note that this will only work if your search strings are always something simple like a character's name, and not regular expressions (this method will fail if there is repetition like a* in your search string).
I would create a function that would take "Daenerys Targaryn" as a parameter and then strip the first letter. Then, it would only search for "aenerys Targaryn," and if found, it would search for ">D<" or the first variable letter. Does than make sense?
Example:
public static string searchFor(string str)
{
// strip first letter of search string (in this case "D")
// search for the rest of the string ("aenerys Targaryn")
// if found, search for ">D<"
// if found, search for HTML tags with "D" inside (using regex)
// if found, search for HTML tags with the previous HTML tag in them (using regex)
return result;
}
Well using Javascript or Php you can get the text of elements and the text of documents and search there and then do a regex to return the closest match (containing the html):
Another option:
would be to index the books first using something like Lucene Search Engine (which happens to let you index in different formats (html format being one of them).
You can then use the Lucene api to search your documents a little easier.
In php we have Zend_Search_Lucene which works perfectly for this kind of thing.
Lucene Search can be found at:
http://lucene.apache.org/core/
Have fun!
I'm attempting to write an application to extract properties and code from proprietary IDE design files. The file format looks something like this:
HEADING
{
SUBHEADING1
{
PropName1 = PropVal1;
PropName2 = PropVal2;
}
SUBHEADING2
{
{ 1 ; PropVal1 ; PropValue2 }
{ 2 ; PropVal1 ; PropValue2 ; OnEvent1=BEGIN
MESSAGE('Hello, World!');
{ block comments are between braces }
//inline comments are after double-slashes
END;
PropVal3 }
{ 1 ; PropVal1 ; PropVal2; PropVal3 }
}
}
What I am trying to do is extract the contents under the subheading blocks. In the case of SUBHEADING2, I would also separate each token as delimited by the semicolons. I had reasonably good success with just counting the brackets and keeping track of what subheading I'm currently under. The main issue I encountered involves dealing with the code comments.
This language happens to use {} for block comments, which interferes with the brackets in the file format. To make it even more interesting, it also needs to take into account double-slash inline comments and ignore everything up to the end of the line.
What is the best approach to tackling this? I looked at some of the compiler libraries discussed in another article (ANTLR, Doxygen, etc.) but they seem like overkill for solving this specific parsing issue.
I'd suggest writing a tokenizer and parser; this will give you more flexibility. The tokenizer basically does a simple text-wise breakdown of the sourcecode and puts it into more usable data structure; the parser figures out what to do with it, often leveraging recursion.
Terms to google: tokenizer, parser, compiler design, grammars
Math expression evaluator: http://www.codeproject.com/KB/vb/math_expression_evaluator.aspx
(you might be able to take an example like this and hack it apart into what you want)
More info about parsing: http://www.codeproject.com/KB/recipes/TinyPG.aspx
You won't have to go nearly as far as those articles go, but, you're going to want to study a bit on this one first.
You should be able to put something together in a few hours, using regular expressions in combination with some code that uses the results.
Something like this should work:
- Initialize the process by loading the file into a string.
Pull each top-level block from the string, using regex tags to separately identify the block keyword and contents.
If a block is found,
Make a decision based on the keyword
Pass the content to this process recursively.
Following this, you would process HEADING, then the first SUBHEADING, then the second SUBHEADING, then each sub-block. For the sub-block containing the block comment, you would presumably know based on the block's lack of a keyword that any sub-block is a comment, so there is no need to process the sub-blocks.
No matter which solution you will choose, I'm pretty sure the best way is to have 2 parsers/tokenizers. One for the main file structure with {} as grouping characters, and one for the code blocks.