Is it possible to use regular expressions to define syntax highlighting in Scintilla? And if so, how to do it?
I have a custom language to process, which cannot be described in simple terms of keywords and delimiters. The meaning of particular structures in this language is dependent only on their position relative to keywords. I have regular expression based parser for this format, all I need is to apply regular expression defined rules as text styles.
I mean if something matches regex1, it should have style1. Is it possible? How?
If not - can I set styles for manually selected ranges? I mean to assign style number to a specified character range in editor. How to do it?
Is it possible to define Scintilla styles in code, not in xml file?
EDIT:
OK, I've found a way.
foreach (Match m in Patterns.Keyword0.Matches(Encoding.ASCII.GetString(e.RawText)))
e.GetRange(m.Index, m.Index + m.Length).SetStyle(1);
The problem is RawText property. It's byte buffer of UTF-8 encoded text. The text property contains nice UTF-16 text, but the GetRage method accepts byte offset not character offset. If I use conversion on each TextChanged event I loose almost all speed advantage from using Scintilla.
Of course the easiest way would be to change internal encoding to UTF-16, but when I do it, I get exception saying this encoding is not supported. The only one supported seems to be UTF-8 which is ridiculously hard (and slow) to process.
I'm hitting a wall here.
The key to this is to set the lexer to SCLEX_CONTAINER and then handle the SCN_STYLENEEDED notification. This means you only ever have to process the text that actually needs styling.
There are several guides linked at the top of the Scintilla Documentation that detail various aspects of implementing customs lexers, so I won't bother repeating any of that here.
As for performance: I've written custom scintilla lexers is python that decode to utf-8 when styling and have never noticed any significant issues, so I'd be amazed if you couldn't at least match that using C#.
Related
In C# StringInfo and TextElementEnumerator classes provide methods and properties for text elements.
And here, we can find the definition of the Text Element.
The .NET Framework defines a text element as a unit of text that is
displayed as a single character, that is, a grapheme. A text element
can be any of the following:
Yes, it says a text element is a grapheme in .NET. I also tested with some unicode characters myself, and it really seemed true until I tested one Korean letter '가'.
As we all know some Unicode characters consist of multiple code points. Also we may face code point sequences and that's the reason I'm using StringInfo and TextElementEnumerator instead of simple String.
StringInfo and TextElementEnumerator could tell if Chars were surrogate pairs correctly. And "\u0061\u0308", a Unicode character which consists of multiple code points, was recognized as one text element just as expected. But as for "\u1100\u1161", it failed to say that it was also one text element.
"\u1100" is a leading letter "ㄱ", and "\u1161" is a vowel letter "ㅏ". They can be individual characters and shown to the users just as I write here and you can see them now. But if they are used together, they are rendered as one character "가" instead of "ㄱㅏ".
There are two ways in order to represent a Korean character "가":
Using a single code point U+AC00 from Hangul Syllable.
Using two code points U+1100 and U+1161 from Jamo.
Most of the time the former is used. The latter is rarely used, to be honest, I can't imagine when it's used at all..
Anyway, the first one is just one precomposed letter and the second is a sequence of Lead and Vowel which is treated as one character. When rendered they look the exactly same and both are actually canonically equivalent.
Also the following line returns true in C# :
"\u1100\u1161".Normalize() == "\uAC00"
I wonder why Normalize() here works just fine when C# doesn't think they are one complete text element..
I thought it had something to do with my .NET's version, but it turns out it's not the case. This thing happens even in Mono too.
I tested this with ICU as well, and it could treat "\u1100\u1161" as one grapheme correctly!
I initially thought StringInfo and TextElementEnumerator could eliminate need for ICU4C in some simple cases, so I'm very disappointed now..
Here's my question :
Am I doing something wrong here?
or
A Text Element in .NET isn't a user-perceived character unlike in ICU?
The basic issue here is that per the Korean standard KS X 1026, the two jamos ㄱ and ㅏ are distinct from their combined form 가. In fact, this exact example is used in the official standard (see section 6.2).
Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.
You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.
When adding a ToolStripMenuItem to a form and setting RightToLeft to true and having a quote at the end of the text does it place the quote at the front of the Text?
ToolStripMenuItem1.Text = "Name \"Text\"";
ToolStripMenuItem1.RightToLeft = System.Windows.Forms.RightToLeft.Yes;
Displays as; "Name "Text
Edit: This also happens with single quotes.
By setting RightToLeft to Yes, you are asking the Windows text rendering engine to apply the text layout rules used in languages that use a right-to-left order. Arabic and Hebrew. Those rules are pretty subtle, especially because English phrases in those languages are not uncommon. It is not going to render "txeT emaN" as it normally does with Arabic or Hebrew glyphs, that doesn't make sense to anybody. It needs to identify sentences or phrases and reverse those. Quotes are special, they delineate a phrase.
Long story short, you are ab-using a feature to get alignment that was really meant to do something far more involved. Don't use it for that.
EDIT:
IF you just want to change Text Alignment then there is always ToolStripItem.Alignment or ToolStripItem.Padding to try...
The feature you are using is meant for localization and can support mixed content...
The way you use that feature seems more like an abuse... since Windows needs to make sense of "mixed content" of which you provide an extreme (no arabic at all).
I find it always hard to be sure that such behaviour though unexpected and unintuitive is really a bug...
Rendering logic for BiDi text is rather complex - for example when you have an exclamation mark somewhere in text which should be rendered RightToLeft it can lead to reversing the direction depending on the implementation...
For some insight see http://www.unicode.org/reports/tr9/
As it seems .NET is not fully compliant with the Unicode BiDi algorithm... there is even library that tries to implement it see http://sourceforge.net/projects/nbidi/
I am using the follow regex:
(<(table|h[1-6])[^>]*>(?<op>.+?)<\/(table|h[1-6])>)
to extract tables (and headings) from a html document.
I've found it to work quite well in the documents we are using (documents converted with word save as filtered html), however I have a problem that if the table contains a table inside it the regex will match the initial table start tag and the second table end tag rather than the initial table end tag.
Is there a way in regex to specify that if it finds another table tag within the match to keep to ignore the next match of and go for the next one and so on?
Don't do this.
HTML is not a regular grammar and so a regular expression is not a good tool with which to parse it. What you are asking in your last sentence is for a contextual parser, not a regular expression. Bare regular expression parsing it is too likely fail to parse HTML correctly to be responsible coding.
HtmlAgilityPack is a MsPL-licensed solution I've used in the past that has widely acceptable license terms and provides a well-formed DOM which can be probed with XPath or manipulated in other useful ways ("Extract all text, dropping out tags" being a popular one for importing HTML mail for search, for example, that is nigh trivial after letting a DOM parser rip through the HTML and only coding the part that adds value for your specific business case).
Is there a way in regex to specify
that if it finds another table tag
within the match to keep to ignore the
next match of and go for the next one
and so on?
Since nobody's actually answered this part, I will—No.
This is part of what makes regular languages "regular". A regular language is one that can be recognized by a certain regular grammar, often described in syntax that looks very much like basic regular expressions (10* to match 1 followed by any number of 0s), or a DFA. "Regular Expressions" are based strongly off of these regular languages, as their name implies, but add some functions such as lookaheads and lookbehinds. As a general rule, a regular language knows nothing about what's around it or what it's seen, only what it's looking at currently, and which of its finite states it's in.
TLDNR: Why does this matter to you? Since a regular language cannot "count" elements in that way, it is impossible to keep a tally of the number of <table> and </table> elements you have seen. An HTML Parser does just that - since it is not trying to emulate a regular language, it can count the number of opening and closing tags it sees.
This is the prime example of why it's best not to use regular expressions to parse HTML; even though you know how it may be formed, you cannot parse it since there may be nested elements. If you could guarantee there would be no nested tables, it may be feasible to do this, but even then, using a parser would be much simpler.
Plea to the theoretical computer scientists: I did my best to explain what I know from the CS Theory classes I've taken in a way that most people here should be able to understand. I know that regular languages can "count" finite numbers of things. Feel free to correct me, but please be kind!
Regular expressions are not really suited for this as what you're trying to do contains knowledge about the fact that this is a nested language. Without this knowledge it will be really hard (and also hard to read and maintain) to extract this information.
Maybe do something with an XPath navigator?
I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.
I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.
I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."
I don't believe it's possible, but I am intrigued why you would want it.
In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.
It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps
PHP has a great function called htmlspecialcharacters() where you pass it a string and it replaces all of HTML's special characters with their safe equivalents, it's almost a one stop shop for sanitizing input. Very nice right?
Well is there an equivalent in any of the .NET libraries?
If not, can anyone link to any code samples or libraries that do this well?
Try this.
var encodedHtml = HttpContext.Current.Server.HtmlEncode(...);
System.Web.HttpUtility.HtmlEncode(string)
Don't know if there's an exact replacement, but there is a method HtmlUtility.HtmlEncode that replaces special characters with their HTML equivalents. A close cousin is HtmlUtility.UrlEncode for rendering URL's. You could also use validator controls like RegularExpressionValidator, RangeValidator, and System.Text.RegularExpression.Regex to make sure you're getting what you want.
Actually, you might want to try this method:
HttpUtility.HtmlAttributeEncode()
Why? Citing the HtmlAttributeEncode page at MSDN docs:
The HtmlAttributeEncode method converts only quotation marks ("), ampersands (&), and left angle brackets (<) to equivalent character entities. It is considerably faster than the HtmlEncode method.
In an addition to the given answers:
When using Razor view engine (which is the default view engine in ASP.NET), using the '#' character to display values will automatically encode the displayed value. This means that you don't have to use encoding.
On the other hand, when you don't want the text being encoded, you have to specify that explicitly (by using #Html.Raw). Which is, in my opinion, a good thing from a security point of view.