Remove inline styles from innerHtml using HtmlAgilityPack

Remove inline styles from innerHtml using HtmlAgilityPack - c#

I am parsing a web page to return all the unique sentences on the page, each with a minimum of two words. It almost works. The following appears as one sentence in the page however my code is dropping the text in the <b></b> tags. How do I remove the inline style/tags to return the sentence as it appears in the browser with the text in the bold tags or any other inline style like strong tags?
Currently it returns NHL Playoffs as one line of text and then Takeaways: Sharks beat Penguins for first Stanley Cup Final win as the second sentence when it is really just one sentence.
<span class="titletext"><b>NHL Playoffs</b> Takeaways: Sharks beat Penguins for first Stanley Cup Final win</span>
Here is my asp.net vb.net code (c# solution is fine).
Public Shared Function validateIsMoreThanOneWord(input As String, numberWords As Integer) As Boolean
If String.IsNullOrEmpty(input) Then
Return False
End If
Return (input.Split(New Char() {" "c}, StringSplitOptions.RemoveEmptyEntries).Length >= numberWords)
End Function
Private Sub form1_Load(sender As Object, e As EventArgs) Handles form1.Load
Try
Dim html = New HtmlDocument()
html.LoadHtml(New WebClient().DownloadString("http://news.google.ca/nwshp?hl=en&ei=4H1UV7-NNOfCjwTAl4bABw&ved=0EKkuCAkoBw"))
Dim root = html.DocumentNode
Dim myList As New List(Of String)()
For Each node As HtmlNode In root.Descendants().Where(Function(n) n.NodeType = HtmlNodeType.Text AndAlso n.ParentNode.Name <> "script" AndAlso n.ParentNode.Name <> "style" AndAlso n.ParentNode.Name <> "css")
If Not node.HasChildNodes Then
Dim text As String = HttpUtility.HtmlDecode(node.InnerText)
If Not String.IsNullOrEmpty(text) And Not String.IsNullOrWhiteSpace(text) Then
If validateIsMoreThanOneWord(text.Trim(), 2) Then
myList.Add(text.Trim())
End If
End If
End If
Next
'remove dups from array and other stuff
Dim q As String() = myList.Distinct().ToArray()
For i As Integer = 0 To UBound(q)
Response.Write(q(i).Trim() & "<br/>")
Next
Response.Write(q.Count)
Catch ex As Exception
Response.Write(ex.Message)
End Try
End Sub
Hope you can shed some light on a solution. Thanks!

Since you are looping over all root descendant nodes which parent is not <script>, nor <style> nor css, you will indeed treat every child node from .titleText as a different piece of text.
What you want is to retrieve the InnerText of each .titletext entry.
The following is what I would do in C#, you can get the idea of what you need to do.
HtmlWeb w = new HtmlWeb();
var htmlDoc = w.Load("http://news.google.ca/nwshp?hl=en&ei=4H1UV7-NNOfCjwTAl4bABw&ved=0EKkuCAkoBw");
var textTitles = htmlDoc.DocumentNode.SelectNodes("//span[#class='titletext']");
//for testing purposes
foreach (var textTitle in textTitles)
Console.WriteLine(textTitle.InnerText);

Related

Copy document content (including formatting and page format) to another using Word Interop in c# with 100% fidelity

I want to copy the content of a document created by the user to an existing document. The existing document content must be an exact mirror to the document created by the user.
I cannot simply copy the file using System.IO or saving a copy of the document created by the user using SaveAs methods in Word Interop. This is because the existing document is a document that is generated from a webserver and has VBA modules for uploading it back to the server.
The document generated by the webserver (existing document) is a Word 2003 document, but the document created by the user is either a Word 2003 document or Word 2007+.
Having these limitations in mind, I first created the following method:
string tempsave = //location of user created document;
string savelocation = //location of existing document;
Word.Application objWordOpen = new Word.Application();
Document doclocal = objWordOpen.Documents.Open(tempsave);
Document d1 = objWordOpen.Documents.Open(savelocation);
Word.Range oRange = doclocal.Content;
oRange.Copy();
d1.Activate();
d1.UpdateStyles();
d1.ActiveWindow.Selection.WholeStory();
d1.ActiveWindow.Selection.PasteAndFormat(Word.WdRecoveryType.wdFormatOriginalFormatting);
This is generally working. However, the tables are messed up.
Also, if there is a Page Break, the output is different.
The user created document:
The output - existing document:
Also, at the end of the document a paragraph mark is added, as follows:
The user created document:
The output - existing document:
The page format is also messed up, the output having mirror margins set up.
The user created document:
The output - existing document:
I have also tried using Range.Insert() method and setting the range without copying as described here https://stackoverflow.com/a/54500605/10468231, but I am still having these issues.
I have also tried adding the VBA modules to the document, but there are also Document Variables and other custom properties and I don't want to mess with the file being uploaded to the server.
How do I handle these issues? Both the documents are based on Normal template.
I am open to another suggestion regarding this topic, but I know that .doc files are not handled as easily as .docx format, this is why I think I am stuck with COM Interop.
Thank you.
UPDATE
Based on Macropod code posted by Charles Kenyon, I have managed to copy more of the formatting from the source to target. Still, there is the difference at the page break - the paragraph mark is places on the new page, instead on the same page.
Also, the text is slightly larger, even though the Font Size is the same.
Word.Range oRange;
oRange = Source.Content;
Target.Content.FormattedText = oRange.FormattedText;
LayoutTransfer(Source, Target);
LayoutTransfer method:
private void LayoutTransfer(Document source, Document target)
{
float sPageHght;
float sPageWdth;
float sHeaderDist;
float sFooterDist;
float sTMargin;
float sBMargin;
float sLMargin;
float sRMargin;
float sGutter;
WdGutterStyle sGutterPos;
WdPaperSize lPaperSize;
WdGutterStyleOld lGutterStyle;
int lMirrorMargins;
WdVerticalAlignment lVerticalAlignment;
WdSectionStart lScnStart;
WdSectionDirection lScnDir;
int lOddEvenHdFt;
int lDiffFirstHdFt;
bool bTwoPagesOnOne;
bool bBkFldPrnt;
int bBkFldPrnShts;
bool bBkFldRevPrnt;
WdOrientation lOrientation;
foreach (Word.Section section in source.Sections)
{
lPaperSize = section.PageSetup.PaperSize;
lGutterStyle = section.PageSetup.GutterStyle;
lOrientation = section.PageSetup.Orientation;
lMirrorMargins = section.PageSetup.MirrorMargins;
lScnStart = section.PageSetup.SectionStart;
lScnDir = section.PageSetup.SectionDirection;
lOddEvenHdFt = section.PageSetup.OddAndEvenPagesHeaderFooter;
lDiffFirstHdFt = section.PageSetup.DifferentFirstPageHeaderFooter;
lVerticalAlignment = section.PageSetup.VerticalAlignment;
sPageHght = section.PageSetup.PageHeight;
sPageWdth = section.PageSetup.PageWidth;
sTMargin = section.PageSetup.TopMargin;
sBMargin = section.PageSetup.BottomMargin;
sLMargin = section.PageSetup.LeftMargin;
sRMargin = section.PageSetup.RightMargin;
sGutter = section.PageSetup.Gutter;
sGutterPos = section.PageSetup.GutterPos;
sHeaderDist = section.PageSetup.HeaderDistance;
sFooterDist = section.PageSetup.FooterDistance;
bTwoPagesOnOne = section.PageSetup.TwoPagesOnOne;
bBkFldPrnt = section.PageSetup.BookFoldPrinting;
bBkFldPrnShts = section.PageSetup.BookFoldPrintingSheets;
bBkFldRevPrnt = section.PageSetup.BookFoldRevPrinting;
var index = section.Index;
target.Sections[index].PageSetup.PaperSize = lPaperSize;
target.Sections[index].PageSetup.GutterStyle = lGutterStyle;
target.Sections[index].PageSetup.Orientation = lOrientation;
target.Sections[index].PageSetup.MirrorMargins = lMirrorMargins;
target.Sections[index].PageSetup.SectionStart = lScnStart;
target.Sections[index].PageSetup.SectionDirection = lScnDir;
target.Sections[index].PageSetup.OddAndEvenPagesHeaderFooter = lOddEvenHdFt;
target.Sections[index].PageSetup.DifferentFirstPageHeaderFooter = lDiffFirstHdFt;
target.Sections[index].PageSetup.VerticalAlignment = lVerticalAlignment;
target.Sections[index].PageSetup.PageHeight = sPageHght;
target.Sections[index].PageSetup.PageWidth = sPageWdth;
target.Sections[index].PageSetup.TopMargin = sTMargin;
target.Sections[index].PageSetup.BottomMargin = sBMargin;
target.Sections[index].PageSetup.LeftMargin = sLMargin;
target.Sections[index].PageSetup.RightMargin = sRMargin;
target.Sections[index].PageSetup.Gutter = sGutter;
target.Sections[index].PageSetup.GutterPos = sGutterPos;
target.Sections[index].PageSetup.HeaderDistance = sHeaderDist;
target.Sections[index].PageSetup.FooterDistance = sFooterDist;
target.Sections[index].PageSetup.TwoPagesOnOne = bTwoPagesOnOne;
target.Sections[index].PageSetup.BookFoldPrinting = bBkFldPrnt;
target.Sections[index].PageSetup.BookFoldPrintingSheets = bBkFldPrnShts;
target.Sections[index].PageSetup.BookFoldRevPrinting = bBkFldRevPrnt;
}
}
UPDATE 2
Actually, the page break not remaining in line with paragraph format is not an issue of copying fidelity, but rather an issue of conversion from .doc to .docx. (https://support.microsoft.com/en-us/help/923183/the-layout-of-a-document-that-contains-a-page-break-may-be-different-i)
Maybe someone thought of a method to overcome this.

The following code by Paul Edstein (macropod) may assist you. It will at least give you an idea of the complexities you are facing.
' ============================================================================================================
' KEEP NEXT THREE TOGETHER
' ============================================================================================================
'
Sub CombineDocuments()
' Paul Edstein
' https://www.msofficeforums.com/word-vba/43339-combine-multiple-word-documents.html
'
' Users occasionally need to combine multiple documents that may of may not have the same page layouts,
' Style definitions, and so on. Consequently, combining multiple documents is often rather more complex than
' simply copying & pasting content from one document to another. Problems arise when the documents have
' different page layouts, headers, footers, page numbering, bookmarks & cross-references,
' Tables of Contents, Indexes, etc., etc., and especially when those documents have used the same Style
' names with different definitions.
'
' The following Word macro (for Windows PCs only) handles the more common issues that arise when combining
' documents; it does not attempt to resolve conflicts with paragraph auto-numbering,
' document -vs- section page numbering in 'page x of y' numbering schemes, Tables of Contents or Indexing issues.
' Neither does it attempt to deal with the effects on footnote or endnote numbering & positioning or with the
' consequences of duplicated bookmarks (only one of which can exist in the merged document) and any corresponding
' cross-references.
'
' The macro includes a folder browser. Simply select the folder to process and all documents in that folder
' will be combined into the currently-active document. Word's .doc, .docx, and .docm formats will all be processed,
' even if different formats exist in the selected folder.
'
Application.ScreenUpdating = False
Dim strFolder As String, strFile As String, strTgt As String
Dim wdDocTgt As Document, wdDocSrc As Document, HdFt As HeaderFooter
strFolder = GetFolder: If strFolder = "" Then Exit Sub
Set wdDocTgt = ActiveDocument: strTgt = ActiveDocument.fullname
strFile = Dir(strFolder & "\*.doc", vbNormal)
While strFile <> ""
If strFolder & strFile <> strTgt Then
Set wdDocSrc = Documents.Open(FileName:=strFolder & "\" & strFile, AddToRecentFiles:=False, Visible:=False)
With wdDocTgt
.Characters.Last.InsertBefore vbCr
.Characters.Last.InsertBreak (wdSectionBreakNextPage)
With .Sections.Last
For Each HdFt In .Headers
With HdFt
.LinkToPrevious = False
.range.Text = vbNullString
.PageNumbers.RestartNumberingAtSection = True
.PageNumbers.StartingNumber = wdDocSrc.Sections.First.Headers(HdFt.Index).PageNumbers.StartingNumber
End With
Next
For Each HdFt In .Footers
With HdFt
.LinkToPrevious = False
.range.Text = vbNullString
.PageNumbers.RestartNumberingAtSection = True
.PageNumbers.StartingNumber = wdDocSrc.Sections.First.Headers(HdFt.Index).PageNumbers.StartingNumber
End With
Next
End With
Call LayoutTransfer(wdDocTgt, wdDocSrc)
.range.Characters.Last.FormattedText = wdDocSrc.range.FormattedText
With .Sections.Last
For Each HdFt In .Headers
With HdFt
.range.FormattedText = wdDocSrc.Sections.Last.Headers(.Index).range.FormattedText
.range.Characters.Last.Delete
End With
Next
For Each HdFt In .Footers
With HdFt
.range.FormattedText = wdDocSrc.Sections.Last.Footers(.Index).range.FormattedText
.range.Characters.Last.Delete
End With
Next
End With
End With
wdDocSrc.Close SaveChanges:=False
End If
strFile = Dir()
Wend
With wdDocTgt
' Save & close the combined document
.SaveAs FileName:=strFolder & "Forms.docx", FileFormat:=wdFormatXMLDocument, AddToRecentFiles:=False
' and/or:
.SaveAs FileName:=strFolder & "Forms.pdf", FileFormat:=wdFormatPDF, AddToRecentFiles:=False
.Close SaveChanges:=False
End With
Set wdDocSrc = Nothing: Set wdDocTgt = Nothing
Application.ScreenUpdating = True
End Sub
' ============================================================================================================
Private Function GetFolder() As String
' used by CombineDocument macro by Paul Edstein, keep together in same module
' https://www.msofficeforums.com/word-vba/43339-combine-multiple-word-documents.html
Dim oFolder As Object
GetFolder = ""
Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0)
If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
Set oFolder = Nothing
End Function
Sub LayoutTransfer(wdDocTgt As Document, wdDocSrc As Document)
' works with previous Combine Documents macro from Paul Edstein, keep together
' https://www.msofficeforums.com/word-vba/43339-combine-multiple-word-documents.html
'
Dim sPageHght As Single, sPageWdth As Single
Dim sHeaderDist As Single, sFooterDist As Single
Dim sTMargin As Single, sBMargin As Single
Dim sLMargin As Single, sRMargin As Single
Dim sGutter As Single, sGutterPos As Single
Dim lPaperSize As Long, lGutterStyle As Long
Dim lMirrorMargins As Long, lVerticalAlignment As Long
Dim lScnStart As Long, lScnDir As Long
Dim lOddEvenHdFt As Long, lDiffFirstHdFt As Long
Dim bTwoPagesOnOne As Boolean, bBkFldPrnt As Boolean
Dim bBkFldPrnShts As Boolean, bBkFldRevPrnt As Boolean
Dim lOrientation As Long
With wdDocSrc.Sections.Last.PageSetup
lPaperSize = .PaperSize
lGutterStyle = .GutterStyle
lOrientation = .Orientation
lMirrorMargins = .MirrorMargins
lScnStart = .SectionStart
lScnDir = .SectionDirection
lOddEvenHdFt = .OddAndEvenPagesHeaderFooter
lDiffFirstHdFt = .DifferentFirstPageHeaderFooter
lVerticalAlignment = .VerticalAlignment
sPageHght = .PageHeight
sPageWdth = .PageWidth
sTMargin = .TopMargin
sBMargin = .BottomMargin
sLMargin = .LeftMargin
sRMargin = .RightMargin
sGutter = .Gutter
sGutterPos = .GutterPos
sHeaderDist = .HeaderDistance
sFooterDist = .FooterDistance
bTwoPagesOnOne = .TwoPagesOnOne
bBkFldPrnt = .BookFoldPrinting
bBkFldPrnShts = .BookFoldPrintingSheets
bBkFldRevPrnt = .BookFoldRevPrinting
End With
With wdDocTgt.Sections.Last.PageSetup
.GutterStyle = lGutterStyle
.MirrorMargins = lMirrorMargins
.SectionStart = lScnStart
.SectionDirection = lScnDir
.OddAndEvenPagesHeaderFooter = lOddEvenHdFt
.DifferentFirstPageHeaderFooter = lDiffFirstHdFt
.VerticalAlignment = lVerticalAlignment
.PageHeight = sPageHght
.PageWidth = sPageWdth
.TopMargin = sTMargin
.BottomMargin = sBMargin
.LeftMargin = sLMargin
.RightMargin = sRMargin
.Gutter = sGutter
.GutterPos = sGutterPos
.HeaderDistance = sHeaderDist
.FooterDistance = sFooterDist
.TwoPagesOnOne = bTwoPagesOnOne
.BookFoldPrinting = bBkFldPrnt
.BookFoldPrintingSheets = bBkFldPrnShts
.BookFoldRevPrinting = bBkFldRevPrnt
.PaperSize = lPaperSize
.Orientation = lOrientation
End With
End Sub
' ============================================================================================================

I used a Template and copied it several times into a new Word Document after editing it.
It worked like this
Word.Range rng = wordDocTarget.Content;
rng.Collapse(Word.WdCollapseDirection.wdCollapseEnd)
rng.FormattedText = wordDocSource.Content.FormattedText
An alternative could also be to insert a whole file to a range / document
rng = wordDoc.Range
rng.Collapse(Word.WdCollapseDirection.wdCollapseEnd)
rng.InsertFile(filepath)

Vb.net how to compare large text files

Hi All below code is how to compare contents in two text file and is work fine for record in files, but my issue when files have a lot line ( 80000 up) my code work very very slow and i cannot accept it. please kindly give me some idea
Public Class Form1
Const TEST1 = "D:\a.txt"
Const TEST2 = "D:\b.txt"
Public file1 As New Dictionary(Of String, String)
Public file2 As New Dictionary(Of String, String)
Public text1 As String()
Public i As Integer
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
'Declare two dictionaries. The key for each will be the text from the input line up to,
'but not including the first ",". The valus for each will be the entire input line.
'Dim file1 As New Dictionary(Of String, String)
'Dim file2 As New Dictionary(Of String, String)
'Dim text1 As String()
For Each line As String In System.IO.File.ReadAllLines(TEST1)
Dim part() As String = line.Split(",")
file1.Add(part(0), line)
Next
For Each line As String In System.IO.File.ReadAllLines(TEST2)
Dim part() As String = line.Split(",")
file2.Add(part(0), line)
Next
' AddText("The following lines from " & TEST2 & " are also in " & TEST1)
For Each key As String In file1.Keys
If file2.ContainsKey(key) Then
TextBox1.Text &= (file1(key)) & vbCrLf
MsgBox(file2(key))
Label1.Text = file1(key)
Else
TextBox2.Text &= (file1(key)) & vbCrLf
End If
Next
text1 = TextBox1.Lines
IO.File.WriteAllLines("D:\Same.txt", text1)
text1 = TextBox2.Lines
IO.File.WriteAllLines("D:\Differrent.txt", text1)
End Sub

The first thing I would change is the use of a Dictionary. I would use an Hashset.
See HashSet versus Dictionary
Then I would change the ReadAllLines loop. The ReadAllLines loads every line in memory before starting the loop, while ReadLines doesn't read all lines but you can start to work on your line immediately.
See What's the fastest way to read a text file line-by-line?
The third point is switching the order of the files read. First read the TEST2 file then the TEST1. This because while you load TEST1 lines you could immediately check if the file2 Hashset contains the key and Add the found line in a list of found strings while the line not found in a list of not found strings.
Dim TEST1 = "D:\temp\test3.txt"
Dim TEST2 = "D:\temp\test6.txt"
Dim file2Keys As New Hashset(Of String)
For Each line As String In System.IO.File.ReadLines(TEST2)
Dim parts = line.Split(",")
file2Keys.Add(parts(0))
Next
Dim listFound As New List(Of String)()
Dim listNFound= New List(Of String)()
For Each line As String In System.IO.File.ReadLines(TEST1)
Dim parts = line.Split(",")
If file2Keys.Contains(parts(0)) Then
listFound.Add(line)
Else
listNFound.Add(line)
End If
Next
IO.File.WriteAllText("D:\temp\Same.txt", String.Join(Environment.NewLine, listFound.ToArray()))
IO.File.WriteAllText("D:\temp\Differrent.txt", String.Join(Environment.NewLine, listNFound.ToArray()))

Build XML tree from Text File in .NET

I have a text file that show a tree structure. Number of spaces indicate the level for a given member. For example below, Groups can have members or subgroups who can have members and so son:
MainGroup
Member1
Member2
Group1
Member11
Member12
Group12
Member21
Member22
Member3
Sorry everyone,
My first time and first question here so was figuring out the whole formatting thing.
This is what I have tried so far:
I am reading the text file into datatable (This is not necessary but I need the datatable to display the data for me.).
Going through each row (has one column), I create a node. I find the number of spaces. If it is zero, I add attributes to this node and add it to doc. If it has spaces, I loop through and keep adding child nodes to this node. That is where things are not working for me.
Sub ExportToEXML
Dim datarow As DataRow
Dim fileName As String = ""
Dim level As Integer = 0
Dim counter As Integer = 0
Dim doc As XmlDocument = New XmlDocument
Dim docNode As XmlNode = doc.CreateXmlDeclaration("1.0", "UTF-8", Nothing)
doc.AppendChild(docNode)
Dim ComponentsNode As XmlNode = doc.CreateElement("Components")
doc.AppendChild(ComponentsNode)
Dim firstrow As DataRow
For i As Integer = 0 To dt.Rows.Count - 1
firstrow = dt.Rows.Item(i)
fileName = firstrow(0)
level = CountSpacesBeforeFirstChar(fileName)
Dim partNode As XmlNode = doc.CreateElement("Component")
Dim att As XmlAttribute = doc.CreateAttribute("Name")
att.Value = fileName
partNode.Attributes.Append(att)
GetChildNodes(partNode, i, doc, 0, level, dt)
ComponentsNode.AppendChild(partNode)
Next
doc.Save("D:\TestXML.xml")
End Sub
Private Sub GetChildNodes(ByRef xNode As XmlNode, ByRef rowInd As Integer, ByRef xDoc As XmlDocument, level As Integer, table As DataTable)
Dim lev As Integer
Dim fileName As String
Dim dr As DataRow
For i As Integer = rowInd + 1 To table.Rows.Count - 1
dr = table.Rows.Item(i)
fileName = dr(0)
lev = CountSpacesBeforeFirstChar(fileName)
If lev = 0 Then 'has no children
Exit Sub
End If
If lev > level Then
Dim partNode As XmlNode = xDoc.CreateElement("Component")
Dim att As XmlAttribute = xDoc.CreateAttribute("Name")
att.Value = fileName
partNode.Attributes.Append(att)
xNode.AppendChild(partNode)
GetChildNodes(xNode, i, xDoc, lev, table)
End If
Next
End Sub

Well, you should read the file (if it's not too big read the whole into memory otherwise not), create an empty XML document, iterate through the lines, and depending on the indentation of the line create new Nodes and add them to the appropriate XML element (e.g. keep track of the 'last' node for each level and add them as a child element). Of course you can delay the XML creation to a later phase, and build an object hierarchy based on the file content and simply serialize it when you are done. Or maybe this whole thing can be done with a smart regex. There are quite a few possible solutions.
But frankly: SO is not a place where you will magically get code with no effort. (Well, sometimes it is, but nonetheless: show us you made some effort to actually solve the problem before you ask a very general question.)

Here's a relatively concise way to do this:
Sub ParseHierarchy(ByRef inputFilePath As String, ByRef outputFilePath As String)
' We'll treat depth as zero-based to match the number of spaces in the lines
Dim depth As Integer = -1
Dim settings As XmlWriterSettings = New XmlWriterSettings
settings.Indent = True
Using writer As XmlWriter = XmlWriter.Create("testxml.xml", settings)
For Each line As String In File.ReadLines(inputFilePath)
Dim nextDepth As Integer = GetLineDepth(line)
If nextDepth - depth > 1 Then
Throw New ApplicationException( _
"Depth cannot increase by more than 1 at a time.")
End If
'' Close any elements at a deeper or the same depth as the next one
CloseElements(writer, depth - nextDepth + 1)
depth = nextDepth
writer.WriteStartElement("Component")
writer.WriteAttributeString("Name", line.Trim())
Next
'' Close any elements that are still open
CloseElements(writer, depth + 1)
End Using
End Sub
Private Sub CloseElements(ByRef writer As XmlWriter, ByVal count As Integer)
For i = 1 To count
writer.WriteEndElement()
Next
End Sub
Private Function GetLineDepth(line As String) As Integer
Return Regex.Match(line, "^\s*").Length
End Function
When run on your sample file, the output is:
<Component Name="MainGroup">
<Component Name="Member1" />
<Component Name="Member2" />
<Component Name="Group1">
<Component Name="Member11" />
<Component Name="Member12" />
<Component Name="Group12">
<Component Name="Member21" />
<Component Name="Member22" />
</Component>
</Component>
<Component Name="Member3" />
</Component>

Delete Matching Braces in Visual Studio

In Visual Studio I can jump from/to opening/closing brace with the Control+] shortcut.
Is there a shortcut that will allow me to delete both braces at once (maybe with a macro/extension)?
e.g.
foo = ( 1 + bar() + 2 );
When I am on the first opening brace I would like to delete it and its matching brace to get
foo = 1 + bar() + 2;

There isn't an inherent way to do this with Visual Studio. You would need to implement a macro in order for this.
If you choose the macro route you'll want to get familiar with the Edit.GoToBrace command. This is the command which will jump you from the current to the matching brace. Note it will actually dump you after the matching brace so you may need to look backwards one character to find the element to delete.
The best way to implement this as a macro is to
Save the current caret position
Execute Edit.GoToBrace
Delete the brace to the left of the caret
Delete the brace at the original caret position

Make a macro to press Ctrl+] twice and then backspace, then Ctrl+minus and a delete.
The Ctrl+minus moves the cursor back in time.

It's not quite as simple as JaredPar suggested but I'm no Macro expert either.
This works for (), {} and []
Sub DeleteMatchingBrace()
Dim sel As TextSelection = DTE.ActiveDocument.Selection
Dim ap As VirtualPoint = sel.ActivePoint
If (sel.Text() <> "") Then Exit Sub
' reposition
DTE.ExecuteCommand("Edit.GoToBrace") : DTE.ExecuteCommand("Edit.GoToBrace")
If (ap.DisplayColumn <= ap.LineLength) Then sel.CharRight(True)
Dim c As String = sel.Text
Dim isRight As Boolean = False
If (c <> "(" And c <> "[" And c <> "{") Then
sel.CharLeft(True, 1 + IIf(c = "", 0, 1))
c = sel.Text
sel.CharRight()
If (c <> ")" And c <> "]" And c <> "}") Then Exit Sub
isRight = True
End If
Dim line = ap.Line
Dim pos = ap.DisplayColumn
DTE.ExecuteCommand("Edit.GoToBrace")
If (isRight) Then sel.CharRight(True) Else sel.CharLeft(True)
sel.Text = ""
If (isRight And line = ap.Line) Then pos = pos - 1
sel.MoveToDisplayColumn(line, pos)
sel.CharLeft(True)
sel.Text = ""
End Sub
Then add a shortcut to this macro in VS.

Regex: absolute url to relative url (C#)

I need a regex to run against strings like the one below that will convert absolute paths to relative paths under certain conditions.
<p>This website is <strong>really great</strong> and people love it <img alt="" src="http://localhost:1379/Content/js/fckeditor/editor/images/smiley/msn/teeth_smile.gif" /></p>
Rules:
If the url contains "/Content/" I
would like to get the relative path
If the url does not contain
"/Content/", it is an external file,
and the absolute path should remain
Regex unfortunatley is not my forte, and this is too advanced for me at this point. If anyone can offer some tips I'd appreciate it.
Thanks in advance.
UPDATE:
To answer questions in the comments:
At the time the Regex is applied, All urls will begin with "http://"
This should be applied to the src attribute of both img and a tags, not to text outside of tags.

You should consider using the Uri.MakeRelativeUri method - your current algorithm depends on external files never containing "/Content/" in their path, which seems risky to me. MakeRelativeUri will determine whether a relative path can be made from the current Uri to the src or href regardless of changes you or the external file store make down the road.

Unless I'm missing the point here, if you replace
^(.*)([C|c]ontent.*)
With
/$2
You will end up with
/Content/js/fckeditor/editor/images/smiley/msn/teeth_smile.gif
This will only happen id "content" is found, so in cae you have a URL such as:
http://localhost:1379/js/fckeditor/editor/images/smiley/msn/teeth_smile.gif
Nothing will be replaced
Hope it helps, and that i didn't miss anything.
UPDATE
Obviously considering you are using an HTML parser to find the URL inside the a href (which you should in case you're not :-))
Cheers

That is for perl, I do not know c#:
s#(<(img|a)\s[^>]*?\s(src|href)=)(["'])http://[^'"]*?(/Content/[^'"]*?)\4#$1$4$5#g
If c# has perl-like regex it will be easy to port.

This function can convert all the hyperlinks and image sources inside your HTML to absolute URLs and for sure you can modify it also for CSS files and Javascript files easily:
Private Function ConvertALLrelativeLinksToAbsoluteUri(ByVal html As String, ByVal PageURL As String)
Dim result As String = Nothing
' Getting all Href
Dim opt As New RegexOptions
Dim XpHref As New Regex("(href="".*?"")", RegexOptions.IgnoreCase)
Dim i As Integer
Dim NewSTR As String = html
For i = 0 To XpHref.Matches(html).Count - 1
Application.DoEvents()
Dim Oldurl As String = Nothing
Dim OldHREF As String = Nothing
Dim MainURL As New Uri(PageURL)
OldHREF = XpHref.Matches(html).Item(i).Value
Oldurl = OldHREF.Replace("href=", "").Replace("HREF=", "").Replace("""", "")
Dim NEWURL As New Uri(MainURL, Oldurl)
Dim NewHREF As String = "href=""" & NEWURL.AbsoluteUri & """"
NewSTR = NewSTR.Replace(OldHREF, NewHREF)
Next
html = NewSTR
Dim XpSRC As New Regex("(src="".*?"")", RegexOptions.IgnoreCase)
For i = 0 To XpSRC.Matches(html).Count - 1
Application.DoEvents()
Dim Oldurl As String = Nothing
Dim OldHREF As String = Nothing
Dim MainURL As New Uri(PageURL)
OldHREF = XpSRC.Matches(html).Item(i).Value
Oldurl = OldHREF.Replace("src=", "").Replace("src=", "").Replace("""", "")
Dim NEWURL As New Uri(MainURL, Oldurl)
Dim NewHREF As String = "src=""" & NEWURL.AbsoluteUri & """"
NewSTR = NewSTR.Replace(OldHREF, NewHREF)
Next
Return NewSTR
End Function

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove inline styles from innerHtml using HtmlAgilityPack - c#

Related

Copy document content (including formatting and page format) to another using Word Interop in c# with 100% fidelity

Vb.net how to compare large text files

Build XML tree from Text File in .NET

Delete Matching Braces in Visual Studio

Regex: absolute url to relative url (C#)

Categories

Resources