I have simple code to parse HTML with HtmlAgilityPack and upload results to SQL data server. The thing is when I am using foreach loop it is working properly. I wanted to try the Parell for each loop. I saw, that the scraping is going faster, but I am getting (not everytime) stack overflow exception. Can you look on the code and tell me why?
//for each link object
Parallel.ForEach(link, _link =>
{
pageNumber=0;
position=1;
//try to load every single page number
while (true)
{
//load page's html
siteHtml = web.Load(_link.LinkUrl + "?page=" + pageNumber);
try
{
doc.LoadHtml(siteHtml.DocumentNode.SelectNodes("//section[#class='answers']")[0].InnerHtml);
comments = doc.DocumentNode.SelectNodes("//singleAnswer");
foreach (HtmlNode _comments in comments)
{
HtmlAgilityPack.HtmlDocument HtmlPage2 = new HtmlAgilityPack.HtmlDocument();
HtmlPage2.LoadHtml(_comments.InnerHtml)
commentId = Convert.ToInt32(_comments.GetAttributeValue("id", ""));
commentValue = (HtmlPage2.DocumentNode.SelectNodes("//p[#class='content']")[0].InnerText);
if (commentValue.Contains(_link.Keyword))
{
sql.updateComment(_link.LinkId, commentValue);
_link.Position = position;
_link.MyCommentId = commentId;
goto NextLink;
}
position++;
}
}
catch
{
}
if (!siteHtml.DocumentNode.InnerHtml.Contains(#"class=""pagingx"" rel=""nextPage"">"))
{
break;
}
pageNumber++;
}
NextLink:;
groupBox1.Invoke(new Action(delegate ()
{
groupBox1.Text = "Link's statistics (" + finished + "/" + linksUrlElements + ")";
}));
});
I also saw that CPU usage is much higher with Parallel for each instead of for each
Related
I am trying to redact some word files using c# and openxml. I need to do controlled replace of the numbers with certain phrase. Each word file contains different amount of info. I want to use OPENXML powertools for this purspose.
I used normal openxml method to replace but it very unreliable and gets random errors such as zero length error.I used regex replace and that seems to work but it replaces it through out the document which is highly undesirable.
Here is some snippet of the code :
private void redact_Replaceall(string wfile)
{
try
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(wfile, true))
{
var ydoc = doc.MainDocumentPart.GetXDocument();
IEnumerable<XElement> content = ydoc.Descendants(W.body);
Regex regex = new Regex(#"\d+\.\d{2,3}");
int count1 = OpenXmlPowerTools.OpenXmlRegex.Match(content, regex);
int count2 = OpenXmlPowerTools.OpenXmlRegex.Replace(content, regex, replace_text, null);
statusBar1.Text = "Try 1: Found: " + count1 + ", Replaced: " + count2;
doc.MainDocumentPart.PutXDocument();
}
}
catch(Exception e)
{
MessageBox.Show("Replace all exprienced error: " + e.Message);
}
}
Basically, I want to do this redaction based on content of paragraph. I am able to get the paragraphs using but not the id's
IEnumerable<XElement> content = ydoc.Descendants(W.p);
Here is my approach using the normal openxml method but I get alot of errors depending on the file.
foreach (DocumentFormat.OpenXml.Wordprocessing.Paragraph para in bod.Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>())
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
string temp = text.Text;
int firstlength = first.Length + 1;
int secondlength = second.Length + 1;
if (text.Text.Contains(first) && !(temp.Length > firstlength))
{
text.Text = text.Text.Replace(first, "DELETED");
}
if (text.Text.Contains(second) && !(temp.Length > secondlength))
{
text.Text = text.Text.Replace(second, "DELETED");
}
}
}
}
Here is the last new approach but I am stuck on it
private void redact_Replacebadones(string wfile)
{
try
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(wfile, true))
{
var ydoc = doc.MainDocumentPart.GetXDocument();
/* from XElement xele in ydoc.Root.Elements();
List<string> lhsElements = xele.Elements("lhs")
.Select(el => el.Attribute("id").Value)
.ToList();
*/
/// XElement
IEnumerable<XElement> content = ydoc.Descendants(W.p);
foreach (var p in content )
{
if (p.Value.Contains("each") && !p.Value.Contains("DELETED"))
{
string to_overwrite = p.Value;
Regex regexop = new Regex(#"\d+\.\d{2,3}");
regexop.Replace(to_overwrite, "Deleted");
p.SetValue(to_overwrite);
MessageBox.Show("NAME :" + p.GetParagraphInfo() +" VValue:"+to_overwrite);
}
}
doc.MainDocumentPart.PutXDocument();
}
}
catch (Exception e)
{
MessageBox.Show("Replace each exprienced error: " + e.Message);
}
}
May be a bit late. OpenXML Power tools by Eric white has a Function SearchAndReplace where you can replace Text content, so you don't have to handle it with RegEx.
This function handles also text which is splitted into runs. (If you edit a word, a word can be splittet in runs, so you dint find the search phrase directly.)
May be this helps somebody.
I am using the BingSearchContainer.cs with a Winform in C#. I am returning the results using the following code. After a good couple of hours looking I can't figure out how to return the other pages of results. It is only possible to return a maximum of 50 results at a time. I would like to return more pages and then add these to "imageSet" to have a full list of resulting images. Any hints or pointers would be really useful, thanks in advance for any help.
void bingSearch(string searchTerm)
{
try
{
imageSet = new List<Bing.ImageResult>();
const string bingKey = "[key]";
var bing = new BingSearchContainer(
new Uri("https://api.datamarket.azure.com/Bing/Search/")) { Credentials = new NetworkCredential(bingKey, bingKey) };
var query = bing.Image("\"" + searchTerm + "\"" + "(" + site1 + " OR " + site2 + ")", null, null, null, null, null, ImageFilters);
Debug.Print("Full Search: " + query.ToString());
query = query.AddQueryOption("$top", 50);
query = query.AddQueryOption("$skip", 20);
var results = query.Execute();
int index = 0;
foreach (var result in results)
{
imageSet.Add(result);
Debug.Print("URL: " + imageSet[index].MediaUrl);
index++;
}
Debug.Print("Results: " + imageSet.Count);
}
catch
{
Debug.Print("Error");
}
}
Solved this.
Actually it is very simple. The "$skip", 20 query option sets the off-set for the pages such that if I have an off-set of 0 I get the first 50 images, an off-set of 50 I get the next 50 images and so on.
After importing plenty of XML files into application i tried to do modifications on it by using XML document class, for this i created few methods to do modifications.
The thing is the starting method it's working fine and when comes to the second one it's displaying System.IO exception like "File is already using another process".
So any one help me out how can i solve this issue.
Sample code what i'm doing:
Method1(fileList);
Method2(fileList);
Method3(fileList);
private void Method1(IList<RenamedImportedFileInfo> fileList)
{
try
{
string isDefaultAttribute = Resource.Resources.ImportIsDefaultAttribute;
string editorsPath = editorsFolderName + Path.DirectorySeparatorChar + meterType;
string profilesPath = profileFolderName + Path.DirectorySeparatorChar + meterType;
string strUriAttribute = Resource.Resources.ImportUriAttribute;
foreach (RenamedImportedFileInfo renameInfo in fileList)
{
if (renameInfo.NewFilePath.ToString().Contains(editorsPath) && (renameInfo.IsProfileRenamed != true))
{
var xmldoc = new XmlDocument();
xmldoc.Load(renameInfo.NewFilePath);
if (xmldoc.DocumentElement.HasAttribute(isDefaultAttribute))
{
xmldoc.DocumentElement.Attributes[isDefaultAttribute].Value = Resource.Resources.ImportFalse;
}
XmlNodeList profileNodes = xmldoc.DocumentElement.GetElementsByTagName(Resource.Resources.ImportMeasurementProfileElement);
if (profileNodes.Count == 0)
{
profileNodes = xmldoc.DocumentElement.GetElementsByTagName(Resource.Resources.ImportBsMeasurementProfileElement);
}
if (profileNodes.Count > 0)
{
foreach (RenamedImportedFileInfo profileName in oRenamedImportedFileList)
{
if (profileName.NewFilePath.ToString().Contains(profilesPath))
{
if (string.Compare(Path.GetFileName(profileName.OldFilePath), Convert.ToString(profileNodes[0].Attributes[strUriAttribute].Value, CultureInfo.InvariantCulture), StringComparison.OrdinalIgnoreCase) == 0)
{
profileNodes[0].Attributes[strUriAttribute].Value = Path.GetFileName(profileName.NewFilePath);
renameInfo.IsProfileRenamed = true;
break;
}
}
}
}
xmldoc.Save(renameInfo.NewFilePath);
xmldoc = null;
profileNodes = null;
}
}
oRenamedImportedFileList = null;
}
catch (NullReferenceException nullException) { LastErrorMessage = nullException.Message; }
}
Thanks,
Raj
You are probably opening the same file twice in your application. Before you can open it again, you have to close it (or leave it open and work on the same document without opening it again).
For help on how to implement this, please show us more code so we can give you advice.
I've just started using your ImapX library to retrieve and read mails from gmail.
Now, everything is working fine and it's a great library.
However, when i'm trying to mark a mail read using the Message.Process() option, it returns the IndexOutOfRangeException.
private void Start()
{
int amountRead = 0;
failedMessages.Clear();
foreach(string origin Properties.Settings.Default.MailOrigins)
{
IMailOriginAdapter adapter = MailOriginFactory.CreateMailOriginContainer(origin);
foreach (ImapX.Message message in adapter.Messages())
{
if (SendWebRequest(url))
{
message.Process();
amountRead++;
Dispatcher.BeginInvoke(new MethodInvoker(delegate
{
this.btnStart.Content = "Read [" + amountRead + "/" + GmailUser.Instance.Messages.Count + "]";
}));
}
else
{
failedMessages.Add(message);
}
}
System.Windows.MessageBox.Show(adapter.GmailFromEmail() + " reading completed.");
}
}
Hopefully someone's capable of helping me with this problem which i've had now for over then a month..
Thanks in advance.
Yours Sincerely,
Larssy1
I post the complete code below, so you can see what I'm doing.
Situation:
I create a IHTMLDocument2 currentDoc pointing to the DomDocument
I write the proper string
I close the currentDoc
program shows me the html code including the CSS stuff 100% correct. Works
Now I want to change the CSS, instead of 2 columns I set it to 3 columns
(Simply change the width:48% to width:33%)
and rerun the code with the new 33%
now it suddenly doesn't apply any CSS style anymore.
When I close the program, and then change the CSS to 33% again, it works flawless
So, somehow, without disposing the complete webbrowser, I can't load the CSS a 2nd time..
or, the first CSS is somewhere in some cache, and conflicts with the 2nd CSS.. Just riddling here.. really need help on how to solve this
I searched the internet and stackoverflow long enough that I need to post this, even if someone else on this planet already posted it somewhere, I didn't find it.
private void doWebBrowserPreview()
{
if (lMediaFiles.Count == 0)
{
return;
}
Int32 iIndex = 0;
for (iIndex = 0; iIndex < lMediaFiles.Count; iIndex++)
{
if (!lMediaFiles[iIndex].isCorrupt())
{
break;
}
}
String strPreview = String.Empty;
String strLine = String.Empty;
// Set example Media
String strLinkHTM = lMediaFiles[iIndex].getFilePath();
FileInfo movFile = new FileInfo(strLinkHTM + lMediaFiles[iIndex].getFileMOV());
String str_sizeMB = (movFile.Length / 1048576).ToString();
if (str_sizeMB.Length > 3)
{
str_sizeMB.Insert(str_sizeMB.Length - 3, ".");
}
//Get info about our media files
MediaInfo MI = new MediaInfo();
MI.Open(strLinkHTM + lMediaFiles[iIndex].getFileM4V());
String str_m4vDuration = // MI.Get(0, 0, 80);
MI.Get(StreamKind.Video, 0, 74);
str_m4vDuration = "Duration: " + str_m4vDuration.Substring(0, 8) + " - Hours:Minutes:Seconds";
String str_m4vHeightPixel = MI.Get(StreamKind.Video, 0, "Height"); // "Height (Pixel): " +
Int32 i_32m4vHeightPixel;
Int32.TryParse(str_m4vHeightPixel, out i_32m4vHeightPixel);
i_32m4vHeightPixel += 16; // for the quicktime embed menu
str_m4vHeightPixel = i_32m4vHeightPixel.ToString();
String str_m4vWidthPixel = MI.Get(StreamKind.Video, 0, "Width"); //"Width (Pixel): " +
foreach (XElement xmlLine in s.getTemplates().getMovieHTM().Element("files").Elements("file"))
{
var query = xmlLine.Attributes("type");
foreach (XAttribute result in query)
{
if (result.Value == "htm_header")
{
foreach (XElement xmlLineDes in xmlLine.Descendants())
{
if (xmlLineDes.Name == "dataline")
{
strLine = xmlLineDes.Value;
strLine = strLine.Replace(#"%date%", lMediaFiles[iIndex].getDay().ToString() + " " + lMediaFiles[iIndex].getMonth(lMediaFiles[iIndex].getMonth()) + " " + lMediaFiles[iIndex].getYear().ToString());
strPreview += strLine + "\n";
}
}
}
}
}
strLine = "<style type=\"text/css\">" + "\n";
foreach (XElement xmlLine in s.getTemplates().getLayoutCSS().Element("layoutCSS").Elements("layout"))
{
var query = xmlLine.Attributes("type");
foreach (XAttribute result in query)
{
if (result.Value == "layoutMedia")
{
foreach (XElement xmlLineDes in xmlLine.Elements("layout"))
{
var queryL = xmlLineDes.Attributes("type");
foreach (XAttribute resultL in queryL)
{
if (resultL.Value == "layoutVideoBox")
{
foreach (XElement xmlLineDesL in xmlLineDes.Descendants())
{
if (xmlLineDesL.Name == "dataline")
{
strLine += xmlLineDesL.Value + "\n";
}
}
}
}
}
}
}
}
strLine += "</style>" + "\n";
strPreview = strPreview.Insert(strPreview.LastIndexOf("</head>", StringComparison.Ordinal), strLine);
for (Int16 i16Loop = 0; i16Loop < 3; i16Loop++)
{
foreach (XElement xmlLine in s.getTemplates().getMovieHTM().Element("files").Elements("file"))
{
var query = xmlLine.Attributes("type");
foreach (XAttribute result in query)
{
if (result.Value == "htm_videolist")
{
foreach (XElement xmlLineDes in xmlLine.Descendants())
{
if (xmlLineDes.Name == "dataline")
{
strLine = xmlLineDes.Value;
strLine = strLine.Replace(#"%m4vfile%", strLinkHTM + lMediaFiles[iIndex].getFileM4V());
strLine = strLine.Replace(#"%moviefile%", strLinkHTM + lMediaFiles[iIndex].getFileMOV());
strLine = strLine.Replace(#"%height%", str_m4vHeightPixel);
strLine = strLine.Replace(#"%width%", str_m4vWidthPixel);
strLine = strLine.Replace(#"%duration%", str_m4vDuration);
strLine = strLine.Replace(#"%sizeMB%", str_sizeMB);
strLine = strLine.Replace(#"%date%", lMediaFiles[iIndex].getDay().ToString() + " " + lMediaFiles[iIndex].getMonth(lMediaFiles[iIndex].getMonth()) + " " + lMediaFiles[iIndex].getYear().ToString());
strPreview += strLine + "\n";
}
}
}
}
}
}
foreach (XElement xmlLine in s.getTemplates().getMovieHTM().Element("files").Elements("file"))
{
var query = xmlLine.Attributes("type");
foreach (XAttribute result in query)
{
if (result.Value == "htm_footer")
{
foreach (XElement xmlLineDes in xmlLine.Descendants())
{
if (xmlLineDes.Name == "dataline")
{
strPreview += xmlLineDes.Value + "\n";
}
}
}
}
}
webBrowserPreview.Navigate("about:blank");
webBrowserPreview.Document.OpenNew(false);
mshtml.IHTMLDocument2 currentDoc = (mshtml.IHTMLDocument2)webBrowserPreview.Document.DomDocument;
currentDoc.clear();
currentDoc.write(strPreview);
currentDoc.close();
/*
try
{
if (webBrowserPreview.Document != null)
{
IHTMLDocument2 currentDocument = (IHTMLDocument2)webBrowserPreview.Document.DomDocument;
int length = currentDocument.styleSheets.length;
IHTMLStyleSheet styleSheet = currentDocument.createStyleSheet(#"", 0);
//length = currentDocument.styleSheets.length;
//styleSheet.addRule("body", "background-color:blue");
strLine = String.Empty;
foreach (XElement xmlLine in s.getTemplates().getLayoutCSS().Element("layoutCSS").Elements("layout"))
{
var query = xmlLine.Attributes("type");
foreach (XAttribute result in query)
{
if (result.Value == "layoutMedia")
{
foreach (XElement xmlLineDes in xmlLine.Elements("layout"))
{
var queryL = xmlLineDes.Attributes("type");
foreach (XAttribute resultL in queryL)
{
if (resultL.Value == "layoutVideoBox")
{
foreach (XElement xmlLineDesL in xmlLineDes.Descendants())
{
if (xmlLineDesL.Name == "dataline")
{
strLine += xmlLineDesL.Value;
}
}
}
}
}
}
}
}
//TextReader reader = new StreamReader(Path.Combine(Path.GetDirectoryName(Application.ExecutablePath), "basic.css"));
//string style = reader.ReadToEnd();
styleSheet.cssText = strLine;
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}*/
webBrowserPreview.Refresh();
}
I now successfully implemented the berkelium-sharp method to my project
Has the same bug!
Found a solution!
First attempt which didn't work:
I had a persistent form (main form) and inside it a nested WebBrowser.
After changing the html with it's css, i told it to navigate to this new html!
This didn't work either:
Then I tried putting webbrowser on an own form. Which I simply open/close each
time I need a refresh. TO be sure the garbage collector cleans everything
Then I tried the Berkelium and rewrote it to my needs:
same logic as attempt 2 with the webbrowser. No luck either.
So I tried to open firefox itself and see if I can emulate this behaviour with a real browser. Indeed! When I open firefox, and force open the file (if you simply open a new file, firefox doesn't actually navigate to it, but detects this was already opened and simply refreshes it)
I noticed this due to the fast opening of the page!
A little scripting to force opening the same file twice (navigating) in 1 firefox session had the same effect: all CSS corrupt!
so, for some reason, you shouldn't navigate the same file twice, but instead of closing anything, simply force a refresh! Not a "Navigate"
Hope this info can help others, since I lost a lot of time finding out that it is the "navigate" to the same file more then once causing the corruption of stylesheets