Selecting page data from dynamic columns HtmlAgility pack - c#

I'm using the HtmlAgility pack to scrape data from this url:
http://www.myfitnesspal.com/food/diary/chuckgross
Essentially, the only data I really need is calories, protein, fat and carbs. The issue is that these columns are user ordered (and users can even not show some of them!).
I'm trying to return that page data into a class:
public class NutritionRecord
{
public string Calories { get; set; }
public string Protein { get; set; }
public string Fat { get; set; }
public string Carbs { get; set; }
}
My idea was to scrape the row with the names of the columns (its a footer), and then scrape the Totals row, and then combine them into a new table, and then somehow figure out how to select the data for a column. I haven't gotten that far. This is what I have so far but feel like I'm just flailing:
http://pastebin.com/uYvMYuM3
This code returns an HTML table, and I cannot figure out how to get the data from the columns. Example in English: Give me the data in the cell whose's column header == "protein".
What the table looks like:
<table class='resultsTable'>
<tr class='labels'>
<th>Calories</th>
<th>Protein</th>
<th>Fat</th>
<th>Carbs</th>
<th>Fiber</th>
</tr>
<tr class='resultsTotals'>
<td>2,386</td>
<td>194</td>
<td>109</td>
<td>161</td>
<td>38</td>
</tr>
</table>

Try this, you don't need to scrap the totals just generate them from the result of the following, this should take care of hidden and reordered columns
public class NutritionRecord
{
public string Meal { get; set; }
public string MealPart { get; set; }
public string Calories { get; set; }
public string Protein { get; set; }
public string Fat { get; set; }
public string Carbs { get; set; }
public string Fiber { get; set; }
public string Sugar { get; set; }
}
and the scrape part:
var html = new WebClient().DownloadString("http://www.myfitnesspal.com/food/diary/chuckgross");
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var list = new List<NutritionRecord>();
var orderedColumnsList = doc.DocumentNode.SelectNodes("//tr[#class='meal_header']/td[#class='alt']").Select(td=>td.InnerText.Trim()).ToList();
var trs = doc.DocumentNode.SelectNodes("//tr").ToList();
for (var i = 0; i < trs.Count; i++)
{
bool isMealHeader = false;
if (trs[i].Attributes["class"] != null)
{
isMealHeader = trs[i].Attributes["class"].Value == "meal_header";
}
if (isMealHeader)
{
var dataRows = trs[i].SelectNodes("./following-sibling::*").TakeWhile(tr => !tr.HasAttributes)
.Select(tr => new NutritionRecord() {
Meal = WebUtility.HtmlDecode( trs[i].SelectSingleNode("./td[#class='first alt']").InnerText.Trim()),
MealPart = WebUtility.HtmlDecode(tr.SelectSingleNode("./td[#class='first alt']").InnerText.Trim()),
Calories = tr.SelectSingleNode(string.Format("./td[not(contains(#class, 'delete'))][{0}]", orderedColumnsList.IndexOf("Calories") + 2)).InnerText,
Protein = tr.SelectSingleNode(string.Format("./td[not(contains(#class, 'delete'))][{0}]", orderedColumnsList.IndexOf("Protein") + 2)).InnerText,
Fat = tr.SelectSingleNode(string.Format("./td[not(contains(#class, 'delete'))][{0}]", orderedColumnsList.IndexOf("Fat") + 2)).InnerText,
Carbs = tr.SelectSingleNode(string.Format("./td[not(contains(#class, 'delete'))][{0}]", orderedColumnsList.IndexOf("Carbs") + 2)).InnerText,
Fiber = tr.SelectSingleNode(string.Format("./td[not(contains(#class, 'delete'))][{0}]", orderedColumnsList.IndexOf("Fiber") + 2)).InnerText,
});
list.AddRange(dataRows);
}
}
and the result:
also to get the columns order get the InnerText of the column headers in order ,then use IndexOf function to get the index of the given column name, and use that index to get the value, for example
var orderedColumnsList = doc.DocumentNode.SelectNodes("//tr[#class='labels']/th").Select(td => td.InnerText.Trim()).ToList();
var carbsValue = doc.DocumentNode.SelectSingleNode(string.Format("//tr[#class='resultsTotals']/td[{0}]", orderedColumnsList.IndexOf("Carbs") + 1)).InnerText;

Related

C# Searching MongoDB string that starts with "xyz"

I try to code a Storagesystem and i got one Problem and dont know how to solve it...
public static List<string> GetAllItemsFromDB(string Searchbar)
{
List<string> retList = new List<string>();
retList.Clear();
var filter = Builders<DB_Package_Item>.Filter.Eq(Item => Item.Item_Name, Searchbar);
var ItemsMatch = Item_DB.Find(filter).ToList();
foreach (var Item in ItemsMatch.ToList())
{
retList.Add(Item.Item_Name);
}
return retList;
}
This Works. But when i change the filter to:
var filter = Builders<DB_Package_Item>.Filter.ElemMatch(Item => Item.Item_Name, Searchbar);
It crasches as soon i type any char in the searchbar with Error code : "System.InvalidOperationException: "The serializer for field 'Item_Name' must implement IBsonArraySerializer and provide item serialization info."
"
I just dont get it why...
This is the Data_Package for MongoDB
public class DB_Package_Item
{
public Guid Id { get; set; }
public int Item_ID { get; set; }
public int Box_ID { get; set; }
public string Item_Name { get; set; }
public int Quantity { get; set; }
public string Partnumber { get; set; }
public string Supplier { get; set; }
}
Thx for every help!
FilterDefinitionBuilder<TDocument>.ElemMatch<TItem> or $elemMatch operator is for searching the item in the array field, which is not appropriate for your scenario to search (term) for a string field.
You need the $regex operator which is FilterDefinitionBuilder<TDocument>.Regex in C# syntax.
using System.Text.RegularExpressions;
var filter = Builders<DB_Package_Item>.Filter.Regex(Item => Item.Item_Name
, new Regex("^" + Searchbar));
For case-insensitive:
new Regex("^" + Searchbar, RegexOptions.IgnoreCase)
Or
var filter = Builders<DB_Package_Item>.Filter.Regex(Item => Item.Item_Name
, new BsonRegularExpression("^" + Searchbar));
For case-insensitive,
new BsonRegularExpression("^" + Searchbar, "i")

C# parsing multiple json

The JSON data is as follows:
{"Sucess":true,
"Code":0,
"Msg":"Sucess",
"Data":{
"UserDayRanking":
{
"UserID":11452112,
"UserCharm":0,
"UserName":"gay",
"UserGender":1,
"UserLevel":36,
"UserPhoto":"http://res.xxx.com/2020/3/16/63719926625601201487545U11452112.jpeg",
"Ranking":0,
"IsNobility":0,
"NobilityType":0,
"NobilityLevel":0,
"UserShowStyle":null,
"LiveLevelUrl":null,
"IsStealth":false},
"DayRankingList":[
{
"UserID":3974854,
"UserCharm":114858,
"UserName":"jack",
"UserGender":1,
"UserLevel":91,
"UserPhoto":"http://res.xxx.com/2020/2/15/63717400601924412312384U3974854.jpeg",
"Ranking":2,
"IsNobility":1,
"NobilityType":1,
"NobilityLevel":3,
"UserShowStyle":
{
"NameColor":100102,
"BorderColor":100403,
"LiangMedal":0,
"DztCountDown":0,
"Mounts":100204,
"LiveLevelCode":0,
"LiveRights":null
},
"LiveLevelUrl":null,
"IsStealth":false
},
{"UserID":6231512,
"UserCharm":22644,
"UserName":"red.girl",
"UserGender":1,
"UserLevel":57,
"UserPhoto":"http://res.xxx.com/2019/11/20/63709843050801519858823U6231512.jpeg",
"Ranking":3,
"IsNobility":0,
"NobilityType":0,
"NobilityLevel":0,
"UserShowStyle":{
"NameColor":0,
"BorderColor":0,
"LiangMedal":0,
"DztCountDown":0,
"Mounts":0,
"LiveLevelCode":0,
"LiveRights":null
},
"LiveLevelUrl":null,
"IsStealth":false}
],
"LiveCharmSwitch":1,
"IsSelf":false
}
}
I want to use c # extraction
"UserID": 3974854,
"UserCharm": 114858,
"UserName": "jack",
"UserID":6231512,
"UserCharm":22644,
"UserName":"red.girl",
That is to extract UserID, UserCharm, UserName,This json has many layers,
What I want after the extraction is,id is sorted in order
id = 1, UserID = 3974854, UserCharm = 114858, UserName = jack
id = 2, UserID = 6231512, UserCharm = 22644, UserName = red.girl
I use the following code, but only extract the first one
string json = #"{"Sucess":true,"Code":0,"Msg":"Sucess","Data":{"UserDayRanking":{"UserID":11452112,"UserCharm":0,"UserName":"gay","UserGender":1,"UserLevel":36,"UserPhoto":"http://res.xxx.com/2020/3/16/63719926625601201487545U11452112.jpeg","Ranking":0,"IsNobility":0,"NobilityType":0,"NobilityLevel":0,"UserShowStyle":null,"LiveLevelUrl":null,"IsStealth":false},"DayRankingList":[{"UserID":3974854,"UserCharm":114858,"UserName":"jack","UserGender":1,"UserLevel":91,"UserPhoto":"http://res.xxx.com/2020/2/15/63717400601924412312384U3974854.jpeg","Ranking":2,"IsNobility":1,"NobilityType":1,"NobilityLevel":3,"UserShowStyle":{"NameColor":100102,"BorderColor":100403,"LiangMedal":0,"DztCountDown":0,"Mounts":100204,"LiveLevelCode":0,"LiveRights":null},"LiveLevelUrl":null,"IsStealth":false},{"UserID":6231512,"UserCharm":22644,"UserName":"red.girl","UserGender":1,"UserLevel":57,"UserPhoto":"http://res.xxx.com/2019/11/20/63709843050801519858823U6231512.jpeg","Ranking":3,"IsNobility":0,"NobilityType":0,"NobilityLevel":0,"UserShowStyle":{"NameColor":0,"BorderColor":0,"LiangMedal":0,"DztCountDown":0,"Mounts":0,"LiveLevelCode":0,"LiveRights":null},"LiveLevelUrl":null,"IsStealth":false}],"LiveCharmSwitch":1,"IsSelf":false}}";
List<Info> jobInfoList = JsonConvert.DeserializeObject<List<Info>>(z);
foreach (Info jobInfo in jobInfoList)
{
//Console.WriteLine("UserName:" + jobInfo.UserName);
}
public class Info
{
public string UserCharm { get; set; }
public string UserName { get; set; }
public data DayRankingList { get; set; }
}
public class data
{
public int UserID { get; set; }
public string UserCharm { get; set; }
public string UserName { get; set; }
public string UserGender { get; set; }
public string UserLevel { get; set; }
}
The above code only shows username = jack,Never show username = red.girl
As it looks to me then you want some details from your JSON has the which is in DayRankingList. As you only want some data then we can use a tool like http://json2csharp.com/ to create our classes and then remove what we don't need. Then we end up with the following classes.
public class DayRankingList
{
public int UserID { get; set; }
public int UserCharm { get; set; }
public string UserName { get; set; }
}
public class Data
{
public List<DayRankingList> DayRankingList { get; set; }
}
public class RootObject
{
public Data Data { get; set; }
}
Which you can deserialise like this
string json = .....
var root = JsonConvert.DeserializeObject<RootObject>(json);
Then if you wish, you can extract the inner data into a new List<> and then just work on that.
List<DayRankingList> rankingLists = root.Data.DayRankingList;
//Do something with this, such as output it
foreach(DayRankingList drl in rankingLists)
{
Console.WriteLine(String.Format("UserId {0} UserCharm {1} UserName {2}",drl.UserId, drl.UserCharm, drl.UserName));
}
You can use Json.Linq to parse your JSON into JObject and enumerate DayRankingList items (since it's an array). Then convert every item into data class and order the result sequence by UserID
var jObject = JObject.Parse(json);
var rankingList = (jObject["Data"] as JObject)?.Property("DayRankingList");
var list = rankingList.Value
.Select(rank => rank.ToObject<data>())
.OrderBy(item => item?.UserID);
foreach (var user in list)
Console.WriteLine($"{user.UserID} {user.UserName}");
Another way is copy your JSON, go to Edit->Paste Special->Paste JSON as classes menu in Visual Studio and generate a proper class hierarchy (I've got 5 classes, they are quite long to post here), then use them during deserialization
The most type-safe way is to define the class structure that you want, like jason.kaisersmith suggested.
To have the final format you need, though, you might want to do an extra Linq Order and Select, to include the id:
var finalList = rankingLists.OrderBy(rl => rl.UserId).Select((value, index) => new
{
id = index,
value.UserId,
value.UserCharm,
value.UserName
});
foreach (var drl in finalList)
{
Console.WriteLine($"Id = {drl.id}, UserId = {drl.UserId}, UserCharm = {drl.UserCharm}, UserName = {drl.UserName}");
}

get full value from value from table in a web page using Html Agility Pack

I am trying to get the full value of the "Transaction and get url" using the Html Agility Pack. when i inspect the html source using google i am able to see the full transaction id with a url. My question is how do i get the full value of all Transaction and the url associated with them.
here is the url of the site: http://explorer.litecoin.net/address/LeDGemnpqQjrK8v1s5HZKaDgjgDKQ2MYiK
example of date being brought back "TransactionBlockApprox. TimeAmountBalanceCurrency
5130f066e0...4682752013-11-28 09:14:170.30.3LTC"
protected void Page_Load(string address)
{
string Url = address;
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string wallet = doc.DocumentNode.SelectNodes("/html/body/div/div/div/table")[0].InnerText[0].InnerText;
}
Some times you have to do things manually you cant get the full link directly since javascript is used, to get it you have to manipulate the innerhtml of the td element which contains the href and get whats between the double qutations, and its always better to represent the data as objects such as
public class data
{
public itemWithlink Transaction { get; set; }
public itemWithlink Block { get; set; }
public itemWithlink ApproxTime { get; set; }
public itemWithlink Amount { get; set; }
public itemWithlink Balance { get; set; }
public itemWithlink Currency { get; set; }
}
public class itemWithlink
{
public string numberOrname { get; set; }
public string link { get; set; }
}
and to produce a list of your table with link value to be set as long as its found
var list = htmlDoc.DocumentNode.SelectNodes("//table/tr").
Skip(1).
Select(tr => tr.Elements("td").
Select(td => new itemWithlink() {
numberOrname = td.InnerText, link = td.InnerHtml.Contains("href") ?
td.InnerHtml.Substring(td.InnerHtml.IndexOf("\""), td.InnerHtml.LastIndexOf("\""))
.Replace("..",#"http://explorer.litecoin.net/") : null })
.ToArray())
.Select(row => new data() { Transaction = row[0], Block = row[1], ApproxTime = row[2], Amount = row[3], Balance = row[4] , Currency = row[5] }).ToList();

EF: Fetching Object containing a list as property

I have the following classes:
public class DbCatalogEntry
{
public int id { get; set; }
private string filename;
public long size { get; set; }
public int duration { get; set; }
public int height { get; set; }
public int width { get; set; }
public string vcodec { get; set; }
public List<CatalogAudioStreamEntry> audiostreams { get; set; }
public string imdbId { get; set; }
public int ownedByUser { get; set; }
public string mxHash { get; set; }
}
public class CatalogAudioStreamEntry
{
public int bitrate { get; set; }
public string codec { get; set; }
public string language { get; set; }
public int channels { get; set; }
public string features { get; set; }
}
This is a one to many relationship. I am trying to retrieve a list of DbCatalogEntry objects from the database and also filling the audiostreams in the same query.
I tried the following but it does not work:
var MovieEntryList =
(from vwCat in ctx.sp_GetMovieCatalog(getCurrentUserId())
select new DbCatalogEntry()
{
id = vwCat.id,
size = vwCat.size,
duration = vwCat.duration,
vcodec = vwCat.codec,
Filename = vwCat.filename,
imdbId = vwCat.imdb_id,
mxHash = vwCat.mxhash,
ownedByUser = (int)vwCat.owned,
width = vwCat.width,
height = vwCat.height,
audiostreams =
(from astr in ctx.audiostreamentry
where astr.movie_id == vwCat.id
select new CatalogAudioStreamEntry()
{
bitrate = astr.bitrate,
channels = astr.channels,
codec = astr.codec,
features = astr.features,
language = astr.language
}).ToList()
}).ToList();
Search revealed that you cannot put a ToList() into a linq to entity query as it cannot be converted converted. I read multiple suggestions about changing audiostreams to IEnumerable but was not able to get this to work either. Most attempts compiled fine but faild during runtime with Unable to create a constant value of type...
Can anyone point me to the correct direction to solve this issue? The important thing is that the round trips to the database must be kept at a minimum so it would not be possible to create a client-side subquery for each DbCatalogEntry to fill the list.
Update #1:
To provide more details: This is the model that was generated by EF from my database:
sp_GetMovieCatalog(getCurrentUserId()) is a stored function on the SQL server which accepts one parameter which is used to filter the results.
What I want to do is selecting from movieenty and load all associated rows from audiostreamenty.
This data should be used to create instances of DbCatalogEntry, so the result type would be List<DbCatalogEntry>.
Is that possible at all? Maybe there is a better/easier solution?
I believe, you have your problem, because trying to use same context for multiple queries
You can try it this way:
var MovieEntryList =
(from vwCat in ctx.sp_MovieCatalog
where vwCat.owned = currentUserId
select new DbCatalogEntry()
{
id = vwCat.id,
size = vwCat.size,
duration = vwCat.duration,
vcodec = vwCat.codec,
Filename = vwCat.filename,
imdbId = vwCat.imdb_id,
mxHash = vwCat.mxhash,
ownedByUser = (int)vwCat.owned,
width = vwCat.width,
height = vwCat.height,
audiostreams = vwCat.audiostreamentry.Select(astr=>
new CatalogAudioStreamEntry()
{
bitrate = astr.bitrate,
channels = astr.channels,
codec = astr.codec,
features = astr.features,
language = astr.language
}).ToList()
}).ToList();
Or this way:
var MovieEntryList = ctx.audiostreamentry.where(p=>p.movieCatolog.ownedByUser - currentUserId)
//.ToList() //if you call ToList() here futher will be LINQ to Object, and you want have most of problems
.GroupBy(p=>new {Id = p.movieCatolog.id, Owner = p.movieCatolog.p.movieCatolog.id})
.Select(p=> new DbCatalogEntry{
id = p.Key.Id,
ownedByUser = p.Key.Owner,
audiostrams = p.Select(x=>new CatalogAudioStreamEntry
{
bitrate = astr.bitrate,
channels = astr.channels,
codec = astr.codec,
features = astr.features,
language = astr.language
})
})

Get child element in another child element of XML

I have a XML file that goes like this... (the XML file was taken from web services [WCF] after passing some value into it.)
<Title>
<Questions>
<QuestionID> 1 </QuestionID>
<QuestionType> Quiz </QuestionType>
<Question> What is the shape? </Question>
<SubQuestionSequence> Part 1 </SubQuestionSequence>
<SubQuestions>
<Keywords> Ring </Keywords>
<ParentQuestionID> 1 </ParentQuestionID>
</SubQuestions>
<SubQuestionSequence> Part2 </SubQuestionSequence>
<SubQuestions>
<Keywords> Round </Keywords>
<ParentQuestionID> 1 </ParentQuestionID>
</SubQuestions>
</Questions>
</Title>
The methods to take child elements as below (written in C#), the commented area is supposed to call the class of subQuestion, but i'm not sure how to write that part :
public class Questions {
public int QuestionID { get; set; }
public string QuestionType { get; set; }
public string Question { get; set; }
public string SubQuestionSequence { get; set; }
//suppose to call subQuestion here
}
public class SubQuestion {
public string Keywords { get ; set ; }
public int ParentQuestionID { get; set; }
}
The actual code behind of the file, also the query area, i does not know how to call if they have another sub section:
void client_GetQuestionCompleted(object sender, GetQuestionCompletedEventArgs e)
{
if (e.Error != null)
return;
string result = e.Result.Nodes[0].ToString();
XDocument doc = XDocument.Parse(result);
var QuestionDetails = from Query in doc.Descendants("QuestionDetail")
select new Questions
{
QuestionID = (int)Query.Element("QuestionID"),
QuestionType = (string)Query.Element("QuestionType"),
Question = (string)Query.Element("Question"),
SubQuestionSequence = (string)Query.Element("SubQuestionSequence")
};
int z = 0;
foreach (var QuestionDetail in QuestionDetails)
{
qID = QuestionDetail.QuestionID;
qType = QuestionDetail.QuestionType;
quest = QuestionDetail.Question;
subQS = QuestionDetail.SubQuestionSequence;
z++;
}
}
As you can see from the top, how can i take the child elements of SubQuestions (The keywords and ParentQuestionID) where SubQuestion itself already is a child element ?
[edit] how can i retrieve the repeated element in the child element ? I want some part to loop and retrieve data, and some doesn't need to loop to retrieve.
int z = 0;
foreach (var QuestionDetail in QuestionDetails)
{
qID = QuestionDetail.QuestionID;
qType = QuestionDetail.QuestionType;
quest = QuestionDetail.Question;
subQS[z] = QuestionDetail.SubQuestionSequence;
//doing it this way, i can only retrieve one row of record only,
//even though i used an array to save.
subKeyword[z] = QuestionDetail.SubQuestion.Keywords;
z++;
}
As long as there is only a single SubQuestions element you can simply access Query.Element("SubQuestions").Element("Keywords") respectively Query.Element("SubQuestions").Element("ParentQuestionID").
[edit]
As for you class with an object of the type SubQuestion you would simply use
public class Questions {
public int QuestionID { get; set; }
public string QuestionType { get; set; }
public string Question { get; set; }
public string SubQuestionSequence { get; set; }
public SubQuestion SubQuestion{ get; set; }
}
public class SubQuestion {
public string Keywords { get ; set ; }
public int ParentQuestionID { get; set; }
}
and then in your query you can use e.g.
var QuestionDetails = from Query in doc.Descendants("QuestionDetail")
select new Questions
{
QuestionID = (int)Query.Element("QuestionID"),
QuestionType = (string)Query.Element("QuestionType"),
Question = (string)Query.Element("Question"),
SubQuestionSequence = (string)Query.Element("SubQuestionSequence"),
SubQuestion = new SubQuestion() {
Keywords = (string)Query.Element("SubQuestions").Element("Keywords"),
ParentQuestionID = (int)Query.Element("SubQuestions").Element("ParentQuestionID")
}
};

Categories