Extracting Data from your HTML Feed
Last updated
Last updated
Feeds can be used to get content from ordinary HTML web pages. The fetched page can be accessed through the liquid doc object, as follows:
On the test tab this looks like:
This returns the entire HTML document. To extract content from the HTML document there are 3 helpers that can be used:
Liquid’s standard ‘strip_html’ filter can be useful when working with HTML documents: https://shopify.github.io/liquid/filters/strip_html
In this example we will get the Biography Text from the Taxi for Email Twitter.
Feed set up
Set the feed url to https://twitter.com/taxiforemail
Set the method to ‘GET’
Set the data type to ‘HTML’
Data Extraction
First open the twitter page in a browser, then using the ‘inspect’ tool in the browser find the element we’re looking for:
We can see that the text is in a <p> tag with the class ‘ProfileHeaderCard-bio’. We can use this to make the following CSS selector:
p.ProfileHeaderCard-bio
We can get the content of this P through the doc object, using the find_first_by_css filter:
This gives the following result:
If we want just the text from this, without html tags, we can add the strip_html filter:
#{{doc | find_first_by_css: 'p.ProfileHeaderCard-bio' | strip_html }}
Which gives just the text: