Extracting Data from your HTML Feed

Extracting HTML

Feeds can be used to get content from ordinary HTML web pages. The fetched page can be accessed through the liquid doc object, as follows:

On the test tab this looks like:

This returns the entire HTML document. To extract content from the HTML document there are 3 helpers that can be used:

Tag Stripping:

Liquid’s standard ‘strip_html’ filter can be useful when working with HTML documents: https://shopify.github.io/liquid/filters/strip_html

HTML

In this example we will get the Biography Text from the Taxi for Email Twitter.

Feed set up

Data Extraction

First open the twitter page in a browser, then using the ‘inspect’ tool in the browser find the element we’re looking for:

We can see that the text is in a <p> tag with the class ‘ProfileHeaderCard-bio’. We can use this to make the following CSS selector:

p.ProfileHeaderCard-bio

We can get the content of this P through the doc object, using the find_first_by_css filter:

This gives the following result:

If we want just the text from this, without html tags, we can add the strip_html filter:

#{{doc | find_first_by_css: 'p.ProfileHeaderCard-bio' | strip_html }}

Which gives just the text:

Last updated