# Extracting Data from your HTML Feed

### Extracting HTML

Feeds can be used to get content from ordinary HTML web pages. The fetched page can be accessed through the liquid doc object, as follows:

![](https://downloads.intercomcdn.com/i/o/91174523/90d90531e59f8b1dcd0e36de/image27.png)

On the test tab this looks like:

![](https://downloads.intercomcdn.com/i/o/91175934/66ad5950090b37710682fb6d/image11.png)

This returns the entire HTML document. To extract content from the HTML document there are 3 helpers that can be used:

![](https://downloads.intercomcdn.com/i/o/91176304/051f5c871241bc0d19ca2ac5/Screenshot+2018-12-14+at+13.40.04.png)

### Tag Stripping:

Liquid’s standard ‘strip\_html’ filter can be useful when working with HTML documents: [https://shopify.github.io/liquid/filters/strip\_html](https://www.google.com/url?q=https://shopify.github.io/liquid/filters/strip_html\&sa=D\&ust=1544786297207000)

### HTML

In this example we will get the Biography Text from the [Taxi for Email Twitter](https://twitter.com/taxiforemail).

![](https://downloads.intercomcdn.com/i/o/91184129/101e37fb152ddc59a3e966ba/image22.png)

Feed set up

* Set the feed url to [https://twitter.com/taxiforemail](https://www.google.com/url?q=https://twitter.com/taxiforemail\&sa=D\&ust=1544786297209000)
* Set the method to ‘GET’
* Set the data type to ‘HTML’

Data Extraction

First open the twitter page in a browser, then using the ‘inspect’ tool in the browser find the element we’re looking for:

![](https://downloads.intercomcdn.com/i/o/91184917/28b4ed59b1a6eafc69d378a5/image19.png)

We can see that the text is in a \<p> tag with the class ‘ProfileHeaderCard-bio’. We can use this to make the following CSS selector:

p.ProfileHeaderCard-bio

We can get the content of this P through the doc object, using the find\_first\_by\_css filter:

![](https://downloads.intercomcdn.com/i/o/91184997/e0cacde95e7476a6950af475/image25.png)

This gives the following result:

![](https://downloads.intercomcdn.com/i/o/91185504/85ee6438aff25dccc7fc69ac/image26.png)

If we want just the text from this, without html tags, we can add the strip\_html filter:

\#{{doc | find\_first\_by\_css: 'p.ProfileHeaderCard-bio' | strip\_html }}

Which gives just the text:

![](https://downloads.intercomcdn.com/i/o/91185687/6c93d241347dceffdc3d2195/image14.png)
