Web Scraping 1: Selectors & Shell in Scrapy & Python (Scrapy Series)


Now let’s talk about Scrapy. Scrapy is a framework, which will help you
extracting the data from websites. But before we can scrap the data from a website,
we need a way of informing Scrapy, exactly which data do we need. So the first step is always analyze the website
that we want to scrape. Now this is a sample website which is actually
made to practice scraping. so this is one page which is a random quote. So every time I refresh, one different quote
will be shown, but the structure remains the same. So, let’s say that we want to extract the
quote, then the author, and the tags. So let’ssay that these are the three information
that we want to scrape. So first step, let’s go and check the website
structure. Right-click and inspect. We can see that this quote text is inside
a span, this author is inside a span, and the tags are inside a div and they have actually
individualtags because they are linking to probably other quotes on the same subject. So now we have two options: Number one – we
can use XPath to select a specific element, or Number two – we can use CSS. Now if you are comfortable with XPath, go
ahead and use it. They are very powerful and very flexible. But for a newcomer XPath can be little bit
daunting, so CSS is a better choice. In this series we will be using CSS so let’s
start with this one – the text. So if you look at text, let’s right-click
this text, the actual quode and click on inspect and here we have it. Let’s expand it. So this text is inside a span, which is having
one class called “text” and one “itemprop” called “text”. So this is the attribute name “itemprop”. This is the attribute value. We can use any one of these to find the element. Now whether we are able to find the element
correctly or not. Let’s use something called Scrapy shell. So I’m going to copy this URL, go to command
prompt, and I’m going to run “pip install Scrapy” and
there is one more package that
I am going to install this is called “ipython” Because I already have these packages installed,
it doesn’t take time. Now let’s run the crappy command. Scrapy has various commands and one of them
is a shell so this is what we are going to use so Shell will give us interactive scraping
console. so let’s see it in action to understand how
it works Scrapy shell and this URL that we copied so Scrapy will go and connect to this
URL and give us some useful objects the most important object here is response so if we
just print response we can see that this is the URL that we have fetched and the response
quote was 200 which means it was successfully scraped so 200 is okay, 400 is client error
and 500 is server error. this is how it works. And 300 is redirects Now if I print a response.text
what I will see is the complete HTML of this particular page.We have the complete HTML. What we need to do is, we need to locate this
particular quote and we saw that the class is “text”. So let’s copy this class. Let me press Ctrl+L to clear it up. So, we have a response.css() and here inside
the brackets in the quotes we can write our CSS selector. So because this is a class, we will have to
use a “.” or, a period. Let’s press ENTER and we can see that we have
a selector object. Now we don’t need a selector object, what
we need is the actual element. To get the actual element, we will use get()
method. So we can see that we have the complete span
element. So response.css() this gives selector, [and]
response.css().get() gives specific element. Now what we actually need is the text. So we will use one pseudo element ::text so
this is what we are going to use and this will return the actual text of the quote So
far so good. Let’s move on. Now what we need is the author. So right click>Inspect and we have author
here, which is antag called “small”. It is having one “author” class and one more
attribute called “itemprop” which equals “author”. So we can use class or we can use author. So let’s try with class. So actually nothing will change if I just
repeat the same command and paste in author. There we have it. What you see on the page and what you see
on this shell is different. We have a different quote here, and we have
a different quote here. Because these two are separate requests and
because quotes are created randomly, you see this different. So here… very easy… we can get author
by class.What if this class was not there? We could use another attribute—this “itemprop”. Now, “itemprop” we don’t have a specific selector
for itemprop. So in this case, what we can do is we can
use “attribute selectors”. so response.css() and attribute selectors. They use square brackets. In the square brackets, I have simply pasted
in the attribute and its value. And now let’s call the get() method. And we have actually the same text, I will
have to call text and it has to be outside the square brackets and we will get the author. So now we have the text—actual quote, and
we have the author, and two ways to get the author. Finally it’s time to get the tags. So let’s see right click>Inspect and here
we can see that all these tags are contained inside
tag and they are having a class
“tag”. So this is again straightforward. We’ll say response.css(‘.tag::text’) and get(). Now there is a problem here. The problem is, there are multiple tags and
we want all of them. But here we are getting only one. For that, the solution is to call getall()
method. So there are two methods— get() and getall(). So if we call get all method it will return
all the elements which are matching this particular CSS selector. So get will return one value, and get all
will return all values as a list. Have a look at the square brackets. So if you see square brackets, it’s a list. If you see quotes it’s a string. So get returns a string, while get all returns
a list of string. And as you might have guessed if there are
multiple values, and you call get method you will get only the first result. so now that we have all the selectors ready,
let’s go and create our first spider [Music]

Leave a Reply

Your email address will not be published. Required fields are marked *