scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that?
I read about a parameter -a
somewhere but have no idea how to use it.
Source: (StackOverflow)
I am writing a crawler for a website using scrapy with CrawlSpider.
Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on urls. Also, I can filter requests using rules member of CrawlSpider.
What I want to do is to filter requests like:
http:://www.abc.com/p/xyz.html?id=1234&refer=5678
If I have already visited
http:://www.abc.com/p/xyz.html?id=1234&refer=4567
NOTE: refer is a parameter that doesn't affect the response I get, so I don't care if the value of that parameter changes.
Now, if I have a set which accumulates all ids I could ignore it in my callback function parse_item (that's my callback function) to achieve this functionality.
But that would mean I am still at least fetching that page, when I don't need to.
So what is the way in which I can tell scrapy that it shouldn't send a particular request based on the url?
Source: (StackOverflow)
I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework:
The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right?
Ultimately I want to scrape the Title, Due Date, and Details for each row. Title and Due Date are immediately available on the page...
BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that doesn't make sense here's a table):
|-------------------------------------------------|
| Title | Due Date |
|-------------------------------------------------|
| Job Title (Clickable Link) | 1/1/2012 |
| Other Job (Link) | 3/2/2012 |
|--------------------------------|----------------|
I'm afraid I still don't know how to logistically pass the item around with callbacks and requests, even after reading through the CrawlSpider section of the Scrapy documentation.
Source: (StackOverflow)
I have the item
object and i need to pass that along many pages to store data in single item
LIke my item is
class DmozItem(Item):
title = Field()
description1 = Field()
description2 = Field()
description3 = Field()
Now those three description are in three separate pages. i want to do somrething like
Now this works good for parseDescription1
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription1)
request.meta['item'] = item
return request
def parseDescription1(self,response):
item = response.meta['item']
item['desc1'] = "test"
return item
But i want something like
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription1)
request.meta['item'] = item
request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription2)
request.meta['item'] = item
request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription2)
request.meta['item'] = item
return request
def parseDescription1(self,response):
item = response.meta['item']
item['desc1'] = "test"
return item
def parseDescription2(self,response):
item = response.meta['item']
item['desc2'] = "test2"
return item
def parseDescription3(self,response):
item = response.meta['item']
item['desc3'] = "test3"
return item
Source: (StackOverflow)
I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:
http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/
http://snipplr.com/view/67006/using-scrapy-from-a-script/
I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:
# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script.
#
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
#
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.
#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports
from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue
class CrawlerScript():
def __init__(self):
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def _crawl(self, queue, spider_name):
spider = self.crawler.spiders.create(spider_name)
if spider:
self.crawler.queue.append_spider(spider)
self.crawler.start()
self.crawler.stop()
queue.put(self.items)
def crawl(self, spider):
queue = Queue()
p = Process(target=self._crawl, args=(queue, spider,))
p.start()
p.join()
return queue.get(True)
# Usage
if __name__ == "__main__":
log.start()
"""
This example runs spider1 and then spider2 three times.
"""
items = list()
crawler = CrawlerScript()
items.append(crawler.crawl('spider1'))
for i in range(3):
items.append(crawler.crawl('spider2'))
print items
# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date : Oct 24, 2010
Thank you.
Source: (StackOverflow)
My question is really how to do the same thing as a previous question, but in Scrapy 0.14.
Using one Scrapy spider for several websites
Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance.
Here is the code that I want to make generic. It also uses BeautifulSoup. I paired it down so hopefully didn't remove anything crucial to understand it.
class MySpider(CrawlSpider):
name = 'MySpider'
allowed_domains = ['somedomain.com', 'sub.somedomain.com']
start_urls = ['http://www.somedomain.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/pages/', ), deny=('', ))),
Rule(SgmlLinkExtractor(allow=('/2012/03/')), callback='parse_item'),
)
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll('p', itemprop="myProp")
for contentTag in contentTags:
matchedResult = re.search('Keyword1|Keyword2', contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
pass
Source: (StackOverflow)
I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script.
Here is part of my script:
crawler = Crawler(Settings(settings))
crawler.configure()
spider = crawler.spiders.create(spider_name)
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
print "It can't be printed out!"
It works at it should: visits pages, scrape needed info and stores output json where I told it(via FEED_URI). But when spider finishing his work(I can see it by number in output json) execution of my script wouldn't resume.
Probably it isn't scrapy problem. And answer should somewhere in twisted's reactor.
How could I release thread execution?
Source: (StackOverflow)
In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling
.
So, here is my code so far:
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/login/']
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
if not "Hi Herman" in response.body:
return self.login(response)
else:
return self.parse_item(response)
def login(self, response):
return [FormRequest.from_response(response,
formdata={'name': 'herman', 'password': 'password'},
callback=self.parse)]
def parse_item(self, response):
i['url'] = response.url
# ... do more things
return i
As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse
function), I call my custom login
function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.
The problem is that the parse
function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.
Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider
) Any help would be appreciated.
Source: (StackOverflow)
I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[@class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[@class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[@title]/@title')
......
yield l.load_item()
Source: (StackOverflow)
I've written a working crawler using scrapy,
now I want to control it through a Django webapp, that is to say:
- Set 1 or several
start_urls
- Set 1 or several
allowed_domains
- Set
settings
values
- Start the spider
- Stop / pause / resume a spider
- retrieve some stats while running
- retrive some stats after spider is complete.
At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the settings (start_urls
, allowed_domains
, settings
) must still be hardcoded in the 'scrapy egg' itself ; so it doesn't look like a solution to my question, unless I missed something.
I also looked at this question : How to give URL to scrapy for crawling? ;
But the best answer to provide multiple urls is qualified by the author himeslf as an 'ugly hack', involving some python subprocess and complex shell handling, so I don't think the solution is to be found here. Also, it may work for start_urls
, but it doesn't seem to allow allowed_domains
or settings
.
Then I gave a look to scrapy webservices :
It seems to be the good solution for retrieving stats. However, it still requires a running spider, and no clue to change settings
There are a several questions on this subject, none of them seems satisfactory:
I know that scrapy is used in production environments ; and a tool like scrapyd shows that there are definitvely some ways to handle these requirements (I can't imagine that the scrapy eggs scrapyd is dealing with are generated by hand !)
Thanks a lot for your help.
Source: (StackOverflow)
I am working on scrapy 0.20 with python 2.7.
I found pycharm a good python debugger.
I want to test my scrapy spiders using it.
Anyone knows how to do that please?
What I have tried
Actually I tried to run the spider as a scrip. As a result, I built that scrip.
Then, I tried to add my scrapy project to pycharm as a model like this:
file->setting->project structure->add content root.
But I don't know what else I have to do
Source: (StackOverflow)
I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:

The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse()
and the search form response gets passed to search_generator()
search_generator()
then yield
s lots of search requests using FormRequest
and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?
Source: (StackOverflow)
I want scrapy to crawl pages where going on to the next link looks like this:
<a rel='nofollow' href="#" onclick="return gotoPage('2');"> Next </a>
Will scrapy be able to interpret javascript code of that?
With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this:
encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n
I am trying to build my spider on the CrawlSpider
class, but I can't really figure out how to code it, with BaseSpider
I used the parse()
method to process the first URL, which happens to be a login form, where I did a POST with:
def logon(self, response):
login_form_data={ 'email': 'user@example.com', 'password': 'mypass22', 'action': 'sign-in' }
return [FormRequest.from_response(response, formnumber=0, formdata=login_form_data, callback=self.submit_next)]
And then I defined submit_next() to tell what to do next. I can't figure out how do I tell CrawlSpider which method to use on the first URL?
All requests in my crawling, except the first one, are POST requests. They are alternating two types of requests: pasting some data, and clicking "Next" to go to the next page.
Source: (StackOverflow)
In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy:
class LoginSpider(BaseSpider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# continue scraping with authenticated session...
I've got that working, and it's fine. But my question is: What do you have to do to continue scraping with authenticated session
, as they say in the last line's comment?
Source: (StackOverflow)