Commit bac78a9c authored by pushkar191098's avatar pushkar191098
Browse files

Addtion: training material for scraping.

1 merge request!1addition: main_server_code, scripts, docs
Showing with 1421 additions and 8 deletions
+1421 -8
......@@ -11,7 +11,7 @@ from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
def single_product(log, driver, download_dir, new_output_dir):
def single_product(log, driver, download_dir, new_output_dir, win_handle=2):
try:
doc_section = driver.find_elements(
By.XPATH, '//ul[@class="documentation__content"]//li')
......@@ -20,7 +20,7 @@ def single_product(log, driver, download_dir, new_output_dir):
'a').get_attribute('href')
product_name = str(driver.current_url).split('-')[-1].strip()
try:
product_name = product_name.split('?')[1].strip
product_name = product_name.split('-')[-1].split('?')[:1][0]
except:
pass
driver.switch_to.new_window()
......@@ -39,8 +39,8 @@ def single_product(log, driver, download_dir, new_output_dir):
time.sleep(2)
driver.close()
driver.switch_to.window(driver.window_handles[2])
except:
driver.switch_to.window(driver.window_handles[win_handle])
except Exception as e:
log.info('exception', traceback.format_exc())
......@@ -118,14 +118,12 @@ def GraingerSelenium(agentRunContext):
'//button[@aria-label="Submit Search Query"]').click()
time.sleep(5)
check_url = str(driver.current_url)
# If multi_products are there in search params
if '?search' in check_url:
if len(driver.find_elements(By.XPATH, '//div[@class = "multi-tiered-category"]')) > 0:
multi_product(log, wait, driver, download_dir, new_output_dir)
# If single_products are there in search params
else:
single_product(log, driver, download_dir, new_output_dir)
single_product(log, driver, download_dir, new_output_dir, 0)
log.job(config.JOB_RUNNING_STATUS, 'Downloaded All Invoices')
......
File added
%% Cell type:markdown id: tags:
# Scrapy documentation
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
%% Cell type:markdown id: tags:
---
%% Cell type:markdown id: tags:
## INSTALLATION
you can install Scrapy and its dependencies from PyPI with:
> pip install Scrapy
For more information see [Installation documentation](https://docs.scrapy.org/en/latest/intro/install.html)
%% Cell type:markdown id: tags:
----
%% Cell type:markdown id: tags:
### SAMPLE SPIDER CODE
```
# file_name = quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'https://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
```
%% Cell type:markdown id: tags:
to run your scrapy spider:
> scrapy runspider quotes_spider.py -o quotes.json
%% Cell type:markdown id: tags:
## What just happened?
When you ran the command `scrapy runspider quotes_spider.py`, Scrapy looked for a Spider definition inside it and ran it through its crawler engine.
The crawl started by making requests to the URLs defined in the start_urls attribute (in this case, only the URL for quotes in humor category) and called the default callback method parse, passing the response object as an argument. In the parse callback, we loop through the quote elements using a CSS Selector, yield a Python dict with the extracted quote text and author, look for a link to the next page and schedule another request using the same parse method as callback.
Here you notice one of the main advantages about Scrapy: requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.
%% Cell type:markdown id: tags:
---
%% Cell type:markdown id: tags:
### Simplest way to dump all my scraped items into a JSON/CSV/XML file?
To dump into a JSON file:
> scrapy crawl myspider -O items.json
To dump into a CSV file:
> scrapy crawl myspider -O items.csv
To dump into a XML file:
> scrapy crawl myspider -O items.xml
For more information see [Feed exports](https://docs.scrapy.org/en/latest/topics/feed-exports.html)
---
%% Cell type:markdown id: tags:
scrapy project example : [quotesbot](https://github.com/scrapy/quotesbot)
---
%% Cell type:markdown id: tags:
### Learn to Extract data
The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell.
Run:
> scrapy shell 'https://quotes.toscrape.com/page/1/'
Using the shell, you can try selecting elements using CSS with the response object:
> ->>> response.css('title')
> [< Selector xpath='descendant-or-self::title' data='< title >Quotes to Scrape</ title>'>]
The result of running response.css('title') is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.
To extract the text from the title above, you can do:
> ->>>response.css('title::text').getall()
> ['Quotes to Scrape']
There are two things to note here: one is that we’ve added ::text to the CSS query, to mean we want to select only the text elements directly inside < title> element.
The other thing is that the result of calling .getall() is a list: it is possible that a selector returns more than one result, so we extract them all. When you know you just want the first result, as in this case, you can do:
> ->>>response.css('title::text').get()
> 'Quotes to Scrape'
As an alternative, you could’ve written:
> ->>>response.css('title::text')[0].get()
> 'Quotes to Scrape'
---
%% Cell type:markdown id: tags:
## Run Scrapy from a script
You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via `scrapy crawl`.
Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor.
The first utility you can use to run your spiders is `scrapy.crawler.CrawlerProcess`.
This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands.
Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by adding callbacks to the deferred returned by the `CrawlerRunner.crawl` method.
Here’s an example of its usage, along with a callback to manually stop the reactor after MySpider has finished running.
```
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your spider definition
...
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
```
import json
import time
from elasticsearch import Elasticsearch
import scrapy
from scrapy import Request
class AppliedSpider(scrapy.Spider):
name = 'applied'
user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
def __init__(self, search_param=''):
self.api_url = 'https://www.applied.com'
self.start_urls = [
'https://www.applied.com/search?page=0&search-category=all&override=true&isLevelUp=false&q='+search_param]
super().__init__()
def collect_data(self, response):
# product url parsing
# specification data
spec = dict()
for trs in response.xpath('//*[@id="specifications"]//table//tr'):
key = trs.xpath('./td[1]/text()').get().strip()
value = trs.xpath('./td[2]/text()').get().strip()
spec[key] = value
# final data
data = {
'company': response.xpath('//h1[@itemprop="brand"]/a/text()').get().strip(),
'product': response.xpath('//span[@itemprop="mpn name"]/text()').get().strip(),
'details': response.xpath('//div[@class="details"]//text()').get().strip(),
'item': response.xpath('//div[@class="customer-part-number"]/text()').get().strip(),
'description': [x.strip() for x in response.xpath('//div[@class="short-description"]/ul/li/text()').extract()],
'specification': spec,
'url': response.url.strip(),
'timestamp': int(time.time()*1000)
}
yield data
def parse(self, response):
# search url parsing
for scrape_url in response.xpath('//a[@class="hide-for-print more-detail"]/@href').extract():
# extract product url
yield Request(self.api_url+scrape_url, self.collect_data)
# extract next page url and re-run function
next_page = response.xpath('//a[@class="next"]/@href').get()
if next_page is not None:
yield Request(self.api_url+next_page, self.parse)
import scrapy
class RSSpider(scrapy.Spider):
crawler = 'RSSpider'
name = 'RSSpider'
main_domain = 'https://in.rsdelivers.com'
start_urls = ['https://in.rsdelivers.com/productlist/search?query=749']
def parse(self,response):
for ele in response.css('a.snippet'):
my_href = ele.xpath('./@href').get()
yield scrapy.Request(url=self.main_domain+my_href,callback=self.collect_data)
def collect_data(self,response):
data = dict()
meta_data = response.css('div.row-inline::text').extract()
for i in range(0,100,3):
try:
data[meta_data[i]] = meta_data[i+2]
except:
break
data['title'] = str(response.css('h1.title::text').get()).strip()
data['url'] = response.url
yield data
%% Cell type:markdown id: tags:
## SELENIUM AUTOMATION AND WEBSCRAPING
%% Cell type:markdown id: tags:
Load the Driver
%% Cell type:code id: tags:
``` python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
my_service = Service('/home/amruth/Music/chromedriver')
driver = webdriver.Chrome(service=my_service)
```
%% Cell type:markdown id: tags:
Extra imports
%% Cell type:code id: tags:
``` python
# to set for supported locators
from selenium.webdriver.common.by import By
# to handle time related tasks
import time
# create creds.py with USERNAME,PASSWORD variables
import creds
```
%% Cell type:markdown id: tags:
To fetch the home URL
%% Cell type:code id: tags:
``` python
driver.get("https://kronos.tarento.com/login")
driver.maximize_window()
```
%% Cell type:markdown id: tags:
Login Content
%% Cell type:code id: tags:
``` python
time.sleep(1)
driver.find_element(By.XPATH, '//*[@type="email"]').send_keys(creds.USERNAME)
time.sleep(1)
driver.find_element(By.XPATH, '//*[@type="password"]').send_keys(creds.PASSWORD)
```
%% Cell type:markdown id: tags:
To check if element selected or not
%% Cell type:code id: tags:
``` python
time.sleep(2)
driver.execute_script('arguments[0].click();',driver.find_element(By.XPATH, '//*[@type="checkbox"]'))
```
%% Cell type:markdown id: tags:
To click on the login button
%% Cell type:code id: tags:
``` python
driver.find_element(By.XPATH, '//*[@type="submit"]').click()
```
%% Cell type:markdown id: tags:
Scraping the data for the browser
%% Cell type:code id: tags:
``` python
time.sleep(2)
try:
my_username = driver.find_element(By.XPATH, '//a[@role="button"]').text.strip()
output = 'logged in as:' + my_username
except:
output = 'Login failed'
print(output)
```
%% Output
Login failed
%% Cell type:markdown id: tags:
To close the window
%% Cell type:code id: tags:
``` python
driver.close()
driver.quit()
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
selenium scripts
generalation pdf_scripts
scrapy docs
general refactor
%% Cell type:markdown id: tags:
# SELENIUM-WEBDRIVER-BASICS
%% Cell type:markdown id: tags:
### To install Selenium
pip install selenium
for more details refer this link - https://selenium-python.readthedocs.io/
%% Cell type:markdown id: tags:
#### NOTES
1. different versions of chrome and chromedriver will not work
2. for firefox,profile_path is mandatory
%% Cell type:markdown id: tags:
--------------------------------------------------------------------
%% Cell type:markdown id: tags:
To Initialize the driver
%% Cell type:code id: tags:
``` python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
my_service = Service('/home/amruth/Music/chromedriver')
driver = webdriver.Chrome(service=my_service)
```
%% Cell type:markdown id: tags:
To fetch a URL
syntax: driver.get('my_url')
%% Cell type:code id: tags:
``` python
driver.get('https://www.google.com/')
```
%% Cell type:markdown id: tags:
To get the current URl
%% Cell type:code id: tags:
``` python
driver.current_url
```
%% Output
'https://www.google.com/'
%% Cell type:markdown id: tags:
To maximum the window
%% Cell type:code id: tags:
``` python
driver.maximize_window()
```
%% Cell type:markdown id: tags:
To get bak to previous page
%% Cell type:code id: tags:
``` python
#get a new page
driver.get("https://www.cricbuzz.com/")
```
%% Cell type:code id: tags:
``` python
#back to previous page with back()
driver.back()
```
%% Cell type:markdown id: tags:
To go to forward page
%% Cell type:code id: tags:
``` python
driver.forward()
```
%% Cell type:markdown id: tags:
To refresh the page
%% Cell type:code id: tags:
``` python
driver.refresh()
```
%% Cell type:markdown id: tags:
To take the screenshot
%% Cell type:code id: tags:
``` python
driver.save_screenshot(filename='/home/amruth/Pictures/2.png')
```
%% Output
True
%% Cell type:markdown id: tags:
To get the sessionID
%% Cell type:code id: tags:
``` python
driver.session_id
```
%% Output
'52cb5dafe613edf285132391b58ed44a'
%% Cell type:markdown id: tags:
To view page source
%% Cell type:code id: tags:
``` python
driver.page_source
```
%% Cell type:markdown id: tags:
To create and switch to new_tab
%% Cell type:code id: tags:
``` python
driver.switch_to.new_window()
```
%% Cell type:markdown id: tags:
To get list of tabs
%% Cell type:code id: tags:
``` python
driver.window_handles
```
%% Output
['CDwindow-7A88A3B7E81EE88473EFA8F5FB49CD5D',
'CDwindow-585DC24AC56D0BC0A12B3FA2796921EF']
%% Cell type:markdown id: tags:
To close the tab
%% Cell type:code id: tags:
``` python
driver.close()
```
%% Cell type:markdown id: tags:
To switch an old tab
%% Cell type:code id: tags:
``` python
driver.switch_to.window(driver.window_handles[0])
```
%% Cell type:markdown id: tags:
To quit the browser
%% Cell type:code id: tags:
``` python
driver.quit()
```
%% Cell type:markdown id: tags:
# *What are Locators?*
* Locator is a command that tells Selenium IDE which GUI elements its needs to operate on.(say Text Box, Buttons, Check Boxes etc)
* Identification of correct GUI elements is a prerequisite to creating an automation script.
%% Cell type:code id: tags:
``` python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
```
%% Cell type:code id: tags:
``` python
my_service = Service(r"C:\Drivers\chromedriver_win32\chromedriver.exe")
driver = webdriver.Chrome(service=my_service)
driver.get("https://www.amazon.in/")
```
%% Cell type:markdown id: tags:
# Types of Locators
%% Cell type:markdown id: tags:
# 1. By Tag Name
%% Cell type:markdown id: tags:
Syntax: driver.find_element(By.TAG_NAME, 'tag_name')
%% Cell type:code id: tags:
``` python
driver.find_elements(By.TAG_NAME, 'input')
```
%% Cell type:code id: tags:
``` python
driver.find_element(By.TAG_NAME, 'input')
```
%% Cell type:code id: tags:
``` python
driver.find_elements(By.TAG_NAME, 'a')
```
%% Cell type:code id: tags:
``` python
# <button> tag is not available, will throw an error.
driver.find_element(By.TAG_NAME, 'button')
```
%% Cell type:markdown id: tags:
# 2. By Name
%% Cell type:markdown id: tags:
Syntax: driver.find_element(By.NAME, 'my_name')
%% Cell type:code id: tags:
``` python
# <input type="text" id="twotabsearchtextbox" value="" name="field-keywords"
# autocomplete="off" placeholder="" class="nav-input nav-progressive-attribute" dir="auto" tabindex="0" aria-label="Search">
driver.find_element(By.NAME, 'field-keywords')
```
%% Cell type:code id: tags:
``` python
# <input data-addnewaddress="add-new" id="unifiedLocation1ClickAddress" name="dropdown-selection"
# type="hidden" value="add-new" class="nav-progressive-attribute">
driver.find_element(By.NAME, 'dropdown-selection')
```
%% Cell type:markdown id: tags:
# 3. By ID
%% Cell type:markdown id: tags:
* The ids are generally unique for an element.
Syntax: driver.find_element(By.ID, 'my_id')
%% Cell type:code id: tags:
``` python
# <input type="text" id="twotabsearchtextbox" value="" name="field-keywords"
# autocomplete="off" placeholder="" class="nav-input nav-progressive-attribute" dir="auto" tabindex="0" aria-label="Search">
driver.find_element(By.ID, 'twotabsearchtextbox')
```
%% Cell type:code id: tags:
``` python
# <div id="nav-cart-count-container">
driver.find_element(By.ID, 'nav-cart-count-container')
```
%% Cell type:markdown id: tags:
# 4. By Class Name
%% Cell type:markdown id: tags:
Syntax: driver.find_element(By.CLASS_NAME, 'class_name')
%% Cell type:code id: tags:
``` python
# single word class_name
# <div class="nav-search-field ">
driver.find_elements(By.CLASS_NAME, 'nav-search-field ')
```
%% Cell type:code id: tags:
``` python
# <div class="nav-left">
driver.find_elements(By.CLASS_NAME, 'nav-left')
```
%% Cell type:code id: tags:
``` python
# multi word class_name
```
%% Cell type:code id: tags:
``` python
# If class_name have spaces between them, then exact same class_name (will not work).
driver.find_element(By.CLASS_NAME, 'nav-search-submit nav-sprite')
```
%% Cell type:code id: tags:
``` python
# put dot "." instead on spaces (will work).
driver.find_element(By.CLASS_NAME, 'nav-search-submit.nav-sprite')
```
%% Cell type:markdown id: tags:
# 5. By Link Text
%% Cell type:markdown id: tags:
* The text enclosed within an anchor tag is used to identify a link or hyperlink.
Syntax: driver.find_element(By.LINK_TEXT, 'text')
%% Cell type:code id: tags:
``` python
driver.find_elements(By.LINK_TEXT, 'Best')
```
%% Cell type:code id: tags:
``` python
driver.find_elements(By.LINK_TEXT, 'Best Sellers')
```
%% Cell type:markdown id: tags:
# 6. By Partial Link Text
%% Cell type:markdown id: tags:
* The partial text enclosed within an anchor tag is used to identify a link or hyperlink.
Syntax: driver.find_element(By.PARTIAL_LINK_TEXT, 'text')
%% Cell type:code id: tags:
``` python
driver.find_element(By.PARTIAL_LINK_TEXT, 'Best')
```
%% Cell type:code id: tags:
``` python
driver.find_elements(By.PARTIAL_LINK_TEXT, 'Best Sellers')
```
%% Cell type:markdown id: tags:
# 7. By XPATH
%% Cell type:markdown id: tags:
* The element is identified with the XPath created with the help of HTML attribute, value, and tagName.
* Xpath is of two types absolute and relative.
* For absolute XPath, we have to traverse from root to the element.
* For relative XPath, we can start from any position in DOM.
* An XPath expression should follow a particular rule- // tagname [@attribute=’value’]. The Tag Name is optional. If it is omitted, the expression should //*[@attribute=’value’].
Syntax: driver.find_element(By.XPATH, '//XPATH')
%% Cell type:code id: tags:
``` python
# //tag_name
driver.find_elements(By.XPATH, '//input')
```
%% Cell type:code id: tags:
``` python
# //tag_name[@attribute = "value"]
driver.find_elements(By.XPATH, '//input[@type="text"]')
```
%% Cell type:code id: tags:
``` python
# //*[@attribute = "value"]
# * means it will search all the tags and will stop at where it finds the attribute+value
driver.find_elements(By.XPATH, '//*[@id="nav-xshop"]')
```
%% Cell type:code id: tags:
``` python
# //*[@attribute = "value"]/tag_name
# / next child
driver.find_elements(By.XPATH, '//div[@class="nav-fill"]/div')
```
%% Cell type:code id: tags:
``` python
# //*[@attribute = "value"]//tag_name
# // consider all child
driver.find_elements(By.XPATH, '//div[@class="nav-fill"]//div')
```
%% Cell type:code id: tags:
``` python
# //tagname[. = "text"]
driver.find_elements(By.XPATH, '//a[. = "Best Sellers"]')
```
%% Cell type:code id: tags:
``` python
# //tag_name/..
# .. means parent of the tag_name
driver.find_elements(By.XPATH, '//input/..')
```
%% Cell type:code id: tags:
``` python
driver.find_elements(By.XPATH, '//*[@id="nav-tools"]/a')
```
%% Cell type:code id: tags:
``` python
driver.find_elements(By.XPATH, '//*[@id="nav-tools"]/a[1]')
```
%% Cell type:code id: tags:
``` python
driver.find_elements(By.XPATH, '//*[@id="nav-tools"]/a[last()]')
```
%% Cell type:markdown id: tags:
# 8. By CSS Locator
%% Cell type:markdown id: tags:
* The element is identified with the CSS created with the help of HTML attribute, value, or tagName.
Syntax: driver.find_elements(By.CSS_SELECTOR, 'input#txt')
%% Cell type:code id: tags:
``` python
# tag_name
driver.find_elements(By.CSS_SELECTOR, 'input')
```
%% Cell type:code id: tags:
``` python
# tag_name.class1.class2
driver.find_elements(By.CSS_SELECTOR, 'input.nav-input.nav-progressive-attribute')
```
%% Cell type:code id: tags:
``` python
# tag_name#id
driver.find_elements(By.CSS_SELECTOR, 'input#twotabsearchtextbox')
```
%% Cell type:code id: tags:
``` python
# parent_tag_name > child_tag_name
driver.find_elements(By.CSS_SELECTOR, 'div > input')
```
%% Cell type:code id: tags:
``` python
# #id
driver.find_elements(By.CSS_SELECTOR, '#twotabsearchtextbox')
```
%% Cell type:code id: tags:
``` python
# #id > parent_tag_name > child_tag_name
driver.find_elements(By.CSS_SELECTOR, '#CardInstanceQNqkNMgnYMdkg9dk0pUzTQ > div > div')
```
%% Cell type:code id: tags:
``` python
# tag_name[attribute = "value"]
driver.find_elements(By.CSS_SELECTOR, 'input[aria-label="Search"]')
```
%% Cell type:code id: tags:
``` python
```
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment