Unverified Commit 0ea279e9 authored by Dhiraj Suthar's avatar Dhiraj Suthar Committed by GitHub
Browse files

Merge pull request #1 from dileep-gadiraju/develop

addition: main_server_code, scripts, docs
No related merge requests found
Showing with 713 additions and 3 deletions
+713 -3
# python-crawler-quickstart
Python based Web crawler Quick Start Project.
# python-webscraping-quickstart
Python based Web-scraping Quick Start Project.
For Scraping the project uses Selenium & Scrapy framework.
......@@ -35,7 +37,9 @@ python app.py
Successful local deployment should show Server is up on port 5001.
## Documentation
For Scripting and configuration documentation, refer `./docs` folder
For Scripting and configuration documentation, refer [Documentation](docs/README.md).
## API Reference
......@@ -69,6 +73,19 @@ _The following are mandatory Request Body Parameters_
| :-------- | :------- | :-------------------------------- |
| `JobId` | `string` | `(required) uuid of a job` |
### API Authorization
Currently the projects uses basic aurthorization for authentication.
Set the following environment_variable:
| Variables | Type | Description |
| :-------- | :------- | :-------------------------------- |
| `BASIC_HTTP_USERNAME` | `string` | username for server |
| `BASIC_HTTP_PASSWORD` | `string` | password for server |
## Authors
- [@dileep-gadiraju](https://github.com/dileep-gadiraju)
......
BASIC_HTTP_USERNAME=test
BASIC_HTTP_PASSWORD=generic@123#
ELASTIC_DB_URL=https://localhost:9200
BLOB_SAS_TOKEN=XXXXX
BLOB_ACCOUNT_URL=YYYYY
BLOB_CONTAINER_NAME=ZZZZZ
MAX_RUNNING_JOBS=4
version: '3.7'
services:
web-scraping-project:
deploy:
replicas: 1
update_config:
parallelism: 3
delay: 10s
restart_policy:
condition: on-failure
ports:
- "5001:5001"
env_file:
- ./dev.env
networks:
- frontend
networks:
frontend:
driver: overlay
external: true
# Configuration README
[Configure config.py](config.md)
[Configure agents](agents.md)
[Configure azure](azure.md)
[Configure Environment Variables](env-variables.md)
[Configure ElasticSearch Log](eslog.md)
[Configure scripts.py](scripts.md)
[docker deployment](docker.md)
\ No newline at end of file
# Agent Configurations
To include new agents, add the agent_data to `/static/agents.json`
format:
```
{
"agentId": "MY-AGENT-1",
"description": "Crawler For my_agent_1",
"provider": "AGENT-PROVIDER-X",
"URL": "https://www.my-agent.com",
"scripts": {
"scriptType1": "myAgentScript1",
"scriptType2": "myAgentScript2",
"scriptType3": "myAgentScript3",
...
}
}
```
example:
```
[
{
"agentId": "APPLIED-SELENIUM",
"description": "Crawler For Applied",
"provider": "Applied",
"URL": "https://www.applied.com",
"scripts": {
"info": "AppliedSelenium",
"pdf": "AppliedSelenium"
}
},
{
"agentId": "GRAINGER-SELENIUM",
"description": "Crawler For Grainger",
"provider": "Grainger",
"URL": "https://www.grainger.com",
"scripts": {
"info": "GraingerSelenium",
"pdf": "GraingerSelenium"
}
}
]
```
\ No newline at end of file
docs/azure.md 0 → 100644
# Azure
1. Initialize BlobStorage object.
```
blob_storage = BlobStorage()
```
2. Set the folder for storage.
```
blob_storage.set_agent_folder(folder_name)
```
arguments:
* folder_name : Name of the folder.
3. Upload it to BlobStorage.
```
b_status, b_str = blob_storage.upload_file(file_name, data, overwrite)
```
arguments:
* file_name : Name of the file.
* data : data to be uploaded.
* overwrite : (boolean), flag for overwriting the data to BlobStorage.
return:
* b_status : (boolean), if the data has uploaded or not.
* b_str : Exception if the data is not uploaded.
# Configure config.py
* [Server configuration](#Server-configuration)
* [Agent configuration](#Agent-configuration)
* [AzureBlob configuration](#AzureBlob-configuration)
* [ElasticSearch variables](#ElasticSearch-DB-variables)
* [Logging configuration](#Logging-configuration)
## Server configuration
| Variables | Type | Description |
| :-------- | :------- | :------------------------- |
| `SERVER_HOST` | `string` | host for Server |
| `SERVER_PORT` | `string` | port for Server |
| `SERVER_DEBUG` | `bool` | debugging for Server |
| `SERVER_CORS` | `bool` | CORS policy for Server |
| `SERVER_STATIC_PATH` | `string` | static folder path for Server |
| `API_URL_PREFIX` | `string` | url prefix for Server |
| `API_MANDATORY_PARAMS`| `list` | mandatory parameters for request |
| `BASIC_HTTP_USERNAME` | `string` | username to access Server |
| `BASIC_HTTP_PASSWORD` | `string` | password to access Server |
## Agent configuration
| Variables | Type | Description |
| :-------- | :------- | :------------------------- |
| `AGENT_SCRIPT_TYPES` | `dict` | types of scraping_scripts |
| `AGENT_CONFIG_PATH` | `string` | file_path for agent_configuration(json file) |
| `AGENT_CONFIG_PKL_PATH`| `string` | file_path for agent_configuration(pickle file) |
## AzureBlob configuration
| Variables | Type | Description |
| :-------- | :------- | :------------------------- |
| `BLOB_INTIGRATION` | `bool` | enable/disable AzureBlob Storage |
| `BLOB_SAS_TOKEN` | `string` | SAS Token for AzureBlob Storage |
| `BLOB_ACCOUNT_URL` | `string` | Account URL for AzureBlob Storage|
| `BLOB_CONTAINER_NAME` | `string` | Container for AzureBlob Storage |
## ElasticSearch DB variables
| Variables | Type | Description |
| :-------- | :------- | :------------------------- |
| `ELASTIC_DB_URL` | `string` | URL of ElasticSearch Server |
| `ES_LOG_INDEX` | `string` | Info Logging Index in ElasticSearch |
| `ES_JOB_INDEX` | `string` | Job Logging Index in ElasticSearch |
| `ES_DATA_INDEX` | `string` | Data Logging Index in ElasticSearch |
## Logging configuration
| Variables | Type | Description |
| :-------- | :------- | :------------------------- |
| `JOB_OUTPUT_PATH` | `string` | folder_path for JOB output |
| `MAX_RUNNING_JOBS` | `int` | Max No. of Running Jobs |
| `MAX_WAITING_JOBS` | `int` | Max No. of Waiting Jobs |
| `JOB_RUNNING_STATUS` | `string` | Status for Running Jobs |
| `JOB_COMPLETED_SUCCESS_STATUS`| `string` | Status for Successfull Jobs |
| `JOB_COMPLETED_FAILED_STATUS` | `string` | Status for Failed Jobs |
# Docker Deployment
* Stop and remove existing containers with name `web-scraping-project`.
```
docker stop web-scraping-project
docker rm web-scraping-project
```
* Build Docker image: `web-scraping-project`
```
docker build -t web-scraping-project ./src/
```
_Note: ./src/ contains Dockerfile_
* Spawn: `web-scraping-project`.
```
docker run --name web-scraping-project -p 5001:5001 --env-file ./deploy/dev.env -it web-scraping-project
```
_Note: Here environment file (--env-file) refers from local storage_
\ No newline at end of file
# Environment-Variables
The Following are supported Environment-Variables
| Variables | Type | Description |
| :-------- | :------- | :------------------------- |
| `BASIC_HTTP_USERNAME` | `string` | username for server |
| `BASIC_HTTP_PASSWORD` | `string` | password for server |
| `ELASTIC_DB_URL` | `string` | URL of elasticsearch_DB |
| `BLOB_SAS_TOKEN` | `string` | azure blob_storage SAS token |
| `BLOB_ACCOUNT_URL` | `string` | azure blob_storage account_URL |
| `BLOB_CONTAINER_NAME` | `string` | azure blob_storage container_name|
| `MAX_RUNNING_JOBS` | `int` | maximum jobs running at a time |
| `MAX_WAITING_JOBS` | `int` | maximum jobs waiting at a time |
docs/eslog.md 0 → 100644
# ElasticSearch Log
* Initialize Log object.
```
log = Log(agentRunContext)
```
* Types of logs:
1. log.job : it shows the job status, logs are added to `config.ES_JOB_INDEX`.
Syntax:
```
log.job(status, message)
```
Examples:
```
log.job(config.JOB_RUNNING_STATUS, 'job Started')
# your code goes here
try:
log.job(config.JOB_COMPLETED_SUCCESS_STATUS, 'Job Completed')
except:
log.job(config.JOB_COMPLETED_FAILED_STATUS, 'Job Failed')
```
2. log.info : it shows the job info, logs are added to `config.ES_LOG_INDEX`.
Syntax:
```
log.info(info_type, message)
```
Examples:
```
log.info('info', 'This is generalization project')
log.info('warning', 'Script is taking more than usual time')
log.info('exception', 'No Products Available')
```
3. log.data : it shows the job data, logs are added to `config.ES_DATA_INDEX`.
Syntax:
```
log.data(data)
```
Example:
```
data = {
"A" : "123",
"B" : "Generic Project"
}
log.data(data)
```
# Scripts
1. Create `python_file` in the respective scriptType folder in `./src/scripts`.
2. Format of the script `my_agent_script.py`.
```
# imports
# create a function
def myAgentScript(agentRunContext):
log = Log(agentRunContext)
try:
log.job(config.JOB_RUNNING_STATUS, Job Started')
# Your script
# Goes here
log.job(config.JOB_COMPLETED_SUCCESS_STATUS, Successfully Scraped Dats')
except Exception as e:
log.job(config.JOB_COMPLETED_FAILED_STATUS, str(e))
log.info('exception', traceback.format_exc())
```
3. Add script to `init.py` as
```
from .my_agent_script import myAgentScript
```
**__pycache__
*.vscode
*.log
/env
exp_result.py
**.DS_Store
/upload/*
#
\ No newline at end of file
FROM python:3.9-slim
COPY / /app
WORKDIR /app
RUN apt update
RUN pip3 install -r requirements.txt
COPY start.sh /usr/bin/start.sh
RUN chmod +x /usr/bin/start.sh
CMD ["/usr/bin/start.sh"]
src/app.py 0 → 100644
import json
import os
import sys
from flask import Flask
from flask.blueprints import Blueprint
from flask_basicauth import BasicAuth
from flask_cors import CORS
# local imports
import config
import routes
from models import AgentUtils
# flask server
server = Flask(__name__)
# server configuration
config.SERVER_STATIC_PATH = server.static_folder
server.config['BASIC_AUTH_USERNAME'] = config.BASIC_HTTP_USERNAME
server.config['BASIC_AUTH_PASSWORD'] = config.BASIC_HTTP_PASSWORD
# basic_auth for server
basic_auth = BasicAuth(server)
# load agents config
with open(os.path.join(config.SERVER_STATIC_PATH, config.AGENT_CONFIG_PATH), 'r') as f:
agent_list = json.load(f)
__import__("scripts")
my_scripts = sys.modules["scripts"]
agentUtils = AgentUtils()
agentUtils.filepath = os.path.join(
config.SERVER_STATIC_PATH, config.AGENT_CONFIG_PKL_PATH)
pkl_agent_list = agentUtils.listAgents()
len_diff = len(agent_list) - len(pkl_agent_list)
for i in range(len(agent_list)-1, len(agent_list)-len_diff-1, -1):
agent = agent_list[i]
agent_script = dict()
for type in config.AGENT_SCRIPT_TYPES.values():
agent_script[type] = my_scripts.__dict__[
type].__dict__[agent['scripts'][type]]
agentUtils.addAgent(agent['agentId'],
agent['description'],
agent['provider'],
agent_script,
agent['URL'])
# server CORS policy
if config.SERVER_CORS:
cors = CORS(server, resources={r"/api/*": {"origins": "*"}})
# add blueprint routes to server
for blueprint in vars(routes).values():
if isinstance(blueprint, Blueprint):
server.register_blueprint(blueprint, url_prefix=config.API_URL_PREFIX)
# sample route
@server.route('/')
def home():
return "<h1>HI</h1>"
# start server
if __name__ == "__main__":
print('starting server at {} at port {}'.format(
config.SERVER_HOST, config.SERVER_PORT))
server.run(host=config.SERVER_HOST,
port=config.SERVER_PORT,
debug=config.SERVER_DEBUG,
threaded=True)
from .scraping_utils import get_driver
from .elastic_wrapper import Log
from .errors import ValueMissing, FormatError, BadRequestError
from .blob_storage import BlobStorage
import os
import config
from azure.storage.blob import BlobServiceClient
class BlobStorage(object):
def __init__(self,overwrite=False):
self.blob_service_client = BlobServiceClient(
account_url=config.BLOB_ACCOUNT_URL, credential=config.BLOB_SAS_TOKEN)
self.root_folder = None
self.overwrite = overwrite
@property
def root_folder(self):
return self._root_folder
@root_folder.setter
def root_folder(self, rf):
self._root_folder = rf
@property
def blob_service_client(self):
return self._blob_service_client
@blob_service_client.setter
def blob_service_client(self, bsc):
self._blob_service_client = bsc
def set_agent_folder(self, agent_folder):
self.root_folder = agent_folder
def upload_file(self,file_name,file_contents):
upload_file_path = os.path.join(self.root_folder,file_name)
blob_client = self.blob_service_client.get_blob_client(container=config.CONTAINER_NAME,blob=upload_file_path)
try:
blob_client.upload_blob(file_contents,overwrite=self.overwrite)
except Exception as e:
return False,str(e)
return True,'true'
import config
from elasticsearch import Elasticsearch
import json
import time
class Log(object):
@classmethod
def from_default(cls):
return cls(None)
def __init__(self, agentRunContext):
self.agentRunContext = agentRunContext
self.es_client = Elasticsearch([config.ELASTIC_DB_URL])
def __populate_context(self):
data = {
'agentId': self.agentRunContext.requestBody['agentId'],
'jobId': self.agentRunContext.jobId,
'jobType': self.agentRunContext.jobType,
'timestamp': int(time.time()*1000),
'buildNumber': config.BUILD_NUMBER
}
return data
def __index_data_to_es(self, index, data):
if self.es_client.ping():
self.es_client.index(index=index, body=json.dumps(data))
else:
with open('logger.txt', 'a+') as f:
f.write(json.dumps(data)+'\n')
def info(self, info_type, message):
info_data = self.__populate_context()
info_data['type'] = info_type
info_data['message'] = message
self.__index_data_to_es(config.ES_LOG_INDEX, info_data)
def data(self, data):
data.update(self.__populate_context())
self.__index_data_to_es(config.ES_DATA_INDEX, data)
def job(self, status, message):
job_data = self.__populate_context()
job_data['status'] = status
job_data['message'] = message
self.__index_data_to_es(config.ES_JOB_INDEX, job_data)
def get_status(self, jobId):
print(jobId)
if not self.es_client.ping():
return {'status': 'ES_CONNECTION_FAILED', 'message': "Not able to connect to ES DB"}
else:
search_param = {
"sort": [
{
"timestamp": {
"order": "desc"
}
}
],
"query": {
"bool": {
"must": [
{"match": {
"jobId.keyword": jobId
}}
]
}
}
}
res = self.es_client.search(
index=config.ES_JOB_INDEX, body=search_param)
if len(res['hits']['hits']) > 0:
source = res['hits']['hits'][0]['_source']
return {'status': source['status'], 'message': source['message']}
else:
return {'status': 'JOBID_NOT_FOUND', 'message': "Please check the given jobId"}
from flask import jsonify
class RestAPIError(Exception):
def __init__(self, status_code=500, payload=None):
self.status_code = status_code
self.payload = payload
def to_response(self):
return jsonify({'error': self.payload}), self.status_code
class BadRequestError(RestAPIError):
def __init__(self, payload=None):
super().__init__(400, payload)
class InternalServerErrorError(RestAPIError):
def __init__(self, payload=None):
super().__init__(500, payload)
class FormatError(Exception):
def __init__(self, code, message):
self._code = code
self._message = message
@property
def code(self):
return self._code
@property
def message(self):
return self._message
def __str__(self):
return self.__class__.__name__ + ': ' + self.message
class WorkflowkeyError(Exception):
def __init__(self, code, message):
self._code = code
self._message = message
@property
def code(self):
return self._code
@property
def message(self):
return self._message
def __str__(self):
return self.__class__.__name__ + ': ' + self.message
class FileErrors(Exception):
def __init__(self, code, message):
self._code = code
self._message = message
@property
def code(self):
return self._code
@property
def message(self):
return self._message
def __repr__(self):
return {"code": self.code, "message": self.__class__.__name__ + ': ' + self.message}
class FileEncodingError(Exception):
def __init__(self, code, message):
self._code = code
self._message = message
@property
def code(self):
return self._code
@property
def message(self):
return self._message
def __str__(self):
return self.__class__.__name__ + ': ' + self.message
class ServiceError(Exception):
def __init__(self, code, message):
self._code = code
self._message = message
@property
def code(self):
return self._code
@property
def message(self):
return self._message
def __str__(self):
return self.__class__.__name__ + ': ' + self.message
class ValueMissing(Exception):
def __init__(self, message):
self.message = message
@property
def message(self):
return self._message
@message.setter
def message(self, value):
self._message = value
def __str__(self):
return self.message
def __repr__(self):
return self.message
import os
from pathlib import Path
import config
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
chrome_path = Service(config.CHROMEDRIVER_PATH)
def enable_download_headless(browser, download_dir):
browser.command_executor._commands["send_command"] = (
"POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior', 'params': {
'behavior': 'allow', 'downloadPath': download_dir}}
browser.execute("send_command", params)
def get_driver(temp_directory):
Path(temp_directory).mkdir(parents=True, exist_ok=True)
download_dir = os.path.join(temp_directory)
chrome_options = Options()
d = DesiredCapabilities.CHROME
d['goog:loggingPrefs'] = {'browser': 'ALL'}
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")
chrome_options.add_argument("--disable-notifications")
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--verbose')
chrome_options.add_argument('--log-level=3')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.page_load_strategy = 'normal'
chrome_options.add_argument(
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36')
chrome_options.add_argument('--disable-software-rasterizer')
chrome_options.add_experimental_option("prefs", {
"download.default_directory": str(download_dir),
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing_for_trusted_sources_enabled": False,
"safebrowsing.enabled": False,
"plugins.always_open_pdf_externally": True
})
driver = webdriver.Chrome(
service=chrome_path, options=chrome_options, desired_capabilities=d)
enable_download_headless(driver, download_dir)
return driver
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment