Merge pull request #1 from dileep-gadiraju/develop

addition: main_server_code, scripts, docs

Merge pull request #1 from dileep-gadiraju/develop
addition: main_server_code, scripts, docs
0ea279e9 · Dhiraj Suthar · GitHub · c13bb37e · 9c23e379 · 0ea279e9
Unverified Commit 0ea279e9 authored 2 years ago by Dhiraj Suthar Committed by GitHub 2 years ago
Hide whitespace changes
Inline Side-by-side

Showing

with 713 additions and 3 deletions
+713 -3
--- a/README.md
+++ b/README.md



-# python-crawler-quickstart

-Python based Web crawler Quick Start Project. 
+# python-webscraping-quickstart
+
+Python based Web-scraping Quick Start Project. 
+

 For Scraping the project uses Selenium & Scrapy framework.

@@ -35,7 +37,9 @@ python app.py
 Successful local deployment should show Server is up on port 5001.
 ## Documentation

-For Scripting and configuration documentation, refer `./docs` folder
+
+For Scripting and configuration documentation, refer [Documentation](docs/README.md). 
+

 ## API Reference

@@ -69,6 +73,19 @@ _The following are mandatory Request Body Parameters_
 | :-------- | :------- | :-------------------------------- |
 | `JobId`   | `string` | `(required) uuid of a job`        |

+
+### API Authorization
+
+Currently the projects uses basic aurthorization for authentication.
+
+Set the following environment_variable:
+| Variables             | Type     | Description                        |
+| :--------             | :------- | :--------------------------------  |
+| `BASIC_HTTP_USERNAME` | `string` |  username for server               |
+| `BASIC_HTTP_PASSWORD` | `string` |  password for server               |
+
+
+
 ## Authors

 - [@dileep-gadiraju](https://github.com/dileep-gadiraju)

--- a/deploy/dev.env
+++ b/deploy/dev.env
+BASIC_HTTP_USERNAME=test
+BASIC_HTTP_PASSWORD=generic@123#
+ELASTIC_DB_URL=https://localhost:9200
+BLOB_SAS_TOKEN=XXXXX
+BLOB_ACCOUNT_URL=YYYYY
+BLOB_CONTAINER_NAME=ZZZZZ
+MAX_RUNNING_JOBS=4
--- a/deploy/web-scraping.yml
+++ b/deploy/web-scraping.yml
+version: '3.7'
+services:
+  web-scraping-project:
+    deploy:
+      replicas: 1
+      update_config:
+        parallelism: 3
+        delay: 10s
+      restart_policy:
+        condition: on-failure
+    ports:
+      - "5001:5001"
+    env_file:
+    - ./dev.env
+
+    networks:
+      - frontend
+
+networks:
+  frontend:
+    driver: overlay
+    external: true
--- a/docs/README.md
+++ b/docs/README.md
+# Configuration README
+
+[Configure config.py](config.md)
+
+[Configure agents](agents.md)
+
+[Configure azure](azure.md)
+
+[Configure Environment Variables](env-variables.md)
+
+[Configure ElasticSearch Log](eslog.md)
+
+[Configure scripts.py](scripts.md)
+
+[docker deployment](docker.md)
\ No newline at end of file
--- a/docs/agents.md
+++ b/docs/agents.md
+
+# Agent Configurations
+
+To include new agents, add the agent_data to `/static/agents.json`
+
+format: 
+
+```
+    {
+        "agentId": "MY-AGENT-1",
+        "description": "Crawler For my_agent_1",
+        "provider": "AGENT-PROVIDER-X",
+        "URL": "https://www.my-agent.com",
+        "scripts": {
+            "scriptType1": "myAgentScript1",
+            "scriptType2": "myAgentScript2",
+            "scriptType3": "myAgentScript3",
+            ...
+        }
+    }
+```
+
+example: 
+
+```
+    [
+        {
+            "agentId": "APPLIED-SELENIUM",
+            "description": "Crawler For Applied",
+            "provider": "Applied",
+            "URL": "https://www.applied.com",
+            "scripts": {
+                "info": "AppliedSelenium",
+                "pdf": "AppliedSelenium"
+            }
+        },
+        {
+            "agentId": "GRAINGER-SELENIUM",
+            "description": "Crawler For Grainger",
+            "provider": "Grainger",
+            "URL": "https://www.grainger.com",
+            "scripts": {
+                "info": "GraingerSelenium",
+                "pdf": "GraingerSelenium"
+            }
+        }
+    ]
+```
\ No newline at end of file
--- a/docs/azure.md
+++ b/docs/azure.md
+# Azure
+
+1. Initialize BlobStorage object.
+```
+blob_storage = BlobStorage()
+```
+
+2. Set the folder for storage.
+```
+blob_storage.set_agent_folder(folder_name)
+```
+arguments:
+    
+* folder_name : Name of the folder.
+
+
+3. Upload it to BlobStorage.
+
+```
+b_status, b_str = blob_storage.upload_file(file_name, data, overwrite)
+```
+arguments:
+    
+* file_name : Name of the file.
+* data : data to be uploaded.
+* overwrite : (boolean), flag for overwriting the data to BlobStorage.
+
+return:
+
+* b_status : (boolean), if the data has uploaded or not.
+* b_str : Exception if the data is not uploaded.
--- a/docs/config.md
+++ b/docs/config.md
+
+# Configure config.py
+
+* [Server configuration](#Server-configuration)
+* [Agent configuration](#Agent-configuration)
+* [AzureBlob configuration](#AzureBlob-configuration)
+* [ElasticSearch variables](#ElasticSearch-DB-variables)
+* [Logging configuration](#Logging-configuration)
+
+## Server configuration
+
+| Variables             | Type      | Description                       |
+| :--------             | :-------  | :-------------------------        |
+| `SERVER_HOST`         | `string`  |  host for Server                  |
+| `SERVER_PORT`         | `string`  |  port for Server                  |
+| `SERVER_DEBUG`        | `bool`    |  debugging for Server             |
+| `SERVER_CORS`         | `bool`    |  CORS policy for Server           |
+| `SERVER_STATIC_PATH`  | `string`  |  static folder path for Server    |
+| `API_URL_PREFIX`      | `string`  |  url prefix for Server            |
+| `API_MANDATORY_PARAMS`| `list`    |  mandatory parameters for request |
+| `BASIC_HTTP_USERNAME` | `string`  |  username to access Server        |
+| `BASIC_HTTP_PASSWORD` | `string`  |  password to access Server        |
+
+## Agent configuration
+
+| Variables              | Type      | Description                                     |
+| :--------              | :-------  | :-------------------------                      |
+| `AGENT_SCRIPT_TYPES`   | `dict`    |  types of scraping_scripts                      |
+| `AGENT_CONFIG_PATH`    | `string`  |  file_path for agent_configuration(json file)   |
+| `AGENT_CONFIG_PKL_PATH`| `string`  |  file_path for agent_configuration(pickle file) |
+
+## AzureBlob configuration
+
+| Variables             | Type     | Description                       |
+| :--------             | :------- | :-------------------------        |
+| `BLOB_INTIGRATION`    | `bool`   |  enable/disable AzureBlob Storage |
+| `BLOB_SAS_TOKEN`      | `string` |  SAS Token for AzureBlob Storage  |
+| `BLOB_ACCOUNT_URL`    | `string` |  Account URL for AzureBlob Storage|
+| `BLOB_CONTAINER_NAME` | `string` |  Container for AzureBlob Storage  |
+
+## ElasticSearch DB variables
+
+| Variables        | Type     | Description                          |
+| :--------        | :------- | :-------------------------           |
+| `ELASTIC_DB_URL` | `string` |  URL of ElasticSearch Server         |
+| `ES_LOG_INDEX`   | `string` |  Info Logging Index in ElasticSearch |
+| `ES_JOB_INDEX`   | `string` |  Job  Logging Index in ElasticSearch |
+| `ES_DATA_INDEX`  | `string` |  Data Logging Index in ElasticSearch |
+
+## Logging configuration
+
+| Variables                     | Type     | Description                    |
+| :--------                     | :------- | :-------------------------     |
+| `JOB_OUTPUT_PATH`             | `string` |  folder_path for JOB output    |
+| `MAX_RUNNING_JOBS`            | `int`    |  Max No. of Running Jobs       |
+| `MAX_WAITING_JOBS`            | `int`    |  Max No. of Waiting Jobs       |
+| `JOB_RUNNING_STATUS`          | `string` |  Status for Running Jobs       |
+| `JOB_COMPLETED_SUCCESS_STATUS`| `string` |  Status for Successfull Jobs   |
+| `JOB_COMPLETED_FAILED_STATUS` | `string` |  Status for Failed Jobs        |
--- a/docs/contracts.yaml
+++ b/docs/contracts.yaml
--- a/docs/docker.md
+++ b/docs/docker.md
+# Docker Deployment
+
+* Stop and remove existing containers with name `web-scraping-project`.
+```
+docker stop web-scraping-project 
+docker rm web-scraping-project
+```
+
+* Build Docker image: `web-scraping-project`
+```
+docker build -t web-scraping-project ./src/
+```
+_Note: ./src/ contains Dockerfile_
+
+
+* Spawn: `web-scraping-project`.
+```
+docker run --name web-scraping-project -p 5001:5001 --env-file ./deploy/dev.env -it web-scraping-project
+```
+
+_Note: Here environment file (--env-file) refers from local storage_
\ No newline at end of file
--- a/docs/env-variables.md
+++ b/docs/env-variables.md
+
+# Environment-Variables
+
+The Following are supported Environment-Variables
+
+| Variables             | Type      | Description                       |
+| :--------             | :-------  | :-------------------------        |
+| `BASIC_HTTP_USERNAME` | `string`  |  username for server              |
+| `BASIC_HTTP_PASSWORD` | `string`  |  password for server              |
+| `ELASTIC_DB_URL`      | `string`  |  URL of elasticsearch_DB          |
+| `BLOB_SAS_TOKEN`      | `string`  |  azure blob_storage SAS token     |
+| `BLOB_ACCOUNT_URL`    | `string`  |  azure blob_storage account_URL   |
+| `BLOB_CONTAINER_NAME` | `string`  |  azure blob_storage container_name|
+| `MAX_RUNNING_JOBS`    | `int`     |  maximum jobs running at a time   |
+| `MAX_WAITING_JOBS`    | `int`     |  maximum jobs waiting at a time   |
--- a/docs/eslog.md
+++ b/docs/eslog.md
+
+# ElasticSearch Log
+
+* Initialize Log object.
+```
+log = Log(agentRunContext)
+```
+
+* Types of logs:
+    
+    1. log.job : it shows the job status, logs are added to `config.ES_JOB_INDEX`.
+        
+        Syntax:
+        ```
+        log.job(status, message)
+        ```
+        
+        Examples:
+        ```
+        log.job(config.JOB_RUNNING_STATUS, 'job Started')
+        # your code goes here
+        try:
+            log.job(config.JOB_COMPLETED_SUCCESS_STATUS, 'Job Completed')
+        except:
+            log.job(config.JOB_COMPLETED_FAILED_STATUS, 'Job Failed')
+        ```
+
+    2. log.info : it shows the job info, logs are added to `config.ES_LOG_INDEX`.
+
+        Syntax:
+        ```
+        log.info(info_type, message)
+        ```
+        Examples:
+        ```
+        log.info('info', 'This is generalization project')
+        log.info('warning', 'Script is taking more than usual time')
+        log.info('exception', 'No Products Available')
+        ```
+    3. log.data : it shows the job data, logs are added to `config.ES_DATA_INDEX`.
+        
+        Syntax:
+        ```
+        log.data(data)
+        ```
+        Example:
+        ```
+        data = {
+            "A" : "123",
+            "B" : "Generic Project"
+        }
+        log.data(data)
+        ```
--- a/docs/scripts.md
+++ b/docs/scripts.md
+
+# Scripts
+
+1. Create `python_file` in the respective scriptType folder in `./src/scripts`.
+
+2. Format of the script `my_agent_script.py`.
+```
+# imports
+
+# create a function
+def myAgentScript(agentRunContext):
+    log = Log(agentRunContext)
+    try:
+    
+        log.job(config.JOB_RUNNING_STATUS, Job Started')
+
+        # Your script
+        # Goes here
+
+        log.job(config.JOB_COMPLETED_SUCCESS_STATUS, Successfully Scraped Dats')
+
+    except Exception as e:
+        log.job(config.JOB_COMPLETED_FAILED_STATUS, str(e))
+        log.info('exception', traceback.format_exc())
+
+```
+3. Add script to `init.py` as 
+```
+from .my_agent_script import myAgentScript
+```
+
+
--- a/src/.gitignore
+++ b/src/.gitignore
+**__pycache__
+*.vscode
+*.log
+/env
+exp_result.py
+**.DS_Store
+/upload/*
+#
\ No newline at end of file
--- a/src/Dockerfile
+++ b/src/Dockerfile
+FROM python:3.9-slim
+COPY / /app
+WORKDIR /app
+RUN apt update
+
+RUN pip3 install -r requirements.txt
+COPY start.sh /usr/bin/start.sh
+RUN chmod +x /usr/bin/start.sh
+CMD ["/usr/bin/start.sh"]
--- a/src/app.py
+++ b/src/app.py
+import json
+import os
+import sys
+
+from flask import Flask
+from flask.blueprints import Blueprint
+from flask_basicauth import BasicAuth
+from flask_cors import CORS
+
+# local imports
+import config
+import routes
+from models import AgentUtils
+
+# flask server
+server = Flask(__name__)
+
+# server configuration
+config.SERVER_STATIC_PATH = server.static_folder
+server.config['BASIC_AUTH_USERNAME'] = config.BASIC_HTTP_USERNAME
+server.config['BASIC_AUTH_PASSWORD'] = config.BASIC_HTTP_PASSWORD
+# basic_auth for server
+basic_auth = BasicAuth(server)
+
+# load agents config
+with open(os.path.join(config.SERVER_STATIC_PATH, config.AGENT_CONFIG_PATH), 'r') as f:
+    agent_list = json.load(f)
+
+__import__("scripts")
+
+my_scripts = sys.modules["scripts"]
+
+agentUtils = AgentUtils()
+agentUtils.filepath = os.path.join(
+    config.SERVER_STATIC_PATH, config.AGENT_CONFIG_PKL_PATH)
+pkl_agent_list = agentUtils.listAgents()
+len_diff = len(agent_list) - len(pkl_agent_list)
+for i in range(len(agent_list)-1, len(agent_list)-len_diff-1, -1):
+    agent = agent_list[i]
+    agent_script = dict()
+    for type in config.AGENT_SCRIPT_TYPES.values():
+        agent_script[type] = my_scripts.__dict__[
+            type].__dict__[agent['scripts'][type]]
+    agentUtils.addAgent(agent['agentId'],
+                        agent['description'],
+                        agent['provider'],
+                        agent_script,
+                        agent['URL'])
+
+
+# server CORS policy
+if config.SERVER_CORS:
+    cors = CORS(server, resources={r"/api/*": {"origins": "*"}})
+
+# add blueprint routes to server
+for blueprint in vars(routes).values():
+    if isinstance(blueprint, Blueprint):
+        server.register_blueprint(blueprint, url_prefix=config.API_URL_PREFIX)
+
+# sample route
+
+
+@server.route('/')
+def home():
+    return "<h1>HI</h1>"
+
+
+# start server
+if __name__ == "__main__":
+    print('starting server at {} at port {}'.format(
+        config.SERVER_HOST, config.SERVER_PORT))
+    server.run(host=config.SERVER_HOST,
+               port=config.SERVER_PORT,
+               debug=config.SERVER_DEBUG,
+               threaded=True)
--- a/src/common/__init__.py
+++ b/src/common/__init__.py
+from .scraping_utils import get_driver
+from .elastic_wrapper import Log
+from .errors import ValueMissing, FormatError, BadRequestError
+from .blob_storage import BlobStorage
--- a/src/common/blob_storage.py
+++ b/src/common/blob_storage.py
+import os
+
+import config
+from azure.storage.blob import BlobServiceClient
+
+
+class BlobStorage(object):
+    def __init__(self,overwrite=False):
+        self.blob_service_client = BlobServiceClient(
+            account_url=config.BLOB_ACCOUNT_URL, credential=config.BLOB_SAS_TOKEN)
+        self.root_folder = None
+        self.overwrite = overwrite
+
+    @property
+    def root_folder(self):
+        return self._root_folder
+
+    @root_folder.setter
+    def root_folder(self, rf):
+        self._root_folder = rf
+
+    @property
+    def blob_service_client(self):
+        return self._blob_service_client
+
+    @blob_service_client.setter
+    def blob_service_client(self, bsc):
+        self._blob_service_client = bsc
+
+    def set_agent_folder(self, agent_folder):
+        self.root_folder = agent_folder
+
+    def upload_file(self,file_name,file_contents):
+        upload_file_path = os.path.join(self.root_folder,file_name)
+        blob_client = self.blob_service_client.get_blob_client(container=config.CONTAINER_NAME,blob=upload_file_path)
+        try:
+            blob_client.upload_blob(file_contents,overwrite=self.overwrite)
+        except Exception as e:
+            return False,str(e)
+        return True,'true'
--- a/src/common/elastic_wrapper.py
+++ b/src/common/elastic_wrapper.py
+import config
+from elasticsearch import Elasticsearch
+import json
+import time
+
+
+class Log(object):
+
+    @classmethod
+    def from_default(cls):
+        return cls(None)
+
+    def __init__(self, agentRunContext):
+        self.agentRunContext = agentRunContext
+        self.es_client = Elasticsearch([config.ELASTIC_DB_URL])
+
+    def __populate_context(self):
+        data = {
+            'agentId': self.agentRunContext.requestBody['agentId'],
+            'jobId': self.agentRunContext.jobId,
+            'jobType': self.agentRunContext.jobType,
+            'timestamp': int(time.time()*1000),
+            'buildNumber': config.BUILD_NUMBER
+        }
+        return data
+
+    def __index_data_to_es(self, index, data):
+        if self.es_client.ping():
+            self.es_client.index(index=index, body=json.dumps(data))
+        else:
+            with open('logger.txt', 'a+') as f:
+                f.write(json.dumps(data)+'\n')
+
+    def info(self, info_type, message):
+        info_data = self.__populate_context()
+        info_data['type'] = info_type
+        info_data['message'] = message
+        self.__index_data_to_es(config.ES_LOG_INDEX, info_data)
+
+    def data(self, data):
+        data.update(self.__populate_context())
+        self.__index_data_to_es(config.ES_DATA_INDEX, data)
+
+    def job(self, status, message):
+        job_data = self.__populate_context()
+        job_data['status'] = status
+        job_data['message'] = message
+        self.__index_data_to_es(config.ES_JOB_INDEX, job_data)
+
+    def get_status(self, jobId):
+        print(jobId)
+        if not self.es_client.ping():
+            return {'status': 'ES_CONNECTION_FAILED', 'message': "Not able to connect to ES DB"}
+        else:
+            search_param = {
+                "sort": [
+                    {
+                        "timestamp": {
+                            "order": "desc"
+                        }
+                    }
+                ],
+                "query": {
+                    "bool": {
+                        "must": [
+                            {"match": {
+                                "jobId.keyword": jobId
+                            }}
+                        ]
+                    }
+                }
+            }
+            res = self.es_client.search(
+                index=config.ES_JOB_INDEX, body=search_param)
+
+            if len(res['hits']['hits']) > 0:
+                source = res['hits']['hits'][0]['_source']
+                return {'status': source['status'], 'message': source['message']}
+            else:
+                return {'status': 'JOBID_NOT_FOUND', 'message': "Please check the given jobId"}
--- a/src/common/errors.py
+++ b/src/common/errors.py
+from flask import jsonify
+
+
+class RestAPIError(Exception):
+    def __init__(self, status_code=500, payload=None):
+        self.status_code = status_code
+        self.payload = payload
+
+    def to_response(self):
+        return jsonify({'error': self.payload}), self.status_code
+
+
+class BadRequestError(RestAPIError):
+    def __init__(self, payload=None):
+        super().__init__(400, payload)
+
+
+class InternalServerErrorError(RestAPIError):
+    def __init__(self, payload=None):
+        super().__init__(500, payload)
+
+
+class FormatError(Exception):
+    def __init__(self, code, message):
+        self._code = code
+        self._message = message
+
+    @property
+    def code(self):
+        return self._code
+
+    @property
+    def message(self):
+        return self._message
+
+    def __str__(self):
+        return self.__class__.__name__ + ': ' + self.message
+
+
+class WorkflowkeyError(Exception):
+    def __init__(self, code, message):
+        self._code = code
+        self._message = message
+
+    @property
+    def code(self):
+        return self._code
+
+    @property
+    def message(self):
+        return self._message
+
+    def __str__(self):
+        return self.__class__.__name__ + ': ' + self.message
+
+
+class FileErrors(Exception):
+    def __init__(self, code, message):
+        self._code = code
+        self._message = message
+
+    @property
+    def code(self):
+        return self._code
+
+    @property
+    def message(self):
+        return self._message
+
+    def __repr__(self):
+        return {"code": self.code, "message": self.__class__.__name__ + ': ' + self.message}
+
+
+class FileEncodingError(Exception):
+    def __init__(self, code, message):
+        self._code = code
+        self._message = message
+
+    @property
+    def code(self):
+        return self._code
+
+    @property
+    def message(self):
+        return self._message
+
+    def __str__(self):
+        return self.__class__.__name__ + ': ' + self.message
+
+
+class ServiceError(Exception):
+    def __init__(self, code, message):
+        self._code = code
+        self._message = message
+
+    @property
+    def code(self):
+        return self._code
+
+    @property
+    def message(self):
+        return self._message
+
+    def __str__(self):
+        return self.__class__.__name__ + ': ' + self.message
+
+
+class ValueMissing(Exception):
+    def __init__(self, message):
+        self.message = message
+
+    @property
+    def message(self):
+        return self._message
+
+    @message.setter
+    def message(self, value):
+        self._message = value
+
+    def __str__(self):
+        return self.message
+
+    def __repr__(self):
+        return self.message
--- a/src/common/scraping_utils.py
+++ b/src/common/scraping_utils.py
+import os
+from pathlib import Path
+
+import config
+from selenium import webdriver
+from selenium.webdriver.chrome.options import Options
+from selenium.webdriver.chrome.service import Service
+from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
+
+chrome_path = Service(config.CHROMEDRIVER_PATH)
+
+
+def enable_download_headless(browser, download_dir):
+    browser.command_executor._commands["send_command"] = (
+        "POST", '/session/$sessionId/chromium/send_command')
+    params = {'cmd': 'Page.setDownloadBehavior', 'params': {
+        'behavior': 'allow', 'downloadPath': download_dir}}
+    browser.execute("send_command", params)
+
+
+def get_driver(temp_directory):
+    Path(temp_directory).mkdir(parents=True, exist_ok=True)
+    download_dir = os.path.join(temp_directory)
+    chrome_options = Options()
+    d = DesiredCapabilities.CHROME
+    d['goog:loggingPrefs'] = {'browser': 'ALL'}
+    chrome_options.add_argument("--headless")
+    chrome_options.add_argument("--window-size=1920x1080")
+    chrome_options.add_argument("--disable-notifications")
+    chrome_options.add_argument('--no-sandbox')
+    chrome_options.add_argument('--verbose')
+    chrome_options.add_argument('--log-level=3')
+    chrome_options.add_argument('--disable-gpu')
+    chrome_options.add_argument('--disable-dev-shm-usage')
+    chrome_options.page_load_strategy = 'normal'
+    chrome_options.add_argument(
+        '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36')
+    chrome_options.add_argument('--disable-software-rasterizer')
+    chrome_options.add_experimental_option("prefs", {
+        "download.default_directory": str(download_dir),
+        "download.prompt_for_download": False,
+        "download.directory_upgrade": True,
+        "safebrowsing_for_trusted_sources_enabled": False,
+        "safebrowsing.enabled": False,
+        "plugins.always_open_pdf_externally": True
+    })
+    driver = webdriver.Chrome(
+        service=chrome_path, options=chrome_options, desired_capabilities=d)
+    enable_download_headless(driver, download_dir)
+    return driver