Addition: Docker Documention

31da2d7f · pushkar191098 · 786adeb1 · 31da2d7f · 31da2d7f · 31da2d7f
Commit 31da2d7f authored 2 years ago by pushkar191098
Hide whitespace changes
Inline Side-by-side

Showing

with 69 additions and 24 deletions
+69 -24
--- a/README.md
+++ b/README.md



-# python-crawler-quickstart
+# python-scraping-quickstart

-Python based Web crawler Quick Start Project. 
+Python based Web-scraping Quick Start Project. 

 For Scraping the project uses Selenium & Scrapy framework.

@@ -69,6 +69,17 @@ _The following are mandatory Request Body Parameters_
 | :-------- | :------- | :-------------------------------- |
 | `JobId`   | `string` | `(required) uuid of a job`        |

+### API Authorization
+
+Currently the projects uses basic aurthorization for authentication.
+
+Set the following environment_variable:
+| Variables             | Type     | Description                        |
+| :--------             | :------- | :--------------------------------  |
+| `BASIC_HTTP_USERNAME` | `string` |  username for server               |
+| `BASIC_HTTP_PASSWORD` | `string` |  password for server               |
+
+
 ## Authors

 - [@dileep-gadiraju](https://github.com/dileep-gadiraju)

--- a/docs/dev.env
+++ b/docs/dev.env
--- a/deploy/web-scraping.yml
+++ b/deploy/web-scraping.yml
+version: '3.7'
+services:
+  web-scraping-project:
+    deploy:
+      replicas: 1
+      update_config:
+        parallelism: 3
+        delay: 10s
+      restart_policy:
+        condition: on-failure
+    ports:
+      - "5001:5001"
+    env_file:
+    - ./dev.env
+
+    networks:
+      - frontend
+
+networks:
+  frontend:
+    driver: overlay
+    external: true
--- a/docs/README.md
+++ b/docs/README.md
@@ -10,4 +10,6 @@

 [Configure ElasticSearch Log](eslog.md)

-[Configure scripts.py](scripts.md)
\ No newline at end of file
+[Configure scripts.py](scripts.md)
+
+[docker deployment](docker.md)
\ No newline at end of file
--- a/docs/docker.md
+++ b/docs/docker.md
+# Docker Deployment
+
+* Stop and remove existing containers with name `web-scraping-project`.
+```
+docker stop web-scraping-project 
+docker rm web-scraping-project
+```
+
+* Build Docker image: `web-scraping-project`
+```
+docker build -t web-scraping-project ./src/
+```
+_Note: ./src/ contains Dockerfile_
+
+
+* Spawn: `web-scraping-project`.
+```
+docker run --name web-scraping-project -p 5001:5001 --env-file ./deploy/dev.env -it web-scraping-project
+```
+
+_Note: Here environment file (--env-file) refers from local storage_
\ No newline at end of file
--- a/docs/env-variables.md
+++ b/docs/env-variables.md
@@ -5,8 +5,8 @@ The Following are supported Environment-Variables

 | Variables             | Type      | Description                       |
 | :--------             | :-------  | :-------------------------        |
-| `BASIC_HTTP_PASSWORD` | `string`  |  username for server              |
-| `BASIC_HTTP_USERNAME` | `string`  |  password for server              |
+| `BASIC_HTTP_USERNAME` | `string`  |  username for server              |
+| `BASIC_HTTP_PASSWORD` | `string`  |  password for server              |
 | `ELASTIC_DB_URL`      | `string`  |  URL of elasticsearch_DB          |
 | `BLOB_SAS_TOKEN`      | `string`  |  azure blob_storage SAS token     |
 | `BLOB_ACCOUNT_URL`    | `string`  |  azure blob_storage account_URL   |

--- a/docs/eslog.md
+++ b/docs/eslog.md
@@ -8,7 +8,7 @@ log = Log(agentRunContext)

 * Types of logs:
    
-    1. log.job : it shows the job status, logs are shown in `general-job-stats`
+    1. log.job : it shows the job status, logs are added to `config.ES_JOB_INDEX`.
        
        Syntax:
        ```
@@ -25,7 +25,7 @@ log = Log(agentRunContext)
            log.job(config.JOB_COMPLETED_FAILED_STATUS, 'Job Failed')
        ```

-    2. log.info : it shows the job info, logs are shown in `general-app-logs`
+    2. log.info : it shows the job info, logs are added to `config.ES_LOG_INDEX`.

        Syntax:
        ```
@@ -37,7 +37,7 @@ log = Log(agentRunContext)
        log.info('warning', 'Script is taking more than usual time')
        log.info('exception', 'No Products Available')
        ```
-    3. log.data : it shows the job data, logs are shown in `general-acrawled-data`
+    3. log.data : it shows the job data, logs are added to `config.ES_DATA_INDEX`.
        
        Syntax:
        ```

--- a/src/Dockerfile
+++ b/src/Dockerfile
-FROM mycrawlercontainerregistry.azurecr.io/general-1-crawlerbase:latest
+FROM python:3.9-slim
 COPY / /app
 WORKDIR /app
+RUN apt update

 RUN pip3 install -r requirements.txt
 COPY start.sh /usr/bin/start.sh
 RUN chmod +x /usr/bin/start.sh
-ENTRYPOINT ["/bin/bash","/usr/bin/start.sh"]
-
-
-#FROM python:3.6-slim-stretch
-#COPY / /app
-#WORKDIR /app
-#RUN apt update
-
-#RUN pip3 install -r requirements.txt
-#COPY start.sh /usr/bin/start.sh
-#RUN chmod +x /usr/bin/start.sh
-#CMD ["/usr/bin/start.sh"]
+CMD ["/usr/bin/start.sh"]
--- a/src/requirements.txt
+++ b/src/requirements.txt
@@ -7,12 +7,11 @@ itsdangerous==1.1.0
 Flask-Cors==3.0.10
 Flask-RESTful==0.3.9
 uuid==1.30
-selenium==4.1.5
+selenium==4.2.0
 Flask-BasicAuth==0.2.0
 Flask-HTTPBasicAuth==1.0.1
 pandas==1.4.2
 python-dateutil==2.8.1
 beautifulsoup4==4.9.3
 azure-storage-blob==12.10.0b1
-lxml==4.5.1
 scrapy==2.6.1
--- a/test/crawling-api-collection.json
+++ b/test/crawling-api-collection.json
 {
 	"info": {
 		"_postman_id": "9a1bcfd6-80ac-49a6-ad43-da29f9f6c9d0",
-		"name": "crawling-api-collections",
+		"name": "scraping-api-collections",
 		"schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json",
 		"_exporter_id": "14608642"
 	},