Commit 31da2d7f authored by pushkar191098's avatar pushkar191098
Browse files

Addition: Docker Documention

1 merge request!1addition: main_server_code, scripts, docs
Showing with 69 additions and 24 deletions
+69 -24
# python-crawler-quickstart
# python-scraping-quickstart
Python based Web crawler Quick Start Project.
Python based Web-scraping Quick Start Project.
For Scraping the project uses Selenium & Scrapy framework.
......@@ -69,6 +69,17 @@ _The following are mandatory Request Body Parameters_
| :-------- | :------- | :-------------------------------- |
| `JobId` | `string` | `(required) uuid of a job` |
### API Authorization
Currently the projects uses basic aurthorization for authentication.
Set the following environment_variable:
| Variables | Type | Description |
| :-------- | :------- | :-------------------------------- |
| `BASIC_HTTP_USERNAME` | `string` | username for server |
| `BASIC_HTTP_PASSWORD` | `string` | password for server |
## Authors
- [@dileep-gadiraju](https://github.com/dileep-gadiraju)
......
File moved
version: '3.7'
services:
web-scraping-project:
deploy:
replicas: 1
update_config:
parallelism: 3
delay: 10s
restart_policy:
condition: on-failure
ports:
- "5001:5001"
env_file:
- ./dev.env
networks:
- frontend
networks:
frontend:
driver: overlay
external: true
......@@ -10,4 +10,6 @@
[Configure ElasticSearch Log](eslog.md)
[Configure scripts.py](scripts.md)
\ No newline at end of file
[Configure scripts.py](scripts.md)
[docker deployment](docker.md)
\ No newline at end of file
# Docker Deployment
* Stop and remove existing containers with name `web-scraping-project`.
```
docker stop web-scraping-project
docker rm web-scraping-project
```
* Build Docker image: `web-scraping-project`
```
docker build -t web-scraping-project ./src/
```
_Note: ./src/ contains Dockerfile_
* Spawn: `web-scraping-project`.
```
docker run --name web-scraping-project -p 5001:5001 --env-file ./deploy/dev.env -it web-scraping-project
```
_Note: Here environment file (--env-file) refers from local storage_
\ No newline at end of file
......@@ -5,8 +5,8 @@ The Following are supported Environment-Variables
| Variables | Type | Description |
| :-------- | :------- | :------------------------- |
| `BASIC_HTTP_PASSWORD` | `string` | username for server |
| `BASIC_HTTP_USERNAME` | `string` | password for server |
| `BASIC_HTTP_USERNAME` | `string` | username for server |
| `BASIC_HTTP_PASSWORD` | `string` | password for server |
| `ELASTIC_DB_URL` | `string` | URL of elasticsearch_DB |
| `BLOB_SAS_TOKEN` | `string` | azure blob_storage SAS token |
| `BLOB_ACCOUNT_URL` | `string` | azure blob_storage account_URL |
......
......@@ -8,7 +8,7 @@ log = Log(agentRunContext)
* Types of logs:
1. log.job : it shows the job status, logs are shown in `general-job-stats`
1. log.job : it shows the job status, logs are added to `config.ES_JOB_INDEX`.
Syntax:
```
......@@ -25,7 +25,7 @@ log = Log(agentRunContext)
log.job(config.JOB_COMPLETED_FAILED_STATUS, 'Job Failed')
```
2. log.info : it shows the job info, logs are shown in `general-app-logs`
2. log.info : it shows the job info, logs are added to `config.ES_LOG_INDEX`.
Syntax:
```
......@@ -37,7 +37,7 @@ log = Log(agentRunContext)
log.info('warning', 'Script is taking more than usual time')
log.info('exception', 'No Products Available')
```
3. log.data : it shows the job data, logs are shown in `general-acrawled-data`
3. log.data : it shows the job data, logs are added to `config.ES_DATA_INDEX`.
Syntax:
```
......
FROM mycrawlercontainerregistry.azurecr.io/general-1-crawlerbase:latest
FROM python:3.9-slim
COPY / /app
WORKDIR /app
RUN apt update
RUN pip3 install -r requirements.txt
COPY start.sh /usr/bin/start.sh
RUN chmod +x /usr/bin/start.sh
ENTRYPOINT ["/bin/bash","/usr/bin/start.sh"]
#FROM python:3.6-slim-stretch
#COPY / /app
#WORKDIR /app
#RUN apt update
#RUN pip3 install -r requirements.txt
#COPY start.sh /usr/bin/start.sh
#RUN chmod +x /usr/bin/start.sh
#CMD ["/usr/bin/start.sh"]
CMD ["/usr/bin/start.sh"]
......@@ -7,12 +7,11 @@ itsdangerous==1.1.0
Flask-Cors==3.0.10
Flask-RESTful==0.3.9
uuid==1.30
selenium==4.1.5
selenium==4.2.0
Flask-BasicAuth==0.2.0
Flask-HTTPBasicAuth==1.0.1
pandas==1.4.2
python-dateutil==2.8.1
beautifulsoup4==4.9.3
azure-storage-blob==12.10.0b1
lxml==4.5.1
scrapy==2.6.1
{
"info": {
"_postman_id": "9a1bcfd6-80ac-49a6-ad43-da29f9f6c9d0",
"name": "crawling-api-collections",
"name": "scraping-api-collections",
"schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json",
"_exporter_id": "14608642"
},
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment