Skip to content
GitLab
Explore
Projects
Groups
Topics
Snippets
Projects
Groups
Topics
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Register
Sign in
Toggle navigation
Menu
Tarento
delivery-excellence
digital-assets
python-webscraping-quickstart
Commits
31da2d7f
Commit
31da2d7f
authored
2 years ago
by
pushkar191098
Browse files
Options
Download
Patches
Plain Diff
Addition: Docker Documention
parent
786adeb1
main
develop
1 merge request
!1
addition: main_server_code, scripts, docs
Changes
10
Hide whitespace changes
Inline
Side-by-side
Showing
10 changed files
README.md
+13
-2
README.md
deploy/dev.env
+0
-0
deploy/dev.env
deploy/web-scraping.yml
+22
-0
deploy/web-scraping.yml
docs/README.md
+3
-1
docs/README.md
docs/docker.md
+21
-0
docs/docker.md
docs/env-variables.md
+2
-2
docs/env-variables.md
docs/eslog.md
+3
-3
docs/eslog.md
src/Dockerfile
+3
-13
src/Dockerfile
src/requirements.txt
+1
-2
src/requirements.txt
test/scraping-api-collection.json
+1
-1
test/scraping-api-collection.json
with
69 additions
and
24 deletions
+69
-24
README.md
+
13
−
2
View file @
31da2d7f
# python-cra
wler
-quickstart
# python-
s
cra
ping
-quickstart
Python based Web
cra
wler
Quick Start Project.
Python based Web
-s
cra
ping
Quick Start Project.
For Scraping the project uses Selenium & Scrapy framework.
...
...
@@ -69,6 +69,17 @@ _The following are mandatory Request Body Parameters_
| :-------- | :------- | :-------------------------------- |
|
`JobId`
|
`string`
|
`(required) uuid of a job`
|
### API Authorization
Currently the projects uses basic aurthorization for authentication.
Set the following environment_variable:
| Variables | Type | Description |
| :-------- | :------- | :-------------------------------- |
|
`BASIC_HTTP_USERNAME`
|
`string`
| username for server |
|
`BASIC_HTTP_PASSWORD`
|
`string`
| password for server |
## Authors
-
[
@dileep-gadiraju
](
https://github.com/dileep-gadiraju
)
...
...
This diff is collapsed.
Click to expand it.
d
ocs
/dev.env
→
d
eploy
/dev.env
+
0
−
0
View file @
31da2d7f
File moved
This diff is collapsed.
Click to expand it.
deploy/web-scraping.yml
0 → 100644
+
22
−
0
View file @
31da2d7f
version
:
'
3.7'
services
:
web-scraping-project
:
deploy
:
replicas
:
1
update_config
:
parallelism
:
3
delay
:
10s
restart_policy
:
condition
:
on-failure
ports
:
-
"
5001:5001"
env_file
:
-
./dev.env
networks
:
-
frontend
networks
:
frontend
:
driver
:
overlay
external
:
true
This diff is collapsed.
Click to expand it.
docs/README.md
+
3
−
1
View file @
31da2d7f
...
...
@@ -10,4 +10,6 @@
[
Configure ElasticSearch Log
](
eslog.md
)
[
Configure scripts.py
](
scripts.md
)
\ No newline at end of file
[
Configure scripts.py
](
scripts.md
)
[
docker deployment
](
docker.md
)
\ No newline at end of file
This diff is collapsed.
Click to expand it.
docs/docker.md
0 → 100644
+
21
−
0
View file @
31da2d7f
# Docker Deployment
*
Stop and remove existing containers with name
`web-scraping-project`
.
```
docker stop web-scraping-project
docker rm web-scraping-project
```
*
Build Docker image:
`web-scraping-project`
```
docker build -t web-scraping-project ./src/
```
_Note: ./src/ contains Dockerfile_
*
Spawn:
`web-scraping-project`
.
```
docker run --name web-scraping-project -p 5001:5001 --env-file ./deploy/dev.env -it web-scraping-project
```
_Note: Here environment file (--env-file) refers from local storage_
\ No newline at end of file
This diff is collapsed.
Click to expand it.
docs/env-variables.md
+
2
−
2
View file @
31da2d7f
...
...
@@ -5,8 +5,8 @@ The Following are supported Environment-Variables
| Variables | Type | Description |
| :-------- | :------- | :------------------------- |
|
`BASIC_HTTP_
PASSWORD
`
|
`string`
| username for server |
|
`BASIC_HTTP_
USERNAME
`
|
`string`
| password for server |
|
`BASIC_HTTP_
USERNAME
`
|
`string`
| username for server |
|
`BASIC_HTTP_
PASSWORD
`
|
`string`
| password for server |
|
`ELASTIC_DB_URL`
|
`string`
| URL of elasticsearch_DB |
|
`BLOB_SAS_TOKEN`
|
`string`
| azure blob_storage SAS token |
|
`BLOB_ACCOUNT_URL`
|
`string`
| azure blob_storage account_URL |
...
...
This diff is collapsed.
Click to expand it.
docs/eslog.md
+
3
−
3
View file @
31da2d7f
...
...
@@ -8,7 +8,7 @@ log = Log(agentRunContext)
*
Types of logs:
1.
log.job : it shows the job status, logs are
shown in
`general-job-stats`
1.
log.job : it shows the job status, logs are
added to
`config.ES_JOB_INDEX`
.
Syntax:
```
...
...
@@ -25,7 +25,7 @@ log = Log(agentRunContext)
log.job(config.JOB_COMPLETED_FAILED_STATUS, 'Job Failed')
```
2. log.info : it shows the job info, logs are
shown in `general-app-logs`
2. log.info : it shows the job info, logs are
added to `config.ES_LOG_INDEX`.
Syntax:
```
...
...
@@ -37,7 +37,7 @@ log = Log(agentRunContext)
log.info('warning', 'Script is taking more than usual time')
log.info('exception', 'No Products Available')
```
3. log.data : it shows the job data, logs are
shown in `general-acrawled-data`
3. log.data : it shows the job data, logs are
added to `config.ES_DATA_INDEX`.
Syntax:
```
...
...
This diff is collapsed.
Click to expand it.
src/Dockerfile
+
3
−
13
View file @
31da2d7f
FROM
mycrawlercontainerregistry.azurecr.io/general-1-crawlerbase:latest
FROM
python:3.9-slim
COPY
/ /app
WORKDIR
/app
RUN
apt update
RUN
pip3
install
-r
requirements.txt
COPY
start.sh /usr/bin/start.sh
RUN
chmod
+x /usr/bin/start.sh
ENTRYPOINT
["/bin/bash","/usr/bin/start.sh"]
#FROM python:3.6-slim-stretch
#COPY / /app
#WORKDIR /app
#RUN apt update
#RUN pip3 install -r requirements.txt
#COPY start.sh /usr/bin/start.sh
#RUN chmod +x /usr/bin/start.sh
#CMD ["/usr/bin/start.sh"]
CMD
["/usr/bin/start.sh"]
This diff is collapsed.
Click to expand it.
src/requirements.txt
+
1
−
2
View file @
31da2d7f
...
...
@@ -7,12 +7,11 @@ itsdangerous==1.1.0
Flask-Cors==3.0.10
Flask-RESTful==0.3.9
uuid==1.30
selenium==4.
1.5
selenium==4.
2.0
Flask-BasicAuth==0.2.0
Flask-HTTPBasicAuth==1.0.1
pandas==1.4.2
python-dateutil==2.8.1
beautifulsoup4==4.9.3
azure-storage-blob==12.10.0b1
lxml==4.5.1
scrapy==2.6.1
This diff is collapsed.
Click to expand it.
test/cra
wl
ing-api-collection.json
→
test/
s
cra
p
ing-api-collection.json
+
1
−
1
View file @
31da2d7f
{
"info"
:
{
"_postman_id"
:
"9a1bcfd6-80ac-49a6-ad43-da29f9f6c9d0"
,
"name"
:
"cra
wl
ing-api-collections"
,
"name"
:
"
s
cra
p
ing-api-collections"
,
"schema"
:
"https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
,
"_exporter_id"
:
"14608642"
},
...
...
This diff is collapsed.
Click to expand it.
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment
Menu
Explore
Projects
Groups
Topics
Snippets