Skip to content

Commit 77bd2ac

Browse files
committed
Initial import
0 parents  commit 77bd2ac

7 files changed

Lines changed: 162 additions & 0 deletions

File tree

.dockerignore

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# configurations
2+
.idea
3+
4+
# crawlee and apify storage folders
5+
apify_storage
6+
crawlee_storage
7+
storage
8+
9+
# installed files
10+
node_modules
11+
12+
# git folder
13+
.git
14+
15+
# data
16+
data
17+
src/storage
18+
dist

.editorconfig

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
root = true
2+
3+
[*]
4+
indent_style = space
5+
indent_size = 4
6+
charset = utf-8
7+
trim_trailing_whitespace = true
8+
insert_final_newline = true
9+
end_of_line = lf

.gitignore

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# This file tells Git which files shouldn't be added to source control
2+
3+
.idea
4+
.vscode
5+
storage
6+
apify_storage
7+
crawlee_storage
8+
node_modules
9+
dist
10+
tsconfig.tsbuildinfo
11+
storage/*
12+
!storage/key_value_stores
13+
storage/key_value_stores/*
14+
!storage/key_value_stores/default
15+
storage/key_value_stores/default/*
16+
!storage/key_value_stores/default/INPUT.json
17+
18+
# Added by Apify CLI
19+
.venv

README.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
## Scrape single-page in TypeScript template
2+
3+
A template for scraping data from a single web page in TypeScript (Node.js). The URL of the web page is passed in via input, which is defined by the [input schema](https://docs.apify.com/platform/actors/development/input-schema). The template uses the [Axios client](https://axios-http.com/docs/intro) to get the HTML of the page and the [Cheerio library](https://cheerio.js.org/) to parse the data from it. The data are then stored in a [dataset](https://docs.apify.com/sdk/js/docs/guides/result-storage#dataset) where you can easily access them.
4+
5+
The scraped data in this template are page headings but you can easily edit the code to scrape whatever you want from the page.
6+
7+
## Included features
8+
9+
- **[Apify SDK](https://docs.apify.com/sdk/js/)** - a toolkit for building [Actors](https://apify.com/actors)
10+
- **[Input schema](https://docs.apify.com/platform/actors/development/input-schema)** - define and easily validate a schema for your Actor's input
11+
- **[Dataset](https://docs.apify.com/sdk/js/docs/guides/result-storage#dataset)** - store structured data where each object stored has the same attributes
12+
- **[Axios client](https://axios-http.com/docs/intro)** - promise-based HTTP Client for Node.js and the browser
13+
- **[Cheerio](https://cheerio.js.org/)** - library for parsing and manipulating HTML and XML
14+
15+
## How it works
16+
17+
1. `Actor.getInput()` gets the input where the page URL is defined
18+
2. `axios.get(url)` fetches the page
19+
3. `cheerio.load(response.data)` loads the page data and enables parsing the headings
20+
4. This parses the headings from the page and here you can edit the code to parse whatever you need from the page
21+
22+
```javascript
23+
$("h1, h2, h3, h4, h5, h6").each((_i, element) => {...});
24+
```
25+
26+
5. `Actor.pushData(headings)` stores the headings in the dataset
27+
28+
## Resources
29+
30+
- [Web scraping in Node.js with Axios and Cheerio](https://blog.apify.com/web-scraping-with-axios-and-cheerio/)
31+
- [Web scraping with Cheerio in 2023](https://blog.apify.com/web-scraping-with-cheerio/)
32+
- [Video tutorial](https://www.youtube.com/watch?v=yTRHomGg9uQ) on building a scraper using CheerioCrawler
33+
- [Written tutorial](https://docs.apify.com/academy/web-scraping-for-beginners/challenge) on building a scraper using CheerioCrawler
34+
- [Integration with Zapier](https://apify.com/integrations), Make, Google Drive, and others
35+
- [Video guide on getting scraped data using Apify API](https://www.youtube.com/watch?v=ViYYDHSBAKM)
36+
- A short guide on how to build web scrapers using code templates:
37+
38+
[web scraper template](https://www.youtube.com/watch?v=u-i-Korzf8w)
39+
40+
## Getting started
41+
42+
For complete information [see this article](https://docs.apify.com/platform/actors/development#build-actor-locally). To run the actor use the following command:
43+
44+
```bash
45+
apify run
46+
```
47+
48+
## Deploy to Apify
49+
50+
### Connect Git repository to Apify
51+
52+
If you've created a Git repository for the project, you can easily connect to Apify:
53+
54+
1. Go to [Actor creation page](https://console.apify.com/actors/new)
55+
2. Click on **Link Git Repository** button
56+
57+
### Push project on your local machine to Apify
58+
59+
You can also deploy the project on your local machine to Apify without the need for the Git repository.
60+
61+
1. Log in to Apify. You will need to provide your [Apify API Token](https://console.apify.com/account/integrations) to complete this action.
62+
63+
```bash
64+
apify login
65+
```
66+
67+
2. Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under [Actors -> My Actors](https://console.apify.com/actors?tab=my).
68+
69+
```bash
70+
apify push
71+
```
72+
73+
## Documentation reference
74+
75+
To learn more about Apify and Actors, take a look at the following resources:
76+
77+
- [Apify SDK for JavaScript documentation](https://docs.apify.com/sdk/js)
78+
- [Apify SDK for Python documentation](https://docs.apify.com/sdk/python)
79+
- [Apify Platform documentation](https://docs.apify.com/platform)
80+
- [Join our developer community on Discord](https://discord.com/invite/jyEM2PRvMU)

eslint.config.mjs

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
import apify from '@apify/eslint-config';
2+
3+
// eslint-disable-next-line import/no-default-export
4+
export default [
5+
{ ignores: ['**/dist'] }, // Ignores need to happen first
6+
...apify,
7+
{
8+
languageOptions: {
9+
sourceType: 'module',
10+
11+
parserOptions: {
12+
project: 'tsconfig.eslint.json',
13+
},
14+
},
15+
},
16+
];

tsconfig.eslint.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"extends": "./tsconfig.json",
3+
"include": ["./src/**/*", "./test/**/*", "./scripts/**/*", "./types/**/*"]
4+
}

tsconfig.json

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"extends": "@apify/tsconfig",
3+
"compilerOptions": {
4+
"module": "ESNext",
5+
"target": "ESNext",
6+
"outDir": "dist",
7+
"moduleResolution": "node",
8+
"noUnusedLocals": false,
9+
"lib": ["ES2022"],
10+
"skipLibCheck": true,
11+
"typeRoots": ["./types", "./node_modules/@types"],
12+
"strict": true
13+
},
14+
"include": ["./src/**/*", "./types/**/*"],
15+
"exclude": ["node_modules"]
16+
}

0 commit comments

Comments
 (0)