With the Bardeen scraper, you can extract data from any website and send it directly to your favorite apps. This means that you can ditch copy-pasting from your day-to-day processes forever.
The scraper allows you to do things like copying LinkedIn profile data to your Notion database with one click, saving interesting tweets to a Google Doc, and much more.
In this tutorial, you will learn how the scraper works and how you leverage it to save time.
To understand web scraping, let’s take a look at the big picture.
All digital data is stored in databases. Most of your web apps have APIs that allow you to access your data stored in those databases.
But what about the rest of the data on the web? Take this tutorial, for example. It has a title, body, link, and other data fields. This information is stored in our database and is used to generate this web page.
You can’t get structured information from our tutorials database in its original form, because you don’t have access to it. But all of the information from the database is shown on the website.
We just need to convert web pages back into structured data. And this is exactly what scraper does – it extracts specific parts of a web page using a scraper template.
What’s a scraper template?
A scraper template specifies what information to extract from a given web page and where it is located.
Scraper templates work only for pages of the same type. For example, a scraper template for LinkedIn profile pages will only work on profile pages and will fail if used on a LinkedIn company page.
There are two types of scraper templates: for individual pages and for lists.
The individual page scraper template will get one element for each data field defined. So a LinkedIn profile page has only one “name” field.
The list scraper template, on the other hand, will look for similar elements repeating on the page. So you should use a list scraper template to extract data from LinkedIn search, which shows multiple people at the same time. In that case, the “name” field shows up once for each list item.
What can I scrape?
You can scrape the following types of data.
Ex: name of a person from a LinkedIn profile page.
Ex: links from Google Search results.
Ex: profile images from LinkedIn
Ex: Click on the “contact info” link to open a popup.
Ex: Fill out a form.
Additionally, you can also get fields that are not displayed on the page.
Get the current website URL.
This is the meta title text that shows in Google search or as a tab name.
Meta image is the preview image that shows when you share a link on social media. It’s also known as an Open Graph image.
Gets the exact time when a page was scrapped. It’s often used when scraping the same multiple times.
What are the available scraper actions?
Scrape data on an active tab
This action will scrape data from the currently opened page. You should use this action for workflows that need just one record copied at a time.
For example, imagine you are building a gift wish list on Notion from Amazon. So when you find a cool product, you can launch Bardeen and get it copied with one click. Here you go through one page at a time.
Scrape data on URLs in the background
This action will scrape data from multiple links in the background. Bardeen will open a new browser tab in the background and extract information.
You can use this action to enrich data.
Imagine you have a list of Twitter profile links and nothing else. To do your influencer outreach campaign, you need to get more information such as name, follower count, etc. Without this information, you won’t know whom to reach out to and can’t personalize your emails.
For this use case, you can enter a list of links as an action argument, and Bardeen will scrape the missing information. No more virtual assistants or copy-pasting.
Trigger: when website data changes
Instead of checking a website a million times a day to get updates, you can set this trigger to do it for you.
This trigger scrapes a website every 10 minutes. If something changed compared to the original version, the trigger fires.
This trigger will return new information, which you can use in your Autobook to send you a notification (email, Slack, or SMS), for example.
Use this trigger to track competitor prices, government tenders, and even product availability.
Creating a scraper template
A scraper template describes what information you want to extract and where it is located on the page.
Since every website is different, you need to create a scraper template for each website that you want to scrape. The good news is that it’s super easy.
You can create a scraper template in the Playbook builder or the popup window.
All scraper actions (mentioned above) need a scraper template to run. You can pick one from a list of your existing templates, or create one from there.
You can also create or edit scraper templates right from the popup window. Find the scraper icon and click on “New Scraper Template.”
Next, select either an individual or a list scraper type.
Give your scraper template a name, so that it’s easier to find it later in the scraper actions. Remember, one website may have multiple templates: one for LinkedIn profiles, one for companies, and another one to scrape search results. So name them distinctively.
Click on an element that you want to extract and select the data type. Don’t forget that you can get hidden fields by clicking on “Get More Fields!”
Creating a list scraper
When creating a list scraper template, there is an additional step – defining a list.
You will be asked to click on the same element inside two different list items. We do this to understand which lists you want to scrape; some websites have multiple lists on one page.
Bardeen will draw boxes around each element. Click on an element inside any of the boxes to add it to your scraper template.
Loading more list items (pagination)
After you are done defining your list scraper template, a new window will open. It will ask you if you want to load more items (aka pagination).
On the majority of websites, long lists don’t get loaded entirely onto one page. Instead, they get loaded dynamically (infinite scroll) or are spread out across multiple pages.
You have two options to scrape longer lists: infinite scroll and click pagination.
Websites such as Facebook or Instagram load new list elements, when you scroll to the bottom of a page. For those websites, choose “infinite scroll.”
Other websites such as Google or LinkedIn, require you to click a button to go to the next page. For those sites, pick “click pagination” and select the element that takes you to the next page.
Now, what if a list has a million list items but you only need a few hundred?
You can specify either the max number of items or the max number of pages to scrape inside the scraper actions. If left blank, the scraper will try to get as many items as possible.
If you want to stop scraping jobs in-progress, close the app window and click on the "stop scraping" button at the bottom right corner of the screen. You can also do this from the Activities tab→ Queue.
How to edit a scraper template
There are two common scenarios when you will want to update your scraper template.
The first one is when the scraper template breaks and no longer extracts all information correctly. This usually happens when a website gets updated or changes its markup (code).
And the second scenario, in which you want to extract additional data fields.
Editing a scraper model is as easy as creating one. Click on the scraper icon in the popup window and choose which scraper template you want to edit.
A new window will open with the web page that you originally used to create your scraper template. From there, add new data fields or delete the existing ones.
Building Playbooks with scraper
Building automations with scraper is similar to building any other Playbook. Go to the Builder, add an action, and choose the scraper model.
Scraper outputs data as a table. You can map data from your scraper template to other actions.
For example, let’s try adding LinkedIn profile data to our Notion database.
Click on a box next to a column name and map the related field from the scraper action.
When you use the list scraper, it will output a table with multiple rows. Bardeen will run every action once per row. In the example above, Notion will create as many times as there are LinkedIn profiles returned by the list scraper.
Using multiple scraper templates in one Playbook (deep scraper)
You can combine multiple scraper models in one Playbook. This is usually done when scraping search results and then going through every page on that list to extract additional data. We call the combination of list and individual scraper templates “deep scraper.”
To build such a Playbook, configure the first scraper action as usual. Then use the scraped links from the first scraper action as the URL argument in your second scraper action.
Selectors used in the demo:container:
Name: :scope .feed-shared-actor__name
Headline: :scope .feed-shared-actor__description
Post: :scope .feed-shared-text
Profile url: :scope [data-control-name="actor_container"]
Explore scraper use cases
For some extra inspiration, you should try out the Playbooks below. Like Picasso said - “Good artists borrow, great artists steal.”
LinkedIn profile scraper (XPath)
Is scraping legal?
Yes! Public information on the web is legal to scrape.
In 2019, LinkedIn sued HiQ, an analytics firm that powered its algorithms with data scraped from LinkedIn. LinkedIn lost the case in the US Court of Appeals.
This lawsuit makes web scraping of publicly available information legal according to case law.
Can I get blocked?
Some websites including LinkedIn go the extra length to prevent scraping. New accounts that visit too many pages may get flagged or temporarily disabled.
Bardeen runs in your browser, which means that all scraper actions run from your computer and your local IP. This is extra advantageous to avoid getting blocked.
Use the scraper at your own discretion, nevertheless.