menu
How to Make a Web Scraper with AWS Lambda and the Serverless Framework?
How to Make a Web Scraper with AWS Lambda and the Serverless Framework?
To build a new function, modify your Lambda code, or execute it, go to the AWS Lambda web portal.

Before initiating with development, it is necessary to learn the below things:

  • Node.js and modern JavaScript
  • NPM
  • The Document Object Model
  • Basic Linux command line
  • Basic donkey care

The AWS idea is that Amazon provisioned and maintained all aspects of your application, from storage to processing power, in a cloud environment (i.e., on Amazon's computers), allowing you to design cloud hosting apps that grow automatically. You won't have to deal with setting up or managing servers because Amazon will take care of it. A Lambda function is a cloud-based function that may execute when it's needed and is triggered by signals or API requests. The use of a serverless framework is recommended to develop the Lambda function.

For instance, if you want to fetch the recipes which are posted on a particular website. Scraping this information from the website is possible.

Step 1: Serverless Setup

serverless-setup

Read the quick start guide for the serverless framework. Serverless will eliminate all of the difficulties associated with setting AWS infrastructure, allowing to develop and test locally before deploying everything to the cloud.

Developing a new serverless project work:

$ serverless create --template aws-nodejs --path donkeyjob$ cd donkeyjob

A serverless.yml file is used to start the project. YAML is a commonly used language for system settings, and it is this file that holds all of the AWS configuration information. For the time being, we can ignore all of the remarks and stick to the following:

service: donkeyjobprovider:  name: aws  runtime: nodejs6.10functions:  getdonkeyjobs:    handler: handler.getdonkeyjobs

As per the requirement, we have a function known as getdonkeyjobs, and we will export a function having the name from handler.js.

This is the function that will help in deploying AWS and will trigger to scrape the job listings data.

Creating a handler.js basic function.

Lambda functions are made up of three parts: an event, a context, and a callback. Let's start with some basics for now. You can remove the rest of the file.

module.exports.getdonkeyjobs = (event, context, callback) => {
callback(null, 'Hello world');
};

Check the script locally.

$ serverless invoke local --function getdonkeyjobs

“Hello World” will be the result of the above script.

Step 2: Scraping The Data

Building a scraping functionality for the donkey Sanctuary jobs page and parsing the HTML page for fetching the list of the jobs in the required format.

[
{job: 'Marketing Campaigns Officer', closing: 'Fri Jul 21 2017 00:00:00 GMT+0100', location: 'Leeds, UK'},
{job: 'Registered Veterinary Nurse', closing: 'Sat Jul 22 2017 00:00:00 GMT+0100', location: 'Manchester, UK'},
{job: 'Building Services Manager', closing: 'Fri Jul 21 2017 00:00:00 GMT+0100', location: 'London, UK'}
];

Axios is used for requesting the page contents and then passing over to the HTML string and a parsing function that can be tested. Inside the parsing function, the use of library cheerio is done for parsing the HTML file and getting the desired data.

Cheerio is similar to jQuery in which you can feed an HTML string (for example, the answer you get from a GET request for a page) and it will construct a document object-oriented approach for you to navigate and manage.

The Moment is a useful package for dealing with dates that makes it simple to construct an ISO String format.

const request = require('axios');
	const {extractListingsFromHTML} = require('./helpers');
	

	module.exports.getdonkeyjobs = (event, context, callback) => {
	  request('https://www.thedonkeysanctuary.org.uk/vacancies')
	    .then(({data}) => {
	      const jobs = extractListingsFromHTML(data);
	      callback(null, {jobs});
	    })
	    .catch(callback);
	};

const cheerio = require('cheerio');
	const moment = require('moment');
	

	function extractListingsFromHTML (html) {
	  const $ = cheerio.load(html);
	  const vacancyRows = $('.view-Vacancies tbody tr');
	

	  const vacancies = [];
	  vacancyRows.each((i, el) => {
	

	    // Extract information from each row of the jobs table
	    let closing = $(el).children('.views-field-field-vacancy-deadline').first().text().trim();
	    let job = $(el).children('.views-field-title').first().text().trim();
	    let location = $(el).children('.views-field-name').text().trim();
	    closing = moment(closing.slice(0, closing.indexOf('-') - 1), 'DD/MM/YYYY').toISOString();
	

	    vacancies.push({closing, job, location});
	  });
	

	  return vacancies;
	}
	

	module.exports = {
	  extractListingsFromHTML
	};

To use cheerio, you need to understand how to navigate the DOM with precision and choose the items you desire. To accomplish this, use the dev tools in your browser to study the HTML structure of the website you're scraping, and keep in mind that if the layout of that HTML future changes, your scraper may become worthless.

work-for-us

If we run our function now, we need to see the following array of jobs:

$ serverless invoke local --function getdonkeyjobs
work-for-us

Step 3: Setup DynamoDB

Lambda function cannot be used to persist data but it only saves the temporary information. We will configure DynamoDB as an AWS resource and give Lambda function permission to interact. Here the serverless.yml will look like this:

service: donkeyjob
	

	provider:
	  name: aws
	  runtime: nodejs6.10
	functions:
	  getdonkeyjobs:
	    handler: handler.getdonkeyjobs
	

	resources:
	  Resources:
	    donkeyjobs:
	      Type: AWS::DynamoDB::Table
	      Properties:
	        TableName: donkeyjobs
	        AttributeDefinitions:
	          - AttributeName: listingId
	            AttributeType: S
	        KeySchema:
	          - AttributeName: listingId
	            KeyType: HASH
	        ProvisionedThroughput:
	          ReadCapacityUnits: 1
	          WriteCapacityUnits: 1
	

	    # A policy is a resource that states one or more permssions. It lists actions, resources and effects.
	

	    DynamoDBIamPolicy: 
	      Type: AWS::IAM::Policy
	      DependsOn: donkeyjobs
	      Properties:
	        PolicyName: lambda-dynamodb
	        PolicyDocument:
	          Version: '2012-10-17'
	          Statement:
	            - Effect: Allow
	              Action:
	                - dynamodb:DescribeTable
	                - dynamodb:Query
	                - dynamodb:Scan
	                - dynamodb:GetItem
	                - dynamodb:PutItem
	                - dynamodb:UpdateItem
	                - dynamodb:DeleteItem
	              Resource: arn:aws:dynamodb:*:*:table/donkeyjobs
	        Roles:
	          - Ref: IamRoleLambdaExecution

In order to construct the DynamoDB resource, we'll have to deploy this to AWS. Because a Lambda function is merely a function, we can test it locally before communicating with AWS, but we can't test how a database works without actually having one.

As a result, we move:

$ serverless deploy

This sends our program to AWS and generates the resources we specified in the configuration file.

Step 4: Interact with DynamoDB

interact-with-dynamodb

It is now possible to use the database. For using a database, we need to install and use a package known as aws-sdk (AWS Software Development Kit) which makes interaction of DynamoDB simple.

Here are the steps mentioned for scraping a new list of jobs.

Fetch yesterday’s job from the database using dynamo.scan method.

{
jobs: [ {job: 'Donkey Feeder',
closing: 'Fri Jul 21 2017 00:00:00 GMT+0100',
location: 'Leeds, UK'},
{job: 'Chef',
closing: 'Fri Jul 21 2017 00:00:00 GMT+0100',
location: 'Sheffield, UK'}
],
listingId: 'Fri Jul 21 2017 14:25:35 GMT+0100 (BST)'
}

You can compare to check the difference between yesterday’s jobs and today’s jobs by employing some handy lodash techniques.

Dynamo.delete will help to delete yesterday’s job from the database.

Save the new jobs instead of with the dynamo.put technique.

callback with the new jobs.

const request = require('axios');
	const AWS = require('aws-sdk');
	const dynamo = new AWS.DynamoDB.DocumentClient();
	const { differenceWith, isEqual } = require('lodash');
	const { extractListingsFromHTML } = require('./helpers');
	

	module.exports.getdonkeyjobs = (event, context, callback) => {
	  let newJobs, allJobs;
	

	  request('https://www.thedonkeysanctuary.org.uk/vacancies')
	    .then(({ data }) => {
	      allJobs = extractListingsFromHTML(data);
	

	      // Retrieve yesterday's jobs
	      return dynamo.scan({
	        TableName: 'donkeyjobs'
	      }).promise();
	    })
	    .then(response => {
	      // Figure out which jobs are new
	      let yesterdaysJobs = response.Items[0] ? response.Items[0].jobs : [];
	

	      newJobs = differenceWith(allJobs, yesterdaysJobs, isEqual);
	

	      // Get the ID of yesterday's jobs which can now be deleted
	      const jobsToDelete = response.Items[0] ? response.Items[0].listingId : null;
	

	      // Delete old jobs
	      if (jobsToDelete) {
	        return dynamo.delete({
	          TableName: 'donkeyjobs',
	          Key: {
	            listingId: jobsToDelete
	          }
	        }).promise();
	      } else return;
	    })
	    .then(() => {
	      // Save the list of today's jobs
	      return dynamo.put({
	        TableName: 'donkeyjobs',
	        Item: {
	          listingId: new Date().toString(),
	          jobs: allJobs
	        }
	      }).promise();
	    })
	    .then(() => {
	      callback(null, { jobs: newJobs });
	    })
	    .catch(callback);
	};

We can test the function locally by executing

$ serverless invoke local --function getdonkeyjobs

And therefore, we should expect our callback to include a list of all the positions published on The Donkey Sanctuary today because they are all ‘new' to us. There are still no jobs in our database from the day before.

You should see today's data saved there if you go to the AWS console now, go to DynamoDB, select your donkey jobs table, and look at the entries.

We will see that the jobs array is clear if you run the function locally again. It's because we're comparing the jobs to whatever is already in the database, and nothing has changed unless a new job was added in the last few minutes.

Step 5: Sending a Text Using Nexmo
sending-a-text-using-nexmo

Let's send an SMS to our users notifying them of all the fascinating donkey employment they may be applying for now that we have a list of new opportunities!

To begin, create an account with Nexmo. It provides you a free $2 credit to play with, which is more than plenty. After you join up, you should be taken to a dashboard where you will be given a password and classified information. To send a text message from Nexmo, you'll need these.

We can easily handle the request to send a text using the nexmo npm package. It should be installed and placed in your handler.js file. We may send any text message we wish before calling the last callback on our getdonkeyjobs handler:

.then(() => {
	  if (newJobs.length) {
	    var nexmo = new Nexmo({
	      apiKey: NEXMO_API_KEY,
	      apiSecret: NEXMO_API_SECRET
	    });
	    nexmo.message.sendSms('Donkey Jobs Finder', MY_PHONE_NUMBER, 'Hello, we found a new donkey job!');
	  }
	  callback(null, { jobs: newJobs });
	})

To test this, we'll need to clear the DynamoDB database (as seen below) so that the Lambda assumes there are new jobs, and then we'll be able to run our function locally once again.

sending-a-text-using-nexmo

And, with just about any luck, an SMS message should have arrived!

The final step is to improve the formatting of our text or email. For this, we may make a new helper function that takes a list of jobs and outputs a formatted message with the deadlines, locations, and job names for everything that's available.

Remember that anytime we would like to evaluate our function, we'll have to continue cleaning the table (there are certainly better ways to do this, but for now, it's easy enough to just remove the Item on the AWS console).

function formatJobs (list) {
	  return list.reduce((acc, job) => {
	    return `${acc}${job.job} in ${job.location} closing on ${moment(job.closing).format('LL')}\n\n`;
	  }, 'We found:\n\n');
	}
	

	module.exports = {
	  extractListingsFromHTML,
	  formatJobs
	};

And now that we've finished, we can finally deploy our entire application to AWS:

$ serverless deploy
Step 6: Configuring Lambda to Execute Every Day
configuring-lambda-to-execute-every-day

After we've deployed the function, we can check it to make sure it's working properly:

We may also set the function to run once a day automatically. Selecting ‘Add Trigger', selecting ‘CloudWatch Events' from the drop-down menu, and then filling in the relevant details. To run it daily, we can use the schedule expression rate (1 day).

If you have any queries regarding this blog or if you want any web scraping services then Contact 3i data Scraping or ask for a free quote!