The code for this project can be found on the Github repo https://github.com/klameer/jobscrape.git along with instructions to how to get up and running quickly.

As an IT contractor and freelancer, I spend a lot of time keeping an eye on the job boards in case a suitable opportunity comes up. And there are a lot of job boards out there. I have compiled a list of 10 that I need to visit at least once a week. This is a mostly manual process. You can speed it up with browser ninjutsu and shortcut keys, but I would be lucky if I visited all 10 of the sites in one go.

Since this is a manual process, a better solution would be to automate this as much as possible. This way the work of actually visiting the sites would be automated, pulling out only the information deemed important.

As it turns out, it is pretty easy to do with a little work up front with Python. With not a lot of code, a script could visit each of the job sites and accumulate the relevant data. There are also great frameworks to display and make sifting through the accumulated data easy.

This solution will extract data from two of the major job sites, Monster.com and Total Jobs. This process can then be used to bring in additional job sites in the future. The solution will look for a data analyst roles and consolidate all this information into a single page that can be viewed at your leisure.

The objective is one page that has all the data you are looking for.

Libraries

Creating the solution is quick and the output is high quality. This is made possible by the excellent software libraries available for you to use. It's just a matter of assembling blocks together. Without these libraries, this post wouldn't be possible or be pretty long.

  1. Python Requests. A simple way to download web pages. Because access to web pages can be coded, it is great for our requirement of accessing multiple webpages and gathering their content.
  2. Beautiful Soup. The HTML text within web pages is long and messy and there is no elegant way of traversing an HTML file. That is until Beautiful Soup came along. It gives you a useful set of functions that turn the page into an object hierarchy, that is accessible in code.
  3. Python Flask. A minimalist web framework that prides its self on being as lightweight as possible. We need to display the results of our consolidated dataset in an interactive format. Flask is the quickest way to do it.
  4. TinyDB. The solution consolidates data which needs to be stored somewhere. There are multiple way of doing this in Python. TinyDB is a document database and works well with storing Python dictionary objects.
  5. Bootstrap. A popular framework for designing the front-end of web applications. You can get up and running with well styled web pages quickly.
  6. jQuery DataTables. A great way of displaying tabular data on a web page. It turns an ordinary table into something of wonder, with tabulation and search functionality straight out of the box.
  7. Regular Expressions (re). Regular Expressions are a way of finding patterns within text. Once the patterns have been found, they can be manipulated like replacing and deleting values. Regular expressions are great for cleaning data into the format that you need.
  8. datetime. The datetime module is a comprehensive solution for dealing with dates. It comes in useful when trying to figure out the latest jobs posted on the site.

Downloading Data

Two of the most popular job boards in the UK are Monster and Total Jobs. There are many cases where the jobs posted to on these boards are not unique and could be simultaneously on multiple boards at the same time. This is where agencies post jobs on multiple sites or jobs on some sites are featured on others. The solution looks for particular jobs (data analysts) and the reason why these two sites were chosen is because there is not much overlap of jobs between them.

Both Monster and Total Jobs do their job queries as a link. This is known as a GET request where the link determines the list of jobs shown. The workflow for downloading data will be to create the link to display the type of jobs of interest and use what is on that page to create the consolidated job list. The links tend not to change over time so this is a one off process that has to be carried out for each job board.

The output from the webpage is in HTML and without Python's Beautiful Soup library it is hard to find out what elements you are looking for and extract the text from them. Monster and Total Jobs are built very differently so the HTML elements for each will be different to each other.

Start off by loading the libraries common to both sites.

import requests
import bs4 as bs
import re
import datetime

today = datetime.datetime.now().strftime("%Y-%m-%d")

Monster

Go to the Monster website and do the search for the required role. What I normally look for is Contract Data Analysis roles. Contract roles tend to have a different URL.

We are looking for today's jobs. The date is available as a tag. Also with the job title there are line feeds and carriage returns that need to be removed in order to leave only text with the data.

url = 'https://www.monster.co.uk/jobs/search/Contract_8?q=Data-Analyst&where=london&sort=dt.rv.di'

final = []
resp = requests.get(url)
soup = bs.BeautifulSoup(resp.text, "lxml")

jobs = soup.find("section", {"id":"resultsWrapper"}).find_all("article", {"class":"js_result_row"})

for job in jobs:
    job_title = job.find("div", {"class":"jobTitle"}).find("a").text
    job_link = job.find("div", {"class":"jobTitle"}).find("a").get("href")
    job_title = re.sub("\n|\r", "", job_title)

    job_date = job.find("time").get("datetime")
    job_date = datetime.datetime.strptime(job_date, "%Y-%m-%dT%H:%M").strftime("%Y-%m-%d")

    row = {"source":"monster", "title": job_title, "link": job_link, "time":job_date}

    if job_date == today:
        final.append(row)

Total Jobs

The same applies with Total Jobs. The difference is that there is no specific date you can filter on. Looking at the web page, jobs posted on the current day say they have been posted "Today". This is what will be filtered on.

url = "https://www.totaljobs.com/jobs/contract/data-analyst"
resp = requests.get(url)

soup = bs.BeautifulSoup(resp.text, "lxml")
jobs = []
results  = soup.find_all("div", {"class":"job-title"})
for result in results:
    job_title = result.find("a").text
    job_title = re.sub("\n", "", job_title)
    job_link = result.find("a").get("href")
    job = {"source":"totaljobs", "title":job_title, "link":job_link}
    jobs.append(job)

results  = soup.find_all("div", {"class":"detail-body"})
i = 0
for result in results:
    job_date = result.find("li", {"class", "date-posted"}).text
    job_date = re.sub(" +|\n", "", job_date)
    jobs[i]["time"] = job_date
    i += 1
final = []
for job in jobs:
    if job['time'] == 'Today':
        final.append(job)

Merge Datasets

The above blocks of code return a list of jobs that from the websites that need to be merged. Well organised software projects are as modular as the solution allows. In the case of this solution, downloading the data has been separated from displaying the data.

db = TinyDB("myfile.json")
db.purge()

for job in get_monster():
    db.insert(job)

for job in get_total_jobs():
    db.insert(job)

TinyDB is a handy document database. It is a great way of managing unstructured data, like objects that have different properties to each other. Object serialization is the process of converting objects into a format that can be stored. There are many ways to do this in Python, the most popular being Pickle where you can save and retrieve any object type. Pickle can be used to solve storage but TinyDB is great for JSON documents. Web pages and their links will be stored as JSON documents because it is easy to display in this format.

Display on a Webpage

As mentioned earlier, the fastest way to get up and running with a web application on Python is with Flask. A micro framework gives you bare bones functionality and you can decide what other libraries you want to attach onto your solution. The Flask app will have two functions. One to show the main page and another to retrieve the data and display it as an API.

from flask import Flask
from flask import render_template
from tinydb import TinyDB
import json

app = Flask(__name__)

@app.route("/")
def index():
    return render_template("index.html")

@app.route("/data")
def data():
    db = TinyDB("myfile.json")
    data = db.all()
    ret_val = {"data" : data}
    return json.dumps(ret_val)

Bootstrap

Bootstrap has a starter template. Once this is inserted into your index.html file, you have all the stylings of Bootstrap. Standard Bootstrap styles most HTML elements including the table that displays the consolidated data.

JavaScript Data Tables

The future vision for this project is to search for jobs on more than 10 sites. Consolidated, this could be quite a number of records. DataTables gives you functionality like pagination, search and ordering that lets you sift through a large number of records to quickly get to the data you need.

The DataTables library relies on the jQuery library, which means it needs to be loaded after jQuery is loaded. The Bootstrap template already loads the jQuery library which makes it already available on the page.

<table id="table_id" class="table table-striped table-bordered" style="width:100%">
    <thead>
        <tr>
            <th>source</th>
            <th>title</th>
        </tr>
    </thead>
</table>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
<script type="text/javascript" charset="utf8" src="https://cdn.datatables.net/1.10.19/js/jquery.dataTables.js"></script>
$(document).ready( function () {
    $('#table_id').DataTable({
      "ajax":"/data",
      "columns":[
        {"data":"source"},
        {"data": "title",
          "fnCreatedCell": function(nTd, sData, oData, iRow, iCol){
            $(nTd).html("<a href='" + oData.link + "'>" + oData.title + "</a>");
          }}]});});

DataTables makes call to the data function which sends the data which populates the table. The table then neatly displays the consolidated list of jobs along with links that makes them easy to access.

Requirements

Next Steps

This is a simple solution to be used as a template that can be built on. What I would like to add going forward would be the following functionality.

  1. Obiously bring other job sites in using the same process to give a wider coverage of jobs.
  2. Be able to put in multiple search terms. Currently this looks for only data analysts but the same process can be adapted to look for any other types of jobs displayed on the website.
  3. Build a word cloud of the most frequently used relevant words on the job descriptions. This can help tweak your CV to make sure it includes the words employers are searching for.