Distributed Processing

Payroll matters. One of Justworks’ most important responsibilities is delivering payroll to our customers’ employees. If you’ve ever collected a paycheck, you’ll know you need it to be two things: on time and correct.

What you might not have considered is how complex the calculations are in running payroll calculations for one employee, let alone a large number of employees in a limited time window. With considerations for federal, state and local taxes, with their ever-changing regulations, benefits, deductions, time off, etc., it adds up to a lot of processing. So much so, that running this job serially would require a runtime that is much longer than the window of time available to deliver the payments that people need.

The only option that allows this process to scale is to adopt distributed processing.

Distributed Processing

Distributed processing is a common strategy for many software organizations to scale their processing power. The idea is that a large volume of processing work would be split up into a number of independent “jobs” and sent to multiple servers to be processed in parallel. At a time when cloud computing makes it very easy to instantiate and configure fleets of servers, distributed processing makes for a natural strategy for many software organizations to adopt as they scale up. Whereas a large volume of work might need a long time to process on a single server, by distributing the work amongst a fleet of servers, it can be completed within a much shorter amount of time if done right.

Challenges in Reliability

To implement distributed processing well, we should be able to add new servers easily to increase parallel computing power. As such, when the servers process a job, they are doing so independently of each other. But how do we really ensure that each job is not being processed more than once, by multiple servers? What if one of your servers failed and restarted to pick up a job that is currently being processed by another server? What if a bug was introduced to your code that sent the same job to multiple servers? How do we handle race conditions where multiple servers start processing the same job? Certain functions, such as payroll, are transactional in nature and would result in errors if a job was processed multiple times by multiple servers.

Database Locks

One technique that we found useful was to leverage database locking to ensure that each job is only processed once. This required that we persisted information about a job and its processing status in a database table, like below for example.

              The job_items database table

id job_name job_results job_status
1 “Do a thing” “Thing is done” “Done”
2 “Do a thing”   “Processing”
3 “Do the other thing”   “New”

The important column to note from the database table above is the job_status column. It denotes that the job has yet to start processing (“New”), that the processing has been completed (“Done”), or that the processing is actually in progress (“Processing”).

With the information in the job_status column, each server can independently figure out whether or not it has been sent a duplicate job. To avoid a race condition in which multiple servers are about to start processing the same job, we can place a database lock on the specific database record that represents the job, and update the job_status to “Processing” before it does the actual processing.

Implementation

def process_job(job_item_id)
  do_process = false

  ActiveRecord::Base.transaction do
    job_item = JobItem.lock.find(job_item_id)

    if job_item.job_status == "New"
      job_item.job_status = "Processing"
      job_item.save!
      do_process = true
    end
  end

  if do_process == true
    # Process the job!

    job_item.job_status = "Done"
    job_item.save!
  end
end

The above example is written in Ruby on Rails. Rails provides an easy way to retrieve a record with a pessimistic lock so that the record cannot be updated by another request until the lock is released by the database transaction. Other frameworks such as Django (Python), Spring Framework (Java), and Entity Framework (C#) should also have similar functionality to allow the developer to lock a record for updates.

The technique illustrated by the example above uses a database transaction that is meant to be quickly executed. It first locks a record and checks to see if the record’s job_status is in “New”, denoting that it hasn’t been picked up for processing. And because the record was retrieved with a lock, other servers trying to retrieve the record have to wait until the current database transaction is completed. As you can see, the only database update that the code is doing within the database transaction is to update the job_status column to “Processing”, which is what makes this database lock one that is quick to release. But because of the database lock, if a second server tries to retrieve the record, it would only retrieve the record after the job_status column has already been updated to “Processing”. This essentially prevents the second server from executing the code within the database transaction, and in so doing, ensures that only one server processes the job.

Note that it is possible for a job to remain stuck in “Processing” status if an error occurs and terminates the process. When this happens, we would need to reset the status back to “New” for it to be picked up for processing again.

Scale with Reliability and Data Integrity

As your organization experiences growth, scaling your system to handle a larger load is a necessity. This is a great problem to have! Care must be given, however, to ensure reliability and data integrity as you adopt system architectures to absorb a growing need for processing power. Distributed processing is one way to achieve a larger scale, and the technique outlined above is one implementation to increase confidence that your system remains reliable and that your data is sound.