Taming and Transforming Legacy Distributed Systems and Microservices With Django

Palmetto Park's home-grown CRM, built by multiple freelancers and distributed across multiple disparate microservices, suffered from frequent failures and data integrity issues. The system's complexity and lack of observability caused critical features like the property alerts to be unreliable, risking missed business opportunities and escalating maintenance costs.

Client
Timeline
November 2020 / 12 months
Services
Software Architecture
Technologies
Django
Celery
Python
Client Testimonial

"Strategic planners"

Elyon Technologies spent time caring about the actual deliverable rather than jumping into projects without a strategy. We spent time on planning as well as thinking through alternatives.

Drew Nichols
Founder/CEO @ Palmetto Park Realty
Problem
Dozens of separate freelancers had created a duct-taped system of disparate microservices that frequently failed and were very difficult to monitor, test, or observe.
Solution
Elyon centralized and consolidated scattered, unobservable pieces of software and logic into an observable, testable, and centralized software system, benefiting the business and the software in a number of ways.

Background

Drew had hired dozens of freelancers over the course of five years to build a home-grown CRM for his niche, yet sizable, real estate business on the east coast, Palmetto Park Realty. Over time, Palmetto Park Realty ended up with a system that had been "duct taped" together by many developers with varying levels of skill, experience, and no big picture vision for the software or integrated components.

Problem

As the system grew over time, things would fail in unexpected and very difficult to diagnose ways. For example, the property alerts system would sometimes work, and sometimes not work, and it was hard to figure out why it wasn’t working. Drew’s team could miss out on great closing opportunities due to the system not working, and the buggy functionality was getting costly. He also had data quality and integrity issues, and the old CRM was tied to Angular V1 and an old version of MongoDB, and upgrades were going to be very difficult without upgrading Angular, which would have been a large project on its own.

Drew needed to:

  1. Get his alerts system properly working and also observable, being able to tell if something went wrong right away and to know without a doubt whether it was succeeding or failing.
  2. Get other critical systems in a similar situation working properly and observable.
  3. Fix his data integrity and quality issues and make sure data he had and was presenting to his users was correct.
  4. Still have the existing systems up and running, and make these changes incrementally over time.

Thought Process

Drew’s systems suffered from a “microservices too early/without the actual scaling needs” type of problem. There were several independent systems communicating in a number of different ways: webhooks, one system doing a cron to populate data that another system would be on a cron to read from and check for changes, and it was a nightmare to debug. Each of these systems was also doing a relatively straightforward/simple thing, but being run on different VMs and servers, and having no observability as to if the server was up or not without someone manually SSHing into the servers to see that they were up and things running as expected.

Given all of this, Elyon proposed migrating existing critical systems to a modern Django server and stack that:

  1. Was very tried and true. We weren’t re-inventing the wheel (I.E. adopting Angular 1 and MongoDB right when they came out), and there were and are very large systems powered by Django in production. Django was also a very stable and mature framework, and there are plenty of developers comfortable working with it (and Python in general).
  2. Centralized everything and made any functionality ported over or new software written easy to monitor and debug.
  3. Had support for a relational database and foreign key integrity - The Django ORM is top notch here.
  4. Had great job scheduling and queueing support (Celery was a battle tested tried and true solution here).
  5. Was straightforward to test and write high quality, maintainable code in that could stand the test of time. Django has great test support for any part of it you want to work with, and was an ideal choice here.

We decided to go for it, and start with the alerts system. We ran everything on one server, and coordinated the parts of the Django stack with a simple docker compose setup. We had the web server, Celery beat, and the Celery workers all running on the same server with the same configuration, and set up Sentry to work with all of them, including the logging integration so that it was trivial to send warning and error logs to Sentry any time they showed up.

From there, we actually refactored an existing alerts piece of code that was one of the Python microservices, and moved it into a Celery Beat scheduled Celery task. We also wrote very thorough unit tests to check the work of the existing Python code that had already been written for the alerts, found a number of bugs, and found a big long term issue with unbounded data growth and lag that was likely contributing to the degradation of performance Drew had seen over time. From there, we tied everything together, and got this task firing every 15 minutes as requested.

Right away, Drew got to see high email send numbers on his Mailgun dashboard that he hadn’t seen in a long time, indicating alerts were firing properly and hitting users as expected. Beyond that, we had all of this running in Celery beat with Django Celery Results, so it was very easy to see every time jobs had scheduled to run, and what the result of the job/task was (success/failure), and failures were sent to Sentry automatically, which would deliver a real-time notification that Drew or me could immediately respond to. Sentry surfaced another issue or two shortly thereafter, and we were able to patch the code, and write tests to reproduce the issue

Following the success of alerts, we continued on as follows:

  • Any new features were only built on this new system.
  • Existing critical features were migrated one at a time to the new system.
  • Old functionality was still supported by a combination of raw queries to MongoDB within the new Python code, along with a polling, cursor based MongoDB sync that robustly synchronized data from MongoDB into our new PostgreSQL database and did a number of data quality checks and transformations.

After a number of things had been built from above, we tackled a larger challenge of rebuilding the Agent lead matching system and Slack bot, which had a number of complex pieces of logic, but was critical for Drew’s business.

From there, we continued building new functionality and migrating old functionality until the old system was shut off.

Reflections

Drew had a highly trafficked and very built out website. And, while certain things were broken or intermittently failing, it still worked pretty well for the most part. It was critical to keep it working as it already was while we built this new system.

Sometimes, things like this are done by very careful refactors to an existing system over time, without having to introduce new systems.

However, in our case, introducing a new system was a huge win, because it came with a new environment that brought stability, testability, monitoring, and overall robustness, serving as a foundation for future business ventures. At the same time, when it was needed, we were able to do write records to both the old MongoDB and the new PostgreSQL (so the old Node API would still return proper results in certain cases). With our new system, we could easily write robust tests to assert that we were inserting, updating, and deleting records appropriately so that existing functionality would not be impacted, and had monitoring for anything unexpected that would come up. Every step of the way, we built the new world while still supporting the old one, until the old one was no longer needed.