Mo’ People Mo’ Problems: Scaling the Engineering Team and Supporting Customers with David Thorman, Our First DevOps Engineer

David Thorman is Kloudless’ first employee and current DevOps Lead. He reads a lot of code and makes sure the servers don’t go down. Prior to Kloudless, he was the Lead Unix Systems Administrator at UC Berkeley’s SA-IT, where he was responsible for the FreeBSD and Red Hat Enterprise Linux systems that ran critical resident-facing network services.

Since you were the first employee at Kloudless, how have you seen the engineering priorities evolve?

Early on, our main concern was “how do we get things out as fast as possible” and “how are we going to build the product from scratch”. Now, there is a lot more focus on ongoing maintainability, stability in terms of the API that we present as well as the internal abstractions that need to be maintained for any new code that comes out. There’s also a lot of focus on database performance, making sure that our application is making efficient queries that don’t blow it up.

When we opened our office in Taiwan, the team went from being just me, Tim, and Vinod to many more engineers. I’m doing as much as possible to leverage the new engineering resources that we have. Before, when engineering resources were the main bottleneck, the focus was on “how can we do everything most efficiently ourselves”. Now, we have to make different . We have to care more about how we manage actual engineers to make sure they are all pushing toward a common that impacts revenue instead of everyone fire-fighting with enterprise support or working on their own projects.

How has your role evolved?

I used to code, but I don’t so much anymore. I started off doing DevOps, systems administration, and infrastructure automation and monitoring. Once that was pretty stable and we needed work, I moved to doing development and work on the API platform as it evolved from our email product. Through that work, I became more familiar with Python, Django, software engineering best practices, and whatever else needed doing.

Now, instead of building features, I’m focused on architecture, planning, and a lot of code reviews. I am also the primary person on enterprise support and the underlying infrastructure of that. For example, tooling around doing appliance builds, support, and everything else. Now that we are growing, the actual implementation will be moved to someone else, but I’ll still be involved from a planning and oversight perspective. In a typical day I, read a lot of code and logs and look at graphs to make sure our platform is still up and responsive.

What is your proudest accomplishment at Kloudless?

I am most proud of the Enterprise Appliance, I ended up writing almost all of the packaging, configuration, and tooling around that.

How did you go about building Kloudless Enterprise, and what issues did you encounter?

I repackaged our platform using open source tools like Packer by HashiCorp and SaltStackso that everything can run on a single image, AMIs for AWS, virtual machines for VMWare and VirtualBox, and Docker containers. Almost all of the components that make up our cloud stack are included: NGINX, PostgreSQL, Redis, and the TICK Stack from InfluxData. Each part (like redis, the db, etc.) can also be toggled by the user so it can use a hosted service like RDS instead of the embedded one.

The main issue that we encountered was making sure that the appliance is usable and scalable in a large, production setting. This means focusing on being able to programmatically provision the appliance using existing tools like Docker Swarm and K8s. Initially our appliance had an interactive configuration flow and a stateful clustering mechanism with primary/secondary election. This quickly needed to be changed in order to be usable as a central part of our customer’s infrastructure so now deployment is much more efficient and it is easy to scale a cluster up or down.

What are some of the other engineering challenges at Kloudless right now?

Observability into our Cloud platform is a challenge right now. Making sure that our engineers and I have insight into our platform to know where improvements need to be made. We’re working on making sure that everything can be meaningfully measured and exposed in a way that people can actually use it.

For example, setting alerts on metrics and ensuring engineers can access logs to make sure the feature that they released is actually working. They need to be able to get sufficient info to debug things when they break. This is especially important for enterprise customers, where we don’t always have access directly to the machine itself, so things like logs are critical to figuring out where something went wrong and how we can replicate the issue.

What are some memorable challenges you’ve personally encountered?

I feel like there are a lot of war stories. They all kind of revolve around the database going down for various reasons. We’ve been DDoS’d by Google notifications many times. That’s always fun, because someone will connect an account for an organization that has thousands of users that results in a lot of activity. Google has a few more servers than we do, so we’ve had to implemented different levels of throttling both at the NGINX level and the API service level, so we can dynamically throttle a certain percentage of inbound requests. So that’s pretty cool. There was also the time that Box contacted us because one of our customers was their largest user of their Events platform. Our events platform has been one of the central struggles.

Is there a particularly interesting project that you’re currently working on?

The stuff I’m looking at right now is kind of weird: PostgreSQL won’t reclaim dead tuples from one of our high churn tables and we don’t really know why. For background: a relational database like PostgreSQL uses MVCC, so whenever you do a write or update on a row, the old version row has to be kept around until it’s no longer visible to any live transactions. What’s weird is that there are no long running idle transactions, but tuples are never reclaimed by the autovacuuming, so it’s a mystery. There’s this other tool called pg_repack which does successfully reclaim the dead tuples, but it doesn’t seem like it should be necessary.

What programming languages and development tools do you like to use?

I mostly write code in Python and bash using a pretty minimalist vim configuration. I use a lot of command line tools, so I have an “IDE”. The shell (zsh) is nice and you can do everything you need from command line for the most part. Pretty much all of our code is written in Python. Python also makes it relatively easy to diagnose/extend libraries. For our web stack we use Django, Gevent, Celery. and uWSGI. We get most of our performance from taking advantage of Gevent coroutines, since most of our processing is focused around network i/o.

Python is a good general purpose language because you can program in a lot of different paradigms. It does object orientation pretty well and functional pretty well. It’s pretty easy to do non-high-performance things with it and there’s a lot of good libraries because it’s really popular. The syntax is really straightforward so it’s easy to write code without getting bogged down. It does have issues around memory management (mostly leaks) and it’s also dynamic so there isn’t as much static analysis that can be done. We use Python 2.7, so without type annotations you can run into issues when unexpected data types get passed around. I also like to do “fun” things around meta-programming, like dynamically generating classes and functions, but this can sometimes lead to code being difficult to debug/understand.

I also like the programming language Lua. It’s easy a really tiny language so it is easy to embed in other tools like NGINX. A lot of games also use Lua either for internal scripting or building them. I have also started using a tool called Hammerspoon allows for automation of window management which also uses Lua.

Do you have favorite open source projects that you’ve either contributed to or follow closely?

I love open source. I contributed to some SaltStack formulas that we use in our enterprise appliance. I also wrote a quick script for sending PgBouncer, a connection pooler for PostgreSQL, to statsd. I don’t really get to contribute that much to open source tools. Our company is built out of is almost exclusively open source tools, except for internal code and hosted services. We don’t use/license proprietary libraries in building our platform. This has its trade-offs, but it’s better at the end of the day because we can modify things as needed. In the past, we’ve had to fork open source projects in order to add features that are needed. In a couple cases, those were even included in the upstream.

I also contributed to Wee-Slack, which is a Python plugin for WeeChat. The graphical client for Slack is really memory hungry, so using a command line tool instead is nice. Unfortunately Slack doesn’t make it easy to maintain full support of features. I also follow PostgreSQL, NGINX, and Celery development to an extent, because they’re always building out cool new features.

Many engineers at Kloudless are into mechanical keyboards. Tell me more about them.

Mechanical keyboards are cool! I like typing on mechanical keyboards because they feel much nicer than the chiclet key keyboards that come on all the Mac products. The switch actuation feels really nice and you don’t have to bottom out the key in order to trigger it, which helps prevent RSI. You can also switch out the key caps to some artsy looking ones that are pretty cool.

The keyboard I have at home uses Cherry MX Browns. Cherry is an old keyboard company. They’re one of the most popular brands of switches and they make many different types. They’re categorized by color. Blue switches are clicky and Brown switches have a tactile bump but are not clicky. I use Browns because they’re not too loud and still feel really nice. Mechanical keyboards cost a lot though, which is kind of a drag.

A lot of people in the Taipei office have the ErgoDox EZ, which is made by a Taiwanese company. It’s much more ergonomic and extremely customizable. I haven’t quite been able to justify the steep price tag for myself though.

Read it on Kloudless Official Blog

快來認識 Netskope Taiwan

Netskope Taiwan

Today, there's more data and users outside the enterprise than inside, causing the netw...