Cameron Hotchkies

Categories

  • coding

Tags

  • data-structures
  • data-types
  • javascript
  • json
  • kotlin

Javascript and the interchange format it begat, JSON, have subtle complexities that aren’t necessarily obvious. They are both very quick to learn, and powerful, which makes them seem deceptively simple.

Before jumping in, let’s start with the basic understanding that JavaScript has always been a weakly typed language. This is what makes it so forgiving. That isn’t to say it’s untyped, there actually are some basic types:

  • string
  • number
  • list
  • map
  • null

Those are the only real types that can be represented by JSON at the extremes, but the Object Notation part of JSON indicates that more complex classes can be described.

Let’s say we have a system that determines the number of pods deployed in each data center, currently in the us-sfo data center we have 180 pods, and 47 pods in the us-sjc data center. We could write a JSON data structure that looks like this:

{
  "data_centers": {
    "us-sfo": 180,
    "us-sjc": 47
  }
}

This makes sense, right? It’s built for expansion in case we want something other than data centers, and all of the counts are there.

Let’s look at it when we convert it to a strongly typed language, like kotlin:

data class DataCenters(
  @Json(name="us-sfo") sfo: Int,
  @Json(name="us-sjc") sjc: Int
)
data class Response(data_centers: DataCenters)

A few things pop out here. The names are already causing trouble due to language mismatches, but that’s really a signal of a deeper mistake that is happening. This data structure mapping is papering over the fact the underlying data is really structured to be:

data class Response(data_centers: Map<String, Int>)

The underlying data is important, because as new entries are added, the code will push itself in the direction towards the Map. If we were to add one us-xyz data center, it’s trivial to add one more member variable to DataCenters but if we add 5 more, developers are going to stop adding distinct members and start to just leave it as the less descriptive Maps.

Eventually time will pass

With this, it hopefully becomes apparent that while it’s quick to write, it’s also quick to forget what the integer represented. Was it the total capacity of the data center? Was it the data center’s internal serial number? The elevation? Oh right, active pods.

What we have run into here is the fact that a Map (or Dictionary) can be substituted for lists of unique items, but the pain isn’t felt until much later. When storing in a Map, we automatically do a premature optimization of removing the duplicated name from the object since it also appears in the key field.

What we are really trying to describe is a collection of data centers. Data centers already have two attributes we care about, which fundamentally makes them an explicit object themselves.

data class DataCenter(name: String, deployedPodCount: Int)
data class Response(data_centers: List<DataCenter>)

This leads to restructuring the original JSON as

{
  "data_centers": [
    { "name": "us-sfo", "deployed_pod_count": 180},
    { "name": "us-sjc", "deployed_pod_count": 47 }
  ]
}

This also has the benefit of being more descriptive for new users and consumers of the data format.

Why does this matter?

When we default to the use of Map as a collection, we are forcing an access pattern. Random access by key. All code that doesn’t conform to that access pattern becomes substantially harder to read. If we wanted to just find the most populated datacenter, we would need:


if (dataCenters.sfo > dataCenters.sjc) {
  if dataCenters.sfo > dataCenters.xyz {
     ...
  }
} else if (dataCenters.sjc > dataCenters.xyz) { ... }

// or if we had stopped adding attributes and left as a Map

dataCenters.keys.reduce { dcLeft, dcRight ->
  if (dataCenters.getOrDefault(dcLeft, 0) <
        dataCenters.getOrDefault(dcRight, 0)) {
    dcLeft
  } else {
    dcRight
  }
}

compare that to

dataCenters
  .sortedByDescending { it.deployedPodCount }
  .first()

// which the IDE will encourage you to simplify as:
dataCenters.maxByOrNull { it.deployedPodCount }!!

The latter is far easier to read and comprehend, and doesn’t require people to remember context of what each integer actually is. This is a trivial example, so even the hard mode is trivial. Maps of maps of maps of ints become a cognitive nightmare.

The Big Os are lying to you

An astute reader who has just finished a gauntlet of FAANG interviews will probably respond with:

But a Hashtable has a lookup time of O(1) as opposed to this O(n) nonsense you get from sorting that list. You fool! You’ve wasted precious cycles!!

In the case of the non-key lookups, we are going to have to iterate the entire key set THEN do the O(1) lookup which isn’t free. For the best case lookup by data center name, it’s also important to note that algorithmic complexity doesn’t account for the fact that hashing a string is constant time, but for low values of n (which is most cases) the CPU effort to scan an entire list with string comparisons will be negligible.

The amount of coworker time you’ve saved by making the logic readable will be orders of magnitude higher, and you will have made the world a slightly better place.

.last()

It’s JSON (JS Object Notation) not JSMN (JS Map Notation), so if you have a Map, it better be for an object of some kind. If the name of the entry is a plural, you probably meant to have a list. This applies to data exchange, configurations, and even YAML.