Skip to content

zz note: blurring row typed records and JSON

David Jeske edited this page May 31, 2020 · 28 revisions

One common reasons dynamic languages are preferred is handling external data heterogeneity without fighting with a static type-checker, as discussed somewhat in zz-node: heterogeneous lists of maps and zz note: row optional polymorphic records.

Idea

This note explores the idea of blurring the lines between static row-typed records, and dynamic external data like JSON, by:

  • making runtime data forms built from `dynamic' hashes like JSON (or clojure immutable HAMTs), instead of static flattened c-style-structs (see data-layout below)
  • perform row-typed inference record usage, for compile time static analysis
    • static inference - if production and usage of a type is completely internal, then infer and produce static errors (this happens today)
    • static checking - IF given a data-type declaration, perform compile time static analysis that code and shape match (this happens today)
  • ...but add a third model...
    • dynamic inference - for a type declarated 'dynamic', infer the shape of the type based on code producing or accessing that data... or even "learn" from traces of real world data, then:
      • generate this type as output data to (a) display in IDEs, (b) publish externally as schema specs, (c) feed back into static checking as warnings or errors as we choose, depending on the level of calcification we desire
  • introduce more flexibility to handle more of the real world messiness of data heterogeneity
    • value-dependent typing
    • the ability to calcify parts of a data-schema while allow other parts to be flexible
    • the ability to specify bounds on what types of changes should be "automatically adopted" when inferred from code, and which others should be disallowed

data-layout:

  • a simple starting point could be making all data-forms be dynamic hashes and lists, and only layering static analysis on top of this.. this obviously has performance consequences, but seeing how popular dynamic languages are this hardly seems like a problem.
    • the flexible form could be based on functional immutable HAMT like Clojure
  • a higher-performance version could generate code to talk to static-access-interfaces, rather than dynamic hashes.... and generate interface proxies to allow a dynamic hash to be handed to the same code.
    • an important bit here is that reflection and handling the standard data-model must be aware of the real-world messiness of 'extra-data' or 'ill-typed data' otherwise that static assumption kills us. For example loading a bunch of JSON, changing it, and saving it out, must preserve "extra-data" by default, or we've ruined the system and forced people back to dynamic languages again.

Background

Some form of this heterogeneity includes:

  • optional data elements in hetrogenous maps - present only when relevant, such as in JSON
  • optional nesting - DOM models, trees, etc.
  • multi-versions - where different records in a list might have different forms based on a version specifier
  • extra-data - where flexible data-formats carry data beyond what a specific program expected at the time it was written
  • format mistakes - where different elements in parsed external data might simply be "wrong" due to human editing mistakes, software mistakes, bugs, etc.

Here we will use JSON as our primary example, because real world examples of JSON include every one of the above examples. However, this equally true of other formats, including XML, HTML, database-query results, and all sorts of self-describing data.

Here is an example JSON fragment:

JSON  
{ "universities":   
    {   "university": "South Carolina State University",  
        "students"  : [  
             { "name": "Stephen Cousins", "email" : "[email protected]" }, 
             { "name": "Austin A. Newton"}, 
             { "name": "Adam Wilhite", "cellphone" : "555-555-1212" }, 
             { "name": "Enis Kurtay YILMAZ" }
        ]  
    }  
}  

Two existing, and highly unsatisfactory ways of dealing with JSON in static languages are:

  • (1) accessing a static typed "generic" JSON interface
  • (2) modeling a rigidly typed container for parsed data

(1) Using a static-typed "generic" JSON interface

...is syntactically cumbersome, exploding code-size by 2-4x and ruining readabilty, and doesn't provide any actual static analysis safety over expectations of the data-shape. In fact, this is often cited as a reason dynamically typed languages are superior for handling certain data-manipulation tasks. For example, C# includes facilities for parsing JSON into an Xml data-model, where accessing elements in code looks like this:

var university = root.SelectElement("university").Value;
foreach (var student_node in root.SelectNodes("students")) {
  var name = student_node.SelectElement("name");
  // ... additional code to handle optionality of email/phone
}

This is worse than using JSON from a dynamic language, as the syntax is incredibly wordy and cumbersome without actually providing static analysis benefits. In fact, it eliminates any chance of static analysis of the actual data-shape. Modern Java and C# have admitted this truth, and created a "dynamic" keyword to remove the syntactic pain. Here is an example:

dynamic root = JArray.Parse(incoming_data);
var university = root.university;
foreach (var student in root.students) {
   var name = student.name;
}

On the plus side, the code is readable and simple, much like it would be in fully dynamic Javascript or Python. And we still get the benefit of static analysis for other parts of our program. However, on the minus side, we are not getting any help with static analysis of whether the code meets out expectations of the shape of this data.

(2) Modeling a rigidly typed container for parsed data

...is often cited as one of the benefits of using static languages. However, in practice this has many many problems. See an example of modeling JSON in C#.

The full data-modeling is large and cumbersome, and fails to handle optionality or extra-data. Here is the part that merely models a University entry.

public class University {
   public string university { get; set; }
   public IList <Student> students { get; set; }
}

This example is more repetitive boilerplate ("public", "get; set;") tham modeling.. which ultimately make it better suited to being code-generated than written. Perhaps C# 9 Source Generators will provide a practical solution to do this, but they will not solve the problems several of the other problems with this method.

Here is a summary of some limitations of static rigid type modeling:

  • (a) the type declarations are often larger and more cumbersome than writing JSON itself
  • (b) they often can't hold "extra-data" outside the static specification, so there is no way to parse through extra-data (though this can be fixed with even more code)
  • (c) typical static-type systems are incapable of modeling the above real-world heterogeneity, which over time forces both the static analysis and readability to suffer
  • (d) We would often like to selectively ignore static analysis (or selectively calcify our analysis) but this method doesn't allow us to do this. Neither do the dynamic keyword methods.

Inference, Row-Types, and Value-Dependent-Types to the rescue!

What we really want to do, is to write code that looks normal, and have the compiler give us as much static analysis support as it can, and no more than we choose. For example, if we write the following function:

def handleData(root : [inferred] UniversityJson):
  for university in root.universities:
     print "University Name: " + university.university
     for student in university.students:
          print "   Student Name: " + student.name
          if exists(student.email):  
              print "      Student Email: " + student.email
          if exists(Student.cellphone):  
              print "       student Cell: " + student.cellphone

We'd like the compiler to tell us the shape/type of the inferred UniversityJson is:

UniversityJson : { universities :  list[
       {  university: type(stringable),  
          students  : list[  
             { name: type(stringable), 
                [optional] email : type(stringable),
                [optional] cellphone : type(stringable) }, 
       } ]    
}

If we supplied the above type-spec as a constraint for UniversityJson, and removed one of [optional] keywords, static analysis could warn us if the code is changed in a way that's incompatible with this spec. We can easily generate a JSON schema for this data, suitable for publishing to others.

JSON schema conditionals are a good look into the issue of value-dependent type for JSON.

Examples of ways to use this

A programming system built on this concept might provide static analysis errors for lots of things that create runtime errors, but where currently programmers often favor dynamic typed languages to get around the cost of static type modeling. And in cases where it couldn't.. it would be no worse than dynamic languages, and probably better than static/dynamic hybrids without only local inference (like typescript)

  • config files - type documentation, change notes, better "shape" warnings/errors, static code analysis
  • templating - static analysis of template variables
  • UI DOM and code - inference of dependencies between declared DOM shapes and code

Arguments against this idea

  • It might be impossible or impractical to infer or statically check the kinds of messy value-dependent-type that exists.
  • It might be that forcing programmers to ever think about correctness is part of what hurts productivity, and that people prefer dynamic languages because it lets them make more progress by always focusing specific examples they are working on instead of abstraction
  • It might be that .. ?
Clone this wiki locally