\[ \def\ea{\widehat{\alpha}} \def\eb{\widehat{\beta}} \def\eg{\widehat{\gamma}} \def\sep{ \quad\quad} \newcommand{\mark}[1]{\blacktriangleright_{#1}} \newcommand{\expr}[3]{#1\ \ \vdash\ #2\ \dashv\ \ #3} \newcommand{\packto}[2]{#1\ \approx >\ #2} \newcommand{\apply}[3]{#1 \bullet #2\ \Rightarrow {\kern -1em} \Rightarrow\ #3} \newcommand{\subtype}[2]{#1\ :\leqq\ #2} \newcommand{\braced}[1]{\lbrace #1 \rbrace} \]

1. What is morloc?

morloc is a functional programming language where functions are imported from foreign languages and unified through a common type system. The compiler generates the code needed to compose functions across languages and also to direct automation of mundane tasks such as data validation, type/format conversions, data caching, distributed computing, and file reading/writing.

The ultimate purpose of the Morloc system is to serve as the foundation for a universal library of functions. Each function in the library has a universal type that specifies how the function relates to other functions. Each function has zero or more implementations. Each implementation has a language-specific type that aligns with the universal type. The Morloc compiler takes the specification for a program, written as a composition of typed functions, and generates an optimized program.

Morloc is a query language that takes a program specification and searches against a function database, returning an optimized composition of functions. The functions may be selected all from one language or from several.

2. Getting Started

export hello
hello = "Hello world!!!"

2.1. Hello world!

export hello
hello = "Hello World"

The "export" keyword exports the variable "hello" from the module.

Paste this into a file (e.g. "hello.loc") and then it can be imported by other morloc modules or directly compiled into a program where every exported term is a subcommand.

morloc make hello.loc

This will generate a single file named "nexus.pl". The nexus is the executable script that the user will interact with. For this simple example, it is the only generated file. It is currently written in Perl.

Calling "nexus.pl" with no arguemtns or with the -h flag, will print a help message:

$ ./nexus.pl -h
The following commands are exported:
  hello [0]

The [0] states the number of arguments the "command" hello takes.

The command is called as so:

$ ./nexus.pl hello
Hello World

2.2. Single langage function composition

The following code uses only C++ functions (fold, map, add and mul).

import cppbase (fold, map, add, mul)

export square
export sumOfSquares

square x = mul x x

sumOfSquares xs = fold add 0 (map square xs)

If this script is pasted into the file "example-1.loc", it can be compiled as follows:

morloc install cppbase
morloc make example-1.loc

The install command clones the cppbase repo from github into the local directory ~/.morloc/lib. The make command will generate a file named nexus.pl, which is an executable interface to the exported functions.

You can see the exported functions and the number of arguments they take:

$ ./nexus.pl
The following commands are exported
  square [1]
  rms [1]

Then you can call the exported functions:

$ ./nexus.pl sumOfSquares [1,2,3]
14

The nexus.pl executable dispatches the command to the compiled C++ program, pool-cpp.out.

2.3. Composition between languages

morloc can compose functions across languages. For example

import math (fibonacci)
import rbase (plotVectorPDF, ints2reals)

export fibplot

fibplot n = plotVectorPDF (ints2reals (fibonacci n)) "fibonacci-plot.pdf"

The fibplot function calculates Fibonacci numbers using a C++ function and plots it using an R function.

plotVectorPDF is defined in the morloc module rbase. The morloc module contains the following files:

README.md
package.yaml
main.loc
core.R
rbase.R

The main.loc file contains morloc function signatures, compositions, and export statements. The core.R and rbase.R files contain R source code. rbase.R contains the general serialization functions required for R interoperability with other languages. The core.R file contains mostly core utilities that are common between languages (map, zip, fold, add, sub, etc). ints2reals is an alias for the base R function as.numeric. plotPDF is a wrapper around the generic R plot function, as shown below:

plotVectorPDF <- function(x, filename){
  pdf(filename)
  plot(x)
  dev.off()
}

This is a perfectly normal R function with no extra boilerplate. It takes an arbitrary input x and a filename, passes x to the generic plot function, and writes the result to a PDF with the name filename.

The main.loc file contains the type signatures for this function:

plotVectorPDF :: [Num] -> Str -> ()
plotVectorPDF R :: [numeric] -> character -> ()

The first signature is the general type, the second is the concrete, R-specific type.

Similarly the fibonacci C++ function has the types:

fibonacci :: Int -> List Int
fibonacci Cpp :: "size_t" -> "std::vector<$1>" size_t

The general type, Int → List Int, describes a function that takes an integer and returns a list of integers. List Int could be written equivalently as [Int]. The concrete type signature aligns to the general one in that it must be a function of the same number of arguments and any parameterized types must have the same number of parameters (and so on down recursively into the type). "std::vector<$1>" size_t aligns to List Int (or equivalently, [Int]). The type that is used in C++ prototypes and type annotations is generated from the concrete type via macro expansion into the type constructor: size_t replaces $1 yielding the concrete type std::vector<size_t>.

The fibonacci function itself is a normal C++ function with the prototype:

std::vector<size_t> fibonacci(size_t n)

3. Syntax and Features

3.1. Primitives and containers

data : True | False
     | number
     | string
     | [data, ...]       -- lists
     | (data, ...)       -- tuples
     | {x = data, ...}   -- records

As for as morloc is concerned, number is of arbitrary length and precision. Strings are double quoted and support escapes. In the future, I will add support for string interpolation.

3.2. Semicolons and whitespace

Currently, morloc requires semicolons at the ends of expressions. Semicolons will always be supported in morloc, but I consider them ugly and an unnecessary syntax error source, so I will eventually remove them (as soon as I can figure out how to do so in the parser).

3.3. Functions

Function definition follows Haskell syntax.

foo x = g (f x)

I currently have no support for infix operators, but will in the future. Also, I do not yet support eta reduction.

3.4. Type signatures

General type declarations roughly follow Haskell syntax:

take :: Int -> List a -> List a

Where a is a generic type variable. I support [a] as sugar for List a.

Concrete type declarations describe the type of the function in a specific language:

take Cpp :: "int" -> "std::vector<$1>" a -> "std::vector<$1>" a

This signature describes a C` function with arguments types `int` and a `std::vector` storing generic values. The `$1` will be replaced by the `C type inferred for a when the target C++ code is generated. For example, "std::vector<$1>" int would translate to std::vector<int>.

For languages that do not have parameterized types, such as Python, the type signature for take might be:

take Py :: "int" -> "list" a -> "list" a

The "list" a types would translate to simply list in Python.

3.5. Sourcing functions

Sourcing a function from a foreign language is done as follows:

source Cpp from "foo.h" ("mlc_foo" as foo)

foo :: A -> B
foo Cpp :: "int" -> "std::string"

Here we state that we are importing the function mlc_foo from the C++ source file foo.h and calling it foo. We then give its general and concrete types.

Currently morloc treats language-specific functions as black boxes. The compiler does not parse the C++ code to insure the type the programmer wrote is correct (this may be possible in the future).

3.6. Type defaults

morloc stores default primitive and basic container types for each supported language. In most languages, there is a sensible default type for reals, integers, bools, lists, tuples, and maybe records. Many languages will not distinguish between lists and tuples. Some languages have dynamic records (python and R) and others require the generation of type definitions (C and Haskell). These defaults simplify type definitions, allowing use to write `[a]` in place of the `C` type "std::vector<$1>" a.

3.7. Type aliases

morloc supports type aliases as a syntactic sugar that disapears before the typechecking step.

type (Pairlist a b) = [(a,b)]
type Cpp (Pairlist a b) = [(a,b)]

3.8. Records, objects, and tables

records, objects and tables are all defined with the same syntax (for now) but have subtly different meanings.

A record is a named, heterogenous list such as a struct in C, a dict in Python, or a list in R. The type of the record exactly describes the data stored in the record (in contrast to parameterized types like [a] or Map a b).

A table is like a record where all types are replaced by lists of that type. But table is not just syntactic sugar for a record of lists, the table annotation is passed with the record through the compiler all the way to the translator, where the language-specific serialization functions may have special handling for tables.

An object is a record with baggage. It is a special case of an OOP class where all arguments passed to the constructor can be accessed as public fields.

All three are defined in similar ways.

record (PersonRec a) = PersonRec {name :: Str, age :: Int}
record Cpp (PersonRec a) = "MyObj" {name :: Str, age :: Int}

table (PersonTbl a) = PersonObj {name :: Str, age :: Int}
table R (PersonTbl a) = "data.frame" {name :: Str, age :: Int}
table Cpp (PersonTbl a) = "struct" {name :: Str, age :: Int}

record (PersonRec a) = PersonRec {name :: Str, age :: Int}
object Cpp (PersonRec a) = "MyObj" {name :: Str, age :: Int}

Notice that object is undefined for general types, since they don’t check luggage. Also note the difference between the type constructor (e.g. PersonRec) and the data constructor (e.g., "MyObj"). The latter corresponds to the class constructor in the OOP language.

3.9. Modules

A module includes all the code defined under the import <module_name> statement. It can be imported with the import command.

The following module defines the constant x and exports it.

module Foo
x = 42
export x

Another module can import Foo:

import Foo (x)

...

A term may be imported from multiple modules. For example:

module Add
import cppbase (add)
import pybase (add)
import rbase (add)

export add

This module imports that C++, Python, and R add functions and exports all of them. Modules that import add will import three different versions of the function. The compiler will choose which to use.

3.10. Core libraries

Each new language will have a base library that rougly corresponds to the Haskell prelude. They will have functions for mapping over lists, working with strings, etc. They will also contain standard type aliases for each language. For example, type Cpp Int = "std::string".

4. Code Generation and Optimization

The input to the code generator is a list of bipartite, ambiguous, abstract syntax trees (AST). There is one tree for each command exported from the root module. The AST is implemented as a pair of mutually recursive data structures: SAnno and SExpr.

SAnno associates a set of subtrees with a shared annotation (general type and metadata).

data SAnno g f c = SAnno (f (SExpr g f c, c)) g

Where * g - an annotation for the group of child trees (what they have in common) * f - a collection type (e.g., One or Many) * c - an annotation for the specific child tree

SExpr represents an expression that may have SAnno children.

data SExpr g f c
  = UniS
  | VarS EVar
  | AccS (SAnno g f c) EVar
  | ListS [SAnno g f c]
  | TupleS [SAnno g f c]
  | LamS [EVar] (SAnno g f c)
  | AppS (SAnno g f c) [SAnno g f c]
  | NumS Scientific
  | LogS Bool
  | StrS Text
  | RecS [(EVar, SAnno g f c)]
  | CallS Source

4.1. Rewrite

The SAnno`s are not quite ready for translation yet. They contain `morloc functions on the right-hand-side of equations. Yet these functions are not "real" and should not appear (except as debugging info) in the final generated code.

For example, consider the following morloc script:

import cppbase (mul, add)

export foo

foo x y = bar x (baz y)
bar x y = mul x (add y 1)
baz x = mul x x

The Rewrite step redefines the exported foo function by eliminating the bar and baz abstractions, as below:

foo x y = mul x (add (mul y y) 1)

The process is complicated when there are unresolved topology differences. For example, if bar has two definitions, as below:

import cppbase (mul, add, bar)

export foo

foo x y = bar x (baz y)
bar x y = mul x (add y 1)
baz x = mul x x

Now bar can either be the morloc abstraction or the C++ function. So we can generate two possible forms of foo:

foo x y = mul x (add (mul y y) 1)
foo x y = bar x (mul y y)

These two ASTs will be included under the Many in the SAnno output.

4.2. Realize

The f in the SAnno type takes on two forms in the generator: One or Many:

data One a = One a
data Many a = Many [a]

The input to the generator is ambiguous, hence f is Many. The "Realize" step collapses the tree done to a single "realization" (or "instance"). Thus the realize function, eliding implementation details, has the type:

realize :: SAnno g Many [Type] -> SAnno g One Type

Two levels of ambiguity are removed in this step. The Many to One transition selects selects a single sub-tree topology. For example, suppose there

import math (sum, div, length, mean)
source R ("mean")

export mean
mean xs = div (sum xs) (length xs)

Here we have three definitions of mean. One is sourced from the R language (where it is a built in function). One is sourced from the morloc math module, where it is implemented in C. One is defined as a morloc composition of the three functions div, sum, and length; these are all implemented in C currently, but they could gain more implementations in the future.

There are three definitions of mean, and two topologies (thus two elements in Many). The topologies are either the (div (sum xs) (length xs)) tree or the call to the R or C functions. The first problem is the Many→One selection. The second problem is the [Type]→Type problem, where the sourced implementation is chosen. Here we decide just between a single R and single C function. But the choice could be more involved, such as choosing between a dozen sort algorithms all written in C.

This is the data structure is the starting point for an epic optimization problem.

4.3. Optimization

Algorithm optimization

In the future, when morloc is mature, the realization step will incorporate community knowledge, performance modeling, and benchmarking to make the optimal decision. For now, I assign a relative cost to each pair of inter-language calls and find the tree that minimizes the total cost.

The most interesting optimizations involve choices between algorithms. We could build formal performance models for each algorithm and parameterize them empirically for each implementation.

  • performance

  • accuracy

  • parameterization

Build optimization

The goal of build optimization is to 1) ensure the program compiles, 2) minimize the dependencies and 3) tailor the build to the local architecture. In theory, a morloc program can avoid bit-rot and adapt to any architecture so long as there exists at least one valid tree instance.

I haven’t worked on build optimization yet, but I imagine the input to the mature morloc build machine will be a description of the local architecture and a list of possible ASTs, ordered by expected performance. The machine could then try to build the "best" tree. If the build fails, the machine then finds the next highest scoring tree that does not contain the failing component.

Making this process efficient through judicious use of deep knowledge gathered from the community will be a major focus in the future. The knowledge gained in one build (e.g., function X failed on OS Y in state Z) could be uploaded automatically to the community knowledgebase and accessed in future builds.

Interoperability optimization

The minimization of inter-operability costs is the easiest optimization and the only one that is currently supported. Program performance can be optimized by reducing the number and cost of foreign calls.

Penalties for calls between languages
          C    Python           R
C         1       100        1000
Python   10         2        1000
R        10       100           3

The values in the table above are obviously very rough, but they demonstrate important principles for optimizing morloc programs. Calls within languages are cheap and between languages are expensive. Major performance improvements could be obtained by removing the start-up costs of loading the R/Python runtime (e.g., passing data to an open R server rather than restarting R for every call).

4.4. Serialization

The current interoperability paradigm morloc follows under the hood is based entirely on serialization. To be clear, serialization is not a fundamental requirement of morloc. JSON serialization could be replaced with machine-level interoperability for a pair of languages. This change would be invisible to morloc programmers, since serialization interop is handled by the compiler.

Data types that have an unambiguous mapping to the JSON data model can be automatically serialized without special handling added by the programmer. The JSON data model follows this grammar:

json : number
     | bool
     | string
     | null
     | [json]
     | {label : json, ...}

Types that are compositions of primitives and containers can be automatically serialized. This includes records and the subset of objects for which arguments passed to the constructor are asigned to accessible fields. For other types, an (un)packing function that simplifies the data is required. For example, take the general type Map a b, which maps keys of type a to values of type b. In a given language, the Map type may be implemented as a hash table, a tree, pair lists, or even a connection to a database. The types a and b do not give enough information to serialize the object. Therefore, the user must provide an unpack function which could be Map a b → ([a],[b]) or perhaps Map a b → [(a,b)] while the pack function work in the opposite direction. See below for the morloc code required to add support for Map to C++.

source Cpp "map-packing.hpp" ("unpack_map", "pack_map")

unpack_map :: Map a b -> ([a],[b])
unpack_map Cpp :: "std::map<$1,$2>" a b -> ([a],[b])

pack_map :: ([a],[b]) -> Map a b
pack_map Cpp :: ([a],[b]) -> "std::map<$1,$2>" a b

Note that the unpack function Map a b → ([a],[b]) does not necessarily take us all the way to a serializable form since a and b may be arbitrarily complex objects. This is no problem, morloc will recursively handle (de)serialization all the way down.

5. Specification

This section is a work in progress

The type system presented within this section is based the work of Dunfield and Krishnaswami [1]. I preserved their naming conventions where possible. Only the basic type checking and inference (System F) is described here, the broader knowledge system introduced in the prior section has not yet been implemented.

5.1. Grammar

\[ \begin{align} program &\ :\ [statement] \sep & \text{brackets indicate a list of 0 or more} \\ statement &\ :\ \text{Source}\ x\ L \sep & \text{source var x from language L} \\ {} &\ |\ \text{Import}\ v\ L \sep & \text{import all terms in module v} \\ {} &\ |\ \text{Export}\ x \sep & \text{export var x from this module} \\ {} &\ |\ x\ L ::\ A_L \sep & \text{signature for language L} \\ {} &\ |\ x=expr\ \sep & \text{declaration} \\ expr &\ :\ (\ ) \sep & \text{nothing} \\ {} &\ |\ x \sep & \text{a variable name} \\ {} &\ |\ x\ ::\ [A_L] \sep & \text{annotation, only required for top-level expressions} \\ {} &\ |\ \lambda x .\ expr \sep & \text{abstraction} \\ {} &\ |\ expr\ expr \sep & \text{application} \\ {} &\ |\ num\ |\ str\ |\ bool \sep & \text{primitives} \\ {} &\ |\ [e_1, ..., e_n] \sep & \text{list} \\ {} &\ |\ (e_1, ..., e_n) \sep & \text{tuple} \\ {} &\ |\ \braced{v_1=e_1,\ ...,\ v_n = e_n} \sep & \text{record} \end{align} \]

This system departs from conventional lambda calculus by supporting multiple languages in one program. It is designed to typecheck programs comprised of many languages that share a common general type but that also have language-specific, concrete types. Thus an expression may have the abstract type Num, the C-type double and the python type float. Top-level signatures can be used to specify the type of a term in different languages. A term may be sourced from many different languages, thus \(x\) is multilingual.

\[ \begin{align} \text{Types} \sep A,B,C\ &: \alpha_L \sep & \text{var in language L} \\ &|\ A \rightarrow B \sep & \text{function} \\ &|\ \forall\ \alpha_L\ . A \sep & \text{universal quantification over L} \\ &|\ \alpha_L\ [A] \sep & \text{parameterization} \\ &|\ \ea_L [A] [d] \sep & \text{existential (unsolved) variable with parameters and defaults} \end{align} \]

Every type is associated with either the general language or a concrete realization of the general type into a specific language.

\(\text{TypeAnnotation} \sep A^{\bullet}, B^{\bullet}, C^{\bullet}\ := A \; L \; \braced{P_i}_{i=0}^m \; \braced{C_i}_{i=0}^n\)
\(\text{Typeset} \sep \vec{A} := A_1^{\bullet},\ ...,\ A_n^{\bullet}\)

A type annotation couples a type in a given language with a set of properties and constraints. The "properties" of the type are a set of n-ary relations describing the type. These properties will eventually be used to implement the equivalent of Haskell typeclasses. In the near future, I will add support for casting functions that can be defined to allow automatic, language-specific, handling for conversions between types (e.g., handling automatic conversions from integers to doubles). The "constraints" are not yet implemented, but will be a list of assertions that must be met. When possible, these constraints will be evaluated statically, otherwise they will be translated into (possibly optional) runtime assertions.

Typesets serve as collections of all that is known about a type as it is represented across languages. Different languages may have different sets of properties and constraints, they are unified only by a common name and complete interchangeability.

5.2. Declarative type system

Coming soon (or late)

5.3. Algorithmic Rules

\[ \begin{align} \Gamma, \Delta & \ :\ [entry] \sep & \text{ordered list of }entry \text{ items} \\ entry & \ :\ \alpha_L \sep & \text{type variable from language L} \\ & \ |\ e = \vec{A} \sep & \text{annotated expression} \\ & \ |\ \ea_L \sep & \text{existential} \\ & \ |\ \ea_L = t \sep & \text{solved existential} \\ & \ |\ \mark{x} \sep & \text{a named marker} \\ & \ |\ \text{Source}\ v\ L\ \sep & \text{source term from language L} \\ \end{align} \]

The context is a list storing type annotations, solved and unsolved existential variables, markers, and source/export info.

The subtyping rules are adapted from [1].

I fleshed out the type system with containers (lists, tuples, records) and parameterized types. The main change though is the addition of multi-lingual support.

Table 1. Subtyping Rules

\(\expr{\Gamma}{A\ <:\ B}{\Delta}\) \(\quad\quad\) Under input context \(\Gamma\), type \(A\) is a subtype of \(B\), with output context \(\Delta\)

\(\frac{}{\expr{\Gamma}{\alpha_L\ <:\ \alpha_L}{\Gamma}}\) <:Var \(\quad\quad\) \(\frac{\packto{\beta_{L2}}{\alpha_{L1}}}{\expr{\Gamma}{\alpha_{L1}\ <:\ \beta_{L2}}{\Gamma}}\) <:AlienVar*

\(\frac{}{\expr{\Gamma}{\ea_L\ <:\ \ea_L}{\Gamma}}\) <:Exvar \(\quad\quad\) \(\frac{}{\expr{\Gamma}{\ea_{L1}\ <:\ \eb_{L2}}{\Gamma,\ \packto{\eb_{L2}}{\ea_{L1}}}}\) <:AlienExvar*

\(\frac{\Delta_1 = \Gamma \quad [\expr{\Delta_{i}}{A_i\ <:\ B_i}{\Delta_{i+1}}]_{i=1}^n}{\expr{\Gamma}{\alpha_L\ [A_i]_{i=1}^n <:\ \alpha_L\ [B_i]_{i=1}^n}{\Delta_{n+1}}}\) <:App \(\quad\quad\) \(\frac{\Delta_1 = \Gamma \quad [ \expr{\Delta_i}{A_i\ <:\ B_i}{\Delta_{i+1}} ]_{i=1}^n \quad \packto{\beta_{L2}\ [B_i]_{i=1}^m}{\alpha_{L1}\ [A_i]_{i=1}^n}}{\expr{\Gamma}{\alpha_{L1}\ [A_i]_{i=1}^n\ <:\ \beta_{L2}\ [B_i]_{i=1}^n}{\Delta_{n+1}}}\) <:AppAlien

\(\frac{\packto{\alpha_{L1}\ [A_i]_{i=1}^n}{\beta_{L2}}}{\expr{\Gamma}{\alpha_{L1}\ [A_i]_{i=1}^n\ <:\ \beta_{L2}}{\Gamma}}\) <:AppAlienVarL \(\quad\quad\) \(\frac{\packto{\alpha_{L1}}{\beta_{L2}\ [B_i]_{i=1}^n}}{\expr{\Gamma}{\alpha_{L1}\ <:\ \beta_{L2}\ [B_i]_{i=1}^n}{\Gamma}}\) <:AppAlienVarR

\(\frac{\expr{\Gamma}{B_1\ <:\ A_1}{\Theta} \quad \expr{\Theta}{[\Theta]A_2\ <:\ [\Theta]B_2}{\Delta}}{\expr{\Gamma}{A_1 \rightarrow A_2\ <:\ B_1 \rightarrow B_2}{\Delta}}\) <:→

\(\frac{\expr{\Gamma,\mark{\ea_L},\ea_L}{[\ea_L/\alpha_L]A\ <:\ B}{\Delta,\mark{\ea_L},\Theta}}{\expr{\Gamma}{\forall \alpha_L . A\ <:\ B}{\Delta}}\) <:∀L \(\quad\quad\) \(\frac{\expr{\Gamma,\alpha_L}{A<:B}{\Delta,\alpha_L,\Theta}}{\expr{\Gamma}{A <: \forall \alpha_L . B}{\Delta}}\) <:∀R

\(\frac{\ea_L \notin FV(A) \quad \expr{\Gamma[\ea_L]}{\subtype{\ea_L}{A}}{\Delta}}{\expr{\Gamma[\ea_L]}{\ea_L\ <:\ A}{\Delta}}\) <:InstantiateL \(\quad\quad\) \(\frac{\ea_L \notin FV(A) \quad \expr{\Gamma[\ea_L]}{\subtype{A}{\ea_L}}{\Delta}}{\expr{\Gamma[\ea_L]}{A\ <:\ \ea_L}{\Delta}}\) <:InstantiateR

\(\text{subtype} :: A \rightarrow B \rightarrow \Gamma \rightarrow \Delta\)

Type variables in different languages have no subtype relationship. As far as the typechecker goes, it is assumed that the language-specific (concrete) types match if the general types do. Note that functions are not annotated with languages. Thus the subtype test \(A \rightarrow B\ <:\ \alpha_L\) and its reverse will both raise errors.

The rule <:AlienExvar stores in context an "existential assertion" that cannot be evaluated until the existential variables it contains are solved.

Parameterized types across languages is supported. This may seem impossible, since not all languages support parameterized types. This is easiest to explain with examples showing general, Haskell, C++, python and R signatures.

map of strings to integers
x :: Map String Integer
x Haskell :: "Map $1 $2" String Integer
x C++ :: "std::map<$1,$2>" "std::string" int
x Python :: "dict" str int
x R :: "list" str int

To allow for different syntax for paramterization across languages, the first term is a pattern that takes the parameters as arguments. For Haskell and C++, the parameterized types would ultimately be formed into Map String Integer and std::map<std::string,int>, respectively. For dynamic languages, the parameters will not appear in final type itself (dict and list, respectively), but the type information will be preserved.

Table 2. Instantiation Rules

\(\expr{\Gamma}{\subtype{\ea_L}{A}}{\Delta}\) \(\quad\quad\) Under input context \(\Gamma\), instantiate \(\ea_L\) such that \(\ea_L <: A\), with output context \(\Delta\)

\(\frac{\Gamma\ \vdash\ \tau}{\expr{\Gamma,\ea_L,\Gamma'}{\subtype{\ea_L}{\tau}}{\Gamma,\ea_L=\tau,\Gamma'}}\) InstLSolve \(\quad\quad\) \(\frac{}{\expr{\Gamma[\ea_L][\eb_L]}{\subtype{\ea_L}{\eb_L}}{\Gamma[\ea_L][\eb=\ea_L]}}\) InstLReach

\(\frac{\expr{\Gamma[\ea_2,\ea_1,\ea=\ea_2\rightarrow\ea_1]}{\subtype{A_1}{\ea_1}}{\Theta} \quad \expr{\Theta}{\subtype{\ea_2}{[\Theta]A_2}}{\Delta}}{\expr{\Gamma[\ea]}{\subtype{\ea}{A_1 \rightarrow A_2}}{\Delta}}\) InstLArr \(\quad\quad\) \(\frac{\expr{\Gamma[\ea_L],\beta_L}{\subtype{\ea_L}{B}}{\Delta,\beta_L,\Delta'}}{\expr{\Gamma[\ea_L}{\subtype{\ea_L}{\forall \beta_L . B}}{\Delta}}\) InstLAllR

\(\frac{\Gamma\ \vdash\ \tau}{\expr{\Gamma,\ea_L,\Gamma'}{\subtype{\tau}{\ea_L}}{\Gamma,\ea_L=\tau,\Gamma'}}\) InstRSolve \(\quad\quad\) \(\frac{}{\expr{\Gamma[\ea_L][\eb_L]}{\subtype{\eb_L}{\ea_L}}{\Gamma[\ea_L][\eb_L=\ea_L]}}\) InstRReach

\(\frac{\expr{\Gamma[\ea_{L,2},\ea_{L,1},\ea_L=\ea_{L,2}\rightarrow\ea_{L,1}]}{\subtype{\ea_{L,1}}{A_1}}{\Theta} \quad \expr{\Theta}{\subtype{[\Theta]A_2}{\ea_{L,2}}}{\Delta}}{\expr{\Gamma[\ea_L]}{\subtype{A_1 \rightarrow A_2}{\ea}}{\Delta}}\) InstRArr \(\quad\quad\) \(\frac{\expr{\Gamma[\ea_L],\ \blacktriangleright \eb_L,\ \eb_L}{\subtype{[\eb_L/\beta_L]B}{\ea_L}}{\Delta,\ \blacktriangleright \eb_L,\ \Delta'}}{\expr{\Gamma[\ea_L]}{\subtype{\forall \beta_L . B}{\ea_L}}{\Delta}}\) InstRAllL

\(\text{instantiate}\ ::\ A \rightarrow B \rightarrow \Gamma \rightarrow \Delta\)

The instantiation rules above are identical to the rules from DK. However, in morloc, existential types additionally carry parameters and default values.

Table 3. Transform rules

\(\packto{A_{L1}}{B_{L2}}\) \(\quad\quad\) Type \(A\) in language \(L1\) can be uniquely transformed to type \(B\) in language \(L2\)

\(\frac{}{\expr{\Gamma}{\packto{A_L}{A_L}}{\Gamma}}\) SerializeCis \(\quad\quad\) \(\frac {f\ L_1\ ::\ \text{packs}\ \Rightarrow\ A'_{L1}\ \rightarrow\ C_{L1} \quad g\ L_2\ ::\ \text{unpacks}\ \Rightarrow\ D_{L2}\ \rightarrow\ B'_{L2} \quad \subtype{A'_{L1}}{A_{L1}} \quad \subtype{B'_{L1}}{B_{L1}}} {\expr{\Gamma}{\packto{A_{L1}}{B_{L2}}}{\Gamma}}\) SerializeTrans

\(\frac{f\ L\ ::\ \text{cast}\ \Rightarrow\ A_L\ \rightarrow\ X_L \quad \packto{X_L}{B_L}}{\expr{\Gamma}{\packto{A_{L}}{B_{L}}}{\Gamma}}\) Cast

\(\text{cast}\ ::\ A\ \rightarrow\ B\ \rightarrow\ \Gamma\ \rightarrow\ \Gamma\)

The transform rules assert that types are interconvertible. The serialization rules transform between semantically equivalent types that are expressed in different languages. The cast rules transform between semantically different types expressed in the same language.

SerializeCis is a trivial rule stating that any type can be converted to itself. SerializeTrans states that types \(A_{L1}\) and \(A_{L2}\) interconverted if there exist functions for serializing from type \(A\) in language \(L_1\) to a standard intermediate form (e.g., JSON) and a derserialization function from the standard intermediate to \(B\) in language \(L_2\). The serialization function may be more polymorphic than \(A\) and \(B\). For example, a general serialization function may exist which would serialiaze any type in the given language into JSON.

These assertions alone are not sufficient for proving that two types are interconvertible. The serialization functions show only that a path exists between the types (e.g., they serialize to a common JSON object), it does not show that the types are semantically equivalent. Semantic equivalence is demonstrated through typechecking of the general, language-independent, type. That is, if the language-specific types under consideration are not semantically equivalent, and error will be raised elsewhere in the typechecking process.

Prior to morloc v0.22.0, I explicitly wrote serialization functions with morloc signatures using the pack/unpack property. In v0.22.0, I replaced this system with a language-specific approach of passing some form of type-template to the serialization machinery particular to the given language. This means serialization is up to the libraries in the specific languages and the type checker will not be able to catch the absence of a serialization path. This approach has worked well, but I am not happy with the absence of static checks or with how awkward it is to explain.

The Cast rule involves handling of directed automatic conversions between types within a language. A common example of this would be the conversion of integers to doubles. The current rules are very strict, requiring type identity for casting, and are not amiable to more general transformations. Note the rule is recursive. The cast functions form a directed graph (usually highly disconnected and possibly cyclic) of unambiguous and unfailing transformations between types. They should describe relationships where there is a single obvious meaning (e.g., a->[a] or PositiveInteger->Integer) and that will never fail (so string to integer would not be included).

Further, the rules specified here are assertions showing the transformations are possible. There may be multiple paths to accomplishing the transforms that will differ in performance and require different dependencies at build time. Choosing which path to take is not the responsibility of the typechecker and will be dependent on the user’s system architecture and local configuration.

Table 4. synthesize

\(\expr{\Gamma}{e \Rightarrow A}{\Delta}\) \(\quad\quad\) Under input context \(\Gamma\), \(e\) synthesizes output type \(A\), with output context \(\Delta\)

\(\frac{\expr{\Gamma, x:A_L}{e_2\ \Rightarrow\ \_}{\Delta}}{\expr{\Gamma}{x\ L\ ::\ A_L\ ;\ e_2}{\Delta}}\) Signature \(\quad\quad\) \(\frac{}{\expr{\Gamma}{\text{Source }L\ x}{\Gamma,\ \ea_L}}\) Source

\(\frac{e\ \Rightarrow\ \_\ \vdash\ \Theta \quad \lbrace x:A\ |\ (x:A)\ \in\ \Theta \rbrace\ \vdash\ \Theta' \quad \lbrace x:A\ |\ x\ \in\ xs,\ (x:A) \in \Theta' \rbrace\ \vdash\ \Delta}{\expr{\Gamma}{\text{Import}\ e\ xs}{\Gamma, \Delta}}\) Import

\(\frac{x \notin \text{FV}(\Gamma) \quad \expr{\Gamma[x:A], \mark{x}}{e\ \Leftarrow\ A}{\Delta,\mark{x}, \Theta}}{\expr{\Gamma}{x=e}{\Delta}}\) DeclareCheck \(\quad\quad\) \(\frac{x \notin \text{FV}(\Gamma) \quad \expr{\Gamma,\mark{x}}{e\ \Rightarrow\ A}{\Delta,\mark{x}, \Theta}}{\expr{\Gamma}{x=e}{\Delta,\ x:\text{Gen}(A)}}\) DeclareInfer

\(\text{synthesizeToplevel} :: \Gamma \rightarrow e \rightarrow \Delta\)

\(\frac{L = \text{MLang}}{\expr{\Gamma}{\text{number}\ \Rightarrow\ \text{Num}}{\Gamma}}\) Num⇒ \(\quad\quad\) \(\frac{L = \text{MLang}}{\expr{\Gamma}{\text{int} \Rightarrow \text{Int}}{\Gamma}}\) Int⇒ \(\quad\quad\) \(\frac{L = \text{MLang}}{\expr{\Gamma}{\text{string} \Rightarrow \text{Str}}{\Gamma}}\) Str⇒ \(\quad\quad\) \(\frac{L = \text{MLang}}{\expr{\Gamma}{\text{bool} \Rightarrow \text{Bool}}{\Gamma}}\) Bool⇒

\(\frac{L = \text{MLang} \quad \expr{\Gamma}{x_1 \Rightarrow A}{\Delta_1} \quad \expr{\Delta_1}{x_2 \Leftarrow A}{\Delta_2} \quad ... \quad \expr{\Delta_{n-1}}{x_n \Leftarrow A}{\Delta_n}}{\expr{\Gamma}{[x_1,x_2, ..., x_n]}{\Delta_n,\ \text{List}\ A}}\) List⇒

\(\frac{L = \text{MLang} \quad \expr{\Gamma}{x_1 \Rightarrow A_1}{\Delta_1} \quad ... \quad \expr{\Delta_{n-1}}{x_n \Rightarrow A_n}{\Delta_n}}{\expr{\Gamma}{(x_1,x_2,\ ...\ x_n)}{\Delta_n,\ \text{Tuple}\ A_1\ ...\ A_n}}\) Tuple⇒

\(\frac{L = \text{MLang} \quad \expr{\Gamma}{x_1 \Rightarrow A_1}{\Delta_1} \quad ... \quad \expr{\Delta_{n-1}}{x_n \Rightarrow A_n}{\Delta_n}}{\expr{\Gamma}{\lbrace (k_1,x_1),(k_2, x_2),\ ...,\ (k_n, x_n) \rbrace}{\Delta_n,\ \lbrace (k_1, A_1),\ ...,\ (k_n, A_n) \rbrace}}\) Record⇒

\(\frac{L \quad \expr{\Gamma,\ea_L,\eb_L,x:\ea_L}{e \Leftarrow \eb_L}{\Delta, x:\ea_L, \Theta}}{\expr{\Gamma}{\lambda x.e\ \Rightarrow\ \ea_L\rightarrow \eb_L}{\Delta}}\) →I⇒

\(\text{synthesizeSingular} :: L \rightarrow \Gamma \rightarrow e \rightarrow (\Delta,\ A)\)

\(\frac{(\,x\,:\,A_L\,)\ \in\ \vec{A}\ \in\ \Gamma}{\expr{\Gamma}{x\ \overset{L}{\Rightarrow} A_L}{\Gamma}}\) Var \(\quad\quad\) \(\frac{\forall\ M\ .\ (x:A_m)\ \notin\ \vec{A}\ \in\ \Gamma \quad (\,x\,:\,A\,)\ \in\ \vec{A}\ \in\ \Gamma \quad \Gamma\ \vdash\ x=e \quad \expr{\Gamma}{e \overset{L}{\Rightarrow} A_L}{\Delta}}{\expr{\Gamma}{x\ \overset{L}{\Rightarrow} A_L}{\Delta}}\) Var⇒

\(\frac{\Gamma\ \vdash\ A \quad \Delta_1 = \Gamma \quad \braced{ \expr{\Delta_i}{e \overset{L_i}{\Leftarrow} A_i}{\Delta_{i+1}} }_{i=1}^k}{\expr{\Gamma}{(e:\vec{A})\ \Rightarrow\ \vec{A}}{\Delta}}\) Anno \(\quad\quad\) \(\frac{\expr{\Gamma}{e_1\ \Rightarrow\ \vec{A}}{\Delta} \quad\quad \braced{ \Delta\ \vdash\ [\Delta] \apply{A_{L_i}}{e_2}{C_{L_i}}\ |\ L_i \in \text{lang}(\vec{A}) }_{i=1}^k}{\expr{\Gamma}{e_1 e_2 \Rightarrow \vec{C}}{\Delta}}\) →E

\(\text{synthesizeSpread} :: \Gamma \rightarrow e \rightarrow (\Delta_k,\ [A_L])\)

I added typechecking rules that for primitives, containers, declarations and signatures. The primitive rules are axioms where the types are inferred by the lexer. The containers include homogenous lists, tuples, and records. A declaration allows a variable to be assigned to an expression. Top-level shadowing is not allowed (i.e. no re-assignment). Also the types are generalized, with all remaining existential variables pulled out as universal quantifiers.

The three functions synthesisToplevel, synthesisSingular, and synthesisSpread are all specializations of the general functions of type:

synthesis :: L → Gamma → e → [A]

Each rule will be described in the sections below.

synthesisToplevel

The top-level statements import/source terms, specify their type (Signature), and build compositions from them (Declaration).

The Import rule is premised on the evalutation of \(e\), which is an entire module body that yields a full context. The term \((A\ \Rightarrow\ \_)\) is an inference that throws away the resulting type, being run only for the context it generates.

synthesisSpread

Morloc Data structures can be typed into MLang, but not directly into other languages without additional information. For example, is [Num] in C++ an array or vector? Is Num a "double" or a "float"? Determining the concrete type will require a concrete type-signature. Thus the concrete types are checked rather than synthesized.

Synthesizing a lambda requires we choose a language. Nothing in the body of the lambda expression specifies the language of the lambda. The language of the subcomponents may differ from the language of the lambda or may have no concrete binding at all (e.g., \(\lambda x . 42\)).

The Var⇒ rule handles cases where an expression with no concrete type (i.e., one that does not make a function call) is assigned to a variable and is then used in a second expression. For example:

import pybase (map, add)
ys = [1,2,3]
foo x = map (add x) ys

map and add are both functions imported from Python3. ys, though, is defined as a general list of numbers. At the location where it is defined, no language can be inferred. It is not until ys is called within foo that its concrete type can be inferred.

Table 5. check

\(\expr{\Gamma}{e \Leftarrow A}{\Delta}\) \(\quad\quad\) Under input context \(\Gamma\), \(e\) checks against input type \(A\), with output context \(\Delta\)

\(\frac{\expr{\Gamma,x:A_L}{e \Leftarrow B_L}{\Delta,x:A_L,\Theta}}{\expr{\Gamma}{\lambda x.e \Leftarrow A_L \rightarrow B_L}{\Delta}}\) →I \(\quad\quad\) \(\frac{\expr{\Gamma,\alpha_L}{e \Leftarrow A_L}{\Delta,\alpha_L,\Theta}}{\expr{\Gamma}{e \Leftarrow \forall \alpha_L . A_L}{\Delta}}\) ∀I \(\quad\quad\) \(\frac{\expr{\Gamma}{e \overset{L}{\Rightarrow} A_L}{\Theta} \quad\quad \expr{\Theta}{[\Theta]A_L\ <:\ [\Theta]B_L}{\Delta}}{\expr{\Gamma}{e \Leftarrow B_L}{\Delta}}\) Sub \(\quad\quad\)

\(\text{check} :: \Gamma \rightarrow e \rightarrow A \rightarrow (\Delta,\ B])\)

Table 6. apply

\(\expr{\Gamma}{\apply{A}{e}{C}}{\Delta}\) \(\quad\quad\) Under \(\Gamma\), applying a function of type \(A\) to \(e\) synthesizes type \(C\), with output context \(\Delta\)

\(\frac{\expr{\Gamma[\ea_{2L},\ \ea_{1L},\ \ea_L\ =\ \ea_{1L}\ \rightarrow\ \ea_{2L}]}{e \Leftarrow\ \ea_{1L}}{\Delta}}{\expr{\Gamma[\ea_L]}{\apply{\ea_L}{e}{\ea_{2L}}}{\Delta}}\) \(\ea_L\)App \(\quad\quad\) \(\frac{\expr{\Gamma,\ea_L}{\apply{[\ea_L/\alpha_L]A}{e}{C}}{\Delta}}{\expr{\Gamma}{\apply{\forall\alpha_L . A}{e}{C}}{\Delta}}\) ∀App \(\quad\quad\) \(\frac{\expr{\Gamma}{e \Leftarrow A}{\Delta}}{\expr{\Gamma}{\apply{A \rightarrow C}{e}{C}}{\Delta}}\) →App \(\quad\quad\)

\(\text{apply} :: \Gamma \rightarrow e \rightarrow A \rightarrow (\Delta,\ [(L,\ B)])\)

5.4. Let polymorphism

We depart from Hindley-Milner (HM) by excluding a let term. In HM, expressions bound in let are generalized, allowing statements such as:

let f = (forall a . a -> a) in
    x = (f 42, f "lettuce")

Where \(f\) is generalized, allowing it to retain its polymorphism. The same is not true of variables bound in functions (in HM at least). For example, the following Haskell expression fails to typecheck:

foo :: (a -> a) -> (Int, String)
foo f = (f 42, f "lettuce")

We do not support let expressions or let-polymorphism, instead we generalize expressions only if they are bound at the top-level (i.e. in declaration terms). This follows the practice argued for in [2].

5.5. Examples

In this section, I will step through several type inference examples.

The goal of the type inference engine is to asign types to every expression. Expression will usually have two or more types (a general type and one or more concrete types). The input to the type inference engine is a directed, acyclic graph (DAG) of modules, with (name, alias) pairs for edges and PreparedNode objects for nodes. The PreparedNode object contains a list of expressions (the module body). The value retured is a TypedNode object with every expression recursively labeled.

The first example is the minimal hello world program:

export hw
hw = "hello"

The hello world is an example of a program that can be run but has no concrete types. The general nexus dispatch program will simply print "hello" for the user without dispatching to a function in any target language. More useful examples of functions without concrete types would be modules that export constants (where the concrete type will be inferred later) or modules that define interfaces exporting general type signatures and function compositions that are implemented in downstream modules.

The program also imports nothing. It may be imported from other modules in which case the exported term hw would exposed.

Since there are no imports to consider, the algorithm will simply infer the type of hw = "hello". This statement parses to the Expr object Declaration (EV [] "hw") (StrE "hello"). The full Expr record is:

data Expr
  = SrcE [Source]
  | Signature EVar EType
  | Declaration EVar Expr
  | UniE
  | VarE EVar
  | AccE Expr Text
  | ListE [Expr]
  | TupleE [Expr]
  | LamE EVar Expr
  | AppE Expr Expr
  | AnnE Expr [UnresolvedType]
  | NumE Scientific
  | LogE Bool
  | StrE Text
  | RecE [(Text, Expr)]

An EVar type stores a term name and its scope. Since hw is at the top-level, the scope is an empty list, [].

The rule DeclareInfer is triggered …​ here we run into a deviation between the specification and the implementation. I just removed expressions marks. I’m going to need to re-write the mathy spec above. So ignore the details of the spec for now.

6. In Development

6.1. Index and Refinement types

Nothing here has been implemented yet

This section will first introduce what information the additional type annotations will provide and will then cover what can be done with this information.

The type system described so far allows for clear expressions of general intent, for example, (a,b) → b. But often these signatures are too general to be informative and too permissive to be safe. They can be improved by allowing constraints to be placed on the types.

I’ll introduce these ideas with a series of motivating examples.

Suppose there is a function with the signature [a] → [a]. This could be a sort function, reverse function, or subsampling function. Index types can clarify the meaning:

reverse :: [a]_{n} -> [a]_{n}
sort :: [a]_{n} -> [a]_{n}    -- note, for now, this has the same signature as reverse
head :: [a]_{n} -> a
transpose :: Matrix_{m,n} a -> Matrix_{n,m} a

The index types just describe the dimension of the data.

-- where n is an integer or expression that evaluates to an integer, the
-- expressions may have arithmetic binary operators, min, max, etc
A_{n}
A_{n+1}
A_{n} -> A_{1}
A_{n} -> A_{m} -> A_{m + n}
A_{n} -> A_{m} -> A_{max m n}

-- Expressions may have multiple indices (e.g., matrices and graphs)
Matrix_{m,n} -- row and column counts
Graph_{e,v}  -- counts of edges and vertices in the graph

How do I specify the shape of a term?

shapeA :: shape => A -> Int
shapeMatrix :: shape => Matrix -> (Int,Int)

Index types, though, are just sugar for a subset of refinement types.


type Prob = { x:Num | 0 <= x <= 1 }
type Index = { x:Int | 1 <= x }

reverse :: xs:[a]_{n} -> {ys:[a]_{n} | xs[i] == ys[n - i], 1 <= i <= n }
sort :: [a]_{n} -> {xs:[a]_{n} | xs[i] <= xs[i+1], 0 < i < n}
head :: { [a]_{n}      | n > 1 } -> a
head :: { [a]_{n} -> a | n > 1 }
head :: [a]_{n} -> a where { n > 1 }

Array indices can also be accessed in the types. This allows, for example, for types with autocorrelation to be expressed.

type GaussianWalk = { xs:[Num]_n | xs[i+1] ~ Norm(xs[i], 1), 0 < i < n }
scale :: { [x:Num]_{n} | x ~ Norm(m,s) } -> { [y:Num]_{n} | y ~ Norm(0, 1) }
scale :: { xs:[Num]_{n} | xs[i] ~ Norm(m,s), 0 < i <= n } -> { ys:[Num]_{n} | ys[i] ~ Norm(0, 1)}

A few questions: * should indexing be 1-based (as in math) or 0-based (as in most computer langauges)? * does indexing make sense for non-ordered sets (e.g., graph nodes and edges)? * what about dictionaries that are indexed with non-integer keys? * what about scoping? Are both i's the same in the code below? { xs:[Num]_n | xs[i] ~ Exp(4), 0 < i ⇐ n } → { ys:[Num]_m | ys[i] ~ Exp(5), 0 < i < m }

Scoping should be global within the type signature. So above, the actual bounds on i would be 0 < min n m, which is probably not what the user intended. This would leave some elements untyped.

  • some refinements may be handled statically, but probably not all

  • is indexing just sugar that can be rewritten in refinement type terms? (probably)

-- using indexing
reverse :: [a]_n -> [a]_n

-- using refinement type syntax
reverse :: xs:[a] -> { ys:[a] | len xs = len ys }
reverse :: { xs:[a] -> ys:[a] | len xs = len ys }

6.1.1. How are the types used

The system f layer of the morloc type system is statically typed. Typeclasses, when they are added, will be statically typed as well.

However, the refinement types, particularly when they are using user-defined functions, will not always be checkable statically, and instead will generate (optional) runtime assertions.

Hybrid systems that perform static checking for decidable problems and generate runtime assertions of undecidable cases have been explored [3].

How will we divide what can and cannot be statically checked?

6.1.2. Additional typing examples

align :: { [xs:[Char]]_n | len xs[i] > k1, 0 < i <= m } -> { [ys:[Char]]_n | len ys[i] = k2}

align :: Char -> [xs:[Char]]_n -> [ys:[Char]]_n
  where
    0 < i <= n
    len xs[i] > k1
    len ys[i] = k2
    k2 >= k1
    ungap xs[i] = ungap ys[i]

ungap :: [Char]_n -> { [Char]_m | m <= n }
foo :: {n:Int | n >= 0} -> [a]_{n}
bar :: {[a]_{n} | n > 0} -> a

-- fails at compile time
f xs = bar (foo 0 xs)

-- passes at compile time, refining the constraint on n to {n > 0}
g k xs = bar (foo k xs)

6.1.3. Typing and test generation

One goal of morloc is to be able to generate test data from the type signatures and ensure that running the functions on the generated data 1) doesn’t fail at runtime and 2) yields data that respects all constraints on the function output.

The goal is essentially to add property testing to the type signatures.

6.1.4. Typing and performance modeling

If data can be generated based on the type signatures, and if functions are given that map from input data to model features, then a performance model can be generated that is a function of the input features. These input features are descriptions of the input data. For example, a generic list has exactly one features, its length. The features of a graph might be edge count, node count, and select centrality measures.

This could be implemented as a typeclass:

class Typecheckable a where
  features :: a -> [Num]

That is, if a vector of numeric features can be derived for a type, then for a list of examples of this type a feature table can be created. Then the examples can be run through the function and performance statistics can be measured. Then a model can be made that predicts runtime and memory usage from the features.

6.1.5. Another discussion

[[Char]_{m_i}]_n -> Matrix_{n,m} Char

[[Char]_{m_i}]_n -> Matrix_{n,m >= m_i} Char

[[Char]_{m_i}]_n -> [[Char]_{m | m >= m_i}]_n
    xs:[...]_{5}
 --------------------
    len xs = 5

    xs:[...]_n
 --------------------
    exists Nat n
    let n = len xs    -- bind n to the first dimension of xs

    x:Array_{n,m} a
 --------------------
    exists Nat n
    let n = nrow x
    exists Nat m
    let m = ncol x

    xs:[...]_{n > 5}
 --------------------
    exists Nat n
    n = len xs
    n > 5

    x:Array_{n,m > 5} a
 --------------------
    exists Nat n
    n = nrow x
    exists Nat m
    m = ncol x
    m > 5

Exactly 1 value is available to be bound in each dimensional slot. The expression n uses one value, so this one value MUST be the row dimension. m > 5 is in the column dimension slot and defines one variable, m, so m must be the column dimension.

    xs:[...]_{n > m}
 --------------------
    ! cannot resolve

Here the length dimension slot has two variables defined, but there is only one slot, so an error is raised.

    xs:[...]_{n > m}
    ys:[...]_{m > 5}
 --------------------
    exists Nat n
    exists Nat m
    -- cannot resolve {n > m}, so continue
    m = len ys
    m > 5
    -- now backtrack
    n = len xs
    n > m

    xs:[...]_{m > n} -> ys:[...]_{m > p} -> zs:[...]_{p}
 ----------------------------
    -- same thing, just requires 3 loops

    xs:[...]_{n > m}
    ys:[...]_{m < n}
 --------------------
    exists Nat n
    exists Nat m
    -- cannot resolve {n > m} so contineu
    -- cannot resolve {m > n} so backtrack
    ... -- infinite loop, break with a raised error when a loop resolves nothing

The law of relevance, any statement in the index clause must be about the index of the type.

    xs:[...]_n -> ys:[...]_n
 ----------------------------
    exists Nat n
    n = len xs
    n = len ys

    xs:[...]_n -> ys:[...]_{n > 5}
 ----------------------------
    exists Nat n
    n = len xs
    n = len ys
    n > 5


    xs:[ys:[...]_n]
 --------------------
    forall ys
        len ys = n

    xs:[ys:[...]_{n_i; n_i < 5}]
 --------------------
    forall i in

6.2. Generating user interfaces

There are several types of user interfaces:

  • A command line utility

  • A programming library

  • A web API (e.g., a REST interface)

  • A browser or cellphone app

  • A graphical user interface on a PC

I will go through each of these in detail, but first there are a few missing features from morloc that will be required to generate any of these.

parameterization. Currently all function arguments are positional. This can work fine in languages like Haskell, where parameterization is usually passed in either a Reader monad or a record. But in languages like Python or R, and in command line utilities and node-based programming frameworks (like Blender shader specifications), positional arguments and parameters are dealt with differently. One solution would be to add all parameters as a record in the first positional argument. The record would be interpreted as the list of keyword arguments in languages that support them and would be treated as a normal record in languages that don’t.

passing by filename. Passing data between languages in JSON form is convenient, but limited. Binary data, such as images or music, would better be passed as files or binary globs. Large data that is not in memory would better passed as files or references to database objects. Any data that is threaded through a pool, but not directly used, should be passed as reference to avoid unnecessary (de)serialization.

Documentation. There is nothing novel that needs to be done here. I just need a basic Don Knuth literate programming system similar to Javadocs, Hadock and every other programming language. I’ll probably model my system after Haddock. The documentation can be leveraged into man pages and usage statements for the CLI utilities, transferred as documentation in programmatic libraries, built into docs for web APIs, GUIs, and everything else.

I’ll will walk through each of these and outline what morloc should do with respect to generating the UI.

Command line utility

This is what morloc currently generates. Command line utilities are deeply formulaic and many frameworks exist to generate them from minimal documentation and type annotations. What is currently missing from morloc-generated CLI is clean handling for parameters (flags, keyword arguments), passing data by "reference" (through filenames), conventions for STDIN/STDOUT/STDERR piping, and a documentation system.

Modules that are intended to be used as CLI tools (or other UIs) will probably be written in a different style than modules intended to be used as libraries. The key difference is that the functions in modules that will be compiled into UIs will usually be called directly rather than composed together. CLIs often have many flags and parameters that direct branching determining which function to call.

Below is an example that illustrates this idea:

import bio (aligner)

type AlignmentFormat = s:String where
    s in {"fasta", "clustalw", "stockholm"}

type SequenceFormat = s:String where
    s in {"fasta", "twobit"}

parameter AlignOpt =
  { "output-format|o" :: AlignmentFormat ? "fasta"
  , "intput-format|i" :: SequenceFormat ? "fasta"
  }

align :: AlignOpt -> Filename -> () Filename
align opt fi fo = writer (aligner (reader fi)) fo where
    reader = case opt["input-format"] of
        "fasta" -> readFasta
        "twobit" -> readTwoBit
    writer = case opt["output-format"] of
        "fasta" -> writeFasta
        "clustalw" -> writeClustalW
        "stockholm" -> writeStockholm

Here the generated command line tool has options for specifying the input and output formats of the sequence files. This approach would not normally be used in a library, since it would be easier to just export all the readers and writers and let the user compose them.

The example also uses some experimental syntax that is not yet implemented in morloc. The records have default values ? "fasta" and specify short options |o in the string key names. The "record" is also not a record, but rather a string map.

parameter NormOpt =
  { "verbose|v" :: Bool ? False
  , "mean|m" :: Num ? 0
  , "stddev|s" :: Num ? 1
  }

norm :: NormOpt -> i:Int -> [Num]_{i}

quiet_norm = norm (NormOpt / {verbose = False})

Here quiet_norm will have the type:

quiet_norm :: {"mean|m" :: Num ? 0, "stddev|s" :: Num ? 1} -> i:Int -> [Num]_{i}

Why use "/" for partial application? Because dividing by a term removes it from an equation.

Another example where output type varies with parameter

Programmatic library

This is easy. Already morloc generates the pools for each language and the manifold nexus calls the pools. Generating a library that can be imported into other languages is actually an easier problem. The compiler would fix the first call in each exported composition to the desired target language. Then typechecking and intermediate representation generation would proceed as it currently does. The final code generation step would not generate a manifold nexus and instead just make the one-language pool. It might need to generate a little more package boilerplate than is currently required, but this is usually pretty formulaic, all info required for this process is already in the system. Almost, the programmer meta-data is currently missing. That is, the author, maintainer, copyright, license information and such. But this is an easy thing to add.

6.3. Effect handling - monads, algebraic effects, and nothing

We could use monads to describe different effects

read :: Filename -> IO Text

write :: Filename -> Text -> IO ()

writeErr :: Text -> IO ()

modifyText :: Text -> Either Text Text

main infile outfile = do
  f <- read infile
  case modifyText f of
    Left err -> writeError err
    Right f' -> write outfile f'

We could ignore effects and just "let it be"

read :: Filename -> Text

write :: Filename -> Text -> ()

modifyText :: Text -> Text

main i o = write o . modifyText . read i

Or we could use algebraic effects. approach has been explored in the Koka language [4].

read :: Filename -> Text +IO

write :: Filename -> Text -> () +IO

modifyText :: Text -> Text +Err

main i o = write o . modifyText . read i

Here effects are added to functions with the +EFFECT syntax. Some languages will be able to enforce the contracts, some will not. We get composability back and monadic concepts from Haskell don’t leak into other languages. Implementations signatures would not have to specify effects.

The effect annotations are only specified explicitly in the signatures for functions that perform the effect. And these effects may be implemented only for certain implementations. For example:

add :: Int -> Int
add Hask :: "Integer" -> "Integer"
add Cpp : "int" -> "int" +Exn

Here the ideal add function will never go wrong, the Hask add function with integers of arbitrary length will never go wrong unless memory runs out (and we don’t model that sort of error), but the C++ add can go wrong due to overflows since the type is bound.

Some functions have intrinsic effects, which will be written in the ideal type, such as read and write IO functions. Implementations are free to impose effects as well. So different realizations of a given ideal composition may have different effects.

This system is not mutually exclusive with monadic error handling. For example:

head    :: [a] -> a +Exn
headMay :: [a] -> Maybe a

References

[1] J. Dunfield & N. R. Krishnaswami, Complete and easy bidirectional typechecking for higher-rank polymorphism. ACM SIGPLAN Notices (ACM, 2013), pp. 429–442.

[2] D. Vytiniotis, S. Peyton Jones, & T. Schrijvers, Let should not be generalized. Proceedings of the 5th ACM SIGPLAN workshop on Types in language design and implementation (ACM, 2010), pp. 39–50.

[3] D. N. Xu, Hybrid contract checking via symbolic simplification. Proceedings of the ACM SIGPLAN 2012 workshop on Partial evaluation and program manipulation (2012), pp. 107–116.

[4] D. Leijen, Koka: Programming with Row Polymorphic Effect Types. Electronic Proceedings in Theoretical Computer Science, 153 (2014) 100–126. https://doi.org/10.4204/eptcs.153.8.