\[ \def\ea{\widehat{\alpha}} \def\eb{\widehat{\beta}} \def\eg{\widehat{\gamma}} \def\sep{ \quad\quad} \newcommand{\mark}[1]{\blacktriangleright_{#1}} \newcommand{\expr}[3]{#1\ \ \vdash\ #2\ \dashv\ \ #3} \newcommand{\packto}[2]{#1\ \approx >\ #2} \newcommand{\apply}[3]{#1 \bullet #2\ \Rightarrow {\kern -1em} \Rightarrow\ #3} \newcommand{\subtype}[2]{#1\ :\leqq\ #2} \newcommand{\braced}[1]{\lbrace #1 \rbrace} \]

1. Intro

Morloc is a strongly-typed functional programming language where functions are imported from foreign languages and unified through a common type system. This language is designed to serve as the foundation for a universal library of functions. Each function in the library has one general type and zero or more implementations. An implementation may be either a function sourced from a foreign language or a composition of such functions. All interop code is generated by the Morloc compiler.

2. Why Morloc?

2.1. Compose functions across languages under a common type system

Morloc allows functions from polyglot libraries to be composed in a simple functional language. The focus isn’t on classic interoperability (e.g., calling Python from C) or serialization (e.g., sending data between applications via protobufs) — though morloc implementations may use these under the hood. Instead, you define types, import implementations, and build complex programs through function composition. The compiler invisibly generates any required interop code.

2.2. Write in your favorite language, share with everyone

Do you want to write in language X but have to write in language Y because everyone in your team does or because your expected users do? Love C for algorithms, R for statistics, but don’t want to write full apps in either? Morloc lets you mix and match, so you can use each language where it shines, with no bindings or boilerplate.

2.3. Run benchmarks and tests across languages

Tired of learning new benchmark and testing suites across all your languages? Is it hard to benchmark similar tools wrapped in applications with varying input formats, input validation costs, or startup overhead? In Morloc, functions with the same general type signature can be swapped in and out for benchmarking and testing. The same test suites and test cases will work across all supported languages because inputs/output of all functions of the same type share equivalent Morloc binary forms, making validation and comparison easy.

2.4. Design universal libraries

With Morloc, we can build abstract libraries using the general types as a logical framework. Then we can import implementations of these functions from one or more of the supported languages and easily test and benchmark them. These libraries are the foundation for an ecosystem where functions may be verified, organized/searched by type, and used to build rigorous programs.

2.5. Make better bioinformatics workflows

Within the bioinformatics space, Morloc can serve as a replacement for the brittle application/file paradigm of workflow design. Replace heavy CLI applications with pure function libraries, ad hoc textual file formats with explicit data structures, and workflow specifications with function compositions. See the the first Morloc paper for details (pre-released here).

3. Current status

Morloc is under heavy development in several areas:

  • language support – We need to further standardize the language onboarding process and then start adding new languages beyond the three that are currently supported (Python, R, and C++)

  • type system – There’s lots to do here: sum types, effect handling, constraints, extensible records

  • performance – The shared library implementation lacks proper memory defragmentation, and there is some unnecessary memory copying between languages

  • scaling – I’ve implemented some of the infrastructure and syntax for remote job submission, but more work is needed before it can be used in practice

  • syntax – Pattern matching, custom operators, namespaces, string interpolation, and more are on the roadmap

  • tooling – We need a linter, debugger, dependency manager, and better backend generators that produce better CLI usage statements and programmatic APIs

There is one island of stability, though: the native functions Morloc imports are fully independent of Morloc itself. For a given Morloc program, most of your code will be pure functions in native languages (e.g., Python, C++, or R). This code will never have to change between Morloc versions. Where Morloc will change is in how it describes these native functions, the syntax it uses to compose them, and the particulars of code generation.

Is Morloc ready for production? Maybe. Currently, Morloc has many sharp edges, and new versions may introduce breaking changes. So Morloc is most appropriate right now for adventurous first adopters who can solve problems and write clear issue reports. Morloc may be about one year of full-time work away from v1.0.

Want to contribute? The most helpful thing you can do is join the community (see the Contact section), try out Morloc, and offer feedback on social media or via GitHub issue reports. The community is just starting, and the language is young, so you can strongly influence how the system evolves.

4. Getting Started

4.1. Install the compiler

Warning Not well tested

The easiest way to start using Morloc is through containers. I recommend using podman, since it doesn’t require a daemon or sudo access. But Docker, Singularity, and other container engines are fine as well.

An image with the morloc executable and batteries included can be retrieved from the GitHub container registry as follows:

$ podman pull ghcr.io/morloc-project/morloc/morloc-full:0.54.0

The v0.54.0 may be replaced with the desired Morloc version.

Now you can enter a shell with a full working installation of Morloc:

$ podman run --shm-size=4g \
             -v $HOME:$HOME \
             -w $PWD \
             -e HOME=$HOME \
             -it ghcr.io/morloc-project/morloc/morloc-full:0.54.0 \
             /bin/bash

The --shm-size=4g option sets the shared memory space to 4GB. Morloc uses shared memory for communication between languages, but containers often limit the shared memory space to 64MB by default. By mounting your home directory, the changes you make in the container (including the installation of Morloc modules) will be persistent across sessions.

You can set up a script to run commands in a Morloc environment. To do this, paste the following code into a file:

mkdir -p ~/.morloc
podman run --rm \
           --shm-size=4g \
           -e HOME=$HOME \
           -v $HOME/.morloc:$HOME/.morloc \
           -v $PWD:$HOME \
           -w $HOME \
           ghcr.io/morloc-project/morloc/morloc-full:0.54.0 "$@"

Make it executable (chmod 755 menv) and place it in a bin folder on your PATH (e.g., ~/bin). The script will mount your current working directory and your Morloc home directory, allowing you to run commands in a morloc-compatible environment.

With the menv script, can run commands like so:

$ menv morloc --version             # get the current morloc version
$ menv morloc -h                    # list morloc commands

This should print the Morloc version and usage info.

Next you need to initialize the Morloc home directory:

$ menv morloc init -f               # setup the morloc environment

This will write required headers to your environment and build the required libraries.

You can install Morloc modules as well:

$ menv morloc install types         # install a morloc module

These modules will be retrieved from GitHub and written into Morloc home.

You can compile Morloc programs within this container as well:

$ menv morloc make -o foo foo.loc   # compile a local morloc module

The last command builds a Morloc program with the executable "foo" from the Morloc script file "foo.loc". The generated executable may not work on your system since it was compiled within the container environment, so you should run it in the container environment as well:

$ menv ./foo bar 1 2 3

More advanced solutions with richer dependency handling will be introduced in the future, but for now this allows easy experimentation with the language in a safe(ish) sandbox.

The menv morloc or menv ./foo syntax is a bit verbose, but I’ll let you play with alternative aliases. The conventions here are still fluid. Let me know if you find something better and or if you find bugs in this approach.

4.2. Installing from source

Warning Not well tested

If you want to compile Morloc from source, but don’t want to install a Haskell environment, the following instructions may be helpful.

First clone the Morloc repo:

$ git clone https://github.com/morloc-project/morloc
$ cd morloc

Now, you need a container to build Morloc. Create the following script in your PATH and name it mtest (or whatever you like):

podman run --shm-size=4g \
           --rm \
           -e HOME=$HOME \
           -e PATH="$HOME/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
           -v $HOME/.morloc:$HOME/.morloc \
           -v $HOME/.local/bin:$HOME/.local/bin \
           -v $PWD:$HOME \
           -w $HOME \
           ghcr.io/morloc-project/morloc/morloc-test:latest "$@"

Swap out podman for whichever Docker-compatible container engine you prefer.

This script will allow the image to alter your MORLOC_HOME directory and to install the Morloc executable locally (to ~/.local/bin).

With this container, you can build the Morloc executable:

$ mtest stack install

This will build the morloc executable. The stack utility will install a Haskell compiler (ghc) in a local sandbox along with all required Haskell modules. This will take awhile the first time you run it.

On success, morloc will not be installed. You can test the build like so:

$ mtest morloc -h

You may run the Morloc test suite from here as well:

$ mtest stack test

And you can build Morloc programs:

$ mtest morloc make foo.loc
$ mtest ./nexus foo 1 2 3

As before, you need to run the generated executable in your environment as well.

4.3. Setting up IDEs

Editor support for Morloc is still a work in progress.

If you are working in vim, you can install Morloc syntax highlighting as follows:

$ mkdir -p ~/.vim/syntax/
$ mkdir -p ~/.vim/ftdetect/
$ curl -o ~/.vim/syntax/loc.vim https://raw.githubusercontent.com/morloc-project/vimmorloc/main/loc.vim
$ echo 'au BufRead,BufNewFile *.loc set filetype=loc' > ~/.vim/ftdetect/loc.vim

Developing a full plugin is left as an excercise for the user (pull requests welcome).

If you are working in VS Code, I’ve made a simple extension that offers syntax highlighting and snippets. You can pull the extension from GitHub and move it into your VS code extensions folder:

$ git clone https://github.com/morloc-project/vscode ~/.vscode-oss/extensions/morloc

Update the path to the extensions folder as needed on your system. This manually installs the extensions, which is not ideal. I’ll push the extension to the official VS Code package manager soon.

4.4. Say hello

The inevitable "Hello World" case is implemented in Morloc like so:

module main (hello)
hello = "Hello up there"

The module named main exports the term hello which is assigned to a literal string value.

Paste this code into a file (e.g. "hello.loc") and then it can be imported by other Morloc modules or directly compiled into a program where every exported term is a subcommand.

$ morloc make hello.loc

This command will produce two files: a C program, nexus.c, and its compiled binary, nexus. The nexus is the command line interface (CLI) to the commands exported from the module.

Calling nexus with no arguments or with the -h flag, will print a help message:

$ ./nexus -h
Usage: ./nexus [OPTION]... COMMAND [ARG]...

Nexus Options:
 -h, --help            Print this help message
 -o, --output-file     Print to this file instead of STDOUT
 -f, --output-format   Output format [json|mpk|voidstar]

Exported Commands:
  hello
    return: Str

This usage message is automatically generated. For each exported term, it specifies the input (none, in this case) and output types as inferred by the compiler. For this case, the exported command is just the term hello, so no input types are listed.

The command is called as so:

$ ./nexus hello
Hello up there

4.5. Dice rolling

Let’s write a little program rolls a pair of 20-sided dice and prints the larger result. Here is the Morloc script:

module dnd (rollAdv)
import types
source Py from "foo.py" ("roll", "max", "narrate")

roll :: Int -> Int -> [Int]
max :: [Int] -> Int
narrate :: Int -> Str

rollAdv = narrate (max (roll 2 20))

Here we define a module named dnd that exports the function rollAdv. In line 2, we import the required type definitions from the Morloc module types. Later on we’ll go into how these types are defined. In line 3, we source three functions from the Python file "foo.py". In lines 5-8, we assign each of these functions a Morloc type signature. You can think of the arrows in the signatures as separating arguments. For example, the function roll takes two integers as arguments and returns a list of integers. The square brackets indicate lists. In the final line, we define the rollAdv function.

The Python functions are sourced from the Python file "foo.py" with the following code:

import random

def roll(n, d):
    # Roll an n-sided die d times, return a list of results
    return [random.randint(1, d) for _ in range(n)]

def narrate(roll_value):
    return f"You rolled a {roll_value!s}"

Nothing about this code is particular to Morloc.

One of Morloc’s core values is that foreign source code never needs to know anything about the Morloc ecosystem. Sourced code should always be nearly idiomatic code that uses normal data types. The inputs and outputs of these functions are natural Python integers, lists, and strings — they are not Morloc-specific serialized data or ad hoc textual formats.

We can compile and run this program as so:

$ morloc make main.loc
$ ./nexus rollAdv
"You rolled a 20"

As a random function, it will return a new result every time.

So, what’s the point? We could have done this more easily in a pure Python script. Morloc generates a CLI for us, type checks the program, and performs some runtime validation (by default, just on the final inputs and outputs). But there are other tools in the Python universe can achieve this same end. Where Morloc is uniquely valuable is in the polyglot setting.

4.6. Polyglot dice rolling

In this next example, we rewrite the prior dice example with all three functions being sourced from different languages:

module dnd (rollAdv)

import types

source R from "foo.R" ("roll")
source Cpp from "foo.hpp" ("max")
source Py from "foo.py" ("narrate")

roll :: Int -> Int -> [Int]
max :: [Int] -> Int
narrate :: Int -> Str

rollAdv = narrate (max (roll 2 20))

Note that all of this code is exactly the same as in the prior example except the source statements.

The roll function is defined in R:

roll <- function(n, d){
    sample(1:d, n)
}

The max function is defined in C++:

#pragma one
#include <vector>
#include <algorithm>

template <typename A>
A max(const std::vector<A>& xs) {
    return *std::max_element(xs.begin(), xs.end());
}

The narrate function is defined in Python:

def narrate(roll_value):
    return f"You rolled a {roll_value!s}"

This can be compiled and run in exactly the same way as the prior monoglot example. It will run a bit slower, mostly because of the heavy cost of starting the R interpreter.

The Morloc compiler automatically generates all code required to translate data between the languages. Exactly how this is done will be discussed later.

4.7. Parallelism example

Here is an example showing a parallel map function written in Python that calls C++ functions.

module m (sumOfSums)

import types

source Py from "foo.py" ("pmap")
source Cpp from "foo.hpp" ("sum")

pmap a b :: (a -> b) -> [a] -> [b]
sum :: [Real] -> Real

sumOfSums = sum . pmap sum

This Morloc script exports a function that sums a list of lists of real numbers. Here we use the dot operator for function composition. The sum function is implemented in C++:

// C++ header sourced by morloc script
#pragma one
#include <vector>

double sum(const std::vector<double>& vec) {
    double sum = 0.0;
    for (double value : vec) {
        sum += value;
    }
    return sum;
}

The parallel pmap function is written in Python:

# Python3 file sourced by morloc script
import multiprocessing as mp

def pmap(f, xs):
    with mp.Pool() as pool:
        results = pool.map(f, xs)
    return results

The inner summation jobs will be run in parallel. The pmap function has the same signature as the non-parallel map function, so can serve as a drop-in replacement.

This can be compiled and run with the lists being provided in JSON format:

$ morloc make main.loc
$ ./nexus sumOfSums '[[1,2],[3,4,5]]'

5. Syntax and Features

5.1. Source function from foreign languages

In Morloc, you can import functions from many languages and compose them under a common type system. The syntax for importing functions from source files is as follows:

source Cpp from "foo.hpp" ("map", "sum", "snd")
source Py from "foo.py" ("map", "sum", "snd")

This brings the functions map, sum, and snd into scope in the Morloc script. Each of these functions must be defined in the C++ and Python scripts. For Python, since map and sum are builtins, only snd needs to be defined. So the foo.py function only requires the following two lines:

def snd(pair):
    return pair

The C++ file, foo.hpp, may be implemented as a simple header file with generic implementations of the three required functions.

#pragma once
#include <vector>
#include <tuple>

// map :: (a -> b) -> [a] -> [b]
template <typename A, typename B, typename F>
std::vector<B> map(F f, const std::vector<A>& xs) {
    std::vector<B> result;
    result.reserve(xs.size());
    for (const auto& x : xs) {
        result.push_back(f(x));
    }
    return result;
}

// snd :: (a, b) -> b
template <typename A, typename B>
B snd(const std::tuple<A, B>& p) {
    return std::get<1>(p);
}

// sum :: [a] -> a
template <typename A>
A sum(const std::vector<A>& xs) {
    A total = A{0};
    for (const auto& x : xs) {
        total += x;
    }
    return total;
}

Note that these implementations are completely independent of Morloc — they have no special constraints, they operate on perfectly normal native data structures, and their usage is not limited to the Morloc ecosystem. The Morloc compiler is responsible for mapping data between the languages. But to do this, Morloc needs a little information about the function types. This is provided by the general type signatures, like so:

map a b :: (a -> b) -> [a] -> [b]
snd a b :: (a, b) -> b
sum :: [Real] -> Real

The syntax for these type signatures is inspired by Haskell, with the exception that generic terms (a and b here) must be declared on the left. Square brackets represent homogenous lists and parenthesized, comma-separated values represent tuples, and arrows represent functions. In the map type, (a → b) is a function from generic value a to generic value b; [a] is the input list of initial values; [b] is the output list of transformed values.

Removing the syntactic sugar for lists and tuples, the signatures may be written as:

map a b :: (a -> b) -> List a -> List b
snd a b :: Tuple2 a b -> b
sum :: List Real -> Real

These signatures provide the general types of the functions. But one general type may map to multiple native, language-specific types. So we need to provide an explicit mapping from general to native types.

type Cpp => List a = "std::vector<$1>" a
type Cpp => Tuple2 a b = "std::tuple<$1,$2>" a b
type Cpp => Real = "double"
type Py => List a = "list" a
type Py => Tuple2 a b = "tuple" a b
type Py => Real = "float"

These type functions guide the synthesis of native types from general types. Take the C++ mapping for List a as an example. The basic C++ list type is vector from the standard template library. After the Morloc typechecker has solved for the type of the generic parameter a, and recursively converted it to C++, its type will be substituted for $1. So if a is inferred to be a Real, it will map to the C++ double, and then be substituted into the list type yielding std::vector<double>. This type will be used in the generated C++ code.

5.2. Functions

Functions are defined with arguments seperated by whitespace:

foo x = g (f x)

Here foo is the Morloc function name and x is its first argument.

Morloc supports the . operator for composition, so we can re-write foo as:

foo = g . f

Morloc supports partial application of arguments.

For example, to multiply every element in a list by 2, we can write:

multiplyByTwo = map (mul 2.0)

5.3. Modules

A module includes all the code defined under the import <module_name> statement. It can be imported with the import command.

The following module defines the constant x and exports it.

module foo (x)
x = 42

Another module can import Foo:

import foo (x)

...

A term may be imported from multiple modules. For example:

module main (add)
import cppbase (add)
import pybase (add)
import rbase (add)

This module imports that C++, Python, and R add functions and exports all of them. Modules that import add will import three different versions of the function. The compiler will choose which to use.

5.4. Docstrings and toolboxes

Morloc has early support for docstrings in comments that propagate to the generated CLI code.

For example:

module main (add, sum)

import types

source Py from "main.py" ("add", "sum")

--' Add two floats
add :: Real -> Real -> Real

--' Sum a list of floats
sum :: [Real] -> Real

The special comment --' introduces a docstring that is attached to the following type signature and will be propagated through to the code generated by the backend.

$ morloc make main.loc
$ ./neuxs -h
Usage: ./nexus [OPTION]... COMMAND [ARG]...

Nexus Options:
 -h, --help            Print this help message
 -o, --output-file     Print to this file instead of STDOUT
 -f, --output-format   Output format [json|mpk|voidstar]

Exported Commands:
  add   Add two floats
          param 1: Real
          param 2: Real
          return: Real
  sum   Sum a list of floats
          param 1: [Real]
          return: Real

5.5. User arguments and outputs

User data is passed to Morloc executables as positional arguments to the specified function subcommand. The argument may be a literal JSON string or a filename. For files, the format may be JSON, MessagePack, or Morloc binary (VoidStar) format. The Morloc nexus first checks for a ".json" extension, if found, the nexus attempts to parse the file as JSON. Next the nexus checks for a ".mpk" or ".msgpack" extension, and if found it attempts to parse the file as a MessagePack file. If neither extension is found, it attempts to parse the file first as Morloc binary, then as MessagePack, and finally as JSON. See the parse_cli_data_argument function in morloc.h for details.

Passing literal JSON on the command line can be a little unintuitive since extra quoting may be required. Here are a few examples:

# The Bash shell removes the outer quotes, so double quoting is required
$ ./nexus foo '"this is a literal string"'

# Single quotes are lists is fine, still need to quote inner strings
$ ./nexus bar '["asdf", "df"]'

# By default, output is written to JSON format
$ ./nexus baz 1 2 3 > baz.json

# The output can be directly read by a downstream morloc program
$ ./nexus bif baz.json

Data may be written to MessagePack or VoidStar via the -f argument:

$ ./nexus -f voidstar head '[["some","random"],["data"]]' > data.vs
$ ./nexus -f json head data.vs > data.json
$ ./nexus -f mpk reverse data.json > data.mpk
$ ./nexus reverse data.mpk
"some"

The VoidStar format is the richest and is the only form that contains the schema describing the data.

5.6. Mapping general types to native types

When a function is sourced from a foreign language, Morloc needs to know how Morloc general types map to the function’s native types. This information is encoded in language-specific type functions. For examples:

type R => Bool = "logical"
type Py => Bool = "bool"
type Cpp => Bool = "bool"

type R => Int32 = "integer"
type Py => Int32 = "int"
type Cpp => Int32 = "uint32"

Language-specific types are always quoted since they may contain syntax that is illegal in the Morloc language.

A function such as an integer addition function addi:

add :: Int32 -> Int32 -> Int32

This can be automatically mapped to a C++ function with the prototype int addi(int x, int y).

Containers can be similarly mapped to native types:

type Py => List a = "list" a
type Cpp => List a = "std::vector<$1>" a

The $1 symbol is used to represent the interpolation of the first parameter into the native type. So the Morloc type List Int32 would translate to std::vector<uint32> in C++.

5.7. Records and tables

Morloc has dedicated support for defining records and tables.

Here is a record example:

module foo (incAge)

import types (Int, Str)

source R from "foo.R" ("incAge")

record Person = Person
    { name :: Str
    , age :: Int
    }

-- Increment the person's age
incAge :: Person -> Person

Where the "foo.R" file contains the function:

incAge <- function(person){
    person$age <- person$age + 1
    person
}

This may be compiles and run as so:

$ morloc make foo.loc
$ ./nexus incAge '{name:"Alice",age:32}'

Tables are similar, but all fields are lists of equal length:

module foo (readPeople, addPeople)

import types (Int, Str)

source R from "people-tables.R"
   ( "read.delim" as readPeople
   , "addPeople")

table People = People
    { name :: Str
    , age :: Int
    }

readPeople :: Filename -> People
addPeople :: [Str] -> [Int] -> People -> People

With "people-tables.R" containing:

addPeople <- function(names, ages, df){
    rbind(df, data.frame(name = names, age = ages))
}

This can be compiled and run as so:

# read a tab-delimited file containing person rows
./nexus readPeople data.tab > people.json

# add a row to the table
./nexus addPeople '["Eve"]' '[99]' people.json

The record and table types are currently excessively strict. Defining functions that add or remove fields/columns requires defining entirely new records/tables. Generic funtions for operations such as removing lists of columns cannot be defined at all. Future versions of Morloc will have more flexible tables, but for now most operations should be done in coarser functions. Alternatively, custom non-parameterized tabular/record types may be defined.

The case study in the Morloc paper uses a JsonObj type that represents an arbitrarily nested object that serializes to/from JSON. In Python, it deserializes to a dict object; in R, to a list objects; and in C to an ordered_json object from from (Niels Lohmann’s json package).

A similar approach could be used to define a non-parameterized table type that serialized to CSV or some binary type (such as Parquet).

These non-parameterized solutions are flexible and easy to use, but lack the reliability of the typed structures.

5.8. Type hierarchies

In some cases, there is a single obvious native type for a given Morloc general type. For example, most languages have exactly only one reasonable way to represent a boolean. However, other data types have may have many forms. The Morloc List is a simple example. In Python, the list type is most often used for representing ordered lists, however it is inefficient for heavy numeric problems. In such cases, it is better to use a numpy vector. Further, there are data structures that are isomorphic to lists but that are more efficient for certain problems, such as stacks and queues.

We can define type hierarchies that represent these relationships.

-- aliases at the general level
type Stack       a = List a
type LList       a = List a
type ForwardList a = List a
type Deque       a = List a
type Queue       a = List a
type Vector      a = List a


-- define a C++ specialization for each special type
type Cpp => Stack a = "std::stack<$1>" a
type Cpp => LList a = "std::list<$1>" a
type Cpp => ForwardList a = "std::forward_list<$1>" a
type Cpp => Deque a = "std::deque<$1>" a
type Cpp => Queue a = "std::queue<$1>" a

Here we equate each of the specialized containers with the general List type. This indicates that they all share the same common form and can all be converted to the same binary. Then we specify language specific patterns as desired. When the Morloc compiler seeks a native form for a type, it will evaluate these type functions by incremental steps. At each step the compiler first checks to see if there is a direct native mapping for the language, if none is found, it evaluates the general type function.

Native type annotations are also passed to the language binders, allowing them to implement specialized behavior for more efficient conversion to binary.

5.9. One term may have many definitions

Morloc supports what might be called term polymorphism. Each term may have many definitions. For example, the function mean has three definitions below:

import base (sum, div, size, fold, add)
import types
source Cpp from "mean.hpp" ("mean")
mean :: [Real] -> Real
mean xs = div (sum xs) (size xs)
mean xs = div (fold 0 add xs) (size xs)

mean is sourced directly from C++, it is defined in terms of the sum function, and it is defined more generally with sum written as a fold operation. The Morloc compiler is responsible for deciding which implementation to use.

The equals operator in Morloc indicates functional substitutability. When you say a term is "equal" to something, you are giving the compiler an option for what may be substituted for the term. The function mean, for example, has many functionally equivalent definitions. They may be in different languages, or they may be more optimal in different situations.

Now this ability to simply state that two things are the same can be abused. The following statement is syntactically allowed in Morloc:

x = 1
x = 2

What is x after this code is run? It is 1 or 2. The latter definition does not mask the former, it appends the former. Now in this case, the two values are certainly not substitutable. Morloc has a simple value checker that will catch this type of primitive contradition. However, the value checker cannot yet catch more nuanced errors, such as:

x = div 1 (add 1 1)
x = div 2 1

In this case, the type checker cannot check whithin the implementation of add, so it cannot know that there is a contradiction. For this reason, some care is needed in making these definitions.

5.10. Overload terms with typeclasses

In addition to term polymorphism, Morloc offers more traditional ad hoc polymorphism over types. Here typeclasses may be defined and type-specific instances may be given. This idea is similar to typeclasses in Haskell, traits in Rust, interfaces in Java, and concepts in C++.

In the example below, Addable and Foldable classes are defined and used to create a polymorphic sum function.

class Addable a where
    zero a :: a
    add a :: a -> a -> a

instance Addable Int where
    source Py "arithmetic.py" ("add")
    source Cpp "arithmetic.hpp" ("add")
    zero = 0

instance Addable Real where
    source Py "arithmetic.py" ("add")
    source Cpp "arithmetic.hpp" ("add")
    zero = 0.0

class Foldable f where
    foldr a b :: (a -> b -> b) -> b -> f a -> b

instance Foldable List where
    source Py "foldable.py" ("foldr")
    source Cpp "foldable.hpp" ("foldr")

sum = foldr add zero

The instances may import implementations for many languages.

The native functions may themselves be polymorphic, so the imported implementations may be repeated across many instances. For example, the Python add may be written as:

def add(x, y):
    return x + y

And the C++ add as:

template <class A>
A add(A x, A y){
    return x + y;
}

Typeclasses work currently, but they are not yet in their final form. They cannot be directly imported and they are not explicit in type signatures. I would be happy to hear your thoughts on Morloc typeclasses. Getting them right is crucial to the grand structure of the future Morloc library.

5.11. Binary forms

Every Morloc general type maps unambiguously to a binary form that consists of several fixed-width literal types, a list container, and a tuple container. The literal types include a unit type, a boolean, signed integers (8, 16, 32, and 64 bit), unsigned integers (8, 16, 32, and 64 bit), and IEEE floats (32 and 64 bit). The list container is represented by a 64-bit size integer and a pointer to an unboxed vector. The tuple is represented as a set of values in contiguous memory. These basic types are listed below:

Table 1. Morloc primitives
Type Domain Schema Width (bytes)

Unit

()

z

1

Bool

True | False

b

1

UInt8

\([0,2^{8})\)

u1

1

UInt16

\([0,2^{16})\)

u2

2

UInt32

\([0,2^{32})\)

u4

4

UInt64

\([0,2^{64})\)

u8

8

Int8

\([-2^{7},2^{7})\)

i1

1

Int16

\([-2^{15},2^{15})\)

i2

2

Int32

\([-2^{31},2^{31})\)

i4

3

Int64

\([-2^{63},2^{63})\)

i8

4

Float32

IEEE float

f4

4

Float64

IEEE double

f8

8

List x

het lists

a{x}

\(16 + n \Vert a \Vert \)

Tuple2 x1 x2

2-ples

t2{x1}{x2}

\(\Vert a \Vert + \Vert b \Vert\)

TupleX \(\ t_i\ ...\ t_k\)

k-ples

\(tkt_1\ ...\ t_k\)

\(\sum_i^k \Vert t_i \Vert\)

\(\{ f_1 :: t_1,\ ... \ , f_k :: t_k \}\)

records

\(mk \Vert f_1 \Vert f_1 t_1\ ...\ \Vert f_k \Vert f_k t_k \)

\(\sum_i^k \Vert t_i \Vert\)

All basic types may be written to a schema that is used internally to direct conversions between Morloc binary and native basic types. The schema values are shown in the table above. For example, the type [(Bool, [Int8])] would have the schema at2bai1. You will not usually have to worry about these schemas, since they are mostly used internally. They are worth knowing, though, since they appear in low-level tests, generated source code, and binary data packets.

Here is an example of how the type ([UInt8], Bool), with the value ([3,4,5],True), might be laid out in memory:

---
03 00 00 00 00 00 00 00 00 -- first tuple element, specifies list length (little-endian)
30 00 00 00 00 00 00 00 00 -- first tuple element, pointer to list
01 00 00 00 00 00 00 00 00 -- second tuple element, with 0-padding
03 04 05                   -- 8-bit values of 3, 4, and 5
---

Records and tables are represented as tuples. The names for each field are stored only in the type schemas. Morloc also supports tables, which are just records where the field types correspond to the column types and where fields are all equal-length lists. Records and tables may be defined as shown below:

A record is a named, heterogenous list such as a struct in C, a dict in Python, or a list in R. The type of the record exactly describes the data stored in the record (in contrast to parameterized types like [a] or Map a b). They are represented in Morloc binary as tuples, the keys are only stored in the schemas.

A table is like a record where field types represent the column types. But table is not just syntactic sugar for a record of lists, the table annotation is passed with the record through the compiler all the way to the translator, where the language-specific serialization functions may have special handling for tables.

record Person = Person { name :: Str, age :: UInt8 }
table People = People { name :: Str, age :: Int }

alice = { name = "Alice", age = 27 }
students = { name = ["Alice", "Bob"], age = [27, 25] }

The Morloc type signatures can be translated to schema strings that may be parsed by a foundational Morloc C library into a type structure. Every supported language in the Morloc ecosystem must provide a library that wraps this Morloc C library and translates to/from Morloc binary given the Morloc type schema.

5.12. Defining non-primitive types

Types that are composed entirely of Morloc primitives, lists, tuples, records and tables may be directly and unambiguously translated to Morloc binary forms and thus shared between languages. But what about types that do not break down cleanly into these forms? For example, consider the parameterized Map k v type that represents a collection with keys of generic type k and values of generic type v. This type may have many representations, including a list of pairs, a pair of columns, a binary tree, and a hashmap. In order for Morloc to know how to convert all Map types in all languages to one form, it must know how to express Map type in terms of more primitive types. The user can provide this information by defining instances of the Packable typeclass for Map. This typeclass defines two functions, pack and unpack, that construct and deconstruct a complex type.

class Packable a b where
    pack a b :: a -> b
    unpack a b :: b -> a

The Map type for Python and C++ may be defined as follows:

type Py => Map key val = "dict" key val
type Cpp => Map key val = "std::map<$1,$2>" key val
instance Packable ([a],[b]) (Map a b) where
    source Cpp from "map-packing.hpp" ("pack", "unpack")
    source Py from "map-packing.py" ("pack", "unpack")

The Morloc user never needs to directly apply the pack and unpack functions. Rather, these are used by the compiler within the generated code. The compiler constructs a serialization tree from the general type and from this trees generates the native code needed to (un)pack types recursively until only primitive types remain. These may then be directly translated to Morloc binary using the language-specific binding libraries.

In some cases, the native type may not be as generic as the general type. Or you may want to add specialized (un)packers. In such cases, you can define more specialized instances of Packable. For example, if the R Map type is defined as an R list, then keys can only be strings. Any other type should raise an error. So we can write:

type R => Map key val = "list" key val
instance Packable ([Str],[b]) (Map Str b) where
source R from "map-packing.R" ("pack", "unpack")

Now whenever the key generic type of Map is inferred to be anything other than a string, all R implementations will be pruned.

5.13. The universal library

A module may export types, typeclasses, and function signatures but no implementations. Such a module would be completely language agnostic. A powerful approach to building libraries in the Morloc ecosystem is to write one module that defines all types, then $n$ modules for language-specific implementations that import the type module, and then one module to import and merge all implementations. This is the approach taken by the base module and by other core libraries.

In the future, when hundreds of languages are supported, and when possibly some functions may even have many implementations per language, it will be desirable to have finer control over what functions are used. One solution would be to add filters to the import statement. Thus the import expressions would be a sort of query. Alternatively, constraints could be added at the function level, and thus the entire Morloc script would be a query over the universal library. This would be especially powerful when imported types are expressed as unknowns to be inferred by usage.

6. Internals

6.1. Packet protocols

A Morloc program compiles into a single "nexus" file and a "pool" file for each language. The nexus program accepts user input, dispatches to the pools, and formats results. The pools "pool" all functions from each specific language. When a command is sent to the nexus, it initializes the pools as background processes, pool daemons. Then the nexus sends the given information to the pool daemon that contains the top function in the composition.

Data is passed between the nexus and between daemons using a combination of UNIX domain socket messages and shared memory storage.

Packets follow a binary protocol that is defined in the data/morloc.h file in the main Morloc repository.

Each packet has a 32-byte with the following fields:

Table 2. Packet header specification
Field Type Width Description

magic

unsigned int

4

Morloc-specific constant: 6D F8 07 07 (mo-ding-ding)

plain

unsigned int

2

Morloc "plain" (see below for description)

version

unsigned int

2

Packet version

flavor

unsigned int

2

Metadata convention

mode

unsigned int

2

Evaluation mode (e.g., debug or verbose)

command

8-byte union

8

Packet description

offset

unsigned int

4

length of metadata block

length

unsigned int

8

length of the data payload

The Morloc plain specifies membership in a special set of Morloc libraries that follows a certain set of conventions of requirements. This is broader than a namespace. An example of a plain would be a "Safe" plain where all functions are verified in some fashion. In this case, it may be required that any packets that are read into a program should also have been created by a member of the Safe plain. Handling for plains is not yet implemented.

The command specifies the type of packet. There are currently three: data, call, and ping.

A data packet represents data and how it is stored. These packets have three main uses. First, they are used to store arguments that are passed between the nexus and pool daemons. Second, they are transport data within the pools that has not been transformed into native structures. This allows data to be efficiently transferred between languages without being "naturalized" unless needed. Third, data packets may be written by the nexus as the output of a program.

Table 3. Data packet command field specification
Field Type Width Description

type

char

1

Constant 0 - for "data" type

source

char

1

Source type (e.g., file, message, pointer to shared memory)

format

char

1

Data format (e.g., JSON, MessagePack, Text, Voidstar)

compresssion

char

1

Compression algorithm

encryption

char

1

Encryption algorithm

status

char

1

Pass or fail status

padding

char[2]

2

zero-padding

The source field states "where" the data is. It might be stored literally in the packet itself. It might be stored in a file, in which case the packet stores only the filename. It might be stored in the shared memory volumes, in which case the packet stores a relative pointer to the memory location. The format field stores the data type. It might be JSON data, MessagePack data, literal text (e.g., for error messages), or the Morloc binary (which I sometimes call Voidstar). The compression and encryption fields are not currently used. But in the future they will be needed to support packet-specific compression/encryption of payloads. The status field represents whether the producing computation failed. If so, then the packet may contain an error message.

A call packet is sent from the nexus to a pool daemon or between daemons (for foreign calls). These packets specify the function to call and contain a contiguous list of data packets representing positional arguments as their payload.

Table 4. Call packet command field specification
Field Type Width Description

type

char

1

Constant 1 - for "call" type

entrypoint

char

1

stores if the call is local or remote (or something else)

padding

char[2]

2

zero-padding

midx

unsigned int

4

ID for the function to call in the pool daemon

The entrypoint field is currently used when a call packet is sent to a remote machine for execution.

A ping packet is a header-only packet that is sent to check if some resource (such as a pool daemon) is up and running. The nexus will ping all pool daemons until it gets a response before sending them call instructions.

Table 5. Ping packet command field specification
Field Type Width Description

type

char

1

Constant 2 - for "ping" type

padding

char[7]

7

0 padding

Metadata is stored in blocks that start with an 8-byte metadata header.

Table 6. Metadata block specification
Field Type Width Description

magic

char[3]

3

Constant "mmh" string (Morloc Metadata Header)

type

char

1

Metadata type (e.g., schema string or data hash)

size

uint

4

Data format (e.g., JSON, MessagePack, Text, Voidstar)

Currently Morloc uses the metadata section to store data type schemas and to cache data hashes. In the future, these sections could be extended to store provenance data, benchmark data, environment info, runtime dags, or even raw inputs and code. The nexus can write data packets as a final output.

7. Q&A

7.1. I only use one language, is Morloc still useful?

Yes, Morloc remains useful even if you only use one programming language.

While Morloc is designed to allow polyglot development, its core benefits also apply to single-language projects. In the Morloc ecosystem, you may continue working in your preferred language, but focus shifts to writing libraries instead of standalone applications.

Morloc lets you compose these functions and automatically generate applications from them, offering several advantages:

  • Broader usability: Your functions can be easily reused and easily accessed by other language communities.

  • Improved testing and benchmarking: Functions can be integrated into language-agnostic testing and benchmarking frameworks.

  • Future-proofing: If you ever need to migrate to a new language, Morloc’s type annotations and documentation carry over—only the implementation needs to change. And if you want to leave the Morloc ecosystem, your implementation does not need to change.

  • Better workflows: Especially in fields like bioinformatics, Morloc shifts workflows from chaining applications and files to composing typed functions and native data structures, making pipelines more robust and easier to validate.

  • No more format parsing: Morloc data structures replace bespoke file formats and offer efficient serialization.

While language interop is a major feature of Morloc, it is not main purpose. The very first version of Morloc was not even polyglot at all. The focus originally was to just have a simple composition language that separated pure code from associated effects, conditions, caching, etc.

The primary goal of Morloc is to support the development of composable, typed universal libraries. Support for many languages is required for this goal, since no one language is best for all cases. Most Morloc users would continue to program in their favorite language, but gain the ability to compose, share, and extend functionality more easily.

7.2. Is this just a bioinformatics workflow language?

No. The Morloc paper pre-released here, is focused on bioinformatics applications. As discussed at length in the paper, Morloc addresses systematic flaws in the traditional approaches to building bioinformatics workflows. Given the need, and also given my personal background, bioinformatics is a good place to start. However, Morloc can be more broadly applied to any functional problem.

7.3. Do you really want to deprecate all the bioinformatics formats?

Yes, with the possible exception of a specialized binary formats which may offer performance benefits.

For human-readable semi-structured formats, I think only two are necessary. A tabular format (e.g., CSV) and a tree format (e.g., JSON).

7.4. Do you really want to deprecate all the bioinformatics applications?

Yes, with the exception of interactive graphical applications.

7.5. Do you want to deprecate the conventional workflow languages?

Not entirely. They do offer good scaling support that Morloc cannot yet match. Some also support GUIs which offer an intuitive and valuable way to visualize and create workflows from coarse components.

Hybrid solutions are possible. Conventional workflow languages can wrap Morloc compiled applications and pass Morloc generated data in place of bespoke bioinformatics formats.

7.6. Does Morloc allow function-specific environments?

No, unlike workflow managers such as Snakemake and Nextflow, Morloc does not offer function-specific environments. This is a deliberate design choice.

Dependency resolution is a hard and heavily researched problem. The general goal of dependency solvers is to find one set of dependencies that satisfies the entire program. The bioinformatics community often gives up on finding unified environments and instead runs each function in its independent environment. With every function running in its own container, all dependency issues are encapsulated and all functions may be executed from one manager. But this comes at a heavy cost. Each application must be wrapped in a script, the script must be executed via an expensive system call into the container, and data must be serialized and sent to the container. This approach is reasonable for workflows with a small number of heavy components. But from a programming language perspective, wrapping every function call in its own environment is inefficient and opaque.

Morloc is designed not to hide problems in boxes, but rather to solve the root problem. Conventional workflow languages attempt to simplify workflows design by layering frameworks over the functions. The Morloc approach is the exact opposite. First delete everything unnecessary from all applications and lift their light algorithmic cores into clean, well-typed libraries. Then build upwards through composition of these pure functions—​and judicious use of impure ones—​to create efficient, reliable, and composable tools.

Now, if you really do need to run something in a container, you can just make a function that wraps a call to a container and then use it just as you would any other function. You could even write a wrapper function that takes a record with all the metadata needed for a conda environment and execute its function within that environment. We can do this through libraries, so there is no need to hardcode this pattern into the Morloc language itself.

The reproducibility of Morloc workflows may be ensured by running the entire Morloc program in an environment or container, with a single set of dependencies. The specific Morloc compiler version can be specified and modules may be imported using their git hashes. This is done in the current Morloc examples (see the Dockerfile in the workflow-comparisons folder of https://github.com/morloc-project/examples).

7.7. What about recursion?

Recursion is not directly supported in Morloc. It may be in the future, but the implementation is complicated by inconsistent support for recursion in different languages. For example, Python has a recursion limit that can cause runtime crashes. Instead, recursive algorithms should be written as control fucntions in foreign languages.

For example, the Morloc bio module contains many generic tree traversal algorithms. One is the foldTree function:

foldTree n e l a
  :: (l -> a -> a)
  -> (n -> e -> a -> a)
  -> a
  -> RootedTree n e l
  -> a

Here the RootedTree type represents a phylogenetic tree with generic node (n), edge (e), and leaf (l) types. The foldTree function accepts two functions as arguments. The first (l → a → a), reduces leaf values to the accumulator value. The second, reduces branch given the parent node, edge and current accumulator value. The strategy for implementing this function is decided in the foreign source code. The current C++ implementation is recursive, but iterative alternatives are possible.

7.8. What about object-oriented programming?

An "object" is a somewhat loaded term in the programming world. As far as Morloc is concerned, an object is a thing that contains data and possibly other unknown stuff, such as hidden fields and methods. All types in Morloc have must forms that are transferable between languages. Methods do not easily transfer; at least they cannot be written to Morloc binary. However, it is possible to convey class-like APIs through typeclasses. Hidden fields are more challenging since, by design, they are not accessible. So objects cannot generally be directly represented in the Morloc ecosystem.

Objects that have a clear "plain old data" representation can be handled by Morloc. These objects, and their component types, must have no vital hidden data, no vital state, and no required methods. Examples of these are all the basic Python types (int, float, list, dict, etc) and many C++ types such as the standard vector and tuple types. When these objects are passed between languages, they are reduced to their pure data.

7.9. Is Morloc still relevant when AI can program and translate?

Maybe. Morloc may serve as a system for functional composition, verification, and automation even when most functions are generated by machines.

I’ll lay out an argument for this below, starting with a few proposition:

  1. Adversaries exist. AIs may themselves be adversarial or there might be adversarial code in ecosystem around the AIs (for example, prompt injection). Humans can’t trust humans, humans can’t trust AIs, AIs can’t trust humans, and AIs can’t trust AIs. Depending on their architecture, AIs may not even be able to trust their own memories.

  2. Stupid is fast. Narrow intelligence outperforms general intelligence for narrow problems. A vast AGI system with deep understanding of physics and Shakespeare will not be the fastest tool for sorting a list of integers. There will always be a need for programs across the intelligence spectrum — from classical functions, to statistical models, to general intelligences.

  3. Creating functions is expensive. Designing high-performance algorithms is not trivial. Even simple functions, like sorting algorithms, require deep thought to optimize for a given use case. But there is a further combinatorial explosion of more complex physical simulations, graphics engines, and statistical algorithms. While simple functions might be created in seconds, others may take years of CPU time to optimize.

  4. Reproducibility is important. Future AIs may serve as nearly perfect oracles, but they are complex entities and future AIs will likely be capable of evolving over time as persons. So they will likely not give equivalent answers day to day. It is valuable to be able to crystallize a thought process into something that will behave the same every time it is invoked on a given input. So again, functions are important.

  5. Correctness is important. If functions are being composed by AIs to create new programs, any function that does not behave in the way the AI expects can cause cascading errors. It doesn’t matter how intelligent the AI is, if it is building programs from functions that it cannot verify, then the programs may not be safe.

A few things follow from these propositions.

First, AI will benefit from writing functions. Even in a world with no humans, they will need functions for efficiently solving narrow problems. They will likely generate libraries of billions of specialized functions. Some may be classical functions and others may be small statistical models. By caching these functions, compute time can be saved. Rather than generating entire programs from first principles, they can build them logically through composition of prior functions. The same forms of abstractions that help humans reason will also be of value to AIs. Yes, they have far larger working memories than we do, but that does not change the fact that abstraction and composition reduce the costs of re-derivation.

Time can also be saved if different AIs share functions they have written (both with each other and with humans). Since adversaries exist, shared functions must be verified. But verification is hard, especially if a godlike super-intelligence were trying to hide adversarial features in the binary. The problem can be simplified by using a controlled language that can be formally verified by a trusted classical computer program — a compiler. So rather than share functions as binary, it would make sense to share them in strict controlled languages. For this reason, I believe that something resembling current programming languages will exist far into the future. Their main purpose will be as easily verifiable and human readable specifications for languages that can be compiled into high-performance code.

So in this imagined future, there are billions of functions in databases that are written in verifiable languages readable by humans, classical machines, and AIs. But what language is used? Maybe the AIs can converge on one standard. But even for AIs, and perhaps especially for them, I don’t think a single language is optimal. Rather, just as in human mathematics, there will likely be many languages for many domains. Languages make trade-offs. In general, the more complex a language is, the more difficult it is to parse, verify and optimize. So even if we ignore human factors, multi-lingual ecosystems are still likely to appear. Adding in human factors, we are again likely to see a spectrum of languages that accept different trade offs in rigor, ease of use, and domain specificity.

I predict a future where humans and AIs use libraries of functions written in specialized languages. All the functions need to be easily verifiable by an outside actor and verified functions need to be composed to more complex programs using a well-verified composer. Since we don’t trust any agent to verify, we need a classical program. Morloc is a potential candidate for this role. It would serve as a classical composition tool, function verification ecosystem, automation engine, and conceptual framework for organizing and using billions of mostly machine generated functions.

Of course, the future is impossible to predict, especially where AI is concerned. It is possible that AIs will converge on a single universal representation for computation. It is possible that the need for human readability and curation may disappear. It is possible that classical computer functions could be entirely replaced by discrete mathematical constructs that are composable and machine verifiable but entirely incomprehensible to humans.

7.10. Why is it named after Morlocks, weren’t they, like, bad?

While the Morlocks of Wellian fame are best known for their culinary preferences, I think Wells misrepresented them. And even if he didn’t, we don’t treat our own Eloi any better. Meat choices aside, the Morlocks worked below to maintain the machines that simplified life above. That’s why the Morloc language adapts their name.

7.11. Wait! I have more questions!

Great! Look me up on Discord (link below) and we can chat.

8. Contact

This is a young project and any brave early users are highly valued. Feel free to contact me for any reason!