# Whitepaper (v0.4)

# Overview

# Revolutions

Every major social change in the modern age has had an "ab incunabulis" moment, i.e - an origin. The Internet as of today is the result of the move from circuit-switching to packet-switching networks 1 (opens new window). The modern mobile revolution is a direct result of compact transistors 2 (opens new window) and wireless networking 3 (opens new window).

We believe that the next revolution of computing is upon us. Although might seem like wishful thinking, the technical building blocks are already here: high computing power 4 (opens new window), embedded sensors 5 (opens new window), and AI 6 (opens new window).

The culmination of these technical milestones will usher a new era for personal-computing.

  • On-Context will replace On-Demand (i.e, information will appear based on the environment, instead of manually opening an app).
  • Zero-Low-Latency requirements (see below) will change our wide-scale network topology 7 (opens new window).
  • Digital-Real-Estate will create new markets and legal frameworks.

# A Next-Generation Decentralized Spatial Web Apps Platform

The emergence of the Web[1a] (opens new window) and later the Web 2.0[1b] (opens new window) has transformed our way of consuming various services. On our workstations, desktop applications were mostly replaced by web applications. The same phenomenon is currently taking place on our mobile platforms, mostly due to diminishing bandwidth costs, increased computing power and improved energy density.

Alongside the flow from offline to online, there is also a major force of decentralization. The ability to share, change, merge and redeploy code (e.g, GitHub (opens new window)), data (e.g, Sketchfab (opens new window)), information (e.g Reddit (opens new window)) and ownership (e.g, BitCoin (opens new window)) is a huge innovative force. As our digital services (e.g, Apps) constantly evolve, they become less constrained; both in time (i.e, online services) and space (i.e, cloud infrastructure). The same changes can already be seen in the Mixed Reality applications market[2] (opens new window).

However, another, arguably more important, aspect has emerged: context. Usually we need to 'launch' an application for its services. This is changing. Mobile devices already use contextual information (e.g, location, time, inertial sensors, etc.) in our daily use. There is neither begin nor end for these types of applications. Information is pushed, instead of pulled. Contextual applications are emerging[3] (opens new window) because the technology has matured. Machine learning, high performance mobile compute devices and accurate sensors are the building blocks.

What ReSight intends to provide is the a computing platform for the next wave of contextual applications. We imagine a future where applications are fused to the physical space around us. They push visual and auditory information to the next-generation mobile devices; they can interact with their surroundings and share information with other applications; they are persistent over time and space; they are shared[4] (opens new window) and perceived by multiple users in real time and consistently; and as such, they are part of the next web, independent of any specific provider or manufacturer.

# ReSight

# Hardware & Software

ReSight is building the next-platform for spatial computing. While the tech giants (FAAMG) are building the hardware platform for mixed reality, ReSight is building the software solution.

We believe there will be a massive industry need for a cross-platform, decentralized, near-real-time and non-closed-garden solution - and we work hard to make this future possible.

# Differentiation

All current (public) solutions are based on the common Web2.0 practice of a centralized service. While this model is the right solution for today's apps, it breaks down in the spatial computation era.

When two or more people are in vicinity of each other, they expect to see the same physical environment. As digital information will become more immersed and realistic (e.g, AR glasses), this expectation breaks without near-zero-latency promise.

Also, without a standardization in computer vision algorithms, two competing platforms (e.g, Apple & Google) could not "merge" the spatial digital content, leaving the users in total separated networks. This is akin to having two entirely separate "Internets". We believe gaming theory prevents this future from materializing.

In the past, this key game-theoretic aspect, was indeed solved with industry wide convergence/standardization. The world-wide-web (opens new window), the videotape-format-war (opens new window), the protocol-wars (opens new window), are all fine examples.

However, spatial computing requires a collaboration of a different kind, one which was never previously done. It requires collaboration of a very complex computer vision pipeline/algorithm. Most of it, if not all, consist of deep learning models. Failure to accept a shared model, leads to a separation of networks.

Will the Big Tech collaborate, losing significant market share in the process, or as happened again-and-again in the past, will we experience a period of "platform-war"?

We are already seeing (as of 2020) a need from developers for a more robust less-platform-attached solution. Many of the current obstacles are related to limitations of the centralized-service model.

We aim to remedy this problem by building a decentralized solution from the ground-up. The next sections will try to elaborate on the technical endeavor to achieve this goal.

# ReSight's Platform

# Mirror SDK

Our cross-platform[1] SDK provides automatic, scalable, accurate, fast, and simple-to-use spatial computation capabilities:

  • Indoor localization is based on computer-vision-generated "maps" that are shared between individual users.
  • Applications built on our platform can utilize the new mesh-network stack to share app/user data across network boundaries, without the need of a centralized service/database.
  • Infinite scale: there is no "room" or "map" that must be known pre-usage. It-Just-Works is critical for rapid development and useful products.
  • Cold start: there is no need to "pre-scan" the environment before usage. Multiple users can just enter a new space and start a shared experience simultaneously.
  • GIS: using GPS signals (both spatially and temporally across users/sessions), we could triangulate the entire graph (see NLSS) and provide a much more accurate WGS84 coordinates.

# Indoor Localization & Shared Mapping

For localization, the outdoor solution is to use GPS, which is accurate to about 5m, and therefor useful for large scale outdoor navigation.

For indoor, the available solutions for this problem are:

  • Manual: The user needs to share a special code/name with other users.
  • QR Code: The user needs to point her device to a QR code, and only then can start.
  • GPS: Using the GPS as-is.
  • Map-Merge: A 3d scan is built and merged remotely.

Drawbacks:

  • Manual: Similar to a requirement to enter the exact street address before one can start navigation. Not usable.
  • QR Code: A QR code must be printed and placed by an "admin". Not usable.
  • GPS: Accuracy of GPS indoor is low and service is very degraded. Not usable for AR.
  • Map-Merge: Still requires a pre-known "map/room" name. Also, this requires a centralized server, which usually takes minutes for a merge, not private, not applicable for highly-sensitive and secure locations due privacy concerns, and could not work in offline scenarios.

All these solutions lack robustness and useability. In contrast, we propose a decentralized-edge-computation hybrid approach:

  • Mapping is sparse and is done on the edge device as a local service [2].
  • Each device finds, connects and queries other devices for mapping information (see [Mesh Network](#Mesh Network)).
  • Localization is done on the edge device against the acquired mapping information (note that E2E encryption is now possible, with no possibility of MITM attacks).
  • There is no concept of "merge" of the low-level mapping information. Therefore, information is not lost after a merge.
  • There is no "source of truth", each device has its own view of the environment. Therefore, each device is fully independent.

# Mesh Network

Spatial computing implies physical closeness. When Alice and Bob are in the same room, they expect their spatial application to understand that. However, what happens if Alice is connected to the local wifi, while Bob is using his cellular 3G connection? What happens when Charlie arrives, who is also on the same wifi as Alice?

The main problems in today's server-oriented networking model:

  • Latency: As long as the speed-of-light is a physical constant, information traveling across routers and sometimes continents, takes time [3]. Although Alice and Bob are a few meters apart, they experience an exchange of information as if they were kilometers apart. This is accepted for traditional applications, but breaks when a digital object is manipulated in real time by Alice and Bob and can only see the change with a perceived visual delay.
  • Privacy: Requiring a central server means that the mapping data can only be encrypted in-flight, but not end-to-end. This might be an acceptable compromise for traditional applications, but breaks when we contemplate the serious implications of personal-indoor-spaces automatically scanned and sent remotely.
  • Offline: Only P2P solutions could handle network outages
  • Enterprise/Defense: Specialized industries players that will either need an in-house solution (high cost and time) or a private network capable solution. On promise servers are possible for some scenarios, while others, if not most, will benefit by using a more low-watermark approach.

There are no good solutions. Consequently, a spatial computation platform should provide a P2P networking stack.

# How does Mirror Mesh Networking work?

  • ID
    • A Peer generates a public/private key pair. Saves them encrypted in the local key store.
    • PeerID is defined as the hash of the peer public key (i.e, PeerID = SHA2(PeerPK))
    • The public key is used for authentication & encryption for the DLTS handshaking.
    • Bootstrap A client finds other candidate clients and try to establish a connection:
      • mDNS (for local wifi connectivity)
      • Bluetooth (for close peers)
      • Signaling Server (using computer-vision sketch, see Hints below)
      • Shared Worlds (high level information, see Worlds below)
  • Connectivity We want to use the simplest algorithm that maximizes connectivity in the graph. (Utilizing Erdos-Renyi Model (opens new window) from random graph theory, we need at least (roughly) ln(n) edges for each peer to surpass the critical phase).
  • Routing Once connected, a mesh networking model is utilized.
    • A datagram-based messaging.
    • Use gossip protocol to broadcast messages to close-by peers.
    • Use a local-table source-oriented routing algorithm.
    • Flood-before-drop: a specialized custom made adaptive algorithm that prioritize low-latency and high-robustness, detects redundant paths, and prunes them using local-only routing messages.
  • Channels Once connected, the clients can effectively communicate with remote non-connected peers. These non-direct paths can then be lifted to direct channels using ICE (network punching), UDP (low latency), DTLS (encryption), SCTP (reliability and multiplexing).

# Consensus

We aim for availability with eventual consistency. There is no consensus mechanism[4], and each peer is free to advance. When a conflict is detected, a peer uses a deterministic conflict resolution algorithm that makes sure that once there are no partitions, all peers have chosen the same order of events.

Each peer, independently, is responsible for:

  • Switch between worlds (as a function of distance)
  • Push commits to worlds
  • Merge worlds
  • Add commits to worlds (maps data)
  • Refine geometric consistency (e.g, global BA)
  • Solve conflicts for changes to the same world

# Metro: Execution as a Computation Graph {#Metro}

One of the key components is an on-demand computational execution graph module. Most computation can be easily broken down and described as a graph (more precisely, a DAG). A node in this graph (Unit) takes a tuple of inputs, does a computation, and sends out the output.

Similar to a message oriented design (see: Actor Model), but connections are usually pre-determined, and a strict ordering is enforced using an attached tag for each message (i.e, the Clock). Although a bit abstract, this design allows us to run a complicated computer vision pipeline very efficiently.

Let's explain by an example:

An image is acquired by the device camera and is pushed for processing. If we consider the processing pipeline as a blackbox, that on a single system-thread takes 100ms, what will we do with the arrival of the next image? We could use a lock (i.e, one at a time), which will lower drastically the FPS. We can do better. So we'll change our system for multi-threading and launch concurrently N pipelines. Great, no? No. What happens when, inside our pipeline, we need the results of the previous processed image? We must wait. Not great. Even worse, today systems include many specialized hardware devices (e.g, GPU, Neural Network Chip) - if our pipeline uses one of these, even partially, we must synchronize access across all of the concurrent pipelines. Best case scenario - as before - it lowers the FPS.

Our solution is called Metro - a data flow execution framework[5] that can utilize the computation resources, free the developer to concentrate on the computer vision algorithm itself, allow us to describe a deep learning model as sub-networks (see UDNN), enforce strict order over the computed results and makes it very easy to cross-platform (as a platform-specific Unit can be easily replaced when porting to new hardware).

# Computer Vision

# UDNN

Built on top of Metro, we can take a deep learning model and break it into sub models.

In computer vision, many models are built with an encoder and a decoder head. A more complicated model will involve a multi-head decoder. When one has many models, all share the same encoder (as happens quite frequently), the only way to utilize them is to re-train a combined model (or manually re-combine the layers, if the encoder layers were frozen in all models).

The impact of this is particularly challenging when used in an embedded hardware (as the resources are always limited and costly):

  • If we have N different models, the cost to run them all in parallel is high (and perhaps not even possible).
  • If we have a combined model, we are enforced to always process all the "heads", even when some are not needed.

Using Metro, we can remedy this problem. Taking a combined model, we break it into separate components - the encoder and the heads. Although we have N+1 models, we can dynamically disable any model that is not needed for a specific image/input.

As the industry matures, we believe that many features and solutions will be used using deep learning. The ability to switch on/off, on-demand, any part of the model, while maintaining high processing utilization, might become a very competitive advantage.

# Factor Graphs & NLLS

As described before, each peer builds its own view of the world. Internally, the peer builds a factor graph using information it receives from: previous mappings, real time connected peers, and its own computer vision module.

An edge in this graph describes the prior distribution of the relative transformation. Some observations are very unique with tight distributions (e.g, a QR Code), some degrades temporaly (e.g, 6dof tracking) while others are very noisy and wide distributions (e.g, GPS).

Taking all of this into account, each peer periodically tries to optimize this graph using a non linear least square solver. This makes it much more robust to outliers and allows us to detect and switch-off possible outliers.

Once a solution is approved (a filtering and validation process), a commit is created that describes the changes that were made to the entities/nodes. (See Consensus)

A shared factor graph that includes observations from past sessions and over a growing spatial size enables:

  • Accuracy: as we can merge more observations, the MAP solution is more tightly bound, i.e, high accuracy.
  • Robustness: as any observation is considered a possible outlier, and we use non-linear solvers, we are much more robust than naive solutions.
  • Locality: as each peer is independent and all computations are done on-device, contradicting views of the world can co-exist at the same time. Consistency is achieved over time, instead over (limited) resources or a centralized oracle (a single point of failure).

# Worlds

  • TODO:
    • explain the non-boundary of world
    • explain inter and intra connections
    • explain why a worlds is a repository, and describe the commits, rebase, conflicts solutions
    • explain staging and real-time user-data
    • explain the blockchain structure
    • describe the consistency algorithm (or not, patentable)

Users-Worlds, illustration

# Hints

  • TODO:
    • examples of hints (gps, sketches, bow, ssid, mdns)
    • explain how hints are used to find candidate worlds from other peers
    • explain how worlds are used to find candidate peers
    • explain

# User Data

TODO

# ECS Service

TODO

# Consensus

TODO

# Zero-Latency Sync Service

TODO

# CRDT

TODO



  1. Internally, the SDK is platform-independent. However, currently we only provide an iOS API. Android API will be released in the near future. ↩︎

  2. We believe that this piece of technology will be a commodity in future OSs most likely through a cheap SoC. We are already seeing this happening in the last few years. ↩︎

  3. The Internet at the Speed of Light (opens new window) ↩︎

  4. We are currently also evaluating a more traditional consensus algorithm under a high frequency of join/leave events. More of that in the future. ↩︎

  5. Even better, behind the scenes, we are even using lightweights threads (opens new window) instead of system threads. ↩︎