Some of the most elegant computer science concepts

Distributed Thoughts
10 min readMar 29, 2022

--

Have you ever read an algorithm and thought: “Wow! This is so brilliant and elegant!”. This happens quite often to me, but there are some very precious and elegant algorithms and concepts that are — in my opinion — a must to know for every computer scientist.

This article aims to share my enthusiasm for some of the widely known key concepts of computer science. It aims to be neither complete nor extensive.

Photo by Markus Spiske on Unsplash

The Chomsky Hierarchy and Automata

We start with a core concept of computer science, related to computability.

The Chomsky Hierarchy refers to a hierarchy of classes of formal grammars. A formal grammar defines production rules, used to generate formal languages. A production rule is in the form of “left-hand sideright-hand side. Symbols stand on the left-hand side (lfs) and the right-hand side (rhs). We distinguish:

  • non-terminal symbols, usually indicated with uppercase letters, meaning that some production rule can yet be applied;
  • terminal symbols, usually indicated with lowercase letters, meaning that no production rule can be further applied;
  • a start symbol, a distinguished nonterminal symbol used to indicate where to start the production of words belonging to the language supported by the specific grammar.

We can have production rules in the form: S→AB, S→ε, A → aS, B→b; where {S, A, B} are nonterminals, {a, b} are terminals and S is the starting symbol.

A very high-level perspective on grammar. Let {A, B} be nonterminals, {a} be a terminal, {α, β, γ} be a string of terminals and/or non-terminals, with α and β possibly empty and γ never empty. The Chomsky hierarchy classifies grammar as follows:

The Chomsky hierarchy underlines the relation between the different grammars. Besides the beauty itself of this hierarchy, we know that different automata can be used for generating words from the grammar (or for recognizing them). The different types of grammars are related to different automata, namely finite state automata (for type-3 grammars), pushdown automata (for type-2), linear bounded automata (for type-1), and (always-halting) Turing machine (for type-0) .

Let’s consider the task of recognizing words belonging to grammar. A state machine has some inner states and allows transactions from one state to another according to the character seen on the (read-only) input tape. As you may guess, the pushdown automaton introduces a stack to a finite-state automaton. So, the automata can behave differently according to the symbol on top of the stack when the same character is observed as input. Of course, we cannot move around on the input tape and on the stack (only push/pop operations are allowed). The linear-bounded automaton (LBA) has a tape where the automaton can move on the left and on the right until a left or right end marker is met. In other words, computation is restricted to the portion of the tape containing the input plus the two tape squares holding the end markers (that cannot be crossed). Then, (always-halting) Turing machine has no limit on the tape length.

The beauty here is to realize that the different automata differ by just a few details and, yet importantly, their expressive power changes a lot!

Radix Sort

When we want to sort elements, we usually need to compare one element to some (if not all) of the other elements in the set to be sorted. This is the sorting algorithms work. If they rely on comparison between elements, they lead to an execution time of O(n log n) in the best case, where n is the number of elements to sort. Can we sort elements without using comparison?

This is the very idea of bucket sorting algorithms: we have multiple buckets, among which exists an ordering relation, and we just want to put the elements to sort in the right bucket. As soon as we finish, we have the sorted list of elements by simply reading the elements from the buckets. A special implementation is radix sort, specifically designed to sort numbers; two versions exist, starting from the most or least significative digit. Importantly, bucket soft works as expected because of its stability property. A sorting algorithm is “stable” if two objects with equal keys appear in the same order in sorted output as they appear in the input.

The video below shows how quickly different sorting algorithms perform.

Visualizing different sorting algorithms.

If you are in sorting algorithms and popular data structures, I suggest reading this article.

Binary Search

As soon as we have an array of sorted elements, we can speed up the retrieval of an element. How? The idea here is very well-known: binary search.

Binary search is a search algorithm that compares the target value to the middle element of a sorted array of elements. If they are equal, the element is found. Otherwise, it splits the array in half and continues the search on the first or second half only: on the first half, if the target value is lower than the middle element; on the second half, otherwise. The search continues until the target value is found; if the search ends with an empty half, the target is not in the array.

In the worst case, the binary search runs in logarithmic time — it makes O(log n) comparisons, with n the number of elements in the array. The idea behind binary search is so brilliant that we can find applications of binary search in many fields of computer science (e.g., database, information retrieval).

Distributed Hash Table

To further improve retrieval performance, we can recur to hash tables, which can retrieve elements with O(1) comparisons. Data is stored through key-value pairs, where the key is used to index data (e.g., to implement a cache, user-related values can be stored using the userId as key). Storing key-value pairs requires a hash table. A hash table applies a hash function to the key to identify the bucket where to store the value.

Distributed hash tables (DHT) are very popular nowadays, and they are often adopted in peer-to-peer services or in the emerging decentralized cloud storage services. A DHT is a distributed system that stores key-value pairs on distributed nodes so that any participating node can efficiently retrieve the value associated with a given key. Responsibility for maintaining the mapping from keys to value is distributed among the nodes. This improves the scalability of DHT, which can consist of a large number of nodes. Having a large number of nodes, DHT can handle failures and node arrivals and departures. A good DHT implementation allows nodes to be added or removed with minimum work on re-distributing keys.

Notably, two solutions are worth mentioning: Chord and Pastry. Here, we only overview Chord.

Chord specifies how keys are assigned to nodes. Nodes and keys are assigned to m-bit identifier using consistent hashing. Nodes are organized in a (logical) ring, so that each node has a successor and a predecessor in such a ring. The successor node of a key k is the first node whose identifier equals k or follows k in the identifier circle. Every key is stored at its successor node (successor(k)).

To quickly retrieve data, Chord requires each note to keep a finger table containing up to m records. The i-th entry of node n will contain the address of successor( (n+2^{i-1}) mod 2^m ). When a node wants to look up a key k, it will forward the request to the first node that precedes the successor(k) until it finds out the key is stored in its immediate successor. Such a finger table allows having a good knowledge of closeby nodes and a rough knowledge of far away nodes. Hence, a look-up contacts at most O(log n) nodes.

Probabilistic K Nearest Neighbor

Another elegant application of hash functions can be found in machine learning. The k-nearest neighbor algorithm (kNN) is a popular non-parametric method used in machine learning for solving classification and regression problems. This method classifies (or predicts) an object by the plurality vote of its neighbors in a multi-dimensional space.

Lookup is the most expansive operation in kNN: it can take O(N) with a sequential table, O(log N) with a binary tree, and O(1) with a hash table, where N is the number of sample objects. We are interested in hash tables here.

A hash function randomly distributes values among bins (the hash codes). Here, we want to have near points grouped together in the same bin, so we need locality-sensitive hash (LSH) functions. The LSH should have the property that, for any two objects, the closer the objects, the higher the probability that they are in the same bin. To improve probability of finding nearest objects, multiple LSH can be used: a LSH creates random projections that can be then combined to find an approximate solution to the kNN problem. When a new object q should be classified, we simply compute the LSH of q to identify many (but not all) objects that are near to q; the plurality vote on these objects is used to classify q.

Note that it does not matter how many objects we store, probabilistic kNN retrieves neighbor points of the object in O(1) comparison. Of couse, because of the probabilistic nature of LSH, probabilistic kNN only finds an approximated solution to the original problem.

Reinforcement Learning

Reinforcement Learning (RL) is a special method belonging to the branch of machine learning. It refers to a collection of trial-and-error methods by which an agent can learn how to make good decisions through a sequence of interactions with the environment.

In RL, an agent observes the system state and can perform an action, which usually depends on the current system state. The execution of the action a in the state s leads to a new state s’ and to the payment of an immediate cost c(s, a, s’). The experienced cost represents the experience and it is used by the RL agent to learn a policy, which aims to select the best actions so to minimize the long-term cost. One of the main challenges in RL is to find a good trade-off between the exploration and exploitation phases. To minimize the obtained cost, an RL agent must prefer actions that it found to be effective in the past (exploitation). However, to discover such actions, it has to explore new actions (exploration).

Importantly, the RL agent store an action-value function Q(s, a), known as “Q-function”, which is the expected infinite-horizon discounted cost incurred by taking action a in state s and then following the policy. Different approaches exist for estimating the Q-function, including solutions that exploit different degrees of available knowledge on the system (e.g., Q-learning, Dyna-Q, Model-based) or approximate the Q-function (e.g., Deep Q-learning). We cannot provide further details in this article, but the interested reader can refer to [1, 2].

This remarkably simple idea allows to automatically learn a policy that can drive an agent to perform a task, e.g., navigate in an unknown environment, perform system reconfigurations, learn to play games — see the video below, where a Deep RL agent learns to play Super Mario. The video shows the remarkable results RL can achieve.

Public Key Cryptography

Public-key cryptography has deeply changed the history of cryptography. In contrast to symmetric encryption, public-key cryptography is asymmetric as it involves the use of two separate keys. It is based on the hardness of some mathematical problems (e.g., discrete logarithm, prime number factorization) rather than on the complexity introduced by substitutions and permutations. Due to its higher computational overhead, asymmetric encryption will not replace symmetric encryption but is instead currently used to solve other, specific tasks, such as key distribution and digital signature [3].

In asymmetric encryption, we have a public and a private key. If the public key is used for encryption, then the related private key is used for decryption; if the private key is used for encryption, then the related public key is used for decryption. Encrypting with a private key is used to digitally sign (a hash of) the message. Since the public key is publicly available, the signature can be easily verified by anyone knowing it.

Asymmetric encryption relies on mathematical properties, often related to modular arithmetic and prime numbers. These ingredients have been used by Rivest, Shamir, and Adleman to design the RSA encryption algorithm. RSA revolves around the following equation:

where e, d, and n are very large positive integers and m (the message) is an integer with 0 ≤ m < n. The core principle is that knowing e and n, or even m, it can be extremely difficult to find d. Hence, the public key is the pair {e, n}, whereas the private key is {d, n}. In RSA, {e, d, n} are related: n is the product of two great prime numbers, say p and q, and ed = 1 mod φ(n), with φ(n)= (p-1)(q-1). The robustness of RSA revolves around the difficulty of factor n in its two prime factors (this would enable calculation of φ(n) and determination of ed = 1 mod φ(n).

The Diffie-Hellman (DH) key exchange algorithm also leverages the difficulty of computing discrete logarithms. DH devised a way to enable two users to securely exchange a key that can then be used for subsequent symmetric encryption of messages.

Nowadays, these hard problems are threatened by quantum computing, which could be able to more easily solve these hard problems than traditional computers. We are witnessing the definition of post-quantum cryptographic algorithms, which should be secure against a cryptanalytic attack by a quantum computer. The NIST has initiated a process to evaluate and standardize quantum-resistant public-key cryptographic algorithms, which will probably replace nowadays asymmetric encryption algorithms.

References

[1] Sutton, Richard S., and Andrew G. Barto. Introduction to reinforcement learning. Cambridge: MIT Press, 1998.

[2] Silver, David. Introduction to reinforcement learning. DeepMind. https://deepmind.com/learning-resources/-introduction-reinforcement-learning-david-silver

[3] Stallings, William. Cryptography and Network Security: Principles and Practice. Pearson Education, 2020.

--

--