Ricardo Pallás

Optimizing a Math Expression Parser in Rust

2025-06-30T08:00:00+00:00

Optimizing a Math Expression Parser in Rust

Baseline implementation (43.1 s)
Optimizations for speed and memory
Conclusion

In a previous post I explored how to optimize file parsing for max speed. This time, we’ll look at a different, self-contained problem: writing a math expression parser in Rust, and making it as fast and memory-efficient as possible.

Let’s say we want to parse simple math expressions with addition, subtraction, and parentheses. For example:

4 + 5 + 2 - 1       => 10
(4 + 5) - (2 + 1)   => 6
(1 + (2 + 3)) - 4   => 2

We’ll start with a straightforward implementation and optimize it step by step.

YOU CAN FIND THE FULL CODE ON: https://github.com/RPallas92/math_parser

Baseline implementation (43.1 s)

Here’s the first version of our parser:

use std::fs;
use std::io::Result;
use std::{iter::Peekable, time::Instant};

fn main() -> Result<()> {
    let total_start = Instant::now();
    let mut step_start = Instant::now();

    let input = read_input_file()?;
   
    println!("Step 1: Input file read in {:?}", step_start.elapsed());

    step_start = Instant::now();
    
    let result = eval(&input);
    
    println!(
        "Step 2: Calculation completed in {:?}",
        step_start.elapsed()
    );

    let total_duration = total_start.elapsed();
    
    println!("--- Summary ---");
    println!("Result: {}", result);
    println!("Total time: {:?}", total_duration);

    Ok(())
}

fn read_input_file() -> Result<String> {
    fs::read_to_string("data/input.txt")
}

fn eval(input: &str) -> u32 {
    let mut tokens = tokenize(input).into_iter().peekable();
    parse_expression(&mut tokens)
}

fn tokenize(input: &str) -> Vec<Token> {
    input
        .split_whitespace()
        .map(|s| match s {
            "+" => Token::Plus,
            "-" => Token::Minus,
            "(" => Token::OpeningParenthesis,
            ")" => Token::ClosingParenthesis,
            n => Token::Operand(n.parse().unwrap()),
        })
        .collect()
}

fn parse_expression(tokens: &mut Peekable<impl Iterator<Item = Token>>) -> u32 {
    let mut left = parse_primary(tokens);

    while let Some(Token::Plus) | Some(Token::Minus) = tokens.peek() {
        let operator: Option<Token> = tokens.next();
        let right = parse_primary(tokens);
        left = match operator {
            Some(Token::Plus) => left + right,
            Some(Token::Minus) => left - right,
            other => panic!("Expected operator, got {:?}", other),
        }
    }

    return left;
}

fn parse_primary(tokens: &mut Peekable<impl Iterator<Item = Token>>) -> u32 {
    match tokens.peek() {
        Some(Token::OpeningParenthesis) => {
            tokens.next(); // consume '('
            let val = parse_expression(tokens);
            tokens.next(); // consume ')'
        }
        _ => parse_operand(tokens),
    }
}

fn parse_operand(tokens: &mut Peekable<impl Iterator<Item = Token>>) -> u32 {
    match tokens.next() {
        Some(Token::Operand(n)) => n,
        other => panic!("Expected number, got {:?}", other),
    }
}

#[derive(Debug, Clone, PartialEq)]
enum Token {
    Operand(u32),
    Plus,
    Minus,
    OpeningParenthesis,
    ClosingParenthesis,
}

How it works

Let’s break it down.

The program reads from a file called input.txt, which contains a math expression in a single line. That expression is passed to the eval() function.

The tokenize() function processes the input string, splitting it by whitespace and converting each segment into a token. For example, this input:

7 - 3 + 1

…is turned into this list of tokens:

[Operand(7), Minus, Operand(3), Plus, Operand(1)]

The parser then uses a simple recursive strategy:

parse_expression() handles sequences of additions and subtractions.
parse_primary() handles numbers and expressions inside parentheses.
parse_operand() handles the actual integer values.

The recursive call to parse_expression() inside parse_primary() allows us to evaluate nested expressions (parentheses).

Parser Example: (1 + 2) - 3

Let’s walk through parsing the expression (1 + 2) - 3 using our functions:

fn eval(input: &str) -> u32 {
    let mut tokens = tokenize(input).peekable();
    parse_expression(&mut tokens)
}

Input string: (1 + 2) - 3

Tokens with index:

Index	Token
0	OpeningParenthesis `(`
1	Operand(1)
2	Plus `+`
3	Operand(2)
4	ClosingParenthesis `)`
5	Minus `-`
6	Operand(3)

We begin by calling parse_expression at depth 1:

parse_expression (depth 1)
- Calls parse_primary to get the first value.
parse_primary (depth 2)
- Sees OpeningParenthesis at index 0.
- Consumes ( and calls parse_expression (depth 3) for the parenthesized subexpression.
parse_expression (depth 3)
- Calls parse_primary (depth 4).
parse_primary (depth 4)
- Sees Operand(1) at index 1.
- Calls parse_operand (depth 5), which consumes index 1 and returns 1.
parse_expression (depth 3) (resuming)
- Sees Plus at index 2.
- Consumes + and calls parse_primary (depth 4) again.
parse_primary (depth 4)
- Sees Operand(2) at index 3.
- Calls parse_operand (depth 5), which consumes index 3 and returns 2.
parse_expression (depth 3)
- Combines 1 + 2 = 3.
- Returns 3 to the caller at depth 2.
parse_primary (depth 2) (resuming)
- Now at index 4 sees ClosingParenthesis.
- Consumes ) and returns the inner value 3.
parse_expression (depth 1) (resuming)
- Left side is 3.
- Sees Minus at index 5.
- Consumes - and calls parse_primary (depth 2).
parse_primary (depth 2)
- Sees Operand(3) at index 6.
- Calls parse_operand (depth 5), consumes it, and returns 3.
parse_expression (depth 1)
- Computes 3 - 3 = 0.
- No more operators, so it returns 0 as the final result.

It works! But we can do better

This baseline parser works well, but it’s not optimized.

If we compile it in release mode and execute it for the test file of 1.5GB, it takes 43.87 seconds to execute on my laptop:

Step 1: Input file read in 1.189915008s
Step 2: Calculation completed in 41.876205675s

--- Summary ---
Result: 2652
Total time: 43.06795088s

While the current parser is correct, its 43-second runtime shows there is room for improvement. Our goal is to make it faster and more memory-efficient.

We will improve the parser’s performance through several key optimizations:

Eliminate unnecessary allocations: First, we’ll change the tokenizer to avoid creating a list of tokens in memory.
Process bytes directly: We’ll modify the parser to read raw bytes instead of string slices, reducing overhead.
Parallelize the work: We’ll use multithreading and SIMD to perform calculations in parallel.
Optimize file I/O: Finally, we’ll use memory-mapped files to speed up file reading.

Let’s get started.

Optimizations for speed and memory

Optimization 1: Do not allocate a Vector when tokenizing (43.1 s → 6.45 s, –85% improvement)

Let’s use cargo flamegraph to visualize the call stack of the current solution and identify areas for optimization.

cargo flamegraph --dev --bin parser

We get the following flame graph:

We can see that the majority of the time is spent in the tokenizer function, which reads the input string and allocates a vector of tokens.

To profile memory usage, we can use dhat to generate a profile JSON file and view it at https://nnethercote.github.io/dh_view/dh_view.html:

Notice how 4 GB of RAM is used just to allocate the token vector!

I made a mistake in my initial implementation. Why does the tokenize function return a vector if we’re converting it into an iterator later anyway? Let’s just return a lazy iterator directly instead of allocating a vector:

fn eval(input: &str) -> u32 {
    let mut tokens = tokenize(input).peekable();
    parse_expression(&mut tokens)
}

fn tokenize(input: &str) -> impl Iterator<Item = Token> + '_ {
    input.split_whitespace().map(|s| match s {
        "+" => Token::Plus,
        "-" => Token::Minus,
        "(" => Token::OpeningParenthesis,
        ")" => Token::ClosingParenthesis,
        n => Token::Operand(n.parse().unwrap()),
    })
}

If we run the parser again after this small change, the speed improves significantly:

Step 1: Input file read in 1.249408413s
Step 2: Calculation completed in 5.204344393s

--- Summary ---
Result: 2652
Total time: 6.45377661s

Wow! From 43 seconds down to just 6.45. What an improvement. A small mistake can have a huge impact on performance. Fortunately, the flamegraph pointed us straight to the bottleneck!

Optimization 2: Zero allocations — parse directly from the input bytes (6.45 s → 3.68 s, –43% improvement)

After removing the initial Vec allocation, performance improved significantly. But we can still do better.

If we analyze the flamegraph again, we notice that although we no longer allocate a vector of tokens, we’re still splitting the input string by whitespace. This iterator-based approach is a huge improvement, but there is still overhead in processing the string and creating &str slices for each token:

The pink/violet boxes correspond to the split_whitespace function used by our tokenizer:

fn tokenize(input: &str) -> impl Iterator<Item = Token> + '_ {
    input.split_whitespace().map(|s| match s {
        "+" => Token::Plus,
        "-" => Token::Minus,
        "(" => Token::OpeningParenthesis,
        ")" => Token::ClosingParenthesis,
        n => Token::Operand(n.parse().unwrap()),
    })
}

We’re paying a cost for each split_whitespace call, which allocates intermediate slices. This churns memory and CPU cycles.

Let’s dive deeper.

The idea: Use &[u8]

Instead of working with UTF-8 strings and &str, we can use raw bytes (&[u8]) and manually scan for digits and operators to avoid temporary string allocations.

Here is our new zero-allocation tokenizer:

fn read_input_file() -> Result<Vec<u8>> {
    fs::read("data/input.txt")
}

struct Tokenizer<'a> {
    input: &'a [u8],
    pos: usize,
}

impl<'a> Iterator for Tokenizer<'a> {
    type Item = Token;

    fn next(&mut self) -> Option<Self::Item> {
        if self.pos >= self.input.len() {
            return None;
        }

        let byte = self.input[self.pos];

        self.pos += 1;

        let token = match byte {
            b'+' => Some(Token::Plus),
            b'-' => Some(Token::Minus),
            b'(' => Some(Token::OpeningParenthesis),
            b')' => Some(Token::ClosingParenthesis),
            b'0'..=b'9' => {
                let mut value = byte - b'0';
                while self.pos < self.input.len() && self.input[self.pos].is_ascii_digit() {
                    value = 10 * value + (self.input[self.pos] - b'0');
                    self.pos += 1;
                }

                Some(Token::Operand(value))
            }
            other => panic!("Unexpected byte: '{}'", other as char),
        };

        self.pos += 1; // skip whitespace

        return token;
    }
}

The only heap allocation occurs when the file is read into a vector. The tokenizer operates on references to that vector of bytes and does not perform any intermediate allocations.

If we execute the program again, we get:

Step 1: Input file read in 1.212080967s
Step 2: Calculation completed in 2.471639289s

--- Summary ---
Result: 2652
Total time: 3.683753465s

A great improvement! From 6.45 to 3.68 seconds, nearly 2 seconds faster!

Optimization 3: Do not use Peekable (3.68 s → 3.21 s, –13% improvement)

The new flamegraph shows several samples related to Peekable:

core::iter::adapters::peekable::Peekable::peek::_
core::iter::adapters::peekable::Peekable::peek

This is because we wrap our tokenizer in Rust’s Peekable adapter, which allows us to inspect the next token without consuming it. We initially used it for look ahead when parsing expressions like 1 + (2 - 3) to determine whether to continue parsing or return early.

However, in our use case, peek() isn’t necessary. We can restructure the algorithm to work directly with a plain iterator.

Here’s the old version:

fn parse_expression(tokens: &mut Peekable<impl Iterator<Item = Token>>) -> u32 {
    let mut left = parse_primary(tokens);

    while let Some(Token::Plus) | Some(Token::Minus) = tokens.peek() {
        let operator = tokens.next();
        let right = parse_primary(tokens);
        left = match operator {
            Some(Token::Plus) => left + right,
            Some(Token::Minus) => left - right,
            other => panic!("Expected operator, got {:?}", other),
        };
    }

    left
}

And here’s the new version that eliminates Peekable:

fn parse_expression(tokens: &mut impl Iterator<Item = Token>) -> u32 {
    let mut left = parse_primary(tokens);

    while let Some(token) = tokens.next() {
        if token == Token::ClosingParenthesis {
            break;
        }

        let right = parse_primary(tokens);
        left = match token {
            Token::Plus => left + right,
            Token::Minus => left - right,
            other => panic!("Expected operator, got {:?}", other),
        };
    }

    left
}

We replaced the peek() logic with a match on the current token. If it’s a + or -, we consume the right-hand operand and compute the result. If it’s a closing parenthesis, we break (this is an important point: we no longer manually skip the closing parenthesis after parsing a sub-expression).

Previously, with peekable, we consumed the (, parsed the sub-expression, and then had to explicitly next() again to discard the ) after the recursive call. Now, since we’re using a flat iterator, we simply let the closing ) token be returned by next(), and our while let Some(token) loop handles it. If the token is a ), we break out of the loop, and the recursive call returns.

We also simplified parse_primary in a similar way:

fn parse_primary(tokens: &mut impl Iterator<Item = Token>) -> u32 {
    match tokens.next() {
        Some(Token::OpeningParenthesis) => {
            let val = parse_expression(tokens);
            val
        }
        Some(Token::Operand(n)) => n as u32,
        other => panic!("Expected number, got {:?}", other),
    }
}

By avoiding peek() and handling the tokens linearly, we improve the performance:

Step 1: Input file read in 1.116952011s
Step 2: Calculation completed in 2.094806113s

--- Summary ---
Result: 2652
Total time: 3.21178544s

From 3.68 to 3.21 seconds. We are getting faster. Let’s continue optimizing!

Optimization 4: Multithreading and SIMD (3.21 s → 2.21 s, –31% improvement)

The next logical step is to parallelize the computation. Ideally, if we have a CPU with 8 cores, we want to split the input file into 8 equal chunks and have each core work on one chunk simultaneously. This should, in theory, make our program up to 8 times faster.

However, this is not as simple as just splitting the file into 8 equal chunks. We are bound by the rules of math and syntax, which introduce two restrictions:

We cannot split inside parentheses. A split can only happen at the “top level” of the expression. For example, splitting ((2 + 1)| - 2) is invalid, and this applies to nested parentheses as well.
We cannot split at a - operator. Addition is associative, meaning (a + b) + c is equivalent to a + (b + c). This property allows us to group additions freely. Subtraction, however, is not associative: (a - b) - c is not the same as a - (b - c). Splitting on a - would alter the order of operations and lead to an incorrect result.

These restrictions mean we cannot simply split the file at (total_size / 8). We need a way to find the closest valid split point (a + sign at the top level) to that ideal boundary.

To find these points, we would need to scan the entire input to identify where all current parentheses are closed. A naive scan for this would be slow, requiring a full pass over the data just to find the split points before the actual work begins. So, is this solution slower (2 passes vs 1 pass)? Not necessarily. We can make the first pass blazing fast by using SIMD.

The algorithm at a high level

Before diving into the code, let’s look at the high-level plan. The entire process is started by our parallel_eval function, which follows this data flow:

[ Input File ]
      |
      v
.-----------------------.
|     parallel_eval     |
'-----------------------'
      |
      | 1. Find Splits
      v
.----------------------------------.
| find_best_split_indices_simd     |------> [ Split Indices ]
'----------------------------------'             |
      |                                          |
      | 2. Create Chunks                         |
      v                                          |
[ Chunk 1 ] [ Chunk 2 ] ... [ Chunk N ] <--------+
      |         |               |
      |         |               | 3. Process in Parallel
      v         v               v
.------------------------------------.
|          Thread Pool               |
|                                    |
|  eval(c1)  eval(c2) ...  eval(cN)  |
'------------------------------------'
      |
      | 4. Collect Results
      v
[ Result 1, Result 2, ... Result N ]
      |
      | 5. Sum Results
      v
[ Final Answer ]

What is SIMD?

SIMD stands for Single Instruction, Multiple Data. It’s a powerful feature built into modern CPUs. At its core, SIMD allows the CPU to perform the same operation on multiple pieces of data at the same time, with a single instruction.

Consider a cashier at a grocery store. A traditional CPU core operates like a cashier scanning items one by one. This is a scalar operation, where one instruction processes one piece of data.

Scalar Operation (One by one)
Instruction: Is this byte '+'?
      |
      V
[ H | e | l | l | o |   | + |   | W | o | r | l | d ]
  ^--- Processed sequentially --->

A SIMD-enabled CPU is like a cashier with a wide scanner that can read the barcodes of an entire row of items in the cart simultaneously. This is a vector operation.

SIMD Operation (All at once)
Instruction: For all 64 of these bytes, tell me which ones are '+'?
      |
      V
[ H | e | l | l | o |   | + |   | W | o | r | l | d | ... (up to 64 bytes) ]
[ 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... (result mask)  ]
\_______________________________________________________________________/
                         Processed in a single cycle

For repetitive tasks, such as searching for a specific character in a long string, the performance gain is great.

SIMD example: Finding +

In our project, we need to locate all + characters.

The Scalar Way: Without SIMD, we would need a simple for loop to check every single byte:

let mut positions = Vec::new();
for (i, &byte) in input.iter().enumerate() {
    if byte == b'+' {
        positions.push(i);
    }
}

This approach is simple and correct, but for a 1.5GB file like ours, this loop would execute 1.5 billion times.

The SIMD Way: With SIMD (specifically, using AVX-512 instructions), the process is different:

Load: We load a big chunk of our input string (64 bytes at a time) into a wide 512-bit CPU register.
Compare: We use a single instruction (_mm512_cmpeq_epi8_mask) to compare all 64 bytes in our register against a template register that contains 64 copies of the + character.
Get Mask: The CPU returns a single 64-bit integer (u64) as a result. This is a bitmask. If the 5th bit of this integer is 1, it indicates that the 5th byte of our input chunk was a +.

In a single instruction, we have done the work of 64 loop iterations. While SIMD requires more complex code, the performance gains are worth it.

The Code

Here are the two key functions that implement our parallel strategy: parallel_eval orchestrates the process, and find_best_split_indices_simd uses SIMD to find the valid split points.

fn parallel_eval(input: &[u8], num_threads: usize) -> i64 {
    if num_threads <= 1 || input.len() < 1000 {
        return eval(input);
    }

    // 1. Find the best places to split the input.
    let split_indices = unsafe { find_best_split_indices_simd(input, num_threads - 1) };

    if split_indices.is_empty() {
        return eval(input);
    }

    // 2. Create the chunks based on the indices.
    let mut chunks = Vec::with_capacity(num_threads);
    let mut last_idx = 0;
    for &idx in &split_indices {
        // Slice from the last index to just before the operator's space.
        chunks.push(&input[last_idx..idx - 1]);
        // The next chunk starts after the operator and its space.
        last_idx = idx + 2;
    }
    chunks.push(&input[last_idx..]);

    // 3. Process all chunks in parallel with Rayon.
    let chunk_results: Vec<i64> = chunks.par_iter().map(|&chunk| eval(chunk)).collect();

    // 4. Since we only split on '+', the final result is the sum of all parts.
    chunk_results.into_iter().sum()
}

#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx512f")]
#[target_feature(enable = "avx512bw")]
unsafe fn find_best_split_indices_simd(input: &[u8], num_splits: usize) -> Vec<usize> {
    // Explanation of this function in the next section.
    let mut final_indices = Vec::with_capacity(num_splits);
    if num_splits == 0 {
        return final_indices;
    }

    let chunk_size = input.len() / (num_splits + 1);
    let mut target_idx = 1;
    let mut last_op_at_depth_zero = 0;
    let mut depth: i32 = 0;
    let mut i = 0;
    let len = input.len();

    let open_parens = _mm512_set1_epi8(b'(' as i8);
    let close_parens = _mm512_set1_epi8(b')' as i8);
    let pluses = _mm512_set1_epi8(b'+' as i8);

    'outer: while i + 64 <= len {
        if final_indices.len() >= num_splits {
            break;
        }
        let chunk = _mm512_loadu_si512(input.as_ptr().add(i) as *const _);
        let open_mask = _mm512_cmpeq_epi8_mask(chunk, open_parens);
        let close_mask = _mm512_cmpeq_epi8_mask(chunk, close_parens);
        let plus_mask = _mm512_cmpeq_epi8_mask(chunk, pluses);

        let mut all_interesting_mask = open_mask | close_mask | plus_mask;

        while all_interesting_mask != 0 {
            let j = all_interesting_mask.trailing_zeros() as usize;
            let current_idx = i + j;
            if (open_mask >> j) & 1 == 1 {
                depth += 1;
            } else if (close_mask >> j) & 1 == 1 {
                depth -= 1;
            } else { // Is a '+' operator
                if depth == 0 {
                    last_op_at_depth_zero = current_idx;
                    let ideal_pos = target_idx * chunk_size;
                    if current_idx >= ideal_pos {
                        final_indices.push(current_idx);
                        target_idx += 1;
                        if final_indices.len() >= num_splits {
                            break 'outer;
                        }
                    }
                }
            }
            all_interesting_mask &= all_interesting_mask - 1;
        }
        i += 64;
    }

    // ... scalar remainder and fill logic ...
    while i < len && final_indices.len() < num_splits {
        let char_byte = *input.get_unchecked(i);
        if char_byte == b'(' { depth += 1; }
        else if char_byte == b')' { depth -= 1; }
        else if char_byte == b'+' && depth == 0 {
            last_op_at_depth_zero = i;
            let ideal_pos = target_idx * chunk_size;
            if i >= ideal_pos {
                final_indices.push(i);
                target_idx += 1;
            }
        }
        i += 1;
    }
    while final_indices.len() < num_splits && last_op_at_depth_zero > 0 {
        final_indices.push(last_op_at_depth_zero);
    }
    final_indices
}

Algorithm Breakdown: `find_best_split_indices_simd`

This function’s purpose is to identify the optimal + signs for splitting.

Step 1: The SIMD Scan

The code enters a main loop, processing the input in 64-byte chunks. Within this loop, it uses _mm512_cmpeq_epi8_mask to generate bitmasks. This instruction compares all 64 bytes of the current chunk against a target character and returns a 64-bit integer (u64) where the N-th bit is 1 if the N-th byte was a match.

Step 2: The serial scan

Next, we combine these masks and iterate only through the “interesting” bits. This is a key step:

let mut all_interesting_mask = open_mask | close_mask | plus_mask; // This means we are looking for '(', ')' and '+' characters.

while all_interesting_mask != 0 {
    let j = all_interesting_mask.trailing_zeros() as usize; // j is the index of the next found interesting character.
    let current_idx = i + j; // + j because the mask is a u64 little endian, so trailing zeros are the leading 0 in reality
    if (open_mask >> j) & 1 == 1 { // If that char is a '(' we increase the depth (we enter in a sub expression)
        depth += 1;
    } else if (close_mask >> j) & 1 == 1 { // If that char is a ')' we decrease the depth (we exit from a sub expression)
        depth -= 1;
    } else {
        if depth == 0 { // If the depth is 0, we are at a top level, outside of parentheses. And it is a '+' sign.
            last_op_at_depth_zero = current_idx;
            if current_idx >= ideal_pos { // If we have reached the ideal position (chunk / NUM_THREADS). So we add this '+' sign to the splitting indices.
                final_indices.push(current_idx);
                target_idx += 1;
                if final_indices.len() >= num_splits {
                    break 'outer;
                }
            }
        }
    }
    all_interesting_mask &= all_interesting_mask - 1; // Clears the lowest set 1 bit from the mask, as we have processed it already
}

This loop does not run 64 times. It only runs for the number of set bits in all_interesting_mask. To understand how it processes characters from left-to-right, we need to look at two key details:

trailing_zeros() and Little-Endian: While you might assume trailing_zeros starts from the end of the string, it’s actually the opposite. Modern x86-64 CPUs are little-endian. When a block of memory is loaded into a large integer register, the first byte in memory (e.g., chunk[0]) becomes the least significant byte (LSB) of the integer. The trailing_zeros() instruction counts from this LSB, meaning it always finds the set bit corresponding to the character with the lowest index in our chunk.
```
Memory (Bytes in a chunk):
  Byte Index:   0   1   2   3   ...   63
  Content:     '(' '1' '+' '2'  ...   'X'

      |
      |  Load into a 64-bit integer
      v

Resulting u64 Bitmask:
  Bit Position:  63  ...   3   2   1   0   <-- LSB (The "trailing" end)
  Corresponds to: 'X' ...  '2' '+' '1' '('
```
As you can see, trailing_zeros starts from the right of the integer, which corresponds to the left of our string chunk.
if (open_mask >> j) & 1 == 1 {: This is just to check if there is an open parenthesis at position j. If so, we increment our counter depth.
all_interesting_mask &= all_interesting_mask - 1: This is a trick that clears the lowest set 1 bit we just found. On the next iteration, trailing_zeros finds the new lowest set bit, which corresponds to the character at the next lowest index.

This combination allows us to visit every interesting character in the correct, forward order, but without a slow byte-by-byte scan. Inside the loop, we just update our depth counter to know if we are at a top level position, and if so, we check if we can add a splitting point.

Full example

Let’s trace the entire flow with a small, concrete example:

Input String: (1-2) + (3-4) + (5-6) (Length is 23 bytes)
Goal: Find 1 split point (num_splits = 1) to create 2 chunks.
Ideal Split Position: 1 * (23 / 2) = 11. We are looking for the first + at depth 0 at or after byte 11.

Part 1: `find_best_split_indices_simd` runs

The function will scan the input to find the best split point.

Input:        ( 1 - 2 )   +   ( 3 - 4 )   +   ( 5 - 6 )
Index:        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
                                  1 1 1 1 1 1 1 1 1 1 2 2 2
Depth:        1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0
Ideal Split ->                      ^

The code starts scanning. It finds the first + at index 7.
It checks the depth. The ( at index 0 increased depth to 1, and the ) at index 5 decreased it back to 0. So, at index 7, depth == 0.
It checks the splitting logic: is current_idx (7) >= ideal_pos (11)? The answer is No. The code continues scanning.
The code finds the next + at index 15.
It checks the depth. The ( at index 9 and ) at index 13 have kept the depth at 0.
It checks the splitting logic: is current_idx (15) >= ideal_pos (11)? The answer is Yes!
Action: The code pushes 15 into final_indices and immediately breaks out of all loops because it has found the 1 split it was looking for.
The function returns [15].

Part 2: `parallel_eval` receives the result

split_indices is now [15].
The for loop runs once for the index 15.
- It creates the first chunk by slicing from 0 to 15 - 1 = 14. The chunk is (1-2) + (3-4).
- It updates last_idx to 15 + 2 = 17.

The loop finishes. It creates the final chunk by slicing from 17 to the end. The chunk is (5-6).

Original:     (1-2) + (3-4)   +   (5-6)
              <-- chunk 1 -->   <-- chunk 2 -->

Split Index:                    ^ (15)

The two chunks, (1-2) + (3-4) and (5-6), are sent to the Rayon thread pool.
Thread 1 gets (1-2) + (3-4), calls eval, and gets the result -2.
Thread 2 gets (5-6), calls eval, and gets the result -1.
collect() gathers the results into a vector: [-2, -1].
Finally, sum() adds them together: -2 + -1 = -3.

The final answer is -3, which is correct. The entire process worked perfectly.

Result

In summary, by using SIMD, we can perform an initial, extremely fast pass to identify optimal split points in the input, and then process each chunk in parallel. I believe a similar technique is employed by the popular simdjson library.

I executed the code on my Surface laptop, and here are the results:

Step 1: Input file read in 1.199915008s

Step 2: Calculation completed in 1.010507822s

--- Summary ---
Result: 2652
Total time: 2.210422830s

From 3.21 to 2.21 seconds. 1 second faster, for an already optimized program. Good!

Optimization 5: Memory-Mapped I/O (2.21 s → 0.98 s, –56% improvement)

After profiling the memory usage of our parallel solution, we can see that we’re still allocating a very large buffer on the heap to hold the entire file’s contents.

mmap (memory-mapped files) can be more efficient than standard file I/O because it avoids extra copying from kernel to user space, and allows the operating system to manage memory for us.

When I initially tried mmap with the single-threaded version of the code, the performance gain was negligible. However, now that our program is multithreaded, let’s re-evaluate its impact.

Kernel Space vs. User Space

Kernel space: The privileged area where the operating system runs, managing hardware, I/O, and the page cache.
User space: The unprivileged area where our application code executes, including your heap buffers, stacks, and other program data.
Page cache: A kernel-managed buffer that temporarily stores file data in memory to speed up subsequent access.

Cost of fs::read

Double Memory Footprint
```
[ Disk ] → [ Page Cache ] (1.5 GB)  
          → [ Heap Vec ] (1.5 GB)  
```
This process involves loading the file into kernel space and then copying it to user space.
False Sharing Contention Modern CPUs transfer data between main memory and CPU caches in 64-byte blocks called “cache lines.” False sharing occurs when multiple threads access different variables that happen to reside on the same cache line. If one thread modifies its variable, the entire cache line is invalidated for all other threads, forcing them to re-fetch it from memory even though their own data hasn’t changed.
```
// Thread 1 writes to data at byte 8
// Thread 2 writes to data at byte 40
// Both bytes are in the same 64-byte cache line (0-63).
// The cache line "bounces" between the cores, causing delays.
```
With a single large Vec, the boundaries of the chunks processed by each thread could easily fall in a way that causes this contention.

mmap improvement

Instead of reading the entire file into a Vec, we can map it directly into memory with mmap. This gives us:

use memmap2::Mmap;

fn read_input_file() -> std::io::Result<Mmap> {
    let file = File::open("data/input.txt")?;
    unsafe { Mmap::map(&file) }
}

This approach avoids the extra copy performed by fs::read and does not allocate the file’s content in user space memory.

[ Disk ] → [ Page Cache (1.5 GB) ] ↔ [ mmap view in user space ]

I also think it is faster because we don’t have false sharing with mmap. It hands us the file in 4 KB pages. Threads get whole pages:

Thread 1 works on data starting at Page 0 (byte 0)
Thread 2 works on data starting at Page N (byte N*4096)

Since pages (4 KB) are much larger than cache lines (64 B), threads operate on memory regions that are physically far apart, preventing them from contending over the same cache lines. I’m not entirely certain about this, but it’s my conclusion after reading a lot about the topic and consulting various LLM models.

Code Changes

The change is minimal. The read_input_file function now returns an Mmap object, which is passed directly to the parallel_eval function. This allows the operating system to efficiently map the file directly into our process memory on demand:

use memmap2::Mmap;

fn read_input_file() -> Result<Mmap> {
    let file = File::open("data/input.txt")?;
    unsafe { Mmap::map(&file) }
}

fn main() -> Result<()> {
    let mmap = read_input_file()?;
    let result = parallel_eval(&mmap, NUM_THREADS);
    println!("Result: {}", result);
    Ok(())
}

Performance Results

Step 1: Input file read in 18.8 µs  
Step 2: Calculation completed in 981.2 ms  
**Total time:** 981.3 ms

From 2.21s to 981ms. Less than a second!!

Conclusion

YOU CAN FIND THE FULL CODE ON: https://github.com/RPallas92/math_parser

We started with a simple math parser that took 43 seconds to run. By making a series of changes, we made it run in under one second. Here is a summary of what we did:

Stopped creating a list of all tokens at once. Instead of reading the whole file and creating a big list of tokens, we processed them one by one. This was the biggest improvement, bringing the time down from 43 to 6.4 seconds. (To be honest I made this mistake in purpose just to see the difference).
Worked with bytes instead of text. Instead of treating the input as text, we worked with the raw bytes. This avoided extra work and brought the time down to 3.7 seconds.
Simplified the code by removing Peekable. We changed the logic to avoid peeking at the next token, which made the code faster, reducing the time to 3.2 seconds.
Used multiple threads and modern CPU features. We used Rayon to run calculations in parallel and SIMD to find split points faster. This brought the time down to 2.2 seconds.
Used memory-mapped files. Instead of reading the file into memory ourselves, we let the operating system handle it. This was the final optimization, bringing the time down to just 0.98 seconds.

If you have any corrections or comments, please contact me on LinkedIn or via email. Thank you very much for reading!

Pingora async runtime and threading model

2024-09-07T08:00:00+00:00

Pingora async runtime and threading model

Introduction

Cloudflare open-sourced their Rust framework for building programmable network services called Pingora. They developed it to replace NGINX to overcome its limitations, as explained in their blog:

Today we are excited to talk about Pingora, a new HTTP proxy we’ve built in-house using Rust that serves over 1 trillion requests a day, boosts our performance, and enables many new features for Cloudflare customers, all while requiring only a third of the CPU and memory resources of our previous proxy infrastructure.

…

Over the years, our usage of NGINX has run up against limitations. For some limitations, we optimized or worked around them. But others were much harder to overcome.

I am starting a series of posts where I will dive deep into the internals of Pingora. Each post will cover a different aspect of Pingora’s architecture. In this post, we will discuss its runtime for running async Rust and its threading model.

Pingora’s async runtime

Pingora’s async runtime is based on Tokio. Tokio is the de facto runtime for write asynchronous, non-blocking code in Rust:

Tokio is scalable, built on top of the async/await language feature, which itself is scalable. When dealing with networking, there’s a limit to how fast you can handle a connection due to latency, so the only way to scale is to handle many connections at once. With the async/await language feature, increasing the number of concurrent operations becomes incredibly cheap, allowing you to scale to a large number of concurrent tasks.

Pingora offers two multi-threaded runtimes (with and without work stealing) that we will discuss later. But first, let’s explain how Tokio’s runtime works, as it forms the basis of Pingora.

Tokio’s runtime

The Tokio runtime is composed of three main components:

An I/O event loop, called the driver, which drives I/O resources and dispatches I/O events to tasks that depend on them.
A scheduler to execute tasks that use these I/O resources.
A timer for scheduling work to run after a set period of time.

The tokio::main macro provides a default-configured runtime:

use futures::future::join_all;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    println!("Executing main");

    let mut handles = Vec::new();

    for i in 0..5 {
        let handle = tokio::spawn(async move {
            println!("Executing task {}", i);
        });
        handles.push(handle);
    }

    join_all(handles).await;
    Ok(())
}

In the code above, the main async function is scheduled on the Tokio runtime. The function also schedules five additional async tasks on the same runtime by calling tokio::spawn.

Tokio’s runtime provides two main task scheduling strategies:

Multi-Thread Scheduler
Current-Thread Scheduler

Multi-Thread Scheduler

The multi-thread scheduler executes futures on a thread pool, using a work-stealing strategy. By default, it will start a worker thread for each CPU core available on the system.

At its most basic level, a runtime has a collection of tasks that need to be scheduled. It will repeatedly remove a task from that collection and schedule it (by calling poll). When the collection is empty, the thread will go to sleep until a task is added to the collection.

The multi-thread runtime maintains one global queue, and a local queue for each worker thread. The runtime will prefer to choose the next task to schedule from the local queue, and will only pick a task from the global queue if the local queue is empty.

If both the local queue and global queue are empty, the worker thread will attempt to steal tasks from the local queue of another worker thread. Stealing is done by moving half of the tasks in one local queue to another local queue.

 Multi-Thread Scheduler with 2 threads                        
┌────────────────────────────────────────────────────────────┐
│                                                            │
│   Tokio runtime                                            │
│                                                            │
│   ┌────────────────────────────────────────────────────┐   │
│   │ Global queue                                       │   │
│   │  (empty)                                           │   │
│   └──────────┬─────────────────────────────┬───────────┘   │
│              │                             │               │
│              │                             │               │
│              ▼                             ▼               │
│   ┌──────────────────────┐      ┌──────────────────────┐   │
│   │ Thread 1             │      │ Thread 2             │   │
│   │                      │      │                      │   │
│   │ ┌──────────────────┐ │      │ ┌──────────────────┐ │   │
│   │ │ Local queue      │ │      │ │ Local queue      │ │   │
│   │ │ Task 1           │ │      │ │ (empty)          │ │   │
│   │ │ Task 2           │ │      │ │                  │ │   │
│   │ │ Task 3           │ │      │ │                  │ │   │
│   │ │ Task 4           │ │      │ │                  │ │   │
│   │ └──────────────────┘ │      │ └──────────────────┘ │   │
│   │                      │      │                      │   │
│   │  Executing Task 1    │      │  Idle                │   │
│   │                      │      │                      │   │
│   └──────────────────────┘      └──────────────────────┘   │
│                                                            │
└────────────────────────────────────────────────────────────┘

In the diagram above, the Tokio runtime has two threads. Thread 1 has four tasks in its local queue, while Thread 2 has no tasks. Since the global queue is also empty, Thread 2 will steal half of the tasks from Thread 1. This ensures that tasks are balanced across CPU cores, maximizing the use of available hardware resources and increasing throughput.

 Multi-Thread Scheduler with 2 threads                        
┌────────────────────────────────────────────────────────────┐
│                                                            │
│   Tokio runtime                                            │
│                                                            │
│   ┌────────────────────────────────────────────────────┐   │
│   │ Global queue                                       │   │
│   │  (empty)                                           │   │
│   └──────────┬─────────────────────────────┬───────────┘   │
│              │                             │               │
│              │                             │               │
│              ▼                             ▼               │
│   ┌──────────────────────┐      ┌──────────────────────┐   │
│   │ Thread 1             │      │ Thread 2             │   │
│   │                      │      │                      │   │
│   │ ┌──────────────────┐ │      │ ┌──────────────────┐ │   │
│   │ │ Local queue      │ │      │ │ Local queue      │ │   │
│   │ │ Task 1           │ │      │ │ Task 3           │ │   │
│   │ │ Task 2           │ │      │ │ Task 4           │ │   │
│   │ └──────────────────┘ │      │ └──────────────────┘ │   │
│   │                      │      │                      │   │
│   │  Executing Task 1    │      │  Executing Task 3    │   │
│   │                      │      │                      │   │
│   └──────────────────────┘      └──────────────────────┘   │
│                                                            │
└────────────────────────────────────────────────────────────┘

Single-Thread Scheudler

The single-thread scheduler in Tokio operates with a single thread and uses a global and a local queues to manage tasks. This scheduler is designed to handle all asynchronous tasks on a single core. In this model, there is no need for task migration or work stealing, as all tasks are handled by the single thread:

 Single-Thread Tokio Scheduler                        
┌──────────────────────────────┐
│                              │
│   Tokio Runtime              │
│                              │
│   ┌──────────────────────┐   │
│   │ Thread               │   │
│   │                      │   │
│   │ ┌──────────────────┐ │   │
│   │ │ Global Queue     │ │   │
│   │ │ (empty)          │ │   │
│   │ └──────────────────┘ │   │
│   │                      │   │
│   │ ┌──────────────────┐ │   │
│   │ │ Local Queue      │ │   │
│   │ │ Task 1           │ │   │
│   │ │ Task 2           │ │   │
│   │ │ Task 3           │ │   │
│   │ │ Task 4           │ │   │
│   │ └──────────────────┘ │   │
│   │                      │   │
│   │  Executing Task 1    │   │
│   │                      │   │
│   └──────────────────────┘   │
│                              │
└──────────────────────────────┘

Pingora’ threading model

As already mentioned before, Pingora’s threading model is based on Tokio’s asynchronous runtime, and it provides two main flavors of runtimes: a stealing (multi-threaded with work stealing) and a no-steal (multi-threaded without work stealing) approach.

The stealing flavour is just the standard Tokio multi-thread runtime without any customizations that we already discussed.

On the other hand, the no-steal flavour is a set of OS threads each with its own single-threaded Tokio runtime.

Pingora’s multi-thread runtime without work stealing

The no-steal runtime model in Pingora is an alternative where each thread runs its own independent Tokio runtime with no task migration between threads. This means each thread operates as a single-threaded runtime, but the overall system can still utilize multiple cores by spawning multiple such runtimes:

 Multi-Threaded No Stealing Pingora Runtime with 2 Threads                            
┌────────────────────────────────────────────────────────────┐
│                                                            │
│   ┌────────────────────────────────────────────────────┐   │
│   │   Single-thread Tokio Runtime (Thread 1)           │   │
│   └────────────────────────────────────────────────────┘   │
│                                                            │
│   ┌────────────────────────────────────────────────────┐   │
│   │   Single-thread Tokio Runtime (Thread 2)           │   │
│   └────────────────────────────────────────────────────┘   │
│                                                            │
│   ┌──────────────────────┐     ┌──────────────────────┐    │
│   │ Thread 1             │     │ Thread 2             │    │
│   │                      │     │                      │    │
│   │ ┌──────────────────┐ │     │ ┌──────────────────┐ │    │
│   │ │ Global Queue     │ │     │ │ Global Queue     │ │    │
│   │ │ (Tasks 1, 2, 3)  │ │     │ │ (Tasks 4, 5, 6)  │ │    │
│   │ └──────────────────┘ │     │ └──────────────────┘ │    │
│   │                      │     │                      │    │
│   │ ┌──────────────────┐ │     │ ┌──────────────────┐ │    │
│   │ │ Local Queue      │ │     │ │ Local Queue      │ │    │
│   │ │ (empty)          │ │     │ │ (empty)          │ │    │
│   │ └──────────────────┘ │     │ └──────────────────┘ │    │
│   │                      │     │                      │    │
│   │  Executing Task 1    │     │  Executing Task 5    │    │
│   │                      │     │                      │    │
│   └──────────────────────┘     └──────────────────────┘    │
│                                                            │
│   ┌────────────────────────────────────────────────────┐   │
│   │               Multi-Core Utilization               │   │
│   └────────────────────────────────────────────────────┘   │
│                                                            │
└────────────────────────────────────────────────────────────┘

As we can see in the diagram above, each thread has its own Tokio runtime and task queues, and tasks scheduled on one thread remain on that thread throughout their lifetime. This means there is no work stealing, each thread is responsible for its own work, and no tasks are stolen or migrated across threads. Pingora ensures that any new task spawned in this runtime is randomly assigned to one of the threads.

This runtime flavour allows a thread-per-core thread model: it spawns multiple OS threads, with one thread typically mapped to each available CPU core.

Alternative thread-per-core runtimes

One improvement that Pingora could add to its non-stealing runtime is the use of LocalSet (see docs). Since it is composed of Tokio single-thread runtimes the futures shouldn’t be required to be Send and Sync as they are always run in the same thread. LocalSet provides a way to spawn and manage non-Send tasks within the context of a single-threaded runtime.

There are also other existing alternative runtimes that lend themselves to thread-per-core architectures: glommio from DataDog and monoio from ByteDance.

I recommend reading this post about async Rust without Send + Sync + 'static.

Which Pingora runtime should I use?

Let’s discuss the pros and cons of each alternative:

Work stealing

Pros:

It dynamically balances the load across threads. If one thread is overwhelmed with tasks, other threads can steal work from the busy thread’s queue.
By distributing tasks, it helps ensure that all CPU cores are utilized efficiently.

Cons:

Some overhead is introduced due to the need for coordination and synchronization when threads steal tasks from each other. This can cause additional latency in certain contexts.

Thread-Per-Core

Pros:

Since threads do not need to steal work from one another, there is less synchronization overhead.
Tasks are bound to specific threads, which can be beneficial for tasks that are stateful or have significant initialization costs.

Cons:

If some tasks require more time to execute than others, this could lead to some CPU cores being busy while others remain idle.

In my opinion, in the particular case of an HTTP proxy like Pingora, if the incoming traffic is predictable and each request requires a similar amount of time to be processed, a thread-per-core model might provide consistent performance with reduced thread contention. I opened a discussion here.

Conclusion

In summary, Pingora offers two async runtimes: one with work stealing and one without.

The work-stealing runtime is good for handling varying loads by balancing tasks across threads, but it introduces some overhead.
The thread-per-core model, where each thread runs its own Tokio runtime, can provide more consistent performance with less contention, especially if workloads are predictable.

In my opinion, Pingora should consider improving its non-stealing policy by allowing the use of non-Send and non-Sync futures with a LocalSet. On top of that, they could even explore supporting other runtimes like glommio and monoio.

Rust 1 Billion Row Challenge without Dependencies

2024-06-28T08:00:00+00:00

Rust 1 Billion Row Challenge without Dependencies

Introduction
Base Naive Implementation (90 seconds)
Multithreading Solution (17.96 secs - 80% improvement)
Custom Number Parsing (8.1 seconds - 54.9% improvement)
Custom Key Parsing (6.76 seconds - 16.5% improvement)
Custom Hash Function (5.85 seconds - 13.5% improvement)
Unsafe String Parsing (5.16 seconds - 11.8% improvement)
Edit 1: Custom Line Splitting (4.82 seconds - 6.59% improvement)
Conclusion

Introduction

On January 1st, 2024, Gunnar Morling announced the 1 Billion Row Challenge (1BRC). The challenge is to write a Java program to read temperature data from a text file and find the minimum, average, and maximum temperatures for each weather station. The file has 1,000,000,000 rows.

The text file has a simple structure with one measurement value per row:

Graus;12.0
Zaragoza;8.9
Madrid;38.8
Paris;15.2
London;12.6
...

The program should print out the min, mean, and max values per station, ordered alphabetically like so:

{Graus=5.0/18.0/27.4, Madrid=15.7/26.0/34.1, New York=12.1/29.4/35.6, ...}

I was curious and read several implementations in Rust. They were really good and optimized, and I learned a lot from them. However, they did not follow one of the rules of the 1BRC: no external dependencies may be used.

I tried running the official Rust solution with my 1 billion rows test file on my SER5 MAX mini PC. It took 5.7 seconds to execute.

I decided to write my own solution in Rust without using any external crates. My goal was to achieve similar performance to the official solution while keeping the code simple and short.

The code is available on this Github repository.

Base Naive Implementation (90 seconds)

I started by writing a simple, naive and unoptimized first version to use it as a base implementation for further improvements.

use std::{
    collections::BTreeMap,
    fmt::Display,
    fs::File,
    io::{BufRead, BufReader, Result},
    time::Instant,
};

fn main() {
    /*
    The release build is executed in around 90 seconds on SER5 MAX:
       - CPU: AMD Ryzen 7 5800H with Radeon Graphics (16) @ 3.200GHz
       - GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
       - Memory: 28993MiB
    */
    let start = Instant::now();

    let reader = get_file_reader().unwrap();
    let station_to_metrics = build_map(reader).unwrap();
    print_metrics(station_to_metrics);

    let duration = start.elapsed();
    println!("\n Execution time: {:?}", duration);
}

fn get_file_reader() -> Result<BufReader<File>> {
    let file: File = File::open("./data/weather_stations.csv")?;
    Ok(BufReader::new(file))
}

fn build_map(file_reader: BufReader<File>) -> Result<BTreeMap<String, StationMetrics>> {
    let mut station_to_metrics = BTreeMap::<String, StationMetrics>::new();
    for line in file_reader.lines() {
        let line = line?;
        let (city, temperature) = line.split_once(';').unwrap();
        let temperature: f32 = temperature.parse().expect("Incorrect temperature");
        station_to_metrics
            .entry(city.to_string())
            .or_default()
            .update(temperature);
    }
    Ok(station_to_metrics)
}

// BTreeMap already sorts keys in ascending order.
fn print_metrics(station_to_metrics: BTreeMap<String, StationMetrics>) {
    for (i, (name, state)) in station_to_metrics.into_iter().enumerate() {
        if i == 0 {
            print!("{name}={state}");
        } else {
            print!(", {name}={state}");
        }
    }
}

#[derive(Debug)]
struct StationMetrics {
    sum_temperature: f64,
    num_records: u32,
    min_temperature: f32,
    max_temperature: f32,
}

impl StationMetrics {
    fn update(&mut self, temperature: f32) {
        self.max_temperature = self.max_temperature.max(temperature);
        self.min_temperature = self.min_temperature.min(temperature);
        self.num_records += 1;
        self.sum_temperature += temperature as f64;
    }
}

impl Default for StationMetrics {
    fn default() -> Self {
        StationMetrics {
            sum_temperature: 0.0,
            num_records: 0,
            min_temperature: f32::MAX,
            max_temperature: f32::MIN,
        }
    }
}

impl Display for StationMetrics {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        let avg_temperature = self.sum_temperature / (self.num_records as f64);
        write!(
            f,
            "{:.1}/{avg_temperature:.1}/{:.1}",
            self.min_temperature, self.max_temperature
        )
    }
}

As simple as this:

It opens the CSV file ./data/weather_stations.csv and creates a buffered reader for efficient file reading.
It reads each line of the file, splitting each line into a city name and a temperature value. It uses a BTreeMap to store and update temperature statistics (min, mean, and max) for each city. The BTreeMap automatically keeps the city names sorted.
For each city, it has a StationMetrics struct that tracks the sum, count, min, and max temperatures. The implementation updates these metrics as it processes each line of the file.
Once all data is processed, we print the temperature statistics for each city.

It is executed in 90 seconds on my mini PC. This is too far from the 5.7 seconds of the official implementation. Let’s get started!

Link to commit

I also wrote this script to create a sample test file to try against. In the repository, you can execute it by running cargo run --bin create_data_file.

Multithreading Solution (17.96 secs - 80% improvement)

The first thing that came to mind to improve performance was to introduce multithreading, as my mini PC has a CPU with 8 cores and 16 threads. We can follow this strategy: create as many threads as the number of CPU threads (N). Then, each thread will process 1/N of the file in parallel. Finally, we will merge the results from all threads to calculate the final result.

Here are the changes:

The main function determines the number of available CPU cores to decide how many threads to spawn.

fn main() {
    /*
    The release build is executed in around 17.96 seconds on SER5 PRO MAX:
       - CPU: AMD Ryzen 7 5800H with Radeon Graphics (16) @ 3.200GHz
       - GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
       - Memory: 28993MiB
    */
    let start = Instant::now();

    let n_threads: usize = std::thread::available_parallelism().unwrap().into();

    ...

Then it divides the file into intervals based on n_threads. Each interval represents a chunk of the file to be processed by a thread in parallel.

Note that the function that calculates the intervals ensures that no lines are split beetween chunks by adjusting the end positions to the end of the line.

/// Splits the file into intervals based on the number of CPUs.
/// Each interval is determined by dividing the file size by the number of CPUs
/// and adjusting the intervals to ensure lines are not split between chunks.
///
/// Example:
///
/// Suppose the file size is 1000 bytes and `cpus` is 4.
/// The file will be divided into 4 chunks, and the intervals might be as follows:
///
/// Interval { start: 0, end: 249 }
/// Interval { start: 250, end: 499 }
/// Interval { start: 500, end: 749 }
/// Interval { start: 750, end: 999 }
/// ```
fn get_file_intervals_for_cpus(
    cpus: usize,
    file_size: u64,
    reader: &mut BufReader<File>,
) -> Vec<Interval> {
    let chunk_size = file_size / (cpus as u64);
    let mut intervals = Vec::new();
    let mut start = 0;
    let mut buf = String::new();

    for _ in 0..cpus {
        let mut end: u64 = (start + chunk_size).min(file_size);
        _ = reader.seek(SeekFrom::Start(end));
        let bytes_until_end_of_line = reader.read_line(&mut buf).unwrap();
        end = end + (bytes_until_end_of_line as u64) - 1; // -1 because read_line() also reads the /n

        intervals.push(Interval { start, end });

        start = end + 1;
        buf.clear();
    }
    intervals
}

For each interval, a new thread is spawned to process the corresponding file chunk. The process_chunk function reads the assigned file chunk and builds its own StationsMap for that chunk.

fn process_chunk(file_path: &Path, interval: Interval) -> StationsMap {
    let mut reader = get_file_reader(file_path).unwrap();
    // Starts from the interval start.
    _ = reader.seek(SeekFrom::Start(interval.start));
    // The readers only reads the number of bytes for that interval.
    let chunk_reader = reader.take(interval.end - interval.start);
    build_map(chunk_reader).unwrap()
}

After all threads complete, their results are merged into a single map using the merge_maps function.

fn merge_maps(a: StationsMap, b: &StationsMap) -> StationsMap {
    let mut merged_map = a;
    for (k, v) in b {
        merged_map.entry(k.into()).or_default().merge(v);
    }
    merged_map
}

Here is the final main function to get a better overview of all parts.

fn main() {
    let start = Instant::now();

    // Number of threads available.
    let n_threads: usize = std::thread::available_parallelism().unwrap().into();

    let file_path = Path::new("./data/weather_stations.csv");
    let file_size = fs::metadata(&file_path).unwrap().size();
    let mut reader = get_file_reader(file_path).unwrap();

    // Divide the file into n_threads intervals.
    let intervals = get_file_intervals_for_cpus(n_threads, file_size, &mut reader);

    // Vector that contains all partial results of each thread.
    let results = Arc::new(Mutex::new(Vec::new()));
    let mut handles = Vec::new();

    for interval in intervals {
        let results = Arc::clone(&results);

        // Each thread process a chunk in parallel.
        let handle = thread::spawn(move || {
            let station_to_metrics = process_chunk(file_path, interval);
            results.lock().unwrap().push(station_to_metrics);
        });
        handles.push(handle);
    }

    for handle in handles {
        handle.join().expect("Thread panicked");
    }

    // Combines all partial results into the final result.
    let result = results
        .lock()
        .unwrap()
        .iter()
        .fold(StationsMap::default(), |a, b| merge_maps(a, &b));

    print_metrics(&result);
    println!("\n Execution time: {:?}", start.elapsed());
}

This solution improves the execution time from 90 seconds to 17.96 seconds. Great achivement! But we need to do better to be closer to the 5.7 seconds of the official solution. Let’s continue optimizing!

Link to commit

Custom number parsing (8.1 seconds - 54.9% improvement)

Let’s use cargo flamegraph to visualize the stack of the current solution to know what we can start optimizing. Since I am using Fedora, it uses perf under the hood:

cargo flamegraph -b one-billion-row

We get the following flame graph:

If we zoom in on the right side of the image, we will see that almost 10% of the samples are for parsing the temperature into a f32 number:

The image corresponds to this part of the code:

let temperature: f32 = temperature.parse().expect("Incorrect temperature");

We know that our test file contains all temperatures in one of the following two formats:

ab.c (e.g., 12.5)
b.c (e.g., 5.4)

Also, in case of negative numbers it has a - right before.

Knowing this, we update our code to read each line of the file chunk as bytes instead of strings, and write a function that manually parses the bytes corresponding to the temperature to a fixed-precision i32 signed integer. This should be faster than parsing the file bytes to a string and then to a f32.

// Assuming the file always have 1-2 integer parts and always 1 decimal digit
fn parse_temperature(mut s: &[u8]) -> V {
    let neg = if s[0] == b'-' {
        s = &s[1..];
        true
    } else {
        false
    };

    let (a, b, c) = match s {
        [a, b, b'.', c] => (a - b'0', b - b'0', c - b'0'),
        [b, b'.', c] => (0, b - b'0', c - b'0'),
        _ => panic!("Unknown pattern {:?}", std::str::from_utf8(s).unwrap()),
    };

    let v = (a as V) * 100 + (b as V) * 10 + (c as V);

    if neg {
        -v
    } else {
        v
    }
}

This is how the function works:

It first checks if the first byte (character) in the input slice s is a minus sign (-). If so, it removes it.
It uses pattern matching to handle the two formats of the temperature:
- [a, b, b'.', c]: This pattern matches when the input has two digits before the decimal point and one digit after it. For example, 23.4.
- [b, b'.', c]: This pattern matches when the input has one digit before the decimal point and one digit after it. For example, 3.4.
For both patterns it extracts the numeric values of the digits by subtracting the ASCII value of 0 from each byte. This converts the ASCII byte representation of a digit to its actual integer value.
It calculates the temperature by combining the digits:
- a is multiplied by 100.
- b is multiplied by 10.
- c is used as is.
The sum of these products gives the temperature. But notice that we need to divide it by 10 at the printing step.
If the number is negative it returns -v.

Note that if we want to support other formats, we need to update the function by adding a new branch to the match statement.

After this change, we have improved the execution time from 17.96 seconds to 8.1 seconds. We are getting closer!

Link to commit

Custom key parsing (6.76 seconds - 16.5% improvement)

Let’s now generate another frame graph of the current solution to see what can be our next improvement:

If we zoom in on the right side of the image, we will see that more than 10% of the samples are for parsing the city ftom a bytes slice to a string:

Let’s do the same we did for parsing the temperature. We are going to write a custom parser to make the program faster.

Our current StationsMap maps from city names (String) to StationMetrics. What if we change it to BTreeMap? We can write a fast function that parses from a bytes slice to a u64, and u64 (8 bytes) should be enough to identify a city (e.g. the first 8 characters of its name).

Having a u64 instead of a string as key also comes with the advantage of having the hash map keys inlined.

How hash maps work

When you use String as keys in a HashMap, each key is a heap-allocated, dynamically sized string. This can have implications for performance, especially in terms of hashing and memory usage:

When you use u64 as keys in a HashMap, each key is a fixed-size, stack-allocated integer. This is typically more efficient in terms of both hashing and memory usage.

The difference is that since each u64 key is stack-allocated, it uses a fixed amount of memory and is stored directly on the stack, which is generally faster to allocate and deallocate. That is why by using u64 keys, we can achieve better performance for lookups, insertions, and deletions in the hash map.

See how a value is retrieved in both cases.

HashMap with String keys:

HashMap with u64 keys (inlined):

As we can see in the graphs, the differences are that:

u64 keys are allocated in the stack which is faster and don´t need the additional step of heap allocating the key.
Hashing a fixed-size integer is faster.

Code changes

Let’s then change StationsMap to BTreeMap and write a function to parse the city as u64. The inconvenience is that we will need to add a string field (city) to StationMetrics to store the actual name of the city. Notice this string will be only calculated once for each city, not for the one billion rows.

fn to_key(data: &[u8]) -> u64 {
    let mut hash = 0u64;
    let len = data.len();
    unsafe {
        if len >= 8 {
            hash = *(data.as_ptr() as *const u64);
        } else {
            for i in 0..len {
                hash |= (*data.get_unchecked(i) as u64) << (i * 8);
            }
        }
    }

    hash ^= len as u64;
    hash
}

As we already said, the to_key function converts a slice of bytes into a u64 integer that is used to identify cities in our StationMaps. This is how it works:

It starts by creating a variable called hash and sets it to 0.
It then gets the length of the input data (number of bytes).
If the data length is 8 or more bytes:
- It directly reads the first 8 bytes and interprets them as a u64 integer. This is a fast way to create a key.
If the data length is less than 8 bytes:
- It processes each byte individually in a loop.
- For each byte, it shifts the byte’s value by its position (multiplied by 8) and combines it with the hash using the bitwise OR operator.
Finally, it adjusts the final hash by XOR-ing with the length of the data. This is to ensure that cities that start with the same 8 bytes have a different hash (assuming they have different length).

Running this solution now improves the execution time from 8.1 seconds to 6.76 seconds. We still have some work to do!

Link to commit

Custom hash function (5.85 seconds - 13.5% improvement)

In the previous section, we explained how hash maps work. In the diagrams, we saw that retrieving a value for a given key from the hash map requires first hashing the key, then looking up that hash in the hash table, and finally retrieving the data for that hash.

A question that naturally arises now is: why do we need to hash our keys if we are already using a u64 key that identifies the city? Why not using that u64 key directly as hash so we avoid that extra step?

Let’s use a custom hasher that just returns the u64 without applying any hash function.

One inconvenience is that BTreeMap does not support custom hashers, so we will have to use a HashMap instead. This means we will need to sort the values before printing them.

#[derive(Default)]
struct NoOpHasher {
    hash: u64,
}

impl Hasher for NoOpHasher {
    fn write(&mut self, _bytes: &[u8]) {
        panic!("NoOpHasher only supports u64 values");
    }

    fn write_u64(&mut self, i: u64) {
        self.hash = i;
    }

    fn finish(&self) -> u64 {
        self.hash
    }
}

struct NoOpBuildHasher;

impl BuildHasher for NoOpBuildHasher {
    type Hasher = NoOpHasher;

    fn build_hasher(&self) -> NoOpHasher {
        NoOpHasher::default()
    }
}

type StationsMap = HashMap<u64, StationMetrics, NoOpBuildHasher>;

Not much to comment about this piece of code, it just does what we already mentioned: it uses a u64 value without applying any hash function.

After applying these changes and executing the code, we see an improvement from 6.76 seconds to 5.85 seconds. This is almost the same as the official Rust solution (5.7 seconds)!

Link to commit

Unsafe string parsing (5.16 seconds - 11.8% improvement)

Our current solution is already optimized and has a performance similar to the official Rust solution. But can we do a final optimization? Let’s see what the flamegraph has to say:

If we zoom in, we see that 11% of the samples are calls to core::str::converts::from_utf8:

How is this possible if we are parsing the city names as u64? This is because we are still parsing it as strings to store them in the StationMetrics struct to printing at the end. Even though we only parse them once per city and per thread, this represents 11% of the traces.

How can we improve this? We know that we are calling std::str::from_utf8(city).unwrap().to_string(), which is safe and checks whether the byte slice is valid UTF-8. Since we know that our file contains valid UTF-8 strings, we can replace it with:

city: unsafe { std::str::from_utf8_unchecked(city).to_string() },

This way, we skip the UTF-8 validation. Note that it should be only used when you are certain that the byte slice is a valid UTF-8.

Now the program takes 5.16 seconds. This time, it is faster than the official Rust solution!

Link to commit

Edit 1: Custom line splitting (4.82 seconds - 6.59% improvement)

After reading this post, my friend Kyriacos suggested an optimization to me. He said:

Why are you using line.split_once(|&c| c == b';') to split each line if you know that the separator is almost always at the same position?

He was right. I was using the split_once function, which performs a linear search over the line. This means it scans through the byte slice from the beginning to the end until it finds the separator character. This can clearly be optimized given the file format.

We know that the separator can be at one of these three positions:

line_length - 4: for lines like “city_name;2.3”
line_length - 5: for lines like “city_name;12.3” or “city_name;-2.3”
line_length - 6: for lines like “city_name;-12.3”

Instead of scanning the whole line (linear time complexity), we can check any of these three positions (constant time complexity). Let’s update the code:

let line_length = line.len();
let separator_pos = if line[line_length - 4] == b';' {
    line_length - 4
} else if line[line_length - 5] == b';' {
    line_length - 5
} else {
    line_length - 6
};

let (city, temperature) = line.split_at(separator_pos);

After executing the program again, the time decreased from 5.16 seconds to 4.82 seconds. Thanks, Kyriacos!

Link to commit

Conclusion

In this blog post, we explored various optimizations for the 1 Billion Row Challenge in Rust, aiming to improve performance without external dependencies. We started from a basic solution that took 90 seconds and implemented several enhancements:

Multithreading: Reduced execution time to 17.96 seconds.
Custom Number Parsing: Improved performance to 8.1 seconds.
Custom Key Parsing: Further optimized to 6.76 seconds.
Custom Hash Function: Achieved a time of 5.85 seconds.
Unsafe String Parsing: Reached a final time of 5.16 seconds, which is faster than the official Rust solution!

While our final solution is faster than the official Rust implementation, other approaches using libraries like memmap or hashmap might be even more efficient for real-world scenarios.

Keep in mind that these results are specific to my machine and test file, so performance may vary for others. I invite you to find more optimizations and share them in the comments!

Thanks for reading, and happy optimizing!

Gas stations finder in Golang

2023-11-30T08:00:00+00:00

Hi all,

I wanted to share a project I created a couple of years ago, during winter holidays, with the purpose of learning Go.

In December 2021, gas prices were more expensive than ever in Spain, and I drove a lot. That’s why I written the Gas Station Finder in Go; I wanted to find all gas stations between 2 cities and sort them by price. This way, I could fill the tank at a cheaper gas station without having to deviate from the route.

Let me break down how it all works.

The project relies on the OpenRouteService API to find the route between two coordinates. It also uses its reverse geocoding service to convert city names into coordinates. This is done to expose an API that given 2 city names, it returns both the route and all near gas stations with its prices.

The REST API is built with the Gin web framework. I liked it as it was straightforward and simple to use.

The app is split into different components: one for figuring out where you are, one for planning routes, another for getting directions, one for getting gas station prices, and another to retrieve gas stations nearby.

In order to improve performance, I did two things:

Use a KDTree to store Gas stations. For a given route, it called the KDTree to find nearby gas stations really fast (O(log n)).
Keep all prices in memory and refresh them in the background every 4 hours. Since the prices don´t change that often, caching them was a good idea to improve performance, because retrieving the prices from the Spanish goverment website takes quite a bit.

As an improvement, instead of having all the gas stations prices in memory, we can use an embedded database like GrausDB. It can also be used to store routes between cities instead of calling OpenRouteService to calculate them each time. These routes are not going to change often, therefore we can keep refresh them every 2 weeks, for example.

I enjoyed coding it since it only took a few hours, and it proved to be a useful app for myself. Also, I learned a little bit of Go!

You can find the code on GitHub: GasPrices.

Thanks for reading!

Ricardo.

Essential algorithms for the coding interview

2023-01-04T08:00:00+00:00

Hello everyone,

I am excited to announce the release of my new/first book, “Essential algorithms for the coding interview.”, which has been self-published through Amazon KDP.

Who is this book for?

This book is designed to help software engineers and computer science students prepare for technical interviews by providing a comprehensive guide to essential algorithms and data structures.

What is the structure of the book?

This book is a concise, informative resource that contains 11 main chapters, each covering a different type of coding problem. For each problem, it provides detailed explanations of the algorithms and data structures needed to solve it, followed by a detailed solution with expositions, graphs, and step-by-step executions to help readers understand and apply the concepts.

Here is the table of contents:

Introduction.
First Bad Version (Binary Search).
Valid Parentheses (Stack).
Valid Palindrome (2 pointers).
House Robber (Recursion).
Combination Sum (Backtracking).
Lowest Common Ancestor (Binary Trees).
Binary Tree Level Order Traversal (Tree and BFS).
Course Schedule (Graphs).
Minimum Window Substring (Sliding Window).
Merge k Sorted Lists (Heaps).
Soduku Solver (Backtracking).
Afterword.

Where to find the book?

You can find it on all Amazon marketplaces like:

US: https://www.amazon.com/dp/B0BRJPFT54.
ES: https://www.amazon.es/dp/B0BRJPFT54.
DE: https://www.amazon.de/dp/B0BRJPFT54.
UK: https://www.amazon.co.uk/dp/B0BRJPFT54.
FR: https://www.amazon.fr/dp/B0BRJPFT54.
IT: https://www.amazon.it/dp/B0BRJPFT54.
JP: https://www.amazon.co.jp/dp/B0BRJPFT54.
CA: https://www.amazon.ca/dp/B0BRJPFT54.

Thank you for your interest, and I hope you find this book helpful!

Ricardo.

Promised Architecture Kit V2

2018-12-26T14:00:00+00:00

PromisedArchitectureKit V2

The simplest architecture for PromiseKit, now V2, even simpler and easier to reason about.

I have published a new version of PromisedArchitectureKit, fully redesigned and simplified.

V2 Goal

PromisedArchitectureKit V2 has been designed to impose constraints that enforce correctness and simplicity.

Introduction

PromisedArchitectureKit is a library that tries to enforce correctness and simplify the state management of applications and systems. It helps you write applications that behave consistently, and are easy to test. It’s inspired by Redux and RxFeedback.

Motivation

I have been trying to find a proper way and architecture to simplify the complexity of managing and handling the state of mobile applications, and also, easy to test.

I started with Model-View-Controller (MVC), then Model-View-ViewModel (MVVM) and also Model-View-Presenter (MVP) along with Clean architecture. MVC is not as easy to test as in MVVM and MVP. MVVM and MVP are easy to test, but the issue is the UI state can be a mess, since there is not a centralized way to update it, and you can have lots of methods among the code that changes the state.

Then it appeared Elm and Redux and other Redux-like architectures as Redux-Observable, RxFeedback, Cycle.js, ReSwift, etc. The main difference between these architectures (including PromisedArchitectureKit) and MVP is that they introduce constrains of how the UI state can be updated, in order to enforce correctness and make apps easier to reason about.

Which make PromisedArchitectureKit different from these Redux-like architectures is it uses async reducers (using PromiseKit) to wrap the effects, then it runs side effects for you and calls the UI with the result.

PromisedArchitectureKit runs side effects for you. Your code stays 100% pure.

Quick start

Installation

PromisedArchitectureKit is available through CocoaPods. To install it, simply add the following line to your Podfile:

pod 'PromisedArchitectureKit'

PromisedArchitectureKit

PromisedArchitectureKit itself is very simple. How it looks:

self.system = System.pure(
	initialState: State.start,
   	reducer: State.reduce,
   	uiBindings: [view?.updateUI]
)

The core concept

Each screen of your app (and the whole app) has a state itself. in PromisedArchitectureKit, this state is represented as an Enum. For example, the state of a Ecommerce Product detail page (PDP) app might look like this:

enum State {
    case start
    case loading
    case productLoaded(Product)
    case addedToCart(Product, CartResponse)
    case error(Error)
}

In this screen, the app loads the product, then it can show the product or an error. After the product is loaded, the user can add it to the basket.

This State enum, representes the state of the “PDP screen” in the ecommerce app. With this approach of having an enum that actually represents the state of a screen, views are a direct mapping of state:

view = f(state).

That “f” function will be the UI binding function that we will see later on.

To change something in the state, you need to dispatch an Event. An event is an enum that describes what happened. Here are a few example events:

enum Event {
    case loadProduct
    case addToCart
}

Enforcing that every change is described as an event lets us have a clear understanding of what’s going on in the app. If something changed, we know why it changed.

Events are like breadcrumbs of what has happened. Finally, to tie state and actions together, we write a function called reducer. A reducer it’s just a function that takes state and action as arguments, and returns the next state of the app (asynchronously):

(State, Event) -> AsyncResult

AsyncResult is just a wrapper of Promise.

We write a reducer function for every state of every screen. For the PDP screen:

    static func reduce(state: State, event: Event) -> AsyncResult<State> {
        switch event {

        case .loadProduct:
            let productResult = getProduct(cached: false)
            
            return productResult
                .map { State.productLoaded($0) }
                .stateWhenLoading(State.loading)
                .mapErrorRecover { State.error($0) }
            
        case .addToCart:
            let productResult = getProduct(cached: true)
            let userResult = getUser()
            
            return AsyncResult<(Product, User)>.zip(productResult, userResult).flatMap { pair -> AsyncResult<State> in
                let (product, user) = pair
                
                return addToCart(product: product, user: user)
                    .map { State.addedToCart(product, $0) }
                    .mapErrorRecover{ State.error($0) }
            }
            .stateWhenLoading(State.loading)
        }
    }

Notice that the reducer is a pure function, in terms of referencial transparency, and for state S and event E, it always return the same state description, and has no side effects (it only returns descriptions of the effects, the library will run them for you).

This is basically the whole idea of PromisedArchitectureKit. Note that we haven’t used any PromisedArchitectureKit APIs. It comes with a few utilities to facilitate this pattern, but the main idea is that you describe how your state is updated over time in response to events, and 90% of the code you write is just plain Swift, so the UI logic can be tested with ease.

But what about asynchronous code and side effects as API calls, DB calls, logging, reading and writing files?

Using PromiseKit as a time abstraction

A Promise is used for handling asynchronous operations. PromisedArchitectureKit uses them in order to trigger reactions to some states. Example of Promise:

    func getProduct() -> Promise<Product> {
        return Promise { seal in
            DispatchQueue.main.asyncAfter(deadline: .now() + 5) {
                seal.fulfill("Yeezy 500")
            }
        }
    }

That function returns a Promise that will return a product. It waits for 5 seconds and then returns the product. It simulates a network call.

Don’t fear the AsyncResult

AsyncResult is just a wrapper over Promise that provides it more power. It is just like a Promise on steroids.

But don’t worry. If you whole app uses Promises, it is ok. You can keep using promises and transform them to AsyncResults on the reducer function with ease.

How to get an AsyncResult from a Promise?:

let asyncResult = AsyncResult(promise)

And that’s it!

What if i want to make network calls, DB calls, and so on?

If we want to load the product from the backend, we would require a network call, which is a side effect and it is asynchronous.

In order to achieve it, we will use Promises to handle async code. As the reducer funciton returns the new state async, we can map Promises to new states.

For example, we are in Start state, and we want to load a product and go to loadedProduct state, when a loadProduct event is triggered. In the reducer we do:

    static func reduce(state: State, event: Event) -> AsyncResult<State> {
        switch event {

        case .loadProduct:
            let productResult = getProduct(cached: false)
            
            return productResult
                .map { State.productLoaded($0) }
                .stateWhenLoading(State.loading)
                .mapErrorRecover { State.error($0) }
                
        (...)

What is this doing? Step by step:

When a loadProduct event is triggered

	switch event {
   		case .loadProduct:

We get the product (AsyncResult)

	let productResult = getProduct(cached: false)

In case of the product would be retrieved successfully we will return a loadedProduct state:

return productResult
 	.map { State.productLoaded($0) }

We want to send the UI a loading state while the Promise being executed until it gets resolved, so the UI can show a loading indicator:

.stateWhenLoading(State.loading)

In case of the product wouldn’t be retrieved successfully we will return a error state:

	.mapErrorRecover { State.error($0) }

Pretty easy and neat.

There is no side effect here: there is only a description of it. Actually, the side effect will be executed by the library.

Update the view

After a new state change, the View’s updateUI function will be called with the new state. Then the view is in charge of update its ui components.

Example:

    func updateUI(state: State) {
        showLoading()
        addToCartButton.isEnabled = false
        refreshButton.isHidden = false

    
        switch state {
        case .start:
            productTitleLabel.text = ""
            descriptionLabel.text = ""
            imageView.image = nil
        case .loading:
            refreshButton.isHidden = true
            showLoading()
            
        case .productLoaded(let product):
            productTitleLabel.text = product.title
            descriptionLabel.text = product.description
            updateImage(with: product.imageUrl)
            addToCartButton.isEnabled = true
            hideLoading()
            
        case .error(let error):
            descriptionLabel.text = error.localizedDescription
            hideLoading()
            
        case .addedToCart(_, let cartResponse):
            hideLoading()
            addToCartButton.isEnabled = true
            showAddedToCartAlert(cartResponse)
        }

        print(state)
    }

So, the presenter will compute the next state, and will send it to the view. The view will draw itself accordingly.

What the library does under the hood?

The library’s core is small. It can be pasted here:

//
//  System.swift
//  PromisedArchitectureKit
//
//  Created by Pallas, Ricardo on 7/3/18.
//

import Foundation
import PromiseKit

public final class System<State, Event> {

    internal var eventQueue = [Event]()
    internal var callback: ((State) -> ())? = nil

    internal var initialState: State
    internal var reducer: (State, Event) -> AsyncResult<State>
    internal var uiBindings: [((State) -> ())?]
    internal var currentState: State

    private init(
        initialState: State,
        reducer: @escaping (State, Event) -> AsyncResult<State>,
        uiBindings: [((State) -> ())?]
        ) {
        self.initialState = initialState
        self.reducer = reducer
        self.uiBindings = uiBindings
        self.currentState = initialState
    }

    public static func pure(
        initialState: State,
        reducer: @escaping (State, Event) -> AsyncResult<State>,
        uiBindings: [((State) -> ())?]
        ) -> System {
        
        let system = System<State,Event>(initialState: initialState, reducer: reducer, uiBindings: uiBindings)
        system.bindUI(initialState)
        return system
    }

    public func addLoopCallback(callback: @escaping (State)->()){
        self.callback = callback
    }

    var actionExecuting = false

    public func sendEvent(_ action: Event) {
        assert(Thread.isMainThread)
        if actionExecuting {
            self.eventQueue.append(action)
        } else {
            actionExecuting = true
            let _ = doLoop(action).done { state in
                assert(Thread.isMainThread, "PromisedArchitectureKit: Final callback must be run on main thread")
                if let callback = self.callback {
                    callback(state)
                }
                self.actionExecuting = false
                if let nextEvent = self.eventQueue.first {
                    self.eventQueue.removeFirst()
                    self.sendEvent(nextEvent)
                }
            }
        }
    }

    private func doLoop(_ event: Event) -> Promise<State> {
        return Promise.value(event)
            .then { event -> Promise<State> in

                let asyncResultState = self.reducer(self.currentState, event)

                if let stateWhenLoading = asyncResultState.loadingResult {
                    self.bindUI(stateWhenLoading)
                }

                return asyncResultState.promise
            }
            .map { state in
                self.currentState = state
                self.bindUI(state)
                return state
            }
    }

    private func bindUI(_ state: State) {
        self.uiBindings.forEach { uiBinding in
            uiBinding?(state)
        }
    }
}

It executes loops on the doLoop function. What is a loop? It is the whole cycle where and event is triggered, a new state is calculated and the UI is updated accordingly.

Following the load product example:

A loadProduct event is sent by the view. The sendEvent function is called that calls the doLoop function.
The doLoop function executed the side effects thrown by the reducer and gets the new state async. If a loading state was specified it notifies the UI before running the side effects. After that, it updates the current state and calls the UI with the new state.

To sum up: The system listens to events, runs side effects to get the new state and notifies the UI that the state has changed.

Why should I use PromisedArchiterueKit V2 ?

As said before, the goal of the library is to put constraints to enforce correcness and make architecure easier to read and easier to reason about. These contraints are: there a finite number of states for each screen, there are a finite number of events that can change the state, and the library decides when to update the UI.

Those restrictions comes with advantages, the trade off is worth it. The main advantages the library provides are:

The library executes all side effects for you so your code stays pure.
It updates the view when needed, you don’t need to take care.
You can know what the screen is about, reading the State enum.
You know in compile-time that your view handles are states.
You know what actions can be done on the screen, reading the Event enum.
You know that all events are handled by the presenter on compile time.
A single function will be called on every state change. That can be useful to have good analytics, for example.

Example

To run the example project, clone the repo, and run pod install from the Example directory first.

ViewController’s code:

import UIKit
import PromisedArchitectureKit

class ViewController: UIViewController, View {
    
    @IBOutlet weak var productTitleLabel: UILabel!
    @IBOutlet weak var imageView: UIImageView!
    @IBOutlet weak var descriptionLabel: UILabel!
    @IBOutlet weak var addToCartButton: UIButton!
    @IBOutlet weak var refreshButton: UIButton!
    
    var presenter: Presenter! = nil
    var indicator: UIActivityIndicatorView! = nil
    
    override func viewDidLoad() {
        super.viewDidLoad()
        addLoadingIndicator()
        
        presenter = Presenter(view: self)
        presenter.controllerLoaded()
    }
    
    override func viewWillAppear(_ animated: Bool) {
        super.viewWillAppear(animated)
        presenter.sendEvent(Event.loadProduct)
    }
    
    private func addLoadingIndicator() {
        indicator = UIActivityIndicatorView(style: UIActivityIndicatorView.Style.gray)
        indicator.frame = CGRect(x: 0, y: 0, width: view.frame.width, height: view.frame.height)
        indicator.center = view.center
        view.addSubview(indicator)
        view.bringSubviewToFront(indicator)
        UIApplication.shared.isNetworkActivityIndicatorVisible = true
    }
    
    // MARK: - User Actions
    @IBAction func didTapRefresh(_ sender: Any) {
        presenter.sendEvent(Event.loadProduct)
    }
    
    @IBAction func didTapAddToCart(_ sender: Any) {
        presenter.sendEvent(Event.addToCart)
    }

    // MARK: - User Outputs
    func updateUI(state: State) {
        showLoading()
        addToCartButton.isEnabled = false
        refreshButton.isHidden = false

    
        switch state {
        case .start:
            productTitleLabel.text = ""
            descriptionLabel.text = ""
            imageView.image = nil
        case .loading:
            refreshButton.isHidden = true
            showLoading()
            
        case .productLoaded(let product):
            productTitleLabel.text = product.title
            descriptionLabel.text = product.description
            updateImage(with: product.imageUrl)
            addToCartButton.isEnabled = true
            hideLoading()
            
        case .error(let error):
            descriptionLabel.text = error.localizedDescription
            hideLoading()
            
        case .addedToCart(_, let cartResponse):
            hideLoading()
            addToCartButton.isEnabled = true
            showAddedToCartAlert(cartResponse)
        }

        print(state)
    }
    
    private func showLoading() {
        indicator.startAnimating()
    }
    
    private func hideLoading() {
        indicator.stopAnimating()
    }
    
    private func showAddedToCartAlert(_ message: String) {
        let alertController = UIAlertController(title: "Added to cart", message:
            message, preferredStyle: UIAlertController.Style.alert)
        alertController.addAction(UIAlertAction(title: "Dismiss", style: UIAlertAction.Style.default,handler: nil))
        self.present(alertController, animated: true, completion: nil)
    }
    
    private func updateImage(with urlPath: String) {
        if let url = URL(string: urlPath), let data = try? Data(contentsOf: url) {
            let image = UIImage(data: data)
            imageView.image = image
        }
    }

}

Prenseter’s code:

import Foundation
import PromisedArchitectureKit
import PromiseKit

typealias CartResponse = String
typealias User = String

struct Product: Equatable {
    let title: String
    let description: String
    let imageUrl: String
}

protocol View: class {
    func updateUI(state: State)
}

// MARK: - Events
enum Event {
    case loadProduct
    case addToCart
}

// MARK: - State
enum State {
    case start
    case loading
    case productLoaded(Product)
    case addedToCart(Product, CartResponse)
    case error(Error)
    
    static func reduce(state: State, event: Event) -> AsyncResult<State> {
        switch event {

        case .loadProduct:
            let productResult = getProduct(cached: false)
            
            return productResult
                .map { State.productLoaded($0) }
                .stateWhenLoading(State.loading)
                .mapErrorRecover { State.error($0) }
            
        case .addToCart:
            let productResult = getProduct(cached: true)
            let userResult = getUser()
            
            return AsyncResult<(Product, User)>.zip(productResult, userResult).flatMap { pair -> AsyncResult<State> in
                let (product, user) = pair
                
                return addToCart(product: product, user: user)
                    .map { State.addedToCart(product, $0) }
                    .mapErrorRecover{ State.error($0) }
            }
            .stateWhenLoading(State.loading)
        }
    }
}

fileprivate func getProduct(cached: Bool) -> AsyncResult<Product> {
    let delay: DispatchTime = cached ? .now() : .now() + 3
    let product = Product(
        title: "Yeezy Triple White",
        description: "YEEZY Boost 350 V2 “Triple White,” aka “Cream”. \n adidas Originals has officially announced its largest-ever YEEZY Boost 350 V2 release. The “Triple White” iteration of one of Kanye West’s most popular silhouettes will drop again on September 21 for a retail price of $220. The sneaker previously dropped under the “Cream” alias.",
        imageUrl: "https://static.highsnobiety.com/wp-content/uploads/2018/08/20172554/adidas-originals-yeezy-boost-350-v2-triple-white-release-date-price-02.jpg")
    
    let promise = Promise { seal in
        DispatchQueue.main.asyncAfter(deadline: delay) {
            seal.fulfill(product)
        }
    }

    return AsyncResult<Product>(promise)
}

fileprivate func addToCart(product: Product, user: User) -> AsyncResult<CartResponse> {
    let randomNumber = Int.random(in: 1..<10)

    let failedPromise = Promise<CartResponse>(error: NSError(domain: "Error adding to cart",code: 15, userInfo: nil))
    let promise = Promise<CartResponse>.value("Product: \(product.title) added to cart for user: \(user)")

    if randomNumber < 5 {
        return AsyncResult<CartResponse>(failedPromise)
    } else {
        return AsyncResult<CartResponse>(promise)
    }
}

fileprivate func getUser() -> AsyncResult<User> {
    let promise = Promise { seal in
        DispatchQueue.main.asyncAfter(deadline: .now() + 1) {
            seal.fulfill("Richi")
        }
    }

    return AsyncResult<User>(promise)
}

// MARK: - Presenter
class Presenter {
    
    var system: System<State, Event>?
    weak var view: View?
    
    init(view: View) {
        self.view = view
    }
    
    func sendEvent(_ event: Event) {
        system?.sendEvent(event)
    }
    
    func controllerLoaded() {
        system = System.pure(
            initialState: State.start,
            reducer: State.reduce,
            uiBindings: [view?.updateUI]
        )
    }
}

Bonus: analytics

In case you want to add analytics to your app, you will end up having lots of calls to some TrackingService.trackEvent method among the code. Which, sometimes, can become an mess.

Luckily, PromisedArchitectureKit, includes the “addLoopCallback(callback: @escaping (State)->())” function, that will be called every time a state change occurs. The function receives the new state as a parameter, which can be use for analytics.

Analytics Example

func handleAnalitycs(state: State) {
    switch state {
    case .start:
        EventTracker.trackEvent(event: .pdpShown)
        
    case .loading:
        EventTracker.trackEvent(event: .pdpLoading)

    case .productLoaded(let product):
        EventTracker.trackEvent(event: .productLoaded, attr: product)

    case .error(let error):
        EventTracker.trackEvent(event: .pdpError, attr: error)

        
    case .addedToCart(let product, _):
        EventTracker.trackEvent(event: .pdpAddedToCart, attr: product)

    }
}


func controllerLoaded() {
    system = System.pure(
        initialState: State.start,
        reducer: State.reduce,
        uiBindings: [view?.updateUI]
    )
        
    system?.addLoopCallback(callback: handleAnalytics)
}
    

By adding the handleAnalytics method as a system’s loop callback, we have all analytics in the same place, centralized.

Disclaimer: This will only work with analytics related to logic. If you need to track things like “User did scroll”, you will need to do it the same way as without the library.

Author

Ricardo Pallás

License

PromisedArchitectureKit is available under the MIT license. See the LICENSE file for more info.

Functional architecture for Swift

2018-01-07T23:48:00+00:00

In this article I am going to introduce a library for architecting iOS apps, called ArchitectureKit:

“Simplest architecture for FunctionalKit”

Introduction
Motivation
ArchitectureKit
Dependency Injection
Full example
Conclusion

1. Introduction

ArchitectureKit is a library that tries to enforce correctness and simplify the state management of applications and systems. It helps you write applications that behave consistently, and are easy to test. It’s strongly inspired by Redux and RxFeedback.

2. Motivation

I have been trying to find a proper way and architecture to simplify the complexity of managing and handling the state of mobile applications, and also, easy to test.

I started with Model-View-Controller (MVC), then Model-View-ViewModel (MVVM) and also Model-View-Presenter (MVP) along with Clean architecture. MVC is not as easy to test as in MVVM and MVP. MVVM and MVP are easy to test, but the issue is the UI state can be a mess, since there is not a centralized way to update it, and you can have lots of methods among the code that changes the state.

Then it appeared Elm and Redux and other Redux-like architectures as Redux-Observable, RxFeedback, Cycle.js, ReSwift, etc. The main difference between these architectures (including ArchitectureKit) and MVP is that they introduce constrains of how the UI state can be updated, in order to enforce correctness and make apps easier to reason about.

Which make ArchitectureKit different from these Redux-like architectures is it uses feedback loops to run effects and encodes them into part of state (we will see this in point 3) and uses monads from FunctionalKit to wrap the effects.

ArchitectureKit runs **side effects for you. Your code stays 100% pure.**

3. ArchitectureKit

ArchitectureKit itself is very simple.

The core concept

Each screen of your app (and the whole app) has a state itself. in ArchitectureKit, this state is represented as and object (i.e. Struct). For example, the state of a TO-DO app might look like this:

This State object, representes the state of the “List of TO-DOs screen” in a TO-DO app. The “todos” var contains all the to-dos that might be drawn in the screen and “visibilityFilter” tells what todos should appear in the list.

With this approach of having an object that actually represents the state of a screen, views are a direct mapping of state:

view = f(state)

That “f” function will be the UI binding function that we will see later on.

To change something in the state, you need to dispatch an Event. An event is an enum that describes what happened. Here are a few example events:

Enforcing that every change is described as an event lets us have a clear understanding of what’s going on in the app. If something changed, we know why it changed. Events are like breadcrumbs of what has happened. Finally, to tie state and actions together, we write a function called a reducer. A reducer it’s just a function that takes state and action as arguments, and returns the next state of the app:

(State, Event) -> State

We write a reducer for every state of every screen. For the list of todos’ screen:

Notice that the reducer is a pure function, in terms of referencial transparency, and for state S and event E, it always return the same state, and has no side effects.

This is basically the whole idea of ArchitectureKit. Note that we haven’t used any ArchitectureKit APIs. It comes with a few utilities to facilitate this pattern, but the main idea is that you describe how your state is updated over time in response to events, and 90% of the code you write is just plain Swift, so the UI logic can be tested with ease.

But what about asynchronous code and side effects as API calls, DB calls, logging, reading and writing files?

AsyncResult and FunctionalKit

The AsyncResult** **data structure is used for handling asynchronous operations. AsyncResult is just a typealias to a Reader> monad stack. These monads (and its monad transformers) are available in FunctionalKit, which is the only dependency in ArchitectureKit.

FunctionalKit provides basic functions and combinators for functional programming in Swift, and it can be considered a extension to Foundation. We mainly use the Reader monad along with the Future and Result.

Reader monad:** **it is used in the top of the monad stack to provide a way to inject dependencies. We will see it in depth later.
Future monad: it is used to represent async values.
Result monad: it represents if a computation was successful or there was an error.

We use monad transformers to create AsyncResult as a stack of these three monads. An AsyncResult is a monad that represents an asynchornous operation that returns either a successful value or an error, and also provides a mechanism for dependency injection.

We can see in the following snippet an example of Facebook login using AsyncResult:

To create an AsyncResult we use its static method unfoldTT (the TT stands for transformer, since it is a monad transoformer). It expects a function as a parameter which has two inputs: an environment or context and a continuation or callback. The environment parameter comes from the Reader monad and it is an object that contains the injected dependencies. The continuation parameter is a callback function that must be called with the Result value returned from de async operation. In the example the Result returns an string when it succeeds. When the login succeeds, we call the continuation method with a sucessful Result with the token from Facebook. It the login fails, we call the continuation method with a failure Result that contains the error.

The AsyncResult must me parameterized with 3 values. First one is the Environment type (which contains the dependencies), second one is the actual value expected from the async operation (in the example we use string because we expect the facebook login to return the login token), and the last one is the error type the Result will return if something goes wrong.

Every asynchronous operation and side effect must be performed using the AsyncResult monad and we will use Feedbacks from ArchitectureKit to execute the their side effects. Also, we will see how to work with AsyncResults in the full example.

Design feedback loops

Let’s add a new feature to our previous TO-DO app! We want let users to save their TO-DOs to the cloud. That would require an network call, which is a side effect and it is asynchronous, so to achieve it, we will use feedback loops. The way of dealing with effects in ArchitecrueKit is encode them into part of state and then design the feedback loop.

A feedback loop is just a computation that is triggered in some cases, depending on the current state of the system, that launches a new event, and produces a new state.

A whole ArchitectureKit loop begins from a UserAction that triggers an event. Then the reducer function computes a new state from the event and previous state. ArchitectureKit checks if any feedback loop must be triggered from the new state. If so, the feedback produces a new event asyncrhonously (by executing side effects) and a new state if computed from the feedback’s event.

So, we can see a whole ArchitectureKit loop as the following sequence:

UserAction produces an event.
reducer(Current state, event) -> new state.
Query new state to check if feedback loop must be triggered.
if so, new event triggered (side effects executed).
reducer(new state, new event) -> newer state.
Repeat from step 3 until no more feedback (or maximum of 5 feedback loops)

ArchitectureKit whole loop

In the following code snippet, we can see a Feedback example of how to store the user’s TO-DOs in the cloud:

For implementing this feature, two new events have been added storeTodos() and todosStored(Bool) and there is a new Bool variable in the state: mustStoreTodos. The storeUserTodos(todos:[Todo]) function is the function executed in the feedback loop, which returns an AsyncResult monad that returns the todosStored(Bool) event when side effects are executed. This function is in charge of storing the user’s TO-DOs.

A Feedback object is composed by two functions that receive the current state as parameter. The first function is the actual AsyncResult to be executed, and the second one checks when the feedback loop must be executed, depending on the state. In the example, the user’s TO-DOs feedback will be executed when
mustStoreTodos variable is true.

In the new reducer, the storeTodos() event is setting mustStoreTodos to true, and todosStored(Bool) is setting it back to false. ThestoreTodos() event will be triggered by an UserAction, like tapping a button.

The following diagram illustrates the steps for storing the user’s TO-DOs:

How Feedback loop is executed after an UserAction

Who dispatches events? UserActions

UserAction is the object from ArchitectureKit that represents any action from the User or the iOS framework that triggers an event that changes the state (and from that state change it could trigger a feedback loop).

It has two methods:

init: creates the UserAction and specifies what event will be triggered when user actions is executed
execute: executes the user action.

Simple example

We can see here a simple example of how ArchitectureKit’s code would look like:

It’s a simple counter with an increment and decrement buttons. The State is just an integer that contains the current count.

Dependency injection

see Jorge Castillo article in Kotlin using Kategory

Full example

also add a diagram

key: view = f(state) direct mapping between state and view

ArchitectureKit runs side effects for you. So your code stays 100% pure

Conclusion

(when I would use ArchitectureKit)

difference ArchitectureKIt uses Feedback loops, i remconed use it with Functioanl Clean Architecture, it runs the side effects

but Functional clean architecture (functions and no objects, except objetcs for dependency inbjection and protocols)

next steps: create user actions for every UIKit control

Functional data validation in Swift

2017-09-24T22:48:00+00:00

I am going to talk about a little library I created in Swift to be used either standlone or with Swiftz lib. It is called Swiftz-Validation.

What is Swiftz-Validation?

It’s a data structure that typically models form validations, and other scenarios where you want to aggregate all failures, rather than short-circuit if an error happens (for which Swiftx’s Either is better suited). A Validation may either be a Success(value), which contains a successful value, or a Failure(value), which contains an error.

A Validation is a data structure that implements the Applicative interface (.ap), and does so in a way that if a failure is applied to another failure, then it results in a new validation that contains the failures of both validations. In other words, Validation is a data structure made for errors that can be aggregated, and it makes sense in the contexts of things like form validations, where you want to display to the user all of the fields that failed the validation rather than just stopping at the first failure.

Validations can’t be as easily used for sequencing operations because the.ap method takes two validations, so the operations that create them must have been executed already. While it is possible to use Validations in a sequential manner, it’s better to leave the job to Either, a data structure made for that.

Validating data example

In the following example we are going to validate a password: it should contain more than 8 characters, it should contain an especial character and it has to be different from the user name.

    
    //Check if the password is long enough
    func isPasswordLongEnough(_ password:String) -> Validation<[String], String> {
        if password.characters.count < 8 {
            return Validation.Failure(["Password must have more than 8 characters."])
        } else {
            return Validation.Success(password)
        }
    }
    
    //Check if the password contains a special character
    func isPasswordStrongEnough(_ password:String) -> Validation<[String], String> {
        if (password.range(of:"[\\W]", options: .regularExpression) != nil){
            return Validation.Success(password)
        } else {
            return Validation.Failure(["Password must contain a special character."])
        }
    }
    
    //Check if the user is different from password, by Jlopez
    func isDifferentUserPass(_ user:String, _ password:String) -> Validation<[String], String> {
        if (user == password){
            return Validation.Failure(["Username and password MUST be different."])
        } else {
            return Validation.Success(password)
        }
    }
    

    //Concating all validations in one that checks all rules
    func isPasswordValid(user: String, password:String) -> Validation<[String], String> {
        return isPasswordLongEnough(password)
            .sconcat(isPasswordStrongEnough(password))
            .sconcat(isDifferentUserPass(user, password))
    }


    //Examples with invalid password
    let result = isPasswordValid(user: "Richi", password: "Richi")
    /* ▿ Validation, String>
           ▿ Failure : 3 elements
                - 0 : "Password must have more than 8 characters."
                - 1 : "Password must contain a special character."
                - 2 : "Username and password MUST be different."
    */

    //Example with valid password
    let result = isPasswordValid(user:"Richi", password: "Ricardo$")
    /*
       ▿ Validation, String>
           - Success : "Ricardo$"
    */

Advantages of using Swiftz-Validation

Things like form and schema validation are pretty common in programming, but we end up either using branching or designing very specific solutions for each case.

With branching, I mean using if-else conditions, things get quickly out of hand, it doesn’t scale because it’s difficult to abstract over it and it’s hard to reason about each rule. Let’s see an example of the same validation as before, using branching:

func validatePassword(username: String, password:String) -> [String]{
        var errors:[String] = []
        
        if password.characters.count < 8 {
            errors.append("Password must have more than 8 characters.")
        }
        
        if (password.range(of:"[\\W]", options: .regularExpression) == nil){
            errors.append("Password must contain a special character.")
        }
        
        if (username == password){
            errors.append("Username and password MUST be different.")
        }
        
        return errors
    }
    
    validatePassword(username: "Richi", password: "Richi")
    /*
     * Array 3 elements:
     - 0: "Password must have more than 8 characters."
     - 1: "Password must contain a special character."
     - 2: "Username and password MUST be different."
     */
    
    validatePassword(username: "Richi", password: "Ricardo$")
    /*
     * Array 0 elements
     */

Because this function uses if conditions and modifies a local variable it’s not very modular. This means it’s not possible to split these checks in smaller pieces that can be entirely understood by themselves — they modify something, and so you have to understand how they modify that thing, in which context, etc. For very simple things it’s not too bad, but as complexity grows it becomes unmanageable.

Advantages

The main advantages of Swiftz-Validation is that:

Easy to understand and reason about each validation in its own
Easy to compose validation rules
Easy to reuse validation rules and compose more complex validations (DRY principle)
It has a well know interface or abstraction to work with (It is a functor, pointed, applicative and a semigroup). So you can combine validations with sconcat (Semigroup), apply functions with ap (Applicative), transform results with fmap (Functor) **and react to results with some kind of pattern matching with a **switch statement.

In the following example, you can see how the Validation structure gives you a tool for basing validation libraris and functions on in a way that’s reusable (DRY) and composable:

    //Validate min length
    func minLength(_ value:String, min:Int, fieldName:String) -> Validation<[String], String>{
        if(value.characters.count < min){
            return Validation.Failure(["\(fieldName) must have more than \(min) characters"])
        } else {
            return Validation.Success(value)
        }
    }
    
    //Validate match a regular expression
    func matches(_ value:String, regex:String, errorMessage:String) -> Validation<[String], String>{
        if(value.range(of:regex, options: .regularExpression) == nil){
            return Validation.Failure([errorMessage])
        } else {
            return Validation.Success(value)
        }
    }
    
    //Validate password: concatenation of matches and minLength
    func isPasswordValid(_ password:String) -> Validation<[String], String> {
        return matches(password, regex: "[\\W]", errorMessage: "Password must contain an special character")
            .sconcat(minLength(password, min: 8, fieldName: "Password"))
    }
    
    //Validate name: minLength
    func isNameValid(_ name: String) -> Validation<[String], String> {
        return minLength(name, min: 3, fieldName: "Name")
    }
    
    //Validate form: concatenation of isPasswordValid and isNameValid
    func validateForm(name: String, password: String) -> Validation<[String], String> {
        return isNameValid(name)
            .sconcat(isPasswordValid(password))
    }
    
    
    //Examples
    let result = validateForm(name: "FP", password: "Ricardo$")
    /*▿ Validation, String>
      ▿ Failure : 1 element
        - 0 : "Name must have more than 3 characters"
    */
    let result1 = validateForm(name: "FP", password: "A")
    /*  Validation, String>
      ▿ Failure : 3 elements
        - 0 : "Name must have more than 3 characters"
        - 1 : "Password must contain an special character"
        - 2 : "Password must have more than 8 characters"
    */
    let result2 = validateForm(name: "FPZ", password: "A")
    /* ▿ Validation, String>
       ▿ Failure : 2 elements
        - 0 : "Password must contain an special character"
        - 1 : "Password must have more than 8 characters"
    */
    let result3 = validateForm(name: "FPZ", password: "A$k34k21!!")
    /* ▿ Validation, String>
         - Success : "A$k34k21!!"
    */

How to use the library

The Validation lib is implemented as an enum with two cases:

Success(successValue) — represents a successful value.
Failure(failureValue) — represents an unsuccessful value.

Validation functions just return one of these two cases instead of throwing errors or mutating other variables. The keys of working with Validations are:

Combining validations: sometimes we want to create very complex validation rules. They key is to create simple reusable and composable validations in ther own and combine them into a complex validation structure.
Transforming validations values: Sometimes we get a Validation value that is not what we are looking for. We don’t really want to change anything about the status of the validation (whether it passed or failed), but we’d like to tweak the value a little bit. This is the equivalent of applying functions in an expression.
Reacting to validations results: Once we have the validation results, we need a way to react accordingly if the value is a success or a failure.

Now, we are going to see some examples:

Combining validations

    //Validate min length
    func minLength(_ value:String, min:Int, fieldName:String) -> Validation<[String], String>{
        if(value.characters.count < min){
            return Validation.Failure(["\(fieldName) must have more than \(min) characters"])
        } else {
            return Validation.Success(value)
        }
    }
    
    //Validate match a regular expression
    func matches(_ value:String, regex:String, errorMessage:String) -> Validation<[String], String>{
        if(value.range(of:regex, options: .regularExpression) == nil){
            return Validation.Failure([errorMessage])
        } else {
            return Validation.Success(value)
        }
    }
    
    //Validate password: concatenation of matches and minLength
    func isPasswordValid(_ password:String) -> Validation<[String], String> {
        return matches(password, regex: "[\\W]", errorMessage: "Password must contain an special character")
            .sconcat(minLength(password, min: 8, fieldName: "Password"))
    }
    
 
    
    //Is password valid is a more complex validation created by combining minLenght and matches validations
    let result = isPasswordValid("A")
    /*  Validation, String>
      ▿ Failure : 2 elements
        - 0 : "Password must contain an special character"
        - 1 : "Password must have more than 8 characters"
    */
   

Transforming validation values

    
    //The fmap function is only applied on Success values.
    
    let success: Validation<String, Int> = Validation.Success(1)
    success.fmap{ $0 + 1 }
    // ==> Validation.Success(2)
    
    let failure: Validation<String, Int> = Validation.Failure("error")
    failure.fmap{$0 + 1}
    // ==> Validation.Failure("error")

Reacting to validation results

        //You can react to the validation result value, either it's a success or a failure
        
        let success: Validation<String, Int> = Validation.Success(1)
        switch(success){
        case .Success(let value):
            print(value)
        case .Failure(let error):
            print(error)
        }
        // ==> Print(1)
        
        let failure: Validation<String, Int> = Validation.Failure("error")
        switch(failure){
        case .Success(let value):
            print(value)
        case .Failure(let error):
            print(error)
        }
        // ==> Print("error")

Conclusion

I wrote this lib as an personal experiment since the core SwiftZ library doesn’t include a similar data structure and I think it is a very important one, because validation is pretty common in every sowftware program. The lib is still work in progress but it can be used with SwiftZ or standalone. I would add more operations like liftA3 and similar.

Feel free to pull request the repo and improve it, thanks!.

The lib it’s inspired by the Validation Package for Haskell: https://hackage.haskell.org/package/Validation

Acknowledgements

Thanks to Jose Luis Alcala for helping me with Swift and SwiftZ.
Thanks to @jlopez_rz for helping me with test cases.
Thanks to Jorge Aznar for helping me writing this article.

Awesome functional programming en JavaScript — Spanish version

2017-08-31T23:48:00+00:00

JavaScript es un lenguaje de programación multi-paradigma, casi siempre utilizado orientado a objetos, aunque debido a su gran popularidad, se podría decir que es el lenguaje de programación funcional (FP) más utilizado.

Disclaimer:* Este artículo no pretende enseñar ni introducir en profundidad la programación funcional, sino que es una guía de bibliotecas y recursos, para poder utilizar la mayoría de herramientas y capacidades que da el estilo de programación funcional en JavaScript.*

Lambda — Representa el lambda calculus, base de la programación funcional.

¿Qué es la programación funcional?

Empecemos con una breve introducción de la programación funcional para ponernos en contexto.

La programación funcional es un paradigma de programación que se basa en funciones modeladas como funciones matemáticas. La esencia de la programación funcional es que los programas son una combinación de expresiones. Las expresiones pueden ser valores concretos, variables o funciones. Las funciones se pueden definir de forma más específica: son expresiones a las cuales se les aplica un argumento o entrada, y una vez aplicadas, se pueden reducir o evaluar. En los lenguajes funcionales y lenguajes modernos, las funciones son ciudadanos de primer clase: se pueden utilizar como valores o pueden ser pasadas como argumentos, o entradas, a otras funciones.

Cabe destacar que, los lenguajes puramente funcionales, están todos basados en el lambda calculus.

¿Por qué JavaScript?

Ciertamente, no es el mejor lenguaje para hacer FP, siendo que la corriente actual suele programar en JavaScript de forma imperativa. Además, no es un lenguaje de programación funcional puro, y es débilmente y dinámicamente tipado 😥.

Sin embargo, sus puntos a favor son:

Es uno de los lenguajes más utilizados en la industria y seguramente trabajes con él.
Lo más probable, es que ya sepas programar en JavaScript. No tienes que aprender un lenguaje nuevo.
Con la ayuda de bibliotecas puedes utilizar muchas de las herramientas de la FP.

Bibliotecas (que facilitan la programación funcional)

Voy a hablar, en este punto, de unas pocas bibliotecas, que he elegido y utilizado para FP en JavaScript.

Sanctuary

La primera de ellas (y mi favorita) es Sanctuary, cuyo lema es *“El refugio del JavaScript inseguro”, *refiriéndose a que ayuda a eliminar muchos errores en tiempo de ejecución, sobre todos los provocados por valores nulos.

Sanctuary está inspirado por los lenguajes de programación Haskell y PureScript. Provee un conjunto de funciones similares a Ramda y lodash-fp, pero muchas de ellas son seguras y trabajan con data types directamente, como por ejemplo el tipo Maybe.

Provee dos data types básicos, Maybe y Either, que cumplen la especificación de Fantasy Land, que es la especificación de facto de ADTs en JavaScript.

Promueve un estilo de programación más seguro, sin los odiosos null checks, y reduce los posibles errores en tiempo de ejecución.

Una gran ventaja es su sistema de tipos ad-hoc en tiempo de ejecución, definidos en sanctuary-def. Con este sistema podemos detectar los errores causados por tipos de forma inmediata, lo cual nos evita las clásicas sorpresas de JavaScript…

Conviene leer la entrevista que se le hizo a su creador, donde se explica el por qué de la biblioteca.

Fluture

Fluture, es una biblioteca para provee de una mónada para ejecutar código asíncrono, parecido a una promesa. Como tal, representa un valor success *o *failure, *que resulta de una operación asíncrona de *I/O. La diferencia es que Fluture es una mónada (cumple con su interfaz y leyes), por lo tanto se evalúa de forma perezosa y no lanza los side-effects al crearla.

Puede haber confusión entre las similitudes de una Promesa con una mónada. Se podría decir que el .then *es un *bind, *que el resolve es un pure, etc. *Pero no hay que olvidarse de que, una promesa, no ofrece la interfaz especificada por Fantasy Land, ni mucho menos, las leyes de las mónadas. Además de que ejecutan la operación asíncrona (side effects) nada más crearlas. Se puede ver más claro en este artículo.

Daggy

Daggy, es una pequeña, pero muy útil, biblioteca cuya finalidad es crear ADTs (También llamados Union Types por la comunidad JS/ELM). Permiten representar datos complejos de forma natural e incluso emular pattern matching.

En el siguiente ejemplo, se puede ver una** brillante manera** de usar las ADTs creadas con Daggy en componentes de React.

Otras

Cabe destacar también, la biblioteca RamdaJS, que utilicé en su día junto a Sanctuary. Es una biblioteca de utilidades, es parecida a Underscore o lodash, pero ahora, Sanctuary es mucho más madura que al principio y ha incorporado la mayoría de funciones que provee Ramda, y por ello, dejé de usarla. Las principal diferencia entre ambas son como manejan las entradas invalidas y el sistema de tipos en runtime. Ramda es más insegura (sus funciones pueden causar excepciones) porque los creadores no quieren utilizar data types (ellos creen que les quitaría usuarios), y solo proveen de funciones típicas en programación funcional, como map, sin aprovechar todo el potencial de este paradigma de programación.

Y FolktaleJS, una de las pioneras. Su versión 1.o siempre ha sido muy respetada y ahora están haciendo un gran trabajo en la 2.0, reescribiéndola por completo. Es el mismo concepto que Sanctuary, funciones tipo map, curry, chain, etc y tipos de datos como Maybe, Either, Validation… Se podría decir que Folktale es más orientado a un estilo Java y Sanctuary más hacia Haskell. Además Folktale incluye el tipo Task para tareas asíncronas, muy parecido a lo que ofrece la biblioteca Fluture.

Libros y artículos

Professor Frisby’s Mostly Adequate Guide to Functional Programming — Libro por excelencia en FP en JS, escrito por Brian Lonsdorf. Introduce al paradigma de programación funcional en general utilizando JavaScript. Es una introducción práctica, que va añadiendo, desde la intuición, ejemplos reales. Es un libro imprescindible si no se tiene experiencia previa con FP.
Functional-Light JavaScript — Este libro explora aquellos principios básicos de la FP que se pueden aplicar en JS. Se diferencia en su enfoque práctico, sin usar toda la terminología, que a muchos les echa atrás.
Why Curry Helps — Una visión general de como el currying ayuda a escribir código mas reusable y declarativo.
Functional Mumbo Jumbo — ADTs — Una introducción a los tipos algebraicos de datos, para principiantes.

Ejemplos

En el repositorio del taller que di en el Congreso web en 2016 acerca de la programación funcional en JS, podrás encontrar las slides y el proyecto de ejemplo, un buscador de vídeos de Youtube (utilizando React y FP).
Escape from Callback Mountain — Refactorización, diseño y buenas prácticas.
Design & refactoring tips for Promise-based Functional JavaScript. Key benefits include better readability, testability, and reusability. MIT.

Otros recursos

El repositorio de Awesome FP JS provee de muchos recursos además de los aquí comentados.

Bola extra: TypeScript

Sólo comentar que se Giulio Canti está en proceso de creación de la biblioteca FP-TS, que permitirá el uso de la programación funcional en TypeScript, con la gran mejora sobre JavaScript, de los tipos, que ayudan a escribir código más correcto. Gracias a estos, se puede emular cierto grado de pattern matching en TypeScript, sin bibliotecas adicionales, como se explica aquí.

Conclusión

Con la combinación de las bibliotecas Sanctuary, Fluture, Daggy, podemos llegar a programar en el paradigma de programación funcional de una manera muy similar a la que lo haríamos en cualquier otro lenguaje funcional, salvando las distancias.

Con este artículo quería demostrar que es posible poner en práctica los conceptos y herramientas de la FP, sin tener que recurrir a un lenguaje especializado.

Por otro lado quiero decir, que por sólo hacer FP no vas a desarrollar programas perfectos, sino que te expones a los mismos problemas que si no haces FP… pero que aprender FP te va a ayudar a aprender nuevos conceptos para ser mejor programador… como se discute aquí.

También quiero recomendar el uso en desarrollos frontend de otros lenguajes tipados como PureScript, Elm, ReasonML, que ayudan a escribir programas más correctos, sobre todo cuando se trata de proyectos más grandes y complejos.

Antes de finalizar, decir que en los enlaces que se han ido poniendo a lo largo del artículo de puede ampliar más información.

Finalmente, para el siguiente post, veremos un ejemplo sencillo de como parsear JSON de un servidor de forma segura con programación imperativa vs programación funcional.

Ricardo Pallás

Optimizing a Math Expression Parser in Rust

Optimizing a Math Expression Parser in Rust

Table of contents

Baseline implementation (43.1 s)

How it works

Parser Example: (1 + 2) - 3

It works! But we can do better

Optimizations for speed and memory

Optimization 1: Do not allocate a Vector when tokenizing (43.1 s → 6.45 s, –85% improvement)

Optimization 2: Zero allocations — parse directly from the input bytes (6.45 s → 3.68 s, –43% improvement)

The idea: Use &[u8]

A great improvement! From 6.45 to 3.68 seconds, nearly 2 seconds faster!

Optimization 3: Do not use Peekable (3.68 s → 3.21 s, –13% improvement)

Optimization 4: Multithreading and SIMD (3.21 s → 2.21 s, –31% improvement)

The algorithm at a high level

What is SIMD?

SIMD example: Finding +

The Code

Algorithm Breakdown: find_best_split_indices_simd

Step 1: The SIMD Scan

Step 2: The serial scan

Full example

Part 1: find_best_split_indices_simd runs

Part 2: parallel_eval receives the result

Result

Optimization 5: Memory-Mapped I/O (2.21 s → 0.98 s, –56% improvement)

Kernel Space vs. User Space

Cost of fs::read

mmap improvement

Code Changes

Performance Results

Conclusion

Pingora async runtime and threading model

Pingora async runtime and threading model

Introduction

Pingora’s async runtime

Tokio’s runtime

Multi-Thread Scheduler

Single-Thread Scheudler

Pingora’ threading model

Pingora’s multi-thread runtime without work stealing

Alternative thread-per-core runtimes

Which Pingora runtime should I use?

Work stealing

Thread-Per-Core

Conclusion

Rust 1 Billion Row Challenge without Dependencies

Rust 1 Billion Row Challenge without Dependencies

Table of Contents

Introduction

Base Naive Implementation (90 seconds)

Multithreading Solution (17.96 secs - 80% improvement)

Custom number parsing (8.1 seconds - 54.9% improvement)

Custom key parsing (6.76 seconds - 16.5% improvement)

How hash maps work

Code changes

Custom hash function (5.85 seconds - 13.5% improvement)

Unsafe string parsing (5.16 seconds - 11.8% improvement)

Edit 1: Custom line splitting (4.82 seconds - 6.59% improvement)

Conclusion

Gas stations finder in Golang

Essential algorithms for the coding interview

Promised Architecture Kit V2

PromisedArchitectureKit V2

V2 Goal

Introduction

Motivation

Quick start

Installation

PromisedArchitectureKit

The core concept

Using PromiseKit as a time abstraction

Don’t fear the AsyncResult

What if i want to make network calls, DB calls, and so on?

Update the view

What the library does under the hood?

Why should I use PromisedArchiterueKit V2 ?

Example

Bonus: analytics

Baseline implementation (43.1 s)

Optimization 1: Do not allocate a Vector when tokenizing (43.1 s → 6.45 s, –85% improvement)

Optimization 2: Zero allocations — parse directly from the input bytes (6.45 s → 3.68 s, –43% improvement)

Optimization 3: Do not use Peekable (3.68 s → 3.21 s, –13% improvement)

Optimization 4: Multithreading and SIMD (3.21 s → 2.21 s, –31% improvement)

Algorithm Breakdown: `find_best_split_indices_simd`

Part 1: `find_best_split_indices_simd` runs

Part 2: `parallel_eval` receives the result

Optimization 5: Memory-Mapped I/O (2.21 s → 0.98 s, –56% improvement)