Starting a Rust learning journey with a simple parser

Before diving into learning some Rust, let’s have some background on the motivation why I want to learn Rust in the first place. I’ve written Java professionally for a few years, using different Java versions, as the language grows, more and more recent Java version features aim at increasing the expressiveness of the language and reducing the verbosity of the existing syntax, such as some features like records, switch expressions or pattern matching. However, I personally think the motivation for building such features was highly influenced by other languages such as Rust or Scala.

Rust is one of the most loved and popular languages in the world right now because of its memory safety and performance. Not only that, it also allows me to write a different type of programming than what I was doing in Java, in which I mainly write OOP code and don’t have low-level memory access. So why not start learning some new language like Rust to think differently when writing a program (since Rust doesn’t enforce any programming paradigm), have a deeper understanding of memory management, be exposed to new vocabulary and concepts, and figure out why it’s more performant compared to Java?

In this very first post, we will start with something simple but non-trivial. The outcome should be that we can get familiarized with some of the most important Rust syntax, such as standard libraries, some of its data structures, primitive types, and preludes. Some of the vocabulary might be new to us, but we will get along with them after finishing this post.

Let’s start our first coding challenge in Rust by writing a really simple parser. I’m really glad that my friend suggested this idea instead of jumping straight to more complicated ones in the first place. Here is what we need to do:

Given a text file containing words in each line, output all unique words in an aphabetically order and listing off all the lines that they appear in. If any of them appears in a single line more than one then only print the line number once for it. Also, line numbers each word appeared in should also be sorted in an increasing order.

Let’s sketch some high-level ideas on what should we do to complete this task:

Read a whole text file to get the content
Use suitable data structures to conveniently store and process the input
Output the result when all the file content is processed

Starting off by reading a file, we’re introduced to the concept of modules in Rust. Which is a way to organize and group related functions. For example, the fs module is used for file system manipulation functions such as reading or writing files, and the os module is for OS-specific functions such as file permissions and process management. To access a function inside a module, we can use the :: (double dots), they are the path separators used to navigate through modules, structs, enums, and other items within a namespace. Here we have the fs module existing in the std namespace. We need to import the fs module into our code:

use std::fs;
fn read_file(file_path: &str) -> std::io::Result<String> { 
  fs::read_to_string(file_path)
}

The naming convention in Rust is the snake case, that is why we have read_file and read_to_string. We take a file path as an argument and return the content of this file. In Rust, the return value of a function is specified by the last expression within the function without a semicolon. The &str parameter is a type that represents a reference to a string slice. The & we can think as a pointer in other languages such as C but with other characteristics such as borrowing (you can access the data without being responsible for its memory), immutable by default (cannot modify the referenced value unless you explicitly use a mutable reference (&mut), and also safety (the reference cannot live longer than the data it points to). The Result type represents either success (Ok(T)) or failure (Err(E)). In our case, a string will be returned if our reading operation is succeeded.

The prelude is a collection of commonly used types, traits, and functions that are automatically imported into every Rust program. Similar to the Java program where the java.lang is automatically imported.

Next, we want to go through each line and lowercase all the words:

fn get_lines_lower_case(text: &str) -> Vec<String> {
    text.to_lowercase()
        .lines()
        .map(|line| line.to_string())
        .collect()
}

We see some new syntax |line| line.to_string() which is a closure. That is how we create an anonymous function in Rust. Inside the || will be the function parameters, and after that is the function body, and the whole function get_lines_lower_case returns a list of strings as a vector.

Now, let’s go to the main part in which we take a text, and then return a data structure holding our sorted output:

fn process_text(text: &str) -> BTreeMap<String, BTreeSet<usize>> {
    let mut word_lines: BTreeMap<String, BTreeSet<usize>> = BTreeMap::new();
    let lines = get_lines_lower_case(text);
...
}

We use the let keyword to declare a variable in Rust. The BTreeMap is a data structure similar to a HashMap, but also includes another property in which its elements are sorted by the key and the keys will be unique. That’s exactly what we need for storing our word index. We use BTreeSet as the value of our map since we also want to uniquely store the lines where words appear and also want them to be sorted. The usize is an unsigned integer type and platform dependent, I use macOS so the size of usize will be 64 bits (8 bytes). As we want to mutate the word_lines directly, we need to prefix it with the mut keyword. Now we have a full function body:

fn process_text(text: &str) -> BTreeMap<String, BTreeSet<usize>> {
    let mut word_lines: BTreeMap<String, BTreeSet<usize>> = BTreeMap::new();
    let lines = get_lines_lower_case(text);
    for (line_number, line) in lines.iter().enumerate() {
        let words: HashSet<String> = line
            .split_whitespace()
            .map(|word| word.to_string())
            .collect();
        for word in words {
            word_lines
                .entry(word)
                .or_insert_with(BTreeSet::new)
                .insert(line_number + 1);
        }
    }
    word_lines
}

In Rust, a tuple is a collection of multiple values that can be of different types. What is returned lines.iter().enumerate() is a tuple having the first element as an integer, and the second element as a string representing the line content.

For each line, we split words by white space and stored them in a set. Then we iterate through each word in our set and update our word_lines. First, we look up the word in our map, if we cannot find it, then the entry function will insert that key into our map, also creating a new BTreeSet value for this key. The insert function will add a new line number to the set of lines that a word appeared. We need to add 1 to it because the line_number starts with 0. Finally, the last expression word_lines returns our data structure wrapping the result.

We now create another function that takes a command line argument and returns a path where our input file is stored:

fn read_file_from_args() -> io::Result<String> {
    let args: Vec<String> = env::args().collect();

    if args.len() < 2 {
        return Err(io::Error::new(io::ErrorKind::InvalidInput, "Usage: <file_path>"));
    }

    let file_path = &args[1];
    read_file(file_path)
}

The args.len() >= 2 will make sure that we have at least 2 arguments when trying to run the program since the first argument can be a path to the Cargo executable, and our input path will be on the second argument, and we access it with &args[1].

Finally, we are now ready to glue all functions together and execute the program:

use std::collections::{BTreeMap, BTreeSet, HashSet};
use std::env;
use std::fs;
use std::io;

fn read_file(file_path: &str) -> io::Result<String> {
    fs::read_to_string(file_path)
}

fn get_lines_lower_case(text: &str) -> Vec<String> {
    text.to_lowercase()
        .lines()
        .map(|line| line.to_string())
        .collect()
}

fn process_words(content: &str) -> BTreeMap<String, BTreeSet<usize>> {
    let mut word_map: BTreeMap<String, BTreeSet<usize>> = BTreeMap::new();
    let lines = get_lines_lower_case(content);

    for (line_number, line) in lines.iter().enumerate() {
        let words: HashSet<&str> = line.split_whitespace().collect(); 
        for word in words {
            let word = word.to_string();
            word_map
                .entry(word)
                .or_insert_with(BTreeSet::new)
                .insert(line_number + 1);
        }
    }

    word_map
}

fn read_file_from_args() -> io::Result<String> {
    let args: Vec<String> = env::args().collect();

    if args.len() < 2 {
        return Err(io::Error::new(
            io::ErrorKind::InvalidInput,
            "Usage: <file_path>",
        ));
    }

    let file_path = &args[1];
    read_file(file_path)
}

fn main() {
    let content = match read_file_from_args() {
        Ok(text) => text,
        Err(e) => {
            eprintln!("Error: {}", e);
            std::process::exit(1);
        }
    };

    let word_map = process_words(&content);

    for (word, lines) in word_map {
        println!("Word: '{}', Lines: {:?}", word, lines);
    }
}

Similar to Maven, or Gradle in the Java world, Cargo is a package manager and build tool in Rust. To run our program, we can navigate inside our project, and run the cargo command:

cargo run -- /path/to/the/input/file.txt

We have a simple input file like this:

Apple banana cherry
Banana dog cat
Rust is a programming language
Hello world from Rust
The quick brown fox jumps
The lazy dog is cute
This is a test file
For testing random purposes
Banana and apple are fruits
Goodbye and hello

The final output we are expected to see:

Word: 'a', Lines: {3, 7}
Word: 'and', Lines: {9, 10}
Word: 'apple', Lines: {1, 9}
Word: 'are', Lines: {9}
Word: 'banana', Lines: {1, 2, 9}
Word: 'brown', Lines: {5}
Word: 'cat', Lines: {2}
Word: 'cherry', Lines: {1}
Word: 'cute', Lines: {6}
Word: 'dog', Lines: {2, 6}
Word: 'file', Lines: {7}
Word: 'for', Lines: {8}
Word: 'fox', Lines: {5}
Word: 'from', Lines: {4}
Word: 'fruits', Lines: {9}
Word: 'goodbye', Lines: {10}
Word: 'hello', Lines: {4, 10}
Word: 'is', Lines: {3, 6, 7}
Word: 'jumps', Lines: {5}
Word: 'language', Lines: {3}
Word: 'lazy', Lines: {6}
Word: 'programming', Lines: {3}
Word: 'purposes', Lines: {8}
Word: 'quick', Lines: {5}
Word: 'random', Lines: {8}
Word: 'rust', Lines: {3, 4}
Word: 'test', Lines: {7}
Word: 'testing', Lines: {8}
Word: 'the', Lines: {5, 6}
Word: 'this', Lines: {7}
Word: 'world', Lines: {4}

In conclusion, this simple challenge taught us quite a few different concepts in Rust. I hope that you find the content useful. Also, if you’re interested in more posts like this, where we try to learn new languages through some sort of coding challenges, then please consider subscribing to my newsletter. I expect as more time we spend with this language, we will then gravitate toward more complex problems where more fundamental computer science knowledge is required. At that time, we’ll be likely concerned with the problems themselves, rather than dealing with the language syntax. Happy coding and see you soon!

[UPDATED] The code is now available at rust-coding-challenges

5 2 votes

Article Rating

Facebook Tweet LinkedIn

Starting a Rust learning journey with a simple parser9 min read