Tackling the wordcount challenge with Guile Scheme
Table of Contents
The wordcount challenge pits programming languages against each other in a battle of counting words in unicode text. It provides a simple language comparison in the spirit of the benchmarks game: split a text and count each word's frequency, then print the list sorted by frequency in decreasing order. Ties are printed in alphabetical order.
Together with Linus Björnstam I set out to tackle it with Guile. We produced two versions:
- One written for maximum clarity and
- one written for speed.
We offered both in a pull-request.
Reference for clarity
#!/usr/bin/env bash # -*- mode: scheme -*- exec guile $0 # !# ;;; wordcount.scm --- meeting the challenge with Guile, clear version ;; Copyright (C) 2017 ;; Arne Babenhauserheide <arne_bab@web.de> and Linus Björnstam ;; License: LGPL ;;; Commentary: ;; The wordcount challenge pits programming languages against each ;; other in a battle of counting words in unicode text. ;; ;; See https://github.com/juditacs/wordcount ;; ;; wordcount.scm holds an alternative version written for ;; speed but not clarity. ;;; Code: (import (ice-9 rdelim) (ice-9 hash-table)) (define (count-or-unicode>? a b) (let ((Na (cdr a))(Nb (cdr b))) (cond [(> Na Nb) #t] [(= Na Nb) (string<? (car a) (car b))] [else #f]))) (define (count-words port) (define delimiters " \n\t") (define count (make-hash-table #e1e6)) (define (add-word next) (hash-set! count next (1+ (hash-ref count next 0)))) (let loop ((next (read-delimited delimiters port))) (unless (eof-object? next) (unless (string-null? next) (add-word next)) (loop (read-delimited delimiters port)))) (sort! ;; destructive sort to save memory (hash-map->list cons count) count-or-unicode>?)) (for-each (lambda (x) (format #t "~a\t~a\n" (car x) (cdr x))) (count-words (current-input-port)))
Optimized for speed
;;; wordcount.scm --- meeting the challenge with Guile, fast version ;; Copyright (C) 2017 ;; Arne Babenhauserheide <arne_bab@web.de> and Linus Björnstam ;; License: LGPL ;;; Commentary: ;;; Code: (import (ice-9 rdelim) (ice-9 hash-table)) (define (count-or-unicode>? a b) (let ((Na (cdr a))(Nb (cdr b))) (cond [(> Na Nb) #t] [(= Na Nb) (string<? (car a) (car b))] [else #f]))) (define (skip-whitespace port) (let ([ch (peek-char port)]) (when (and (not (eof-object? ch)) (char-whitespace? ch)) (read-char port) (skip-whitespace port)))) (define (count-words port) (define (read-word port) (skip-whitespace port) (let ((line (read-delimited "\t" port))) (if (eof-object? line) #f (map (λ (x) (string-split x #\space)) (string-split line #\newline))))) (define count (make-hash-table 100000)) (define (add-word next) (unless (string-null? next) (hash-set! count next (1+ (hash-ref count next 0))))) (let loop ((next (read-word port))) (when next (for-each (lambda (wl) (for-each add-word wl)) next) (loop (read-word port)))) (sort! ;; destructive sort to save memory (hash-map->list cons count) count-or-unicode>?)) (for-each (lambda (x) (format #t "~a\t~a\n" (car x) (cdr x))) (count-words (current-input-port)))
Lessons Learned
- Hashmaps are fast.
- Refactoring made this code much easier to understand.
- The performance of Guile could be golfed by factor 2.
- Using skip lists almost halved the required memory, but increased runtime by factor 6. Hashmaps are fast.
- We’re on the level of Python3 in this string-benchmark, a task for which Python is heavily optimized.
- Working together is fun, even if only exchanging a few lines in IRC per hour!
License: cc by-sa