Tackling the wordcount challenge with Guile Scheme

Table of Contents

The wordcount challenge pits programming languages against each other in a battle of counting words in unicode text. It provides a simple language comparison in the spirit of the benchmarks game: split a text and count each word's frequency, then print the list sorted by frequency in decreasing order. Ties are printed in alphabetical order.

Together with Linus Björnstam I set out to tackle it with Guile. We produced two versions:

We offered both in a pull-request.

Reference for clarity

#!/usr/bin/env bash
# -*- mode: scheme -*-
exec guile $0
# !#

;;; wordcount.scm --- meeting the challenge with Guile, clear version

;; Copyright (C) 2017
;; Arne Babenhauserheide <arne_bab@web.de> and Linus Björnstam
;; License: LGPL

;;; Commentary:

;; The wordcount challenge pits programming languages against each
;; other in a battle of counting words in unicode text.
;;
;; See https://github.com/juditacs/wordcount
;;
;; wordcount.scm holds an alternative version written for
;; speed but not clarity.

;;; Code:

(import (ice-9 rdelim)
        (ice-9 hash-table))

(define (count-or-unicode>? a b)
  (let ((Na (cdr a))(Nb (cdr b)))
    (cond
     [(> Na Nb) #t]
     [(= Na Nb) (string<? (car a) (car b))]
     [else #f])))

(define (count-words port)
  (define delimiters " \n\t")
  (define count (make-hash-table #e1e6))
  (define (add-word next)
    (hash-set! count next
               (1+ (hash-ref count next 0))))
  (let loop ((next (read-delimited delimiters port)))
    (unless (eof-object? next)
      (unless (string-null? next)
        (add-word next))
      (loop (read-delimited delimiters port))))
  (sort! ;; destructive sort to save memory
   (hash-map->list cons count)
   count-or-unicode>?))

(for-each
 (lambda (x) (format #t "~a\t~a\n" (car x) (cdr x)))
 (count-words (current-input-port)))

Optimized for speed

;;; wordcount.scm --- meeting the challenge with Guile, fast version

;; Copyright (C) 2017
;; Arne Babenhauserheide <arne_bab@web.de> and Linus Björnstam
;; License: LGPL

;;; Commentary:

;;; Code:

(import (ice-9 rdelim)
        (ice-9 hash-table))

(define (count-or-unicode>? a b)
  (let ((Na (cdr a))(Nb (cdr b)))
    (cond
     [(> Na Nb) #t]
     [(= Na Nb) (string<? (car a) (car b))]
     [else #f])))

(define (skip-whitespace port)
  (let ([ch (peek-char port)])
    (when (and (not (eof-object? ch)) (char-whitespace? ch))
      (read-char port)
      (skip-whitespace port))))

(define (count-words port)
  (define (read-word port)
    (skip-whitespace port)
    (let ((line (read-delimited "\t" port)))
      (if (eof-object? line)
          #f
          (map (λ (x) (string-split x #\space))
                (string-split line #\newline)))))
  (define count (make-hash-table 100000))
  (define (add-word next)
    (unless (string-null? next)
      (hash-set! count next
                 (1+ (hash-ref count next 0)))))
  (let loop ((next (read-word port)))
    (when next
      (for-each (lambda (wl) (for-each add-word wl)) next)
      (loop (read-word port))))
  (sort! ;; destructive sort to save memory
   (hash-map->list cons count)
   count-or-unicode>?))

(for-each
 (lambda (x) (format #t "~a\t~a\n" (car x) (cdr x)))
 (count-words (current-input-port)))

Lessons Learned

  • Hashmaps are fast.
  • Refactoring made this code much easier to understand.
  • The performance of Guile could be golfed by factor 2.
  • Using skip lists almost halved the required memory, but increased runtime by factor 6. Hashmaps are fast.
  • We’re on the level of Python3 in this string-benchmark, a task for which Python is heavily optimized.
  • Working together is fun, even if only exchanging a few lines in IRC per hour!

With Guise and Guile

License: cc by-sa

Date: 2017-05-28 Sun 00:00

Author: Arne Babenhauserheide

Created: 2020-05-06 Wed 09:35

Emacs 24.5.1 (Org mode 8.2.6)

Validate