This site is under active development. In the near future, we plan on providing a means for users of this site to provide feedback, at which time we welcome any and all comments and suggestions for improvement. At the moment, however, the current website is made available for purely testing purposes.
The primary purpose of this site is to promote the use of, and research into, the Noongar language. At its core, the site is a database containing Noongar words and their glosses that have been collated from several publically available wordlists. In this respect, it is similar both in spirit and in content to the work of Bindon and Chadwick, A Nyoongar wordlist : from the south-west of Western Australia; however, the words in this database have been transcribed from the original sources, and not from Bindon and Chadwick's compilation. So far, this database contains words from the following sources:
The main innovation that this site offers is a search utility with a customisable "fuzziness" feature. The need for a so-called "fuzzy search" arises because historical Noongar wordlists show great variation in spelling. This is a natural consequence of the facts that
Fuzzy searches are, of course, not new; however, the particular algorithm used in this website has been developed specifically for common spelling variations found in Noongar. For example, the sound ⟨d͡ʒ⟩ (like the j in jump), which occurs frequently in Noongar, is variously spelt ch, d, dj, g, j, tj, and tch. Similarly, there is a wide variation in which letters were used to denote vowel sounds. Thus, the search has been designed so that jet and chitt would count as a much closer match than, say, jet and met. For the curious, a full description of the algorithm is given below.
I myself am neither a fluent speaker of Noongar, nor am I a trained linguist. Although I have taken care to enter the wordlists into the database accurately, I will have no doubt made many mistakes along the way. If you notice any transcription errors, they can be reported by clicking on the "" icon next to the word in question. Your help and feedback are greatly appreciated!
The algorithm consists of two stages: (1) Tokenisation, and (2) Levenshtein distance calculation. There is also an initial, "cleaning" stage in which (Unicode) strings are converted to lower case (ASCII) characters, as well as removing any diacritic marks (e.g. Gurdăk gurdak).
Tokenisation means splitting up a word into smaller units called tokens. A token is either a trigraph (group of three letters), a digraph (group of two letters), or a single letter drawn from a pre-defined list (see below) that has been tailored specifically for the spelling variations of similar sounds found across the wordlists. For example, the word djook is tokenised as ['dj', 'oo', 'k'], that is, two digraphs and one single letter.
The algorithm for tokenising a string of characters is as follows. Starting at the beginning of the string, the next group of three, two, and one character is checked against the pre-defined list of allowable trigraphs and digraphs (in that order). If a match is found, the matching group of characters is considered the first token of the string. If no match is found, then the single character is considered the first token. The group of characters (or single character) is then added to a list of tokens for the string in question, and removed from the start of the string. The process is then repeated, each time adding either the matching trigraph, digraph, or single letter to the list of tokens until the end of the string is reached.
The above is undoubtedly made clearer by walking through an example. Suppose the string to be tokenised is Bărdănitch. The initial cleaning stage first converts this to bardanitch. Next, the following steps are taken:
Only alphabetic characters are considered for tokens, but non-alphabetic characters (e.g. spaces, hyphens) do play a role. If a non-alphabetic character occurs in a word, digraphs and trigraphs are prohibited from spanning that character. For example, the word bun-gal is tokenised as ['b', 'u', 'n', 'g', 'a', 'l'], and not as ['b', 'u', 'ng', 'a', 'l'], even though ng is in the list of allowed digraphs.
The Levenshtein distance is a method for measuring how different two strings are. It does this by finding the most efficient way to convert the first string to the second string by using character insertions, deletions, and substitutions. Each time one of these three operations must be used, a penalty is incurred; essentially, the Levenshtein distance is equal to the total penalty points accrued. See the (Wikipedia article on Levenshtein distance) for more detail.
The algorithm used in this website differs from the standard Levenstein distance in two important ways. First, it is calculated on the basis of tokens instead of characters. Thus, the traditional Levenshtein distance between tch and dj is 3 (two character substitutions and one deletion), whereas here, the distance would be 1 (a single token substitution). Second, not all substitutions are equally penalised. Pairs of tokens that are more likely to represent similar underlying sounds are penalised less than those that typically represent very different sounds.
The tables of penalties are given below. The grid cells represent the "penalty" of substituting the tokens given in the corresponding row and column headers, and are coloured accordingly. Any pair of tokens not represented in one of the following tables incurs the maximum penalty of 10. (Note that 10 is also the penalty assigned to insertions and deletions.)
The algorithm described above has drawbacks: