Website version: v0.5 (beta)

This site is under active development. In the near future, we plan on providing a means for users of this site to provide feedback, at which time we welcome any and all comments and suggestions for improvement. At the moment, however, the current website is made available for purely testing purposes.

About this site

The primary purpose of this site is to promote the use of, and research into, the Noongar language. At its core, the site is a database containing Noongar words and their glosses that have been collated from several publically available wordlists. In this respect, it is similar both in spirit and in content to the work of Bindon and Chadwick, A Nyoongar wordlist : from the south-west of Western Australia; however, the words in this database have been transcribed from the original sources, and not from Bindon and Chadwick's compilation. So far, this database contains words from the following sources:

Why use this site?

The main innovation that this site offers is a search utility with a customisable "fuzziness" feature. The need for a so-called "fuzzy search" arises because historical Noongar wordlists show great variation in spelling. This is a natural consequence of the facts that

  1. Noongar was not written language,
  2. the people compiling the wordlists (who were necessarily coming up with their own bespoke spelling systems) were by and large not trained linguists, and
  3. there is dialectal variation across the South-West parts of Australia where Noongar was spoken.
Even though there now exists a widely implemented orthographic standard for Noongar as it is currently used today, the problem of finding a particular word in a historical document persists.

Fuzzy searches are, of course, not new; however, the particular algorithm used in this website has been developed specifically for common spelling variations found in Noongar. For example, the sound ⟨d͡ʒ⟩ (like the j in jump), which occurs frequently in Noongar, is variously spelt ch, d, dj, g, j, tj, and tch. Similarly, there is a wide variation in which letters were used to denote vowel sounds. Thus, the search has been designed so that jet and chitt would count as a much closer match than, say, jet and met. For the curious, a full description of the algorithm is given below.

Caveats and disclaimers

I myself am neither a fluent speaker of Noongar, nor am I a trained linguist. Although I have taken care to enter the wordlists into the database accurately, I will have no doubt made many mistakes along the way. If you notice any transcription errors, they can be reported by clicking on the "" icon next to the word in question. Your help and feedback are greatly appreciated!


Details of the search algorithm

The algorithm consists of two stages: (1) Tokenisation, and (2) Levenshtein distance calculation. There is also an initial, "cleaning" stage in which (Unicode) strings are converted to lower case (ASCII) characters, as well as removing any diacritic marks (e.g. Gurdăk gurdak).

Tokenisation

Tokenisation means splitting up a word into smaller units called tokens. A token is either a trigraph (group of three letters), a digraph (group of two letters), or a single letter drawn from a pre-defined list (see below) that has been tailored specifically for the spelling variations of similar sounds found across the wordlists. For example, the word djook is tokenised as ['dj', 'oo', 'k'], that is, two digraphs and one single letter.

The algorithm for tokenising a string of characters is as follows. Starting at the beginning of the string, the next group of three, two, and one character is checked against the pre-defined list of allowable trigraphs and digraphs (in that order). If a match is found, the matching group of characters is considered the first token of the string. If no match is found, then the single character is considered the first token. The group of characters (or single character) is then added to a list of tokens for the string in question, and removed from the start of the string. The process is then repeated, each time adding either the matching trigraph, digraph, or single letter to the list of tokens until the end of the string is reached.

The above is undoubtedly made clearer by walking through an example. Suppose the string to be tokenised is Bărdănitch. The initial cleaning stage first converts this to bardanitch. Next, the following steps are taken:

...and so on. The final list of tokens in this case would be ['b', 'a', 'rd', 'a', 'n', 'i', 'tch'].

Only alphabetic characters are considered for tokens, but non-alphabetic characters (e.g. spaces, hyphens) do play a role. If a non-alphabetic character occurs in a word, digraphs and trigraphs are prohibited from spanning that character. For example, the word bun-gal is tokenised as ['b', 'u', 'n', 'g', 'a', 'l'], and not as ['b', 'u', 'ng', 'a', 'l'], even though ng is in the list of allowed digraphs.

Levenshtein distance calculation

The Levenshtein distance is a method for measuring how different two strings are. It does this by finding the most efficient way to convert the first string to the second string by using character insertions, deletions, and substitutions. Each time one of these three operations must be used, a penalty is incurred; essentially, the Levenshtein distance is equal to the total penalty points accrued. See the (Wikipedia article on Levenshtein distance) for more detail.

The algorithm used in this website differs from the standard Levenstein distance in two important ways. First, it is calculated on the basis of tokens instead of characters. Thus, the traditional Levenshtein distance between tch and dj is 3 (two character substitutions and one deletion), whereas here, the distance would be 1 (a single token substitution). Second, not all substitutions are equally penalised. Pairs of tokens that are more likely to represent similar underlying sounds are penalised less than those that typically represent very different sounds.

The tables of penalties are given below. The grid cells represent the "penalty" of substituting the tokens given in the corresponding row and column headers, and are coloured accordingly. Any pair of tokens not represented in one of the following tables incurs the maximum penalty of 10. (Note that 10 is also the penalty assigned to insertions and deletions.)

Legend

0
1
2
3
4
5
6
7
8
9
10

Vowels

a
aa
ah
e
ee
i
o
oo
u
y
ya
a
0
1
1
5
5
5
5
5
3
5
2
aa
1
0
0
5
5
5
5
5
4
5
2
ah
1
0
0
5
5
5
5
5
4
5
2
e
5
5
5
0
3
5
5
5
5
5
5
ee
5
5
5
3
0
1
5
5
5
2
3
i
5
5
5
5
1
0
5
5
5
2
3
o
5
5
5
5
5
5
0
1
5
5
5
oo
5
5
5
5
5
5
1
0
2
5
5
u
3
4
4
5
5
5
5
2
0
4
4
y
5
5
5
5
2
2
5
5
4
0
5
ya
2
2
2
5
3
3
5
5
4
5
0

Nasals

m
mm
n
ng
nn
ny
rn
m
0
0
8
6
8
6
9
mm
0
0
8
6
8
6
9
n
8
8
0
2
0
1
2
ng
6
6
2
0
2
1
2
nn
8
8
0
2
0
1
2
ny
6
6
1
1
1
0
2
rn
9
9
2
2
2
2
0

Liquids

l
ll
ly
rl
r
rh
rr
l
0
0
1
1
10
10
10
ll
0
0
1
1
10
10
10
ly
1
1
0
5
10
10
10
rl
1
1
5
0
10
10
10
r
10
10
10
10
0
1
1
rh
10
10
10
10
1
0
1
rr
10
10
10
10
1
1
0

Bilabial stops

b
bb
bw
p
pb
pp
pw
b
0
0
3
0
0
0
3
bb
0
0
3
0
0
0
3
bw
3
3
0
3
0
3
0
p
0
0
3
0
0
0
3
pb
0
0
0
0
0
0
3
pp
0
0
3
0
0
0
3
pw
3
3
0
3
3
3
0

Dental stops

ch
d
dd
dg
dj
dt
dw
j
t
tch
tj
tt
tw
ch
0
6
6
1
0
6
2
0
6
0
0
6
2
d
6
0
0
9
8
0
3
8
0
8
8
0
3
dd
6
0
0
9
8
0
3
8
0
8
8
0
3
dg
1
9
9
0
1
9
9
1
9
1
1
9
9
dj
0
8
8
1
0
8
8
0
8
0
0
8
8
dt
6
0
0
9
8
0
3
8
0
8
8
0
3
dw
0
3
3
9
8
3
0
10
8
0
0
8
8
j
0
8
8
1
0
8
10
0
8
0
0
8
8
t
6
0
0
9
8
0
8
8
0
8
8
0
3
tch
0
8
8
1
0
8
0
0
8
0
0
8
8
tj
0
8
8
1
0
8
0
0
8
0
0
8
8
tt
6
0
0
9
8
0
8
8
0
8
8
0
3
tw
2
3
3
9
8
3
8
8
3
8
8
3
0

Velar stops

c
ck
g
gw
k
kk
kw
qu
c
0
0
0
3
0
0
3
3
ck
0
0
0
3
0
0
3
3
g
0
0
0
3
0
0
3
3
gw
3
3
3
0
3
3
0
0
k
0
0
0
3
0
0
3
3
kk
0
0
0
3
0
0
3
3
kw
3
3
3
0
3
3
0
0
qu
3
3
3
0
3
3
0
0

Known issues

The algorithm described above has drawbacks:

  1. The tokenisation is not contextual. For example, Bussell (1930) writes the Noongar word for stomach (koboorl) as "Cobble", which is tokenised as ['c', 'o', 'bb', 'l', 'e']. In reality, the Noongar word ends with the liquid 'l', just as an English speaker would pronounce "cobble" if they thought they were reading an English word instead of a Noongar one. However, the current algorithm does not take the context of the final 'e' when producing the tokenisation. Consequently, Cobble and koboorl are more "distant" than they would have been if context had been taken into account.
  2. The penalties are subjective. The penalties in the above tables are based on an entirely subjective idea whether two tokens could potentially "sound similar". There probably is an optimal set of penalties that minimises the distance between different spellings of the same word in a given set of wordlists, but the extra efficiency that would be gained by finding the optimal penalties is expected to be marginal compared to the initial gain of using a flexible penalty system in the first place. Having said that, I welcome any and all suggestions on how to improve this algorithm.