About a year ago, I wrote about some vocabulary statistics I’d put together around various texts. This included a subcorpus of Plato I wanted to put together for the Greek Learner Texts Project. Based on the “core” works list I’d put together for https://vocab.perseus.org, this included:
At the time, I used Giuseppe Celano’s experimental lemmatization as the basis for the vocabulary counts.
In the intervening period, I went further with Crito and restructured the citation scheme to be based on units of dialogue and sentences. You can see HTML generated from the result at:
https://jtauber.github.io/plato-texts/
but the underlying data is available at https://github.com/jtauber/plato-texts/blob/master/text/crito.txt.
I also took a first pass at aligning an English translation at the sentence level and the raw data for that is available at https://github.com/jtauber/plato-texts/blob/master/analysis/crito_aligned.txt.
My plan was always to return to the other four texts and last weekend I started on that, freshly bringing in the texts (in both Greek and English) from Perseus, the Diorisis corpus tagging, and the treebanks from AGLDT (Euthyphro) and Vanessa Gorman (the Apology).
I also added:
and may add others if they seem appropriate for the Greek Learner Texts Project.
The first thing I did was produce a stripped-down tokenized version of the Greek texts from Perseus with minimal markdown. In this process, I found a small number of issues with the Perseus XML which I’ll submit corrections for shortly (mostly some stray gammas).
I then wrote a script to extract similar tokens from Diorisis for alignment. As I’ve written about before, the Diorisis corpus made the odd choice to use betacode for the tokens so I had to do a conversion. Then the real fun began.
Firstly, the Perseus text, based on the Burnet edition, has various editorial markup like <add>
, <del>
, <corr>
, <sic>
. I quickly discovered that the Diorisis text drops the <del>
and <sic>
elements. That’s fine although I might seek the advice of people more familiar with Burnet and the text scholarship of Plato as to what the Greek Learner Texts edition should do.
Secondly, in Phaedo at least, named entities are marked up in the Perseus TEI XML. People and places are all appropriately tagged. I don’t happen to need that right now although it’s potentially useful information. But the Diorisis corpus drops those elements. I don’t just mean it drops the tags, it dropps the elements. So if the sentence was <persName>John</persName> loves <persName>Mary</persName>
, Diorisis would just give the sentence as loves
(at least in Phaedo). Fairly easy to work around for alignment purposes, though.
The more time consuming aspect is the odd way Diorisis handles quotations. It seems to repeat the tokens of each quotation, once in context and then once in a sentence of its own. Except sometimes the repetition is incorporated in an unrelated sentence.
For example, the Homeric quotation in 408a (Republic Book 3) is analyzed inline but then also repeated in another sentence where it’s part of the first sentence of 409a (“δικαστὴς δέ γε…”) which, unless I’m missing something considerable, is just completely wrong.
I’m manually correcting all this (it comes up as an alignment mismatch and I’m going in and editing the Diorisis XML to remove the duplication). But even without the bad sentence merges, this also means that the vocabulary counts I’ve previously generated from Diorisis (and in vocabulary-tools
) may have doubled up on any words appearing in quotations.
So there’s lots more to do with Plato, not least of all the manual curation of lemmatization. But the goal, like that of the Greek Learner Texts Project as a whole, is to have a set of openly-licensed, high-quality, lemmatized texts for extensive reading by language learners.
Collaboration always welcome. Just ping me on the Greek Learner Texts Project Slack workspace.
]]>Back in Lexical Dispersion in the Greek New Testament Via Gries’s DP I wrote:
My sense is that dispersion might be a useful input to deciding what vocabulary to learn. For example διδαχή or σκότος might be better to learn before ἀρνίον because, even though they all have the same frequency, you are more likely to encounter διδαχή or σκότος in a random book or chapter.
Egbert’s plenary (available here after free signup) encouraged me to try a very simple metric instead of frequency: what proportion of text units in the corpus does the word appear in? Egbert emphasises using linguistically meaningful units of text (definitely not fixed-length windows) and pericopes seem perfect for this. There are dispersion measures that allow for varying sizes of text unit (like Gries’s DP) but it seemed to me that just seeing what proportion of pericopes the item appears in might be a good measure of the importance to learn (instead of frequency).
This downplays words that might get repeated a lot in just a handful of pericopes and favours those that appear in lots of pericopes even if only one or two times in that pericope. Intuitively this makes sense, A word that appears 10 times in one passage in the New Testament (and nowhere else) isn’t as generally useful to learn as a word that appears once in ten different passages. Overall corpus frequency can therefore be misleading because it treats these two cases as the same.
With vocabulary-tools
it was trivial to produce a list of all the New Testament lemmas sorted by pericope dispersion.
This gist contains the code and the list:
https://gist.github.com/jtauber/fc4b0476a4c4a94d7cb01d068161892e
Eyeballing the resultant list, it seems a very promising ordering although I welcome comments on anything interesting people notice.
Next steps are:
For this first chart, I haven’t just shown the GNT 100% and 80% but also the 98%, 95%, and 90% levels. The chart shows, assuming you’ve learned a certain % of GNT lemmas, how many tokens in the works of Plato are from those lemmas plotted against the length of the Plato work. All the plots here are log-log because of the Zipfian nature of word distributions (although it is more important in subsequent plots than this one).
At mentioned in the previous post, I was actually surprised at how little coverage drops off as a function of the length of the Plato work. A 100,000 token work has very similar token coverage than a 5,000 token work.
Visually this can be seen in how horizontal the best-fit lines are above.
However, when it comes to lemma coverage rather than token coverage, the story is very different:
The drop-off above as the Plato work gets longer is quite dramatic (especially when you consider this is a log-log plot). The points fit quite well to a line, though, indicating how Zipfian the distribution is. This demonstrates the clear relationship between the length of the work and how many lemmas you’re likely familiar with. The longer a work is, the more distinct lemmas it will use, although they tend to be low frequency within the work (hence how horiztonal the lines in the first chart are).
Notice there are some outliers—some works that seem to have higher coverage than their length would suggest given the best-fit line. I’ve called out one here, showing just the GNT 80% points and best-fit line (although it’s an outlier on the others too):
This suggests that this work might be, in some sense, easier for a GNT reader to read compared with other works of Plato. It suggests that perhaps the vocabulary of that particular work is closer to that of the GNT. The data was all there in the previous post but it’s a lot easier to spot the outliers graphically.
The work indicated above is Parmenides. I started wonder what it was about that work that made it more “GNT like”.
Then I took a step back because I realised there may be a confounding factor here. The statement “this work might be easier for a GNT reader to read compared with other works of Plato” stands but note this might not be a property of any GNT/Parmenides shared vocabulary but rather just the word distribution in Parmenides itself. In other words, Parmenides might just be easier compared with other works of Plato and that might have nothing to do with any vocabulary similarity to the GNT.
So I decided to just plot the token-to-lemma counts in the works of Plato. This doesn’t involve the GNT at all, just how many tokens each work in Plato has versus how many unique lemmas that work has.
Here is the result with Parmenides called out:
In other words, a large part (and maybe all) of why Parmenides stands off the line in the coverage after GNT is because it simply has fewer lemmas for its overall token count. Its vocabulary is just smaller for its length.
In fact, visually you can see that most of the deviations of works from the line in the early charts maps to corresponding deviations in this chart (which remember has nothing to do with the GNT).
This is just some visual comparison. There are more quantative ways of actually measuring how much the deviations in the first three charts can be explained by those in the last chart. But I’ll save that for another post.
The important takeaway for now is that, to the extent some works of Plato might be easier to read after the GNT than others, this probably has little to do with any relationship between their vocabularies, and is more to do with the inherent token-to-lemma ratio of the target work of Plato. It is possible to separate out the effects of each, though, and I will explore that in the future.
Note all the caveats I listed in my previous post about this data. Better lemmatization and richer vocabulary models are still needed.
]]>It turned out to be very simple to do with vocabulary-tools
and you can now see the script in the repo as examples3.py.
But here let me share the results and give some caveats.
In the table below:
id | title | lemmas | tokens | GNT lemmas | GNT tokens | % GNT lemmas | % GNT tokens |
---|---|---|---|---|---|---|---|
001 | Euthyphro | 690 | 5,181 | 441 | 4,274 | 63.91% | 82.49% |
002 | Apology | 1,112 | 8,745 | 631 | 7,357 | 56.74% | 84.13% |
003 | Crito | 712 | 4,172 | 433 | 3,429 | 60.81% | 82.19% |
004 | Phaedo | 1,921 | 21,825 | 1,000 | 18,033 | 52.06% | 82.63% |
005 | Cratylus | 1,607 | 17,944 | 781 | 14,701 | 48.6% | 81.93% |
006 | Theaetetus | 2,072 | 22,489 | 966 | 17,962 | 46.62% | 79.87% |
007 | Sophist | 1,598 | 16,024 | 788 | 12,932 | 49.31% | 80.7% |
008 | Statesman | 2,013 | 16,953 | 937 | 13,384 | 46.55% | 78.95% |
009 | Parmenides | 805 | 15,155 | 478 | 12,738 | 59.38% | 84.05% |
010 | Philebus | 1,567 | 17,668 | 800 | 14,076 | 51.05% | 79.67% |
011 | Symposium | 1,949 | 17,461 | 961 | 13,806 | 49.31% | 79.07% |
012 | Phaedrus | 2,266 | 16,645 | 1,027 | 12,935 | 45.32% | 77.71% |
013 | Alcibiades 1 | 1,138 | 10,264 | 628 | 8,356 | 55.18% | 81.41% |
014 | Alcibiades 2 | 711 | 4,268 | 420 | 3,449 | 59.07% | 80.81% |
015 | Hipparchus | 431 | 2,256 | 281 | 1,890 | 65.2% | 83.78% |
016 | Lovers | 473 | 2,391 | 284 | 1,923 | 60.04% | 80.43% |
017 | Theages | 627 | 3,485 | 374 | 2,811 | 59.65% | 80.66% |
018 | Charmides | 919 | 8,311 | 534 | 6,875 | 58.11% | 82.72% |
019 | Laches | 960 | 7,674 | 559 | 6,100 | 58.23% | 79.49% |
020 | Lysis | 911 | 6,980 | 524 | 5,729 | 57.52% | 82.08% |
021 | Euthydemus | 1,268 | 12,453 | 686 | 10,015 | 54.1% | 80.42% |
022 | Protagoras | 1,753 | 17,795 | 869 | 14,306 | 49.57% | 80.39% |
023 | Gorgias | 1,938 | 26,337 | 951 | 21,467 | 49.07% | 81.51% |
024 | Meno | 961 | 9,791 | 534 | 8,066 | 55.57% | 82.38% |
025 | Hippias Major | 958 | 8,448 | 528 | 6,730 | 55.11% | 79.66% |
026 | Hippias Minor | 698 | 4,360 | 396 | 3,387 | 56.73% | 77.68% |
027 | Ion | 721 | 4,024 | 382 | 3,012 | 52.98% | 74.85% |
028 | Menexenus | 958 | 4,808 | 571 | 3,985 | 59.6% | 82.88% |
029 | Cleitophon | 418 | 1,549 | 284 | 1,293 | 67.94% | 83.47% |
030 | Republic | 4,846 | 88,878 | 1,782 | 71,377 | 36.77% | 80.31% |
031 | Timaeus | 2,666 | 23,662 | 1,122 | 18,644 | 42.09% | 78.79% |
032 | Critias | 1,130 | 4,950 | 638 | 3,997 | 56.46% | 80.75% |
033 | Minos | 528 | 2,859 | 309 | 2,333 | 58.52% | 81.6% |
034 | Laws | 5,227 | 103,193 | 1,804 | 82,652 | 34.51% | 80.09% |
035 | Epinomis | 1,014 | 6,309 | 590 | 5,135 | 58.19% | 81.39% |
036 | Epistles | 2,026 | 16,964 | 1,015 | 13,768 | 50.1% | 81.16% |
It’s encouraging how any works are above the 80% level. Here are some caveats, though:
Favouring shorter works isn’t necessary a bad thing if the goal is to find the most readable (by vocabulary) works of Plato post-GNT.
Here’s a run of the code only assuming the 80% level of GNT vocabulary rather than the whole thing.
id | title | lemmas | tokens | GNT lemmas | GNT tokens | % GNT lemmas | % GNT tokens |
---|---|---|---|---|---|---|---|
001 | Euthyphro | 690 | 5,181 | 149 | 3,135 | 21.59% | 60.51% |
002 | Apology | 1,112 | 8,745 | 165 | 5,551 | 14.84% | 63.48% |
003 | Crito | 712 | 4,172 | 150 | 2,581 | 21.07% | 61.86% |
004 | Phaedo | 1,921 | 21,825 | 214 | 13,647 | 11.14% | 62.53% |
005 | Cratylus | 1,607 | 17,944 | 192 | 11,208 | 11.95% | 62.46% |
006 | Theaetetus | 2,072 | 22,489 | 215 | 13,416 | 10.38% | 59.66% |
007 | Sophist | 1,598 | 16,024 | 183 | 9,644 | 11.45% | 60.18% |
008 | Statesman | 2,013 | 16,953 | 194 | 9,577 | 9.64% | 56.49% |
009 | Parmenides | 805 | 15,155 | 140 | 9,852 | 17.39% | 65.01% |
010 | Philebus | 1,567 | 17,668 | 187 | 10,209 | 11.93% | 57.78% |
011 | Symposium | 1,949 | 17,461 | 208 | 10,437 | 10.67% | 59.77% |
012 | Phaedrus | 2,266 | 16,645 | 212 | 9,395 | 9.36% | 56.44% |
013 | Alcibiades 1 | 1,138 | 10,264 | 177 | 6,296 | 15.55% | 61.34% |
014 | Alcibiades 2 | 711 | 4,268 | 142 | 2,566 | 19.97% | 60.12% |
015 | Hipparchus | 431 | 2,256 | 111 | 1,339 | 25.75% | 59.35% |
016 | Lovers | 473 | 2,391 | 104 | 1,427 | 21.99% | 59.68% |
017 | Theages | 627 | 3,485 | 124 | 2,129 | 19.78% | 61.09% |
018 | Charmides | 919 | 8,311 | 158 | 5,277 | 17.19% | 63.49% |
019 | Laches | 960 | 7,674 | 165 | 4,632 | 17.19% | 60.36% |
020 | Lysis | 911 | 6,980 | 150 | 4,204 | 16.47% | 60.23% |
021 | Euthydemus | 1,268 | 12,453 | 181 | 7,640 | 14.27% | 61.35% |
022 | Protagoras | 1,753 | 17,795 | 195 | 10,973 | 11.12% | 61.66% |
023 | Gorgias | 1,938 | 26,337 | 205 | 16,301 | 10.58% | 61.89% |
024 | Meno | 961 | 9,791 | 159 | 6,042 | 16.55% | 61.71% |
025 | Hippias Major | 958 | 8,448 | 154 | 5,123 | 16.08% | 60.64% |
026 | Hippias Minor | 698 | 4,360 | 134 | 2,446 | 19.2% | 56.1% |
027 | Ion | 721 | 4,024 | 133 | 2,236 | 18.45% | 55.57% |
028 | Menexenus | 958 | 4,808 | 161 | 2,877 | 16.81% | 59.84% |
029 | Cleitophon | 418 | 1,549 | 113 | 966 | 27.03% | 62.36% |
030 | Republic | 4,846 | 88,878 | 252 | 53,090 | 5.2% | 59.73% |
031 | Timaeus | 2,666 | 23,662 | 210 | 13,555 | 7.88% | 57.29% |
032 | Critias | 1,130 | 4,950 | 171 | 2,872 | 15.13% | 58.02% |
033 | Minos | 528 | 2,859 | 121 | 1,776 | 22.92% | 62.12% |
034 | Laws | 5,227 | 103,193 | 250 | 58,891 | 4.78% | 57.07% |
035 | Epinomis | 1,014 | 6,309 | 165 | 3,700 | 16.27% | 58.65% |
036 | Epistles | 2,026 | 16,964 | 211 | 10,229 | 10.41% | 60.3% |
The Plato coverage generally drops from around 80% to 60% which suggests it might be worth “topping up” one’s vocabulary with some common Plato words not in the GNT before embarking on a specific work. It would be easy to generate such a list with vocabulary-tools
.
But it was quite striking to me in both tables just how little the token % drops due to length (in contrast to the lemma %).
This just goes to show that longer works introduce a lot of new words but very sparsely (probably with only one occurrence in many cases).
I might explore that graphically in a follow-up post.
]]>We previously introduced the (θ)η-aorists. In this post, we’ll mention the stem variants and then go over some counts.
In terms of stem variants, we first of all have δέω, where we find the infinitive δεθῆναι alongside the 1SG ἐδεήθην and 3SG ἐδεήθη. The infinitive form suggests a stem of δε-θη whereas the finite forms suggest a stem of ἐ-δεη-θη with an extra η.
Secondly, we have two 3SG forms of ἁρπάζω: ἡρπάγη and ἡρπάσθη.
Finally we have ἀνοίγω with its confused augmentation (which we’ve seen in other aorists) and also both a θ and non-θ form. Putting aside the ἠνοι- vs ἀνεῳ- vs ἠνεῳ- variation, we have 3SG ἠνοίχθη alongside ἠνοίγη and 3PL ἠνοίχθησαν alongside ἠνοίγησαν.
Notice that in both the ἁρπάζω and ἀνοίγω cases, we have a non-θ form with γ before the η. We’ll look at the letters we find before η and θη later in this post.
But first let’s do our usual counts of tokens and lemmas.
class | # lemmas | # tokens | # hapakes |
---|---|---|---|
-θη- | 250 | 954 | 130 |
-η- | 34 | 79 | 19 |
As one can see, the non-θ forms are more rare lexically and the lexemes that do take them occur less frequently. They both, however, seem productive.
-θη- | -η- | |
---|---|---|
INF | 166 | 4 |
1SG | 29 | 6 |
2SG | 8 | 2 |
3SG | 489 | 43 |
1PL | 30 | 5 |
2PL | 44 | 3 |
3PL | 188 | 16 |
The distribution above is what we might expect except for the INF which are disproportionately -θη-. This is not due to a single lexical item (unlike the 3SG where ἀπεκρίθη dominates).
This will be worth further investigation but we have other things to cover first. For example, is there any phonological reason why a non-θ form might be used rather than a θ-form? We saw previously, for example, that the existence or absence of the sigma in the alphathematic aorists was largely (although not entirely) predicted by the preceding letter.
It turns out, at least in our text (we’ll look more broadly later) there’s quite a strong correlation between whether a θ is found or not and what the preceding letter is.
For example, if the preceding letter is any of the vowels ε η ι ο υ ω, then we always find the θη form in the SBLGNT. α is the only exception and even then only in one lexical item out of 14, the κατακαίω form κατεκάη. (Notably κατεκαύ(σ)θη is more common elsewhere but we’ll have to wait a little to discuss καίω forms in general)
If the preceding letter is σ, then we always find the θη form. This is actually the most likely letter to precede θ by far, followed by η.
ξ ψ and ζ don’t appear in (θ)η aorists in the SBLGNT. Nor do δ τ or θ.
Amongst the velars: κ doesn’t appear in (θ)η aorists in the SBLGNT but γ and χ both do. γ is always followed directly by η (and in fact the bigram γθ never appears in the SBLGNT at all). In contrast, χ always takes the θ form (which might be explained by an underlying κ or γ becoming χ because of the following θ but this doesn’t explain why the θ would be absent in the -γ-η- instances).
Amongst the bilabials: both π and β are always followed directly by η (and neither πθ nor βθ appear as bigrams in the SBLGNT). φ however is found both in θη and η forms with a slight preference for φθη over φη.
This leaves our resonances: the liquids λ and ρ, and the nasals μ and ν. The bigram λθ is definitely allowed in Greek but we only find -λ-η- aorists, not -λ-θη-. With ν and ρ we find both θ and non-θ forms. There are no μ examples in the SBLGNT, nor do we find the bigram μθ.
Here’s a summary with lexeme counts:
-θη- | -η- | |
---|---|---|
α | 13 | 1 |
ε | 21 | - |
η | 80 | - |
ι | 17 | - |
ο | 11 | - |
υ | 26 | - |
ω | 52 | - |
σ | 108 | - |
ξ | - | - |
ψ | - | - |
ζ | - | - |
τ | - | - |
δ | - | - |
θ | - | - |
κ | - | - |
γ | - | 12 |
χ | 37 | - |
β | - | 2 |
π | - | 3 |
φ | 16 | 10 |
λ | - | 6 |
ρ | 6 | 8 |
μ | - | - |
ν | 16 | 3 |
Clearly there are some patterns here. Vowels, σ, and the aspirated stops strongly (or even entirely) favour -θη-. The non-aspirated stops seem to entirely favour a plain η. The resonances are a mixed bag.
There are definitely some correlations but it’s unclear what the casual relationship is. And it raises the important question of where the letter before the θ (or η) comes from in the first place. This relates more broadly to the question of the aorist stem. What is the relationship between the aorist stems used in the active, middle and (θ)η forms? In the next post, we’ll start to explore that. Then, after reviewing all our endings so far, we’ll move on to the even bigger question: what’s the relationship between the aorist stem and the present stem?
]]>We now turn to the (θ)η-aorists. These are often called aorist ‘passives’ but this is an unhelpful and confusing term. When talking about the form, it’s better to give a label that simply refers to the form itself rather than to one of the functions that form may or (often) may not be used for. Naming the form for one of its functions (especially when other forms can be used for the same function) runs the risk of overemphasizing that function and somehow treating other functions as anomalies.
We must be clear, though, that “(θ)η-aorist” is not a category like “root aorist” or “thematic aorist” or “sigmatic aorist” where different lexemes fall (in most cases exclusively) into just one of those categories without there necessarily being a morphsyntactic distinction. The (θ)η-aorist is a new paradigm available to verbs for expressing a certain voice in contrast to the active and middle forms that we’ve already seen.
A lot more could be said about all this but that’s outside the scope of a tour of morphological forms. The main point is that aorists can come in three voice-contrasting paradigms.
Three of the most common (θ)η-aorists in the New Testament, with broad coverage across personal endings are γενηθῆναι, ἀποκριθῆναι, and χαρῆναι.
γίνομαι | ἀποκρίνομαι | χαίρω | |
---|---|---|---|
INF | γενηθῆναι | ἀποκριθῆναι | χαρῆναι |
1SG | ἐγενήθην | ἀπεκρίθην | ἐχάρην |
2SG | ἀπεκρίθης | ||
3SG | ἐγενήθη | ἀπεκρίθη | ἐχάρη |
1PL | ἐγενήθημεν | ἐχάρημεν | |
2PL | ἐγενήθητε | ἐχάρητε | |
3PL | ἐγενήθησαν | ἀπεκρίθησαν | ἐχάρησαν |
All the above forms appear in the SBLGNT.
The “vertical” distinguishers are our familiar endings seen in the root aorist actives:
INF | -ναι |
1SG | -ν |
2SG | -ς |
3SG | - |
1PL | -μεν |
2PL | -τε |
3PL | -σαν |
The “horizontal” distinguishers, however, look like this:
INF | -(θ)ῆναι |
1SG | -(θ)ην |
2SG | -(θ)ης |
3SG | -(θ)η |
1PL | -(θ)ημεν |
2PL | -(θ)ητε |
3PL | -(θ)ησαν |
The whole category always has a -η- before the ending and most often a -θη-, hence the name (θ)η-aorist.
By far the most common form in the SBLGNT is ἀπεκρίθη (82 tokens). The plural ἀπεκρίθησαν is the third most common form (19 tokens). The second most common form is ἐδόθη (31 tokens).
The next post will look at further counts of these (θ)η-aorists and then we’ll look at the relationship between aorist active, middle and (θ)η forms before moving on to the large question of the relationship between perfective and imperfective forms.
]]>We saw in Part 42 that the aorist middle endings were:
either:
which correspond to our classes:
Note again what we said in the previous post: this is just a classification based on the distinguisher paradigms and there are other ways of categorizing aorist forms.
Does this cover all aorist middle indicatives and infinitives in SBLGNT? Are there any words in more than one class or with more than one stem? And what are the counts for the different classes and dominant lemmas or forms within each class?
We’ll cover that here.
There is one form that doesn’t match our distinguisher patterns and that’s ξυρᾶσθαι in 1 Cor 11.6. This seems to be an error in the MorphGNT tagging, though. It’s clearly a present (imperfective) infinitive not an aorist (perfective) infinitive and so is not relevant here.
Now in terms of multiple stems, we do have an augmentation difference with ἐργάζομαι. We find both ἠργασ- and εἰργασ-.
And in terms of words that seem to appear in more than one class, we have these forms of ἐξαιρέω:
We also have these forms of ἀποδίδωμι (which we’ve brought up before):
We would expect the forms to follow the root pattern. ἀπέδοσθε is unambiguously root, ἀπέδετο is unambiguously thematic. ἀπέδοντο could be taken to be root or thematic. For the purposes of the counts below, we’ll take ἀπέδοντο to be root.
Note that ἐκδίδωμι only appears as ἐξέδετο which is also thematic so there’s definitely some reanalysis going on with the δίδωμι compounds.
Here are the total counts across classes for aorist middles in SBLGNT:
class | # lemmas | # tokens | # hapakes |
---|---|---|---|
alphathematic | 109 | 393 | 49 |
thematic | 20 | 320 | 10 |
root | 14 | 39 | 3 |
(yes, “hapakes” is a running joke, equivalent to calling them “the onces”)
ἀπο-δο is the only root ending with ο and the rest of the root endings are θε and compounds:
The thematics come from ten familes and are:
But note that γίνομαι alone makes up 269 out of the 320 tokens!
Now by person/number:
alphathematic | thematic | root | |
---|---|---|---|
INF | 63 | 45 | 3 |
1SG | 23 | 16 | 3 |
2SG | 8 | 2 | 1 |
3SG | 208 | 227 | 16 |
1PL | 12 | 0 | 0 |
2PL | 22 | 4 | 2 |
3PL | 57 | 26 | 14 |
Because μι verbs have root forms throughout the middle (not just in the infinitive like in the active in Hellenistic Greek) we don’t get the disproportionately high INF root counts that we did in the active.
The 3SG expectedly dominates. This is particularly true in the thematic, in large part due to ἐγένετο. But in addition to dominance of the 3SG by ἐγένετο, all the 2SG and 2PL thematic aorist middles are γεν and most of the INF, 1SG and 3PL are too, as show here in our table showing dominant forms:
alphathematic | thematic | root | |
---|---|---|---|
INF | γενέσθαι 36/45 | καταθέσθαι 2/3 ἀποθέσθαι 1/3 | |
1SG | ἐγενόμην 12/16 | προεθέμην 1/3 προσανεθέμην 1/3 ἀνεθέμην 1/3 | |
2SG | κατηρτίσω 2/8 ἠρνήσω 2/8 | ἐγένου 2/2 | ἔθου 1/1 |
3SG | ἐγένετο 201/227 | ἔθετο 7/16 | |
1PL | |||
2PL | ἐγένεσθε 4/4 | ἔθεσθε 1/2 ἀπέδοσθε 1/2 | |
3PL | ἤρξαντο 19/57 | ἐγένοντο 14/26 | ἔθεντο 4/14 |
As with the actives, there is greater lexical variety amongst the alphathematics than amongst the thematics.
class | token-lemma ratio | % hapakes |
---|---|---|
alphathematic | 3.61 | 45.0 % |
thematic | 16.00 | 50.0 % |
root | 2.79 | 21.4 % |
The top 5% of alphathematic lemmas make up 32.1% of the tokens whereas the top 5% of thematic lemmas make up a whopping 84.1% of tokens. For the actives, recall the numbers were 44.1% and 60.7% respectively.
In the next couple of posts we’ll look at the (θ)η aorists (often misleadingly called aorist “passives”).
]]>I took MorphGNT SBLGNT and wrote a script that made a list of words from it as follows:
So up to 8 potential “words” from each token in the SBLGNT, but then with duplicates removed. This led to 55,496 unique “words”.
I grouped every individual Greek character (209 of them) found in the above word list into 30 “chapter” buckets. For example, I put “κ” in chapter 1 and “ξ” in chapter 4 and “έ” in chapter 8 and “ἤ” in chapter 14 and so on. This wasn’t done computationally, just manually. Each chapter has a theme: something new that gets introduced and, other than chapter 5 which covers the uppercase letters, there are no more than 9 new characters in each chapter and usually 5–8.
I then wrote a script that went through all 55,496 “words” from Step 1 and, for each character, looked up which chapter from Step 2 that character was introduced in. Then, for each word, the script noted the earliest chapter needed for all the characters in that word.
In other words, if chapter
is a mapping from a character to what chapter number it is in, calculate max(chapter[character] for character in word)
At this point the script has built a table of 55,496 words each with the “target chapter” they can be introduced in.
When a user on greektyping.com is doing a particular chapter, here’s what happens:
So that’s how it works. It would be fairly easy to apply to other Greek texts (they don’t have to be analysed to the extent MorphGNT is). But even with just the MorphGNT there’s a lot of “replayability”. Chapter 8 alone has 16,704 words you could be tested on.
We’ll probably add some richer statistics at some point and also typing of longer units of text but for now our focus is on adding instructions for more keyboard layouts (the drills themselves will stay the same, though).
]]>We’ve classified aorist active endings into three classes:
It’s important to stress that this is a classification of distinguisher paradigms. It is related to but distinct from other ways of classifying aorists based on the properties of the stem and how it relates to the imperfective (present) stem. We’ll get to those other ways in a few of posts’ time but for now, our classification is just based on the distinctive set of endings.
As we’ve done before, we’ll now take this classifcation and look at various counts in the SBLGNT. How many times do we encounter tokens of each class? How many different lemmas are in each class? Which paradigm cells are most common for each class?
Let’s start with just the lemma and token counts as well as the number of lemmas that only occur once in the SBLGNT text.
class | # lemmas | # tokens | # hapakes |
---|---|---|---|
alphathematic | 661 | 2973 | 326 |
thematic | 103 | 2082 | 33 |
root | 36 | 262 | 15 |
There is more lexical variety in the alphathematic class, especially when compared with the thematic class. This can be seen in the token-lemma ratio and in the percentage of lemmas that are hapakes.
class | token-lemma ratio | % hapakes |
---|---|---|
alphathematic | 4.50 | 49.3 % |
thematic | 20.21 | 32.0 % |
root | 7.28 | 41.7 % |
Another way to see this is what % of tokens are forms of the top % of lemmas.
5% | 10% | 25% | 50% | |
---|---|---|---|---|
alphathematic | 44.1% | 57.4% | 76.1% | 88.7% |
thematic | 60.7% | 76.8% | 89.6% | 96.4% |
root | 21.8% | 47.7% | 80.5% | 92.0% |
This table is saying that the top 5% of lemmas with alphathematic forms make up 44.1% of alphathematic tokens but the top 5% of lemmas with thematic forms make up 60.7% of thematic tokens.
In other words, the thematic aorist active tokens are drawn from a smaller set of lemmas than the alphathematic. In fact, a third of thematic aorist active tokens in SBLGNT are forms of εἶπον (and, as we’ll see in a moment, mostly 3SG).
One interesting anomaly perhaps worth coming back to at some stage (I wasn’t aware of it until now) is that at the top 5% and top 10% lemma level, the root aorists token % is lower than the alphathematic but at the 25% and 50% level is above.
Okay, that’s distribution across the three classes of ending. What about individual paradigm cell counts?
alphathematic | thematic | root | |
---|---|---|---|
INF | 509 | 351 | 95 |
1SG | 224 | 163 | 12 |
2SG | 88 | 30 | 3 |
3SG | 1244 | 1143 | 94 |
1PL | 80 | 45 | 3 |
2PL | 119 | 40 | 13 |
3PL | 709 | 310 | 42 |
In all cases, the infinitive and third person dominate.
It is interesting that in the alphathematics, 3SG dominates with 3PL next and then INF. In the thematics, 3SG dominates even more followed by INF with 3PL not far behind. In the root aorists, the INF is actually up with the 3SG with 3PL a distant third. Recall the μι verbs have a root form in the INF but nowhere else. This likely explains why the INF makes up such a large proportion of root form tokens.
Within the 1st and 2nd person cells, the 1SG dominates in the alphathematic and especially the thematic. In the root, the 2PL is actually on par with the 1SG.
Again this is worthy of closer inspection but there are definitely individual lexical items at work here.
As we’ve done before, let’s look at which lemmas (if any) dominate particular cell paradigm counts.
alphathematic | thematic | root | |
---|---|---|---|
INF | δοῦναι 33/95 | ||
1SG | εἶδον 54/163 | ἔγνων 6/12 ἀνέβην 3/12 | |
2SG | εἶδες 8/30 | ἔγνως 3/3 | |
3SG | εἶπε(ν) 610/1143 | ||
1PL | ἐνέβημεν 1/3 ἐπέγνωμεν 1/3 ἐξέστημεν 1/3 | ||
2PL | ἐλάβετε 13/40 | ἀνέγνωτε 10/13 | |
3PL | ἔγνωσαν 17/42 |
Consistent with its greater lexical variety, the alphathematic cells are not dominated by any one lexical item at all.
In the thematics, though, we see the disproportionate occurrence of εἶδον in the 1SG and 2SG and especially of εἶπον in the 3SG where it makes up more than half the occurrences of thematic 3SG aorist actives.
Note that no root aorist lemma dominates the 3SG cell but all the other cells have a small set of lemmas covering a lot of occurrences. ἔγνως is the only root 2SG form in the SBLGNT, and ἀνέγνωτε makes up 77% of root 2PL occurences.
One thing that might be slightly misleading about the lemma numbers for the thematic and (especially) root aorists is inclusion of compound verbs with preverbs. The 103 thematic aorist active lemmas actually come from 27 base verbs (there are 16 lexical items just from ἔρχομαι/ἦλθον for example). The 36 root aorist active lemmas actually come from just 7 base verbs and 3 of those (δίδωμι, τίθημι, and ἀφ-ίημι) only have a root ending in the infinitive.
So the only fully root verbs in the SBLGNT are the γνω family, the βη family, the στη family, and δυ. With the exception of δυ which has only one instance, the rest have reasonable token counts (82 for γνω, 71 for βη, 55 for στη).
The thematic aorist base verbs with the highest token counts are: the εἰπ family (689), the ἐλθ family (538), εἰδ (178), the λαβ family (125), the ἀγαγ family (71), the βαλ family (70), the ἀπο-θαν family (67), εὑρ (58), the πεσ family (57).
Next up we’ll look at the aorist middles again.
]]>Being able to type Greek fluently, diacritics and all, is an often neglected skill for classical and biblical language students but it’s one that is increasingly important whether you’re doing searches, writing essays, editing electronic editions, or just chatting about (or even better, in) Greek online.
A few years ago, I wrote a simple web application to help me practice typing using the built-in Greek - Polytonic input source on macOS. I grouped all the characters (including with full diacritics) into an ordered sequence with 30 stages then wrote a script to find Greek words in the New Testament that only used letters and diacritics appropriate for each stage in the sequence.
Talking to a classics lecturer a couple of weeks ago, he brought up the increasing need for students to be able to type Greek, and I said: “oh, I have a web app for that”. But I realised it needed a bit of polish.
That polish is now done (with some help from my colleague Patrick Altman) and we’ve now launched
The instructions are currently still just for the macOS Greek - Polytonic input source but we’ve put in place some of the framework to support different instructions depending on your operating system and keyboard layout setup. We’ll add new instructions for new keyboard layouts over time.
Even with missing instructions, it should mostly be possible to actual do the timed exercises with any keyboard layout as you are just assessed on the Unicode characters you are producing, not how you produced them on your particular keyboard.
Hopefully, though, this will be a helpful resource to all those who want to be able to type Ancient Greek faster. And we’ll continue to improve it over time, both in terms of instructions for other layouts but also some more features, interesting stats, and fun games.
And there is no reason we can’t extend it to other writing systems too.
]]>