Glossary Updated

This post describes a recent process to update the glossary found on this blog. I believe a reader should know how a glossary is assembled in order to know how much to  trust its accuracy so I’m trying to be as transparent about process as possible. Furthermore my glossary has two “biases”: 1) it is aimed at terms found in Spain, not any Spanish term from anywhere, and, 2) I (mostly) only include terms I’ve actually found on the hundreds of  menus from restaurants in  Spain I’ve collected and analyzed to create a highly curated corpus. So while the glossary has considerable effort in constructing it naturally it still has errors as it was manually compiled. But I believe it is one of the better and more exhaustive glossaries you’ll find, at least for free on the Net.

After eight more days of work since my post about this effort I decided to call it “done” and update my glossary page as version 4.0. The glossary gained about 150 items, had numerous errors corrected (especially spelling, especially accents), had some definitions changed or enhanced, and adopted my “syntax” to show all the forms of this word under under a single “lemma” (just learned this term from linguistics).

Despite all the work I did there are still mistakes, omissions, inconsistencies in the lemma representations and other errors. This is the challenge of manually editing a large amount of material, even while trying to be very careful. Each time I do this manually I learn a bit more about how I’ll have to create the software to create and manage a properly curated corpus which I’ll need for my translation application.

Not every term in this glossary is really a “translation” to English as often there is no translation. So instead, based on terms I have found in the many menus from Spain restaurants that I’ve analyzed as the “raw” data, I have sometimes had to supply a description instead of either a “definition” or a translation. For instance, I researched and added most of the names of grapes used in Spanish wines, olives used in tapas and cheeses used in various dishes. While one might translate Cabrales as “blue cheese” this isn’t that helpful so descriptions work better.

So almost every term in my glossary I have found in menus. There are more terms in the various glossaries I’ve found and assembled but unless I actually see a term used in a menu in Spain I can’t be certain some term from some other glossary actually applies to Spain. Or, of course, Spanish food terms in other parts of the world may mean something entirely different than they do in Spain and so I’m trying (as best I can) to focus on the vocabulary one would encounter in Spain.

I may do some more “fixes” or additions to this glossary but I don’t expect to do another major revision. As it is this is now one of the largest glossary you’ll find anywhere on the net (and perhaps the easiest to access, just a single, albeit, long webpage, not some more complex access scheme). So while this glossary, like anything you find on the Net, is easily available one should ALWAYS be somewhat skeptical as the editor is human and makes mistakes, so check with authoritative sources for any terms that might really matter for you.


A look at my drill application

Since I’ve mentioned this in multiple posts I thought I’d provide a little more detail. Here’s a screen shot with some food terms.

Ugh, WordPress is hard to get images right, hope this looks OK after saving. Good, for some reason the image looks bad in WordPress’ post editor but I chopped the screenshot to fit and it looks OK after posting.

BTW: Spanish readers out there will note kokotxa in this list which is really Basque, not Castilian which would be cococha.

Anyway, the basic idea is to load a random (though biased to get most effective drilling) set of words and then I visually examine them. Most drills do some sort of “quiz” but this is for me so I just scan the list.

If I don’t instantly know the translation I click the word. That gives me a score of -1 (otherwise if I don’t click a word it gets a score of 0, for appearing but “known”). I don’t “cheat”, since this is just for me, so I don’t need a quiz.

But if I have the least bit of doubt I click and then I see the translation. Then I decide: a) was this a mistake that I clicked and then click Ignore button, b) if I thought I knew the answer but was wrong, then I click the Wrong button and my score becomes -3, and, c) if I really didn’t know at all (or my “guess” was wildly wrong) I click the “no clue” button and get a score of -10).

After I’ve looked at all the words I click Done to record the results. Then I click Drill to get a new set of words (which is more likely to repeat wrongs with scores other than 0). I continue as long as I can stand and then click Save (unless I’m just testing code) and the scores are then added to the XML database.

And if I’m sure I want to record the results then I can use the File menu item to save a new copy of the the XML.  The XML Editor and XML Update are what I use to fix issues in the database itself.

All the drill results are saved in another part of the XML (eventually making it very large, hurrah for having lots of RAM to have all this in memory – I come from the days when RAM was scarce and had to do lots of programming tricks, now I just brute force all this).

Then I have an analysis routine (WIP) to consolidate all the scores over all the drill sessions to find out which words are worst (lots of mistakes, therefore drill more) and which are best (few or no mistakes, so only drill after some time has passed).

While I intend to create other types of drills this is “good enough” to have me looking at a fair portion of my vocabulary every day (todos los días) and thus keep refreshing my wetware memory. I can’t do this very long (so the magenta number on the screen shot is a timer of how long I’ve been doing drills, rarely do I exceed 20 minutes) because I’ll start having “short-term” memory (since my mistakes are more likely to repeat in the drill, by design) and so I begin to “know” them, but not really.

I’m focusing the drill (really the way I’ve created the XML database) on recognizing the Spanish, since, again, my goal is reading menus, not writing them. So my database is (now) poorly structured for doing English drills, which is harder than the Spanish drills, but more useful if I need to be able to ask questions about the menus.

And of course this is all “written” rather than spoken drills and to be really helpful I actually need to know how hablar a camarero but I’m getting there.

Back to menus; a big project

My primary purpose for this blog is to record my progress in developing an application to translate menus in Spain. I worked diligently on this for about nine months but then got into some side-trips in other projects. But now I’m trying to get back to that primary objective.

For 78 days now I’ve also been trying to actually learn Spanish via the nice online application, Duolingo. While this diverted me from my primary task it has been useful. My sister always thought my idea was silly and that instead I should just learn the language. That’s not a bad idea but it looked harder (and more time consuming) than my primary limited work just to read menus, based on the assumption I’d soon be heading to Spain to tour along the route of the Camino de Santiago. Therefore I needed results sooner than I could learn the language.

To build my application I’d first need a large corpus of terms from menus with accurate English equivalents. To do that I’d import the text from websites into a working document and crunch through all the terms. Often that gave me some interesting observations that I was converting to posts, hopefully also interesting to my readers. Obviously there are going to be mistakes in manually collating data so my corpus needed to be carefully curated, with the terms and my “guesses” at translation with a “confidence” factor. Then via the large corpus I could extract the accurate equivalent Spanish to English translations I’d need for the application.

That’s a long slog so a couple of times I went ahead and created a minimally curated “glossary” which I have as a page here at this site. In my searches I found a number of glossaries, or even dictionaries in Spanish, covering food. Years ago when I first got interested in these I just extracted all the glossaries I could find and manually collated them into a single glossary. It was a mess!

The trouble is that food terms in Spanish (my searches) yield results that either don’t apply to Spain’s food dialect or were just wrong. After all any other person who compiles glossaries makes mistakes too. Or I’d make mistakes extracting and collating them. And my lack of any fluency in Spanish meant I often misinterpreted the raw material I was attempting to organize. That previous experience convinced me I needed to be very precise about collating material AND focused on Spain as the source of the raw material and so my idea about creating a corpus evolved.

But in nearly a year I still don’t have that corpus. And without it I can’t build my application. And in the meantime I needed to get some “drill” code done since I reached the point where I was forgetting more than I was learning. And while Duolingo is fairly good for learning Spanish it’s not as good for repeating previous lessons (and their vocabulary). And repetition is the key to learning a language. So I found myself forgetting vocabulary I’d once before acquired.

So I set out to build a drill application, which has some of the same elements I’d need in the translation application. And like compiling glossaries I’ve done this also, in the past – the first time for Italian food terms. So I’ve built drill programs before with only limited success.

The key to a drill program is to be efficient and force me to do repetitions of the vocabulary I know the least well. That’s harder than it sounds. Plus most of the types of drill I did (glorified flashcards, a common language learning technique) took so much time that as my vocabulary grew my repetition, of any particular word, got less and less frequent. Even with an hour a day I could only repeat a fraction of the vocabulary I’d acquired.

So I had some ideas how to improve this and make the drill more efficient. But I needed data even to do the programming. So I fairly quickly assembled the glossary I posted at this blog without being too concerned about its accuracy.

So with that lengthy background now I can describe what I’ve more recently done and the “big project” I’m now doing. I built my first version of the drill application centered around the Duolingo vocabulary. As I’d do each lesson I would fairly careful assemble the “database” (a complex XML) to feed the feed program. For my Duo vocabulary that now contains about 1100 “terms” and 1400 “forms” of those terms. By forms I mean the usual four spellings of adjectives (in Spanish both gender and number) and the first set of conjugations for verbs. Getting all that going for Duo vocabulary drills got me a fairly useful and efficient drill program which is helpful as a supplement to Duolingo.

So then using that code and crunching the glossary I’d assembled here I started on the food terms. And that was a bit of a mess because the glossary sucked.

So to fix this I went back to my 30 or so working documents of all the menus I’d processed. Rather than the more difficult chore of extracting material for a well curated corpus I just quickly (a couple of days) just extracted all the accumulated Spanish. That’s a tedious chore but it does reveal some of the problems of getting “raw” material from the websites. Naturally I found lots of spelling mistakes (easier for me to recognize now that I know a little Spanish) but also the inconsistencies in gender and sometimes number. Also many instances of words are very inconsistent on the use of accents in the Spanish words. My Duolingo study also let me learn the rule that accents sometimes change (for real, not typos) in certain circumstances.

So once I’d compiled all my “words” from all menus I had about 10,000 “raw” bits that I was able to clean up, de-duplicate and consolidate (like all the forms of adjectives under a single “term”) and ended up with about 5500 lines.

Then in a separate process I took the latest (v3.3) copy of my glossary and then combined that with about six other glossaries. That was a chore and resulted in about 4000 entries.

So then I combined these, all the glossary “words” and all the menu “words” and started going through all that by hand. I’m now down with everything through M (since I sort all 9000 or so lines into alphabetic order). I’ve done a few hundred “fixes” to my glossary and about 100 additions. But more importantly all those changes are in my XML “database” for the drill program. With a bit of code I can then extract from that XML to create text I can paste into the glossary page here.

So when I’m finally done with all that tedious manual work I can update my glossary and it will be a big change so I’ll make that the v4.0 version which I believe will be quite a bit better than my current v3.3 but not as good as a curated corpus needs to be. And, really my glossary will then mostly contain words that exist in reference sources (several online dictionaries I use) and/or reconciliation with the other glossaries I found.

Please note, therefore, than my word product is fully derivative from many sources and my editorial work and thus constitutes “original” work. I’m quite conscious of never (almost never) posting anything in this blog that would violate copyright, i.e. the wholesale use of someone else’s glossary.

And now all my material is synchronized – my XML database for the drill program, my derived glossary with reconciliation to other glossaries or reference sources, and I’m only including terms in either place that I’ve found in menus so my product is more closely aligned with Spain dialect and I can exclude other Spanish food terms.

Now, while that isn’t done, I’m back into the code for my drill program. In the case of my Duolingo vocabulary I feed into the drill program I (mostly) know that vocabulary by memory. Duolingo is divided into lessons (aka skills) that require 40 actual drills (to pass the skill and unlock the next one) which means about 800 individual drills. At Duolingo I’ve now done 16,843 “XPs” over 31 skills. On average each skill introduces around 30 words (forms actually). So when I do my “refresh my memory” drills with that vocabulary I have relatively few words I ever mark as uncertain, or worse, “I’m wrong” or “I’m clueless” (really forgot). That means all the scoring I’ve done with that vocabulary has relatively few “errors” and my aggregate score on most terms is 100%.

In contrast I’m much worse on my new food vocabulary. As I’d work on menus I’d “learn” many words, but since I had almost no repetition of those (the most common words appear on many menus so that was my repetition) and I’d done none of my own drill. Now that I have something to feed my drill program I’m getting a lot more “bad” scores. That’s good and bad. It’s bad because it means I don’t know those words very well, by memory. It’s good because now all the scoring of the drills I record in the XML has a lot more data than the drills on Duolingo vocabulary.

So that means back to programming. How do I consolidate tens of thousands of individual drills into some sort of metric that rates each word in the vocabulary as to how well I know it (and/or don’t confuse similar terms). Because I want to drill myself on what I know the least. I don’t very much need to drill on carne or aqua or cerveza or a few hundred other food words and I don’t want to waste the limited time I have for drills (even less than my free time because drill is tedious and I can only tolerate a certain amount each day). So that’s now the algorithms I’m trying to develop so my drill program is even more efficient and therefore more useful.

So while I thought I’d be done with this by now I have probably another week to finish cleaning up my food vocabulary and enhancing up my drill program.  But once I’m done with that I can spend 15-30 minutes every day (or most days) so I get more of the food vocabulary into longer-term memory along with a growing Duolingo vocabulary. Thus I’d hope to have reasonable fluency within a few months so soon I may need to head to some Spanish speaking country to test myself.

Now, note, all this is “reading” (and less “writing”) Spanish. Hearing or speaking is an entirely different problem. But without mastery over much of the vocabulary actual conversation is pretty hopeless. I’d originally assumed I’d have no more audible Spanish than a few phrases and the rest I’d do through reading (plenty of time to study a menu, have to be fast to have conversation).

Now, finally, all this I’m just doing for myself, other than relating some hopefully “interesting” tidbits here in the blog. While I’ve built many software products over my working life all this I’m just doing for myself. But at least, as a derivative from this work, I do hope to end up with the best glossary for food terms in Spain here at this blog as my contribution to others who might need this.


More trail photos; < 100 miles to go

I was close in the previous post when I declared I’d crossed the border into to Galicia, but now I do have less than 100 miles to go on my virtual hike. At the slow pace I’m doing on machines that is a couple of months.

But this post is mostly about photos in my continuing series of photos I’m finding in my personal archive of trails (or crude roads). As it’s said in the movie, “the road is among our oldest tropes”. There is some about a path that holds us, compels us to move forward on that path. So here’s the first of this series:

This is a short trail along a river we found on the way to the Natchez Trace in Mississippi. It was a pleasant walk through the woods. I don’t much like photos that include me but in this case I relented. But from behind it could be anybody.

So let’s get something a bit more visible:

This was an unexpected and quite beautiful hike in Guadalupe Mountains National Park just across the border of New Mexico into western Texas. While this photo doesn’t show the fantastic fall color we encountered, totally unexpected for just a dry place, it is one of the few pictures of me on the trail, taking photos of course. Here the trail crosses a dry riverbed that probably experiences the classic rapid flooding when there are rains. This is along the route through McKittrick Canyon which I can highly recommend, especially in the fall.

And as, I hope, the last time I do this here is another hike, this time across country on no trail at all:

This couldn’t be in a more different location. Here we’re hiking overland in the Big Snowy Mountains of Wyoming. I’ve visited this area multiple times (the nearest big mountains to my home in flat Nebraska). The interpretative signs there claim that at one point in Earth’s history these were the highest mountains on the planet.

This shot is late fall and there is even a bit of snow falling. The purpose of going cross country to to “nowhere” is indicated by the invisible object I’m holding, a Garmin eTrek GPSr. We’re headed to a “dashpoint”, a completely arbitrary coordinate on the earth to try to reach if you can. Usually we reach these points with a car but this was a case where the dashpoint was on public land and thus a place where we could hike.

Actually this was a tough hike because much of the area was even more rocky than you see in this photo. Without an actual trail scrambling over rocks can be very tiring. But we found the dashpoint and returned to the car (had to drive to civilization to file our reports) and escaped the snow that closes in just after we were there.

Looking at all the photos of the Camino, the closest I’ve come to actually trekking there, it’s very pleasant, but if one seeks some beautiful country off the beaten path it’s hard to beat the USA. This isn’t some patriotic chauvinism, just a simple statement of geography. When I see the area around the Camino and realize how long people have been there, with terrain altering technology, part of the beauty of the “nowhere” in USA (or even more so in our neighbor to the north) is simply that people, at least with much technology, have been here such a brief time and thus so much of the land is only slightly altered.

In the trail I showed in the previous post, a very symbol of “civilization” (the railroad) has retreated and disappeared and nature has reclaimed the narrow corridor where once steam prevailed with greenery and now fortunately a place of respite of trekkers.

So enjoy these photos because of the 30,000 I have (with a few worth posting) these are probably the only ones where I’ll be in the shot.

Made it to Galicia; Another trail picture

I’ve now pushed through 393.3 miles on my virtual Camino (i.e. treadmill in the basement) thus putting me just past O Cebreiro which is just over the border into Galicia, the last autonomous community before reaching Santiago in just about 100 miles. By “reached”, of course, I mean I’ve done the distance (from Saint-Jean-Pied-de-Port) along a GPS track of the Camino. I’d love to be doing this for real, but at least I get to “experience” some of what this trek is like, checking out restaurant menus along the way, which is the primary topic of this blog.

And at least I’ve gotten some idea of what the trek looks like. That is, converting treadmill miles into locations and then using Google Street View to “look around” I’s also decided that most of the trek is in pretty boring country, not much different than what is around here. However, since reaching Ponferrada from the East where the way begins to enter the mountains the country has been much prettier. But also, interestingly, it seem that lodging and restaurants are a bit higher quality as well. I deduce that’s because most of the escorted trips along the Camino occur in this area, as only about 100km are required in order to qualify for the compostelana (diploma) and so trekker who want a bit more luxury and a lot less walking start much closer to Santiago. Which, of course, is a “cheat” as getting there (as opposing to being there) is the whole point of the trek.

But here’s another of my trail photos, one of my favorite:

OK, so it’s pretty ordinary looking spot and not at all spectacular. So why is it one of my favorites?

Well, it’s accessible and pleasant walking, that’s why. This is one of many bridges on the Wabash Trail, which goes from the south side of Council Bluffs Iowa to the Missouri border. It’s a Rails-to-Trails recreation project which are sponsored all over the USA.

You see when railroads were first built in the US the land was granted by either the state or the Federal government, often with a provision that if the railroad is abandoned the land reverts back to government (thus public) ownership. Now Iowa is the most intensely farmed state in the USA which means very little land is in its natural condition (and it’s all private, so no access for recreation). So this tiny corridor of “wild” for the Wabash Trail is a real jewel.

Also, though it may just be urban legend, the original trains that used this route burned coal (or even wood) and so burning embers escaped their smokestack. As a result the railroad had a wide buffer of land to avoid setting stuff on fire. Today, given that entire right of way is abandoned woods have reclaimed that area, except for the trail itself.

SO, even though there are farms and houses everywhere along this trail it does a good job of pretending to be wilderness. And all that plant growth creates enough shade that the trail is much cooler for walking than out in the sun (one of the obvious drawbacks of so much of the Camino, exposed to intense sunshine).

This particular photo is where I stopped for a brief rest (that’s my stuff on the bridge). The bridges were for the trains and have been reclaimed and converted for foot and bicycle traffic, which is handy, not having to fork creeks. In addition to the buffer of woods along this trail, often it is cut into the hills so the train had a level grade and that also increases the isolation.

I’ve walked almost all of this trail, although only in intervals, never end-to-end. The problem, compared to the Camino, is there are no accommodations along this trail. Even on a bike it would be hard to cover all of it in a single day and walking is a multi-day trek. While there is so access to food and drink along the trail the only way to walk all of it would be to have someone drive to meet your and take you to some overnight lodging. That kinda defeats the point of it.

This bridge is on the longest stretch I’ve done in one trip, about 15 miles, where I had someone drop me off and then meet me in the town Malvern where we had a pleasant lunch with a couple of craft brews. I wanted to push for 20 miles but my ride wasn’t going to wait for another two hours, so this was the best I could do. Of course one other approach would be to get my ride to haul my bike down to my turnaround spot and so walk one way and bike back, but that’s a lot of trouble. So while I like hiking on this trail: a) having to drive 30 miles to get to it, and, then, b) the logistics are impossible as a long hike is part of the reason the Camino, just from the POV of hiking, is attractive.

When the rails were removed the rock bed under the rails was left and then covered with a crushed limestone aggregate. So actually the walking surface is quite pleasant. The trail is well drained so rarely muddy but it’s much “softer” walking than paved roads would be. Again, with all my StreetView studies of the Camino much of that route is NOT very good walking and certainly walking on streets and dodging cars is not my idea of a good trek.

So while this Wabash Trail may not have the history or significance or the experience of a different country I’m grateful it exists and provides some opportunity to move on foot outside instead always in the basement on a treadmill. Of course, right now it’s buried in snow and it’s nearly 0F outside so I’ve got a month or two before I set foot on this trail again.