Back to work – lists

As I don’t have any more travel planned I can get back to work, perhaps with a renewed effort. So I returned to looking at lists, at least three I’ve found and with more to go. Lists come as: just translation of terms in English and Spanish, glossaries and dictionaries where dictionaries supply an actual definition and glossaries sometimes just provide translation (where literal is possible) or definition otherwise. The Net is full of these but using them can be a challenge. Also I’ve usually looked only at these lists where the terms are Spanish but the translation or definition is in English. It’s more interesting, although more work, to get the lists entirely in Spanish. And ideally as apply to Spain rather than anywhere Spanish is used.

So in my first attempt to build up a translation dictionary I only used lists I could find. It never dawned on me to use purely sources in Spanish and in particular menus, but of course machine translation has advanced a lot since my V1.0 attempt years ago so now sources entirely in Spanish and especially as applied to Spain are my primary sources.

But lists provide a lot information in a hurry. And despite the issues they often provide terms that are unlikely to be found elsewhere. But the biggest issue is that whole thing of Spanish throughout the world versus Spanish gastronomy terms for Spain. As I’ve mentioned tortilla is common in western hemisphere but something entirely different than you’d get in Spain even if the menu does say tortilla patatas. Now where lists might include New World terms not used in Spain it’s just a waste of time, at least for my purpose to process them. But when they conflict in meaning between Spain and elsewhere that is a problem.

So I’ve been crunching through three lists. Finding more lists is a lot easier (at least until I’ve found most of them) than processing the lists, especially when the lists are entirely in Spanish. Plus some types of webpages are hard to “mine” (also known as scraping when code is doing it). Web authors design pages to be most useful for their intended audience and not for someone accumulating a corpus. And even when I’ve processed lists I have to be careful with the whole copyright issue. If I published (except in the fair use case, i.e. a small sample with attribution) any substantial portion of any list I find that is improper. But since my real notion is accumulating a large corpus from many sources and then basing my final translation vocabulary on a meta-analysis of many sources I think I should be OK. Also whenever I only have a term translation from a single source I need to be suspicious of the accuracy of that as well.

So thus far I’ve looked at: 1) the Gallina Blanca Diccionario which is from a website in Spain representing a food company producing packaged products for Spain markets and supplied the diccionario to aid their users of the recetas they also provide; this has Spanish terms and definitions in Spanish but does not apply, at least exclusively to Spain; 2) Nitty Grits, a glossary with Spanish terms and English definitions, not exclusive to Spain, but as I learned after crunching through most of it each term is clickable and often (not always) then indicates where this term is used; Nitty Grits is a large list and allows me to get fairly unambiguous definitions (since they’re in English) and avoid the often incorrect machine translations (such as occurred in Gallina Blanca); and, 3) now I’ve return, since doing some work by in May to a complex website, ARecetas, a recipe site that then has multiple glossaries especially the largest and most directly useful, Glosario de Alimentos.  And there are more I’ve found but haven’t yet crunched through at all. Of these ARecetas glosario is the hardest to process so I only briefly looked at in May and instead focused on Nitty Grits. But for several months Nitty Grits was not operational (at first I thought they might have blocked me but that was not the case).

Anyway now I have more issues having finished two of these sources and now resumed work on the third. First, the way I’ve extracted information (often a tedious process) is inconsistent between the three lists (meaning the tables I created in MSWord manually). Second, my notion system was inconsistent, i.e. I annotated much of what I found with no particular notation as to what is original source text and then my annotation. These issues meant I can’t possibly consolidate the three lists manually. So I had started some code to create a consistent format across all lists (in XML which is more robust than just text in MSWord with a few fonts and colors). I was able to do Nitty Grits fairly easily but ARecetas and GallinaBlanca are toughter, i.e. it’s not just code I need, but I have to go back to the manually compiled lists and use consistent inline markup so the code can parse all entries to the common XML I want for all three lists.

Now I need to finish ARecetas (and perhaps also some other smaller sites I found and also do a thorough job of searching) before moving on to the real world. Once I can convert each list, with my annotations and markup, to a consistent XML structure then I can attempt a “merge”. Once that is done I can then look for agreement or disagreement between the sources (as I processed them) and start fixing errors or doing more searching to get more accurate answers (although without wasting much time on non Spain terms).

People who compile lists usually have some other work. They usually want to get their list with minimal effort to achieve their purpose. Simply put, this means they make mistakes, sometimes even blatantly obvious to simple analysis, sometimes more subtle. I’m well familiar with this from my career, a concept of “good enough”. No compilation of information is ever perfect anyway so it’s more a question of how good does it need to be for the intended purpose versus how much work (usually measured as cost since some paid person is doing the work). So online lists have many flaws. And it’s not just online lists. I’ve bought a few books about food in Spain back in my V1.0 effort and these books have inconsistencies and errors (where error means they disagree with other sources). I’ve looked and I’ve never found a “best” or even highly accurate and comprehensive source.

And that’s part of why I’m even doing this project. Unlike the other people creating materials, either free on the Net or in for-sale published works I don’t have a cost issue with my work. As I’m retired and unlikely to ever even be a temporary consultant the marginal value of my time, measured in money, is zero. Therefore I can spend an infinite amount of it trying to be as accurate and comprehensive as I can be, even (and that would be fun) doing original field research, i.e. actually going to lots of restaurants in Spain with some consultant I could hire who’d be fluent in Spanish and cooking (then the bills do add up). So at least my “free” effort is just a question of how much work I wish to put in it.

So I do believe, despite my lack of fluency in Spanish language, it is feasible that I could compile the best list, meaning the most comprehensive and accurate. Of course my list would have mistakes too but I think it could be better than any I’ve seen. AND, if I write good code to does the bulk of the work consolidating the raw materials for my corpus and then extracting I should have an easier time making corrections, especially as my targeted application is either machine-generated webpages or a smartphone app, i.e. updates should be possible once I actually get feedback (too many sites or apps fail to take advantage of the knowledge of their users to provide very valuable feedback to constantly improve the product, either its usability or its underlying database of Spain culinary terminology.

So I hope to get back into it and finishing these three lists would be a critical milestone because then I can really get down to designing my corpus and the code for importing and consolidating and proofing the information in the corpus.

Advertisements

Quesos de España – A Great Source

I took a break from decoding menus from restaurants in Spain to look at cheeses that originate in Spain. I’ve done this type of investigation before (previously for Italy) and it’s a challenging task. Names of cheeses can be very inconsistent from different sources. Even with DOP names now more common there can still be inconsistencies.

And, of course, using any online source for raw material has the challenge that its author may be wrong or misspelled names or introduced other errors. And consolidating all the names found in different sources is difficult to automate while simultaneously this is a large quantity of information to attempt to mentally collate especially when one is not conversant in the language.

I’ll explain my process below but in case you just want the excellent source I found I’ll describe it first, even though it was after a lot of searching I discovered it.


While it’s entirely in Spanish and as a PDF not subject to Google Translate when accessed through the web browser this is a very nice document: CATÁLOGO ELECTRÓNICO DE QUESOS DE ESPAÑA (slow to download but worth the wait).

It has pictures of the cheeses and even some of the animals for the milk plus standardized descriptions including items like: Zona de Elaboración (processing area), Ingredientes (ingredients), Tipo de Queso (cheese type), Aspecto Exterior (outward appearance) and Aspecto Interior (interior appearance).

And then even more helpful is this section, Características Organolépticas (Organoleptic  characteristics, I had to look up the English definition on this which is “acting on or involving the use of the sense organs”), which then includes: Textura al Tacto (texture to touch), Olor (odor), Textura en Boca (texture in mouth), Aroma (aroma), Sabor (flavor), Otras Sensaciones (other sensations), Gusto Residual (residual taste), Persistencia (persistence). In case you’re not sure what Gusto Residual means here it is for Gamonedo cheese (from  Principado de Asturias):

El gusto después de ser tragado es: a avellana, con predominio suave de humo (The taste after being swallowed is: a hazelnut, with soft predominance of smoke.)

And here is an example of Persistencia for Curado (cured/aged) Mahón-Menorca cheese:

Media-elevada, presencia de mantequilla fundida, aceite de oliva y caldo de carne. Entre quince y treinta segundos  (Medium-high, presence of melted butter, olive oil and meat broth. Between fifteen and thirty seconds)

In addition to this extensive, informative and attractive PDF there is another part of this site where you can filter the list of cheeses, i.e. Buscador de quesos (Cheese Finder (aka Search Engine)). The filters are: Seleccione (Select): Comunidad Autónoma (Autonomous Community), tipo de leche (milk type), calidad diferenciada, régimen de calidad (differentiated quality, quality regime).  So for example I did search for cow’s milk (leche de vaca) cheeses from Cantabria and all (todas) quality regimes and got:

Marca

(mark or brand)

Tipo

(type)

Procedencia Leche

(Origin of milk)
Comunidad Autónoma

(Autonomous Community)

Picón-Bejes-Tresviso D.O.P. Leche de vaca CANTABRIA
Queso Nata de Cantabria D.O.P. Leche de vaca CANTABRIA
Queso Pasiego Sin figura de calidad comunitaria reconocida

(No recognized community quality figure)
Leche de vaca CANTABRIA

After finding the list you can click on the cheese name for the full information page equivalent to the CATÁLOGO pages. You could either use the search tool to find a cheese you might want to try (some Spanish cheeses can be obtained online) or browse the CATÁLOGO.


back to my process for compiling a list of cheeses

But undaunted by these challenges, from past experience, I decided it was time to assemble a complete and accurate list. This only slightly matters for reading menus at restaurants and more likely would be useful for purchases at retail establishments but again knowing what you’re eating in another country is the inspiration for my project.

So I proceeded with the usual suspects, first doing several Google searches (to get the terms right to provide the best source materials) and then following several promising sources. As usual Wikipedia had a useful page List of Spanish cheeses with a fairly long list (fortunately tagged by region) with some links to pages for the more common cheeses. Having processed this list I immediately assumed the Spanish language version of Wikipedia would possibly have an even better list and it did – Quesos de España. Another seemingly authoritative source, Spanish Cheese Guide, covers all (?) of the DOP names.

From all these sources I generated a single list which required picked a “canonical” name and then finding all the variations from the sources. For example this cheese, Arzúa-Ulloa, appeared in all my sources (compiled thus far) but as you can see under quite different names even including a misspelling.

Queso Arzúa-Ulloa (P.D.O.) Galicia 1 link
Arzula Illoa 2 link
Arzúa Galicia 3
Arzúa-Ulloa Galicia 5 link
Arzúa-Ulloa Galicia 6 link

So after consolidating the list from five sources and choosing what appears to the the “standard” name (for those cheeses that appear on more than one list) here is what I believe is a fairly comprehensive lists:

Abredo, Acehúche, Afuega’l Pitu, Ahumado de Pría, Alhama de Granada, Alpujarras, Andalucía de cabra, Ansó-Hecho, Aracena, Arribes de Salamanca, Arzúa-Ulloa, Babia y Laciana, Barros, Benasque, Beyos¸Buelles, Burgos, Cabrales, Cáceres, Cádiz, Camerano, Campo Real, Campoo-Los Valles, Casín, Cassoleta, Castellano, Cebreiro, Colmenar Viejo, Flor de Guía, Fresnedillas de la Oliva, Gamonedo, Garrotxa, Gata-Hurdes, Gaztazarra, Genestoso, Gran Canaria, Grazalema, Guriezo, Herreño, Ibores, Idiazábal, L’alt Urgell y La Cerdanya, La Adrada, La Bureba, La Calahorra, La Gomera, La Montaña de León, La Nucía, La Peral, La Serena, La Siberia, La Sierra de Espadán, La Vera, Lanzarote, Letur, Los Montes de Toledo, Mahón-Menorca, Majorero, Málaga, Mallorquí, Manchego, Mató, Miraflores, Montsec, Murcia, Murcia al vino, Nata de Cantabria, Oropesa, Oscos, Ossera, Palmero, Pasiego, Pastor, Pata de mulo, Pedroches. Peñamellera, Picón Bejes-Tresviso, Pido, Quesaílla, Quesucos de Liébana, Requeixo, Roncal, San Simón da Costa, Serrat, Servilleta, Sierra Morena, Tenerife, Teruel, Tetilla, Tiétar, Torremocha del Jarama, Torta del Casar, Trapo, Tronchón, Tupí, Urbiés, Valdeón, Valle de Alcudia, Valle del Narcea, Vidiago, Villalón, Zamorano

There are around 30 more where I’ve found at least one mention but I’ll have to search for each of these individually (once I have the complete list) to see if these cheeses really exist (at least currently) or are just a spurious mention in some online list.

A few terms from ensaladas

I’m continuing to extract terms from a large set of recetas, having switched from postres (desserts) to ensaladas (salads).  Now thinking about salads there is a lot more diversity than merely leafy green stuff with some dressing so this is another lode of terms to find and add to my corpus. So here are a few fragments I’ve found:

Ensalada de verdinas con perdiz escabechada, receta fácil Salad of verdinas with pickled partridge, easy recipe

As usual terms that Google Translate doesn’t translate or has silly answers catch my attention, so what are verdinas? Oxford has an entry that translates to ‘moss’ and it’s plausible a salad might include moss. But this is what makes this source so useful, it’s not just titles of dishes, but the full recipe (ingredients and instructions) and a photo of the dish. In this case the photo reveals the clue to verdinas, showing a bag of alubia verdina which are called Verdinas De Nuestra Tierra in the ingredient list. IOW, since I’ve seen alubia often this is just a specific type of bean (visible in the photo) described here.

So moving on:

Remojón  granadino, receta fácil para el Verano Remodo  granadino, easy recipe for summer

Why Google Translate translate remojón to ‘remodo’ remains a mystery as I can’t find any association. Oxford literally translates remojón to ‘soaking’ and granadino to ‘of Granada’ which doesn’t help much. Fortunately this has no English equivalent but is

a specific recipe with oranges, cod, onions, tomatoes and olives, soaked in olive oil for at least four hours.

so an item like this has to be entered in my corpus with a “description” rather than a translation.

And finally:

Salpicón de bogavante con vinagreta de su coral Lobster salty  with vinaigrette of its coral

So we have two mysteries here: 1) what is a ‘salty’ (presumably the translation of salpicón), and, 2) what is ‘its coral’ (untranslated from coral in the Spanish)?

salpicón is the easier one since it’s a particular preparation of “chopped seafood or meat with onion, tomato and peppers” described here so ‘salty’ is a mysterious translation and inaccurate.

Salpicon (or salpicón, meaning “hodgepodge” or “medley” in Spanish) is a dish of one or more ingredients diced or minced and bound with a sauce or liquid.

But to figure out coral required looking at the recipe which fortunately describes it thusly:

the contents of the inside of the head (of the lobster) and the dark colored matter that is full of flavor

While I couldn’t find any English equivalent for coral (or any definition that matches the recipe) I believe this is a delicacy that some adventuresome foodies like. Now I’ve use the heads of shrimp and their shells to make stock so I suppose this is the same but this sound pretty yucky to me, which means if I had this salad and quite possibly enjoyed it I’d rather not know what coral is.

As the last tidbit the recipe text also includes two interesting terms:

  1. brutal bogavante which Google translated to ‘brutal lobster’. What’s this, some lobster with monster claws that fights back? Actually Oxford did explain that brutal has a colloquial meaning of ‘incredible’ or ‘amazing’ which is a lot more appealing (and reasonable guess at translation)
  2. and un platazo which didn’t appear in any dictionary but was found by search in an obscure (scanned) old text as ‘great dish’ which does fit the rest of the context so also is a likely translation.

These “guesses” I sometimes make have some amount of likelihood of being correct. I’m fairly certain of something like verdinas as a type of bean, but it is a guess and therefore has to be entered in my corpus which some uncertainty. And brutal and platazo have even less authoritative evidence and so would have higher uncertainty.  The Google Translates corresponding English to Spanish also can not be viewed as “certain”. Probably only translations appearing in one of the authoritative dictionaries can be entered as p=0.999 in the corpus. So getting as much volume as possible so every term in the corpus has multiple instances will be key to getting the best possible translation dataset.

 

Made it to Logroño; and new sources

In this blog I’m reporting on my progress (and challenges) of assembling a complete database of all food terms needed to read menus in restaurants in Spain. Once I have that database I’ll convert it to several form factors, especially something usable like a smartphone app one could carry while in Spain.

I need a geographical focus to attempt this and so I’m doing a virtual hike along the Camino de Santiago (mostly in northern Spain, the so-called “French Route”). This pilgrimage and just popular trekking route has been made famous in several milestone books but especially my favorite, The Way, a wonderful movie.

While I’m focusing on learning about food I also have the challenge of losing some weight that my interest in food has put on my body so I do a lot of workouts in my basement. In particular I do miles on a treadmill (not very many per day as I do far more on stationary bike, but still fairly regularly). For any of those who have tried this you know it’s quite boring and hard to stick with the routine. So I learned a long time ago to turn this into something “fun”, i.e. carefully record mileage on treadmill, obtain a detailed GPS track and mileage log for some trail (my first version was the Pacific Coast Trail) and convert my treadmill distance into a location along the trail, THEN, using the wonderful Google map tools (and especially StreetView which does (mostly) exist for the Camino, unlike a wilderness trail) actually “see” where I’m visiting. And for this project the points of interest on Google maps include restaurants, food stores, bars and bodegas (as well as looking at hotels and albergues where the pereginos might stay). For any of those food POIs that have web pages and then any form of information, e.g. menus or recipes, I process these to get side-by-side Iberian Spanish and English equivalents (noting interesting translation issues in posts in this blog) which I’ll be using as a corpus to feed my software to build my food terminology database (not purely a literal word-by-word translation dictionary but far more).

So that’s a review of what I’m doing and so this post is about a distance milestone along my trek. According to my guide the total trek is 494.86 miles (starting at the common starting location, St.-Jean-Pied-de-Port in France).  I started by detailed log (and thus virtual trek) just past Pamplona. I’ve previously noted a few milestones but today I reached (and passed through) one of the larger towns, Logroño (the capital of the province (also the autonomous community) of La Rioja, the premium wine region of Spain). Rick Steves calls this city “forgettable” but I’m pleased to have reached it.

The city is large enough that I’ve found numerous restaurants with web pages and menus so now that will be another new source to extract.

This also means I have left Navarra and thus gradually getting further from the Basque influence that has caused some interesting translation issues as I was examining restaurants influenced by Basque cuisine and vocabulary. But hitting unique languages is not done as at least two more regional languages lie ahead that will surely provide some new challenges.

Too bad my trek is only virtual as I would have liked to have done this trek for real. Although I’d want to skip high summer since most of the Camino is in open country with little shade and I suspect that could be dreary in Spain’s summer heat.

Adventures with sources

Note: I’ve been involved in a particular part of this project for nearly a week and have material for numerous posts but due to holiday activities haven’t had the time to make any posts, so now I’ll just provide the background for some future posts when I catch up.

My project involves building up a large corpus of Spanish food items and descriptions (aka preparations, recipes)  from menus of actual restaurants in Spain and then from that corpus deducing (ideally with some AI-ish code) a Spanish-to-English “vocabulary” of food terms/phrases/concepts, not just for Spanish in general, but specific to Iberian Spanish (thus avoiding many terms only found in Western Hemisphere, or worse, that mean something different in Spain than elsewhere (e.g. tortilla)).

So I was doing my virtual walk along the Camino de Santiago and finding restaurants via POIs on Google Maps and then grabbing those that had websites and online menus to extract for my corpus, a tedious process but one that produces good “raw” data plus gives me a chance to actually internalize some Spain food knowledge (vs pure AI approach that Google uses). In doing this I went off on two levels of distraction. First, I noticed a grocery store (supermercado) on the map that then had a website and as it turns out, also, online ordering. This was a great source since it had photos of the available products and thus the opportunity to get equivalent English words for the Spanish names of food items PLUS this would be oriented to Spain, not general Spanish. Second, in doing this I stumbled on to another business that is selling food online, Gallina Blanca, which I learned was a rather large multinational supplier to restaurants and homecooks. On their website they have a large number of recipes (recetas) which if I get decent translations would be a large source for my corpus, not just ingredients, but also preparation techniques.

But then I discovered an interesting item at the bottom of the webpage, a link to a diccionario.  I thought this would be a real bonanza of food terms but quickly learned: a) it is a real pain to extract information just due to the mechanics of how the webpage is built (i.e. mostly javascript, not HTML, therefore nothing to “copy” (with mouse selection) despite seeing it on the screen), and, b) then as I tediously did begin to extract some information and began to work out a process I realized ‘dictionary’, in this case, meant something different than I thought. I’m used to finding “translation dictionaries” online and referring to these just as ‘dictionary’. But at Gallina Blanca’s site they have the classic notion of a dictionary, i.e. a word (or term) and its definition, in Spanish. What I was expecting, naively, was the English equivalent for Spanish food terms and it turns out I have to do a lot more work to get that. So, for example,

AGUACATE Árbol originario de América, cultivado por su fruto, de pulpa espesa y perfumada. Muy usado para ensaladas, salsas y sopas. Native tree of America, cultivated by its fruit, of thick and fragrant pulp. Widely used for salads, sauces and soups.

Note: This is a good example of how there are enough cognates in the Spanish definition that it is possible, plus knowing only a few rules, to assign the word-by-word correspondence between Spanish and English for the corpus even without any significant knowledge of Spanish.

The actual dictionary gives me the definition in Spanish of aguacate. I used spanishdict.com’s translate function to get the English from the Spanish definition.  BUT, this didn’t quite get me the same thing as finding that aguacate literally translates to avocado (which is itself a loanword in English) which then gives me a much better notion of what aguacate really means (especially in context of a restaurant menu) than the dictionary definition.

But in other cases having the definition is handy, especially in comparison to the literal translations in some dictionaries. For instance,

ABRILLANTAR Dar brillo a cualquier preparado con jalea, gelatina, grasa, o pintando con huevo la superficie de un manjar antes de meter al horno o de presentarlo. Give shine to any preparation with jelly, jelly, fat, or by painting with egg the surface of a delicacy before putting it in the oven or presenting it.

Brighten any preparation with jelly, jelly, grease, or painting with egg the surface of a delicacy before putting it in the oven or presenting it.

For this word I got both the spanishdict.com translation (first one) and the Google translation (second one). abrillantar is obviously a verb and literally means ‘to polish’  (which one might guess) but its meaning in the cooking sense is better explained by the definition provided from this website, which I’d probably translate simply as ‘glaze’ even though glasear is the Spanish verb for that.

So after some time I have steadily refined my process (and streamlined it a bit, actually learning a semi-hidden feature of MSWord to reduce number of manual steps per entry) and begun to realize what I can really learn from the tedious process of crunching through a large number of terms (gracias, GallinaBlanca). And several of those entries will be the basis of some future posts.

But I have been curious about attempting to discover the source and range of this dictionary. GallinaBlanca doesn’t say anything (that I can find) about where they obtained this dictionary. So I face a classic problem I had in earlier versions of this project of getting terms that really are for Spain, not somewhere else in Spanish-speaking world where those terms might not be used or understood in a restaurant in Spain. But without any explanation of this dictionary this is a guessing process for me, but occasionally I get clues. For example,

AREPA Pan de maíz amasado con huevos y manteca. Corn bread kneaded with eggs and lard.

Wikipedia has a good article on arepa that makes it fairly clear this is something common in Colombia or Venezuela but doesn’t even mention Spain. This is just one clue (I’ve had a few others in my work thus far but none as clear as this one, yet) and so I expect to find more as I work through and thus, hopefully, determine if this dictionary cannot fully apply to Spain (which, btw, I learned (I think) is best to refer to its Spanish as ‘Iberian’ or ‘peninsular’ (and not Castilian or castellano, since that irritates some people).

Also given there is no explanation of the source of this dictionary at the website I also have questions whether it is accurate. There is a clear error in the javascript – each page of the dictionary (by letter) is obtained by clicking on the letter in an A B …Z bar except that starting at ‘O’ it’s off by one position (‘O’ gets you words starting with ‘N’ and so forth). IOW, a simple human error. And one error makes me question if there are others. So, for example

ACIDELAR Poner zumo de limón o vinagre en el agua para cocinar huevos escalfados o verduras, para que no ennegrezcan. Put lemon juice or vinegar in the water to cook poached eggs or vegetables, so that they do not blackened. ₽  ₽ Put lemon juice or vinegar in the water to cook poached eggs or vegetables, so they do not blacken.

I couldn’t find acidelar in any online dictionary but did find this

ACIDULAR (literally: acidulate, make sour) Rociar con un líquido ácido frutas, verduras u hortalizas, con el fin de que conserven su blancura o color. Sprinkle with an acidic liquid fruit, vegetables or vegetables so that they retain their whiteness or colour.

Spray fruit, vegetables or vegetables with an acidic liquid, in order to preserve their whiteness or color.

so since the definitions seem mostly the same is the ‘e’ really supposed to be ‘u’ and whoever composed this dictionary just made a typo?

But the most interesting mystery (thus far) along these lines was

ALBARICO Especie de Palma (Bractis setulosa). Species of palm (Bractis setulosa).

I couldn’t find albarico in any online dictionary BUT there is the very similar term (in the GallinaBlanca dictionary and elsewhere) which is

ALBARICOQUE (literally: apricot, probably the fruit, not the tree itself) Fruto de albaricoquero, de hueso liso y piel y carne amarillas. Albaricoquero. Apricot fruit, smooth bone and yellow skin and flesh. Apricot.

So while the descriptions are quite different I thought, perhaps, albarico was simple some short form of albaricoque. But this is wrong.

In some cases words in Spanish, especially for plants, are derived from the Latin scientific nomenclature so I tried to look up, Bractis setulosa, assuming that (from the definition) was the Latin name. No results. So I tried bractis alone (found nothing that made sense) and then setulosa. A-ha! There was a valid entry, that seemed to match definition, for bactris setulosa AND this is a tree, according to Wikipedia ” spiny palm which is found in Colombia, Venezuela, Ecuador, Peru, Trinidad and Tobago and Suriname”. (also evidence this dictionary is not for Iberian Spanish) So that looks like a very plausible human typo, moving the ‘r’ from second syllable where it should be to first syllable where it is wrong (in the sense nothing can be found for that spelling). I felt this was a good piece of detective work on my part to spot this AND it represents fairly good proof there are mistakes in this dictionary. And where there is one there may be more.

So I’m still learning all sorts of interesting things from my slow plodding through all these entries (and will do some more posts) but all this work shows the challenge of trying to get an accurate (and even harder, complete) translation dictionary for Iberian Spanish.

A couple of interesting new sources

For the most part I started collected my corpus of dual Spanish (Spain) / English words or phrases from menus I find online of restaurants that are definitely in Spain (so avoid other variations of Spanish in other parts of the world). It’s a tedious process to dig out the menus and create side-by-side tables in MSWord. But the slow and tedious process also allows me to learn (i.e. actual human intelligence vs Google’s AI approach) something that I’d miss with a more automated process.

And as I’ve mentioned my choice of restaurants to research comes from my virtual tour of the Camino de Santiago where I plot my cumulative mileage on a treadmill in my basement to actual waypoints along the trail. Given Google does a nice job of annotating various points of interest, esp. restaurants, I can find those that have menus online.

Fine, but recently I realized I can expand my sources for the corpus a bit more. Just out of curiosity I explored a link to a large grocery chain (BM SUPERMERCADOS) in Spain that happened to have an outlet in Estella. Exploring that website I found the Compra Online  link (Google translates to ‘online shopping’). And that part of the website has a large list of products one can purchase online (usually with pictures; and in categories) so a side-by-side translation corpus can be created, but also some brand names can be learned to subtract out of other menus where the brand name doesn’t translate and therefore is confusing what it means.

But then I found something even more interesting, again by accident. This is a real jewel, https://www.gallinablanca.es/recetas/. This is a large collection of recipes (recetas) which means lots of instruction of cooking terms plus lots about ingredients.  I’ve only just begun to explore this site but I also found it has a Diccionario (I think you can guess this as a cognate) truly a dictionary in that you click a word and a definition pops up, in Spanish (no English and Google Translate doesn’t work in these popups, so lots of fun to copy-and-paste the definition into a translation site). The website is produced by Gallina Blanca, which appears to be the maker (or brand) of various packaged food products which are also on sale at this site. There is a lot of food information here – too bad they don’t do an English version of the site so I’d get a better translation than Google. It’s a huge site as witnessed by its search results for ‘huevos’, 7,909 results!

And finally (and I’ll do a separate post on this) I found some food terminology that isn’t directly related to menus but can be used to supplement my corpus. Juice&World in Villatuerta is the manufacturer and distributor of various bottled drinks and they have their product list in both Spanish and English so I can obtain their translations (which, btw, doesn’t guarantee they do it any better than Google but hopefully they do). But  you get things like this to cut up to put in the side-by-side corpus

De la mezcla de zumo de lima, naranja y limón, con un toque de hierbabuena y menta, hemos creado esta bebida sin alcohol dando un estilo personal a la tradicional bebida cubana We have created this non-alcoholic drink from a mixture of lime, orange and lemon juice with a touch of spearmint and mint to give a personal style to the traditional Cuban drink

Now even though I don’t know Spanish I’ve done enough fiddling to figure out how to associate bits of the Spanish with their connected bit of English, like (easy) lima (lime, obvious cognate), naranja (orange, I happen to remember that) and limón (lemon, obvious cognate). But less obvious is hierbabuena which translates to spearmint even though spanishdict.com merely has its translation as mint because the y menta is the clue to tie to and mint in the translation and thus deduce spearmint as the word before y.

Interestingly it took a little remembering that adjectives follow nouns (often) and thus non-alcoholic drink is bebida sin alcohol.

This muddling through pieces of text with some sort of translation and with lookups, plus at least short-term memory, is actual part of my learning experience. If I had the time to do this all day-long (and I have tons of source material for that, already way behind on my inventory of links just from Estella alone and I really haven’t had the chance to do Pamplona, an even bigger list) I probably would know a lot of Spanish just from all the repetitive work that does help to burn words (plus a little structure of the language) into one’s brain.

Note: Added after original post. I was trying to locate the grocery I mentioned above on Google maps and instead ended up with this one, Dia, also in Estella. This gave me another interesting idea about confusing translations. Their online shopping is in categories so I was looking at pescado y marisco (fish and shellfish (or sometimes just generic term for any seafood)). And on that page there are images but also everything is either fish or some seafood, except tubo de pato which Google amusingly translated as ‘potato tube’. Since I’d just earlier been looking at potato options I wondered what a tube of potato might be (there is more to this story). In the image associated with this item it sure looks like the body of a squid and is labeled tubo de pato on the package. spanishdict.com fairly quickly resolves the silliness of Google’s translation by indicating pato is cuttlefish (the reverse lookup for ‘squid’ yields calamar, an obvious cognate to Italian but I have a hard time seeing any difference).

But based on only a single source is this information (Google translated):

They are selling a cephalopod of lesser gastronomic value than the squid that we appreciate,

The squid or giant squid , also known as luras in Galicia or cuttlefish in South America (although the cuttlefish is actually cuttlefish in our country, and is called choco when its size is like that of the palm of the hand), it constitutes several species , such as the common pota ( Todarodes sagittatus ), the flying squid ( Illex coindetii ), which is small in size, or the Argentine squid ( Illex argentinus ), which is granted greater quality.

Amusingly Google translated this article as “difference between squid and squid” given my query was ‘difference between pato and calamar’. It’s hard to say from a single source this is a correct distinction but it sounds good. Which then raises another issue – mislabeling of ingredients on menus. If one were concerned about this I suppose this is another reason to actually learn to speak and hear Spanish so one can query the server whether your menu item is the lesser cuttlefish or superior squid.

Note2: My other story was another stab at attempting to determine what patata fritas are (mentioned in earlier post). So, this grocery store has a convenient search so in went patata fritas and I got multiple pages of hits: mostly potato chips (including good old Lays) but also frozen potato wedges (kinda like steak fries, probably the closest to the literal translation) and also numerous frozen French fries (some with English on the packages, e.g. ‘frites’, ‘golden long’, and ‘wedges’). So this didn’t help any but it seems clear that if you want fries with your lunch you need to ask the server whether you’ll get chips or fries and I have no idea how to do that with minimal Spanish fluency.

Starting a new collection today

Even though I have a backlog of  ideas of posts I need to make here to introduce this project time marches on and so I have something on my mind right now. That’s part of what I expect to do with this blog (Dear Reader, you’ll love it) – use it for fleshing out ideas or reviewing work I’m already doing to improve it, with your patient feedback, of course. Now actually I don’t need feedback since I learned long ago that merely writing up details about what I’m doing gives me a moment to emerge from the work and pause and think about what I’m doing.

I’ve probably lost you with what I just said because I haven’t gotten around to a post on my methodology for this project. So here it is. Basically I’m trying to collect a statistically significant corpus of side-by-side Spanish and English menu items directly from real menus in restaurants in Spain. Once I have that, with some annotation, I will feed that data into a program (I’ve done this before but I’ll write and newer and improved one) to generate the final answer, a list of words, phrases, “concepts” (meaning generic foodie info) with some confidence level I have proper translation. And for Spain, as per another post I haven’t yet written, Spain requires tagging all the vocabulary with regional information since different words (or even the different languages is use in Spain) are used to describe the same thing so reading a menu requires knowing some of these regional differences.

So to create the corpus I basically use some approach to finding websites for restaurants and manually clipping the  information (tedious) into an MSWord file with a consistent format and structure. Ideally my translations will come from the site but most of the restaurants I’m using for source material are local without much concern for tourists; besides often the English menu in a Spanish restaurant is not the same anyway, certainly not line-by-line. As I do any particular menu most of it has a simple translation, word-by-word, that I can understand despite not speaking Spanish (not entirely true, this is the third time I’ve done this and a little of the vocabulary and a tiny bit of the structure of Spanish have stuck with me).

So with all that introduction let me move on with the main point of this post.

Where am I finding restaurant menus to study? There are lots of ways but I’m also doing another crazy thing which is converting miles I spend on boring exercise on a treadmill to a virtual hike along the Camino de Santiago (which I’ve dreamed about actually doing). I found a good GPS trace of that walking route and a good list of mileage (or kilomete-age) of towns along The Way. So when I “get” to a town I check the Google map for lodging (as though I needed a bed for the night, but also to see what food albergues might have) and restaurants. Most of these, in small towns, are clearly for the locals, but also the tourist traffic along the Camino, which is entirely a different crowd than other tourist hotspots in Spain. Just looking at the photos many of these restaurants are too modern, too influenced by the homogenizing culture of TV and the Net, to be called “traditional” but it does seem that the people who run them and especially pick the menus are more traditional than some fancy FoodTV (or any best of in cuisine lists) I might find in Barcelona or Madrid. So they’re good enough to be source material for my corpus.

I started this project (and my virtual hike) just before reaching Pamplona along the Camino. On the virtual hike I had fun “exploring” Pamplona and began investigating restaurants, just for fun (especially to look at the photos and reviews). Then it dawned on me (on the outskirts of Pamplona, Cizur Menor, to start v3.0 of my Spanish Food Terms project, which, obviously I’ve done twice before. Each time I did that I learned things that have refined my methodology but the idea of needing the largest possible corpus of actual Spanish menu terms (not lists I find on the Net) is a superior approach to create my v3.0 vocabulary raw data.

So I’ve been doing this for a few days and now have exhausted all the restaurants between Cizur Menor and Puente la Reina along (or near) the Camino. So my virtual hike isn’t going to take me to many restaurants in the  next few weeks so I’m backtracking to extract menus from those in Pamplona (thus staying with Basque influenced terminology rather than picking some other tourist city).

There are quite a few restaurants in Pamplona (big surprise, it is a major tourist destination) but I worry about the authenticity of their menus. Clearly any international destination is influenced by the globalist foodie explosion and like any other place menus (cuisine) is becoming more “fusion” and thus while influenced by traditional cuisine it is also becoming heavily modified, modernized, globalist, and innovative. Fun for dining, but this is one of those biases, in creating a corpus, that my software must recognize. Things found in menus in Pamplona may not exist elsewhere in Spain, or may be contradictory in their vocabulary, or may not include more “common” food found in local (e.g. rural) restaurants. IOW, the menus of many of these places may be written to appeal to foodie tourists and thus represent inaccurate information. But knowing this (and segregating this data) provides the opportunity for my software, working on a large annotated corpus, to make some decisions about each term, i.e. xyz is only used in Asian-influenced fusion restaurants in Pamplona and is something else entirely in Obano.

So that’s the main reason I’m starting a new file but there is also a practical one. I use quite a bit of MSWord features in creating my corpus (then use the RTF to feed my software so I can distinguish metatext and annotation from raw data) and consequently the files get large (with only a relatively small amount of menu data) and especially slow – not to mention often they seem to break and I lose stuff (due to some interesting architectural issues in Office, which I happen to know from my professional career). So therefore starting a new file, just for Pamplona, works faster and easier than continuing to grow my first file. I’m going to have to export all this information from MSWord and then import into my software (where undoubtedly errors will be detected since my accumulation of data is entirely manual and it will take several passes to get true consistency in my data (which ideally would be carefully created XML. but that’s more difficult to create in MSWord than a “document” (mostly using styles to tag context)).

So this is just some techie digression of what I’m about to start today. Dear Reader, this may bore you even more than my other posts, but it’s useful to me to have a “lab notebook” as I developed that discipline, long ago, to improve my process in any project.

Undoubtedly this extraction of information from restaurants in Pamplona is bound to generate ideas for new posts, so you’ll be hearing more about this.

btw: I’ve used WordPress before and the ability to have multiple pages so one of these days I’ll get around to more fully setting up this blog and add some pages to provide information outside the flow of blog posts. One in particular I plan to do is Word of the Day (probably not every day) to explain some of the more interesting terms I find while doing this raw data extraction (and partial annotation) phase of this project.

btw2: At times it will be worth it (to me, maybe to you) to break out of my pattern of creating the menu corpus and actually create some “lists”, like all the terms for vegetables. This was my older approach and it became very problematic for Spain due to: a) regional differences in Spain, and, b) far worse, “Spanish” is of course the language used in much of the world and guess what – food term word lists from Puerto Rico and way way different than those in Spain and basically my insufficient tagging (during v2.0) eventually made that project useless (now I know better, sources are critical to retain to the final compilation of the list to resolve this differences).

So onward, my friends.