Back to work – lists

As I don’t have any more travel planned I can get back to work, perhaps with a renewed effort. So I returned to looking at lists, at least three I’ve found and with more to go. Lists come as: just translation of terms in English and Spanish, glossaries and dictionaries where dictionaries supply an actual definition and glossaries sometimes just provide translation (where literal is possible) or definition otherwise. The Net is full of these but using them can be a challenge. Also I’ve usually looked only at these lists where the terms are Spanish but the translation or definition is in English. It’s more interesting, although more work, to get the lists entirely in Spanish. And ideally as apply to Spain rather than anywhere Spanish is used.

So in my first attempt to build up a translation dictionary I only used lists I could find. It never dawned on me to use purely sources in Spanish and in particular menus, but of course machine translation has advanced a lot since my V1.0 attempt years ago so now sources entirely in Spanish and especially as applied to Spain are my primary sources.

But lists provide a lot information in a hurry. And despite the issues they often provide terms that are unlikely to be found elsewhere. But the biggest issue is that whole thing of Spanish throughout the world versus Spanish gastronomy terms for Spain. As I’ve mentioned tortilla is common in western hemisphere but something entirely different than you’d get in Spain even if the menu does say tortilla patatas. Now where lists might include New World terms not used in Spain it’s just a waste of time, at least for my purpose to process them. But when they conflict in meaning between Spain and elsewhere that is a problem.

So I’ve been crunching through three lists. Finding more lists is a lot easier (at least until I’ve found most of them) than processing the lists, especially when the lists are entirely in Spanish. Plus some types of webpages are hard to “mine” (also known as scraping when code is doing it). Web authors design pages to be most useful for their intended audience and not for someone accumulating a corpus. And even when I’ve processed lists I have to be careful with the whole copyright issue. If I published (except in the fair use case, i.e. a small sample with attribution) any substantial portion of any list I find that is improper. But since my real notion is accumulating a large corpus from many sources and then basing my final translation vocabulary on a meta-analysis of many sources I think I should be OK. Also whenever I only have a term translation from a single source I need to be suspicious of the accuracy of that as well.

So thus far I’ve looked at: 1) the Gallina Blanca Diccionario which is from a website in Spain representing a food company producing packaged products for Spain markets and supplied the diccionario to aid their users of the recetas they also provide; this has Spanish terms and definitions in Spanish but does not apply, at least exclusively to Spain; 2) Nitty Grits, a glossary with Spanish terms and English definitions, not exclusive to Spain, but as I learned after crunching through most of it each term is clickable and often (not always) then indicates where this term is used; Nitty Grits is a large list and allows me to get fairly unambiguous definitions (since they’re in English) and avoid the often incorrect machine translations (such as occurred in Gallina Blanca); and, 3) now I’ve return, since doing some work by in May to a complex website, ARecetas, a recipe site that then has multiple glossaries especially the largest and most directly useful, Glosario de Alimentos.  And there are more I’ve found but haven’t yet crunched through at all. Of these ARecetas glosario is the hardest to process so I only briefly looked at in May and instead focused on Nitty Grits. But for several months Nitty Grits was not operational (at first I thought they might have blocked me but that was not the case).

Anyway now I have more issues having finished two of these sources and now resumed work on the third. First, the way I’ve extracted information (often a tedious process) is inconsistent between the three lists (meaning the tables I created in MSWord manually). Second, my notion system was inconsistent, i.e. I annotated much of what I found with no particular notation as to what is original source text and then my annotation. These issues meant I can’t possibly consolidate the three lists manually. So I had started some code to create a consistent format across all lists (in XML which is more robust than just text in MSWord with a few fonts and colors). I was able to do Nitty Grits fairly easily but ARecetas and GallinaBlanca are toughter, i.e. it’s not just code I need, but I have to go back to the manually compiled lists and use consistent inline markup so the code can parse all entries to the common XML I want for all three lists.

Now I need to finish ARecetas (and perhaps also some other smaller sites I found and also do a thorough job of searching) before moving on to the real world. Once I can convert each list, with my annotations and markup, to a consistent XML structure then I can attempt a “merge”. Once that is done I can then look for agreement or disagreement between the sources (as I processed them) and start fixing errors or doing more searching to get more accurate answers (although without wasting much time on non Spain terms).

People who compile lists usually have some other work. They usually want to get their list with minimal effort to achieve their purpose. Simply put, this means they make mistakes, sometimes even blatantly obvious to simple analysis, sometimes more subtle. I’m well familiar with this from my career, a concept of “good enough”. No compilation of information is ever perfect anyway so it’s more a question of how good does it need to be for the intended purpose versus how much work (usually measured as cost since some paid person is doing the work). So online lists have many flaws. And it’s not just online lists. I’ve bought a few books about food in Spain back in my V1.0 effort and these books have inconsistencies and errors (where error means they disagree with other sources). I’ve looked and I’ve never found a “best” or even highly accurate and comprehensive source.

And that’s part of why I’m even doing this project. Unlike the other people creating materials, either free on the Net or in for-sale published works I don’t have a cost issue with my work. As I’m retired and unlikely to ever even be a temporary consultant the marginal value of my time, measured in money, is zero. Therefore I can spend an infinite amount of it trying to be as accurate and comprehensive as I can be, even (and that would be fun) doing original field research, i.e. actually going to lots of restaurants in Spain with some consultant I could hire who’d be fluent in Spanish and cooking (then the bills do add up). So at least my “free” effort is just a question of how much work I wish to put in it.

So I do believe, despite my lack of fluency in Spanish language, it is feasible that I could compile the best list, meaning the most comprehensive and accurate. Of course my list would have mistakes too but I think it could be better than any I’ve seen. AND, if I write good code to does the bulk of the work consolidating the raw materials for my corpus and then extracting I should have an easier time making corrections, especially as my targeted application is either machine-generated webpages or a smartphone app, i.e. updates should be possible once I actually get feedback (too many sites or apps fail to take advantage of the knowledge of their users to provide very valuable feedback to constantly improve the product, either its usability or its underlying database of Spain culinary terminology.

So I hope to get back into it and finishing these three lists would be a critical milestone because then I can really get down to designing my corpus and the code for importing and consolidating and proofing the information in the corpus.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.