orgtheory.net

stata bleg: text matching

Attention Stata people (esp. Sr. Rossman): Let’s say I have a data base of articles. I have a variable with the author’s name. Then I want to match the author’s name with other data (e.g., Fabio Rojas is matched with height 5′ 8″).

Merge 1:m is the command, but there’s a problem. Let’s say that my author data base doesn’t use the same spelling (e.g., Fabio G. Rojas or fabio rojas). Then the merged data set will have missing data.

Is there a way in Stata to offer the programmer a choice of possible matches to minimize missing data caused by variations in spelling? If not, what program or language has an easy to use tool box for this sort of stuff?

Adverts: From Black Power/Grad Skool Rulz

About these ads

Written by fabiorojas

July 23, 2012 at 12:01 am

Posted in fabio, mere empirics

15 Responses

Subscribe to comments with RSS.

  1. I’d port it to a SQL database first. The fuzzy logic is easier to write there. You’d have something like:

    SELECT
    n.FirstName
    ,n.LastName
    ,Paper
    FROM tbl_Names n
    LEFT JOIN tblPapers p on n.ID = p.AuthorID
    WHERE n.FirstName LIKE ‘%fabio%’
    AND n.LastName LIKE ‘%roja%’

    or something like that. T-SQL isn’t case sensitive and you can specify the details of the fuzzy logic pretty easily. Make sure you design your tables using good database management rules and you’ll save yourself hours (or days, depending on the size of the data) of work.

    Like

    Sam Wilson

    July 23, 2012 at 12:09 am

  2. 1. take a look at -reclink- from SSC for this type of fuzzy matching. 2. -strgroup- from SSC is useful for finding these near matches before the -reclink- merge and cleaning up near matches to increase the match rate. 3. convert all your merge var to lower case and trim it to reduce matching errors (e.g., ” replace var = trim(lower(var)) “) 4. Use regex* and strpos() functions to pre-process vars. So, if you know you want anything that contains “fabio” to be “Fabio Rojas”, you can:

    replace var1 = “fabio rojas” if strpos(var1, “fabio”)>0
    **or
    replace var1 = “fabio rojas” if index(var1, “fabio”)

    or do something more complex with regular expressions. Also, in this type of situation, you’ll probably want to use a -foreach- loop to make many of these kinds of changes to your data.

    Like

    EBooth

    July 23, 2012 at 12:29 am

  3. I bet people who work with address fields — fundraisers and political list people for instance — know a lot of good techniques for this.

    Like

    Unlearner

    July 23, 2012 at 12:47 am

  4. I can’t speak to the code in Stata, but this is precisely what regular expressions are for; here’s one that would match your name with upper or lowercase first letters and with or without a middle initial and/or full middle name:

    [Ff]abio \w?.? [Rr]ojas

    Then, using whatever Stata code you would need to find strings that match that, you could then reassign them to be “Fabio Rojas”.

    Like

    Trey

    July 23, 2012 at 12:48 am

  5. Whoops, make that

    [Ff]abio\s?\w?\.?\s[Rr]ojas

    Like

    Trey

    July 23, 2012 at 12:50 am

  6. If you have lots of variant spellings that are difficult to capture in a regular expression, Google Refine can be useful, though there’s a bit of a learning curve. (Once you learn how to use it, however, it can be quite powerful).

    Like

    Xavier

    July 23, 2012 at 12:53 am

  7. this might help: (main problem: does not distinguish between “Fabio Rojas”, “Fabio G. Rojas” and “Fabio X. Rojas”. Considers all three as one person.)

    1. convert the string valuable to lowercase and trim
    “Fabio G. Rojas ” -> “fabio g. rojas”

    2. remove all characters like “.” or “,”
    “fabio g. rojas” -> “fabio g rojas”

    3. then split the string
    “fabio g rojas” -> “fabio” + “g” + “rojas”

    4. if any of the new string variables, created (after split), has a length of 1, replace them with an empty string.
    “fabio” + “g” + “rojas” – > “fabio” + “” + “rojas”

    5. concat all the string variables and merge them back into one string.
    “fabio” + “” + “rojas” -> “fabio rojas”

    6. remove all the spaces from the string.
    “fabio rojas” -> “fabiorojas”

    then merge.

    Like

    PT

    July 23, 2012 at 3:52 am

  8. Are you asking the best way to do this or how to do it in Stata? And how messy is the input? I’ve had quite a bit of experience with Stata string commands and macros to parse text, although I doubt very much that Stata is the best way to handle a bibliography. For one thing, its text fields are too small. What you want, in your example, is a strategy to end with with a unique identifier for each author and get that into any record you will want to merge. I recommend a unique text identifier because that is actually a lot easier to check for errors and reconcile than a bunch of large numbers.

    Like

    olderwoman

    July 23, 2012 at 3:59 am

  9. I think you can use regular expressions using AWK , this is typically used on Unix/Linux systems. But the variants are available on Windows platform too. But need to search for it….

    Like

    Santosh Sali

    July 23, 2012 at 10:40 am

  10. Let me add to the chorus suggesting that Stata is the wrong tool for this. Perl, python, even awk or sed would let you do a better job more quickly and easily. And storing data in a SQL database as Sam Wilson suggests is probably the best practice.

    Like

    andrewperrin

    July 23, 2012 at 1:49 pm

  11. Agreed with all the people who said this is not best done in Stata. That said, here’s how I do it in Stata. First I create a new variable called “mergevar.” It’s based on the variable I’d like to use except simplified, all caps, no spaces, no special characters, or anything else. If you’re using names you might wish to drop the middle initials. If it gets an exact match with this simplified version I accept that. Then I drop the exact matches, sort by the variable, and export to a spreadsheet for handmerging. Finally, I reimport the hand merged spreadsheet and use it as a merge key.

    (Why do this instead of just hand-editing the actual data? Because it’s more replicable this way as the spreadsheet itself serves as a change log. Always treat your raw data as read-only.)

    Also, regarding Trey’s comment, you do want to use regular expressions but keep in mind that Stata has a relatively impoverished regular expression syntax which is as good a reason as anything to do this with something else.

    Like

    gabriel rossman

    July 23, 2012 at 2:17 pm

  12. Thanks for the comments. I think this means I will have to learn SQL.

    o.w. asked how “messy” the data is. Let me explain the specific problem. I am analyzing a list of biomedical journal articles in one specialty. Each article has key words called “MeSH” terms. I scraped the MeSH terms (e.g., AIDS) and now I want to translate into MeSH numbers (e.g. D02.03.45).

    I downloaded the NIH’s list of MeSH terms and numbers and tried merge in Stata. There was a substantial list of MeSH terms without numbers. I suspect it has to do with descrepancies of spelling. That’s why I need to do this.

    I want a standard solution that can be replicated.

    Like

    fabiorojas

    July 23, 2012 at 7:55 pm

  13. Regarding Gabriel’s most recent comment, my strategy is similar to his — you always have to hand match the difficult cases. But I prefer to document the hand-matching in a Stata do file. I agree about never messing up the original data file. When I recently cleaned up our historical grad program records (which spelled students’ names differently in different databases and involved both a change in the system of student ID numbers so that wasn’t a reliable match AND data entry errors in the student IDs including a few cases in which students had actually been assigned duplicate IDs or had their ID numbers scrambled in official records) , I created a variable called name_unique.

    People in this thread have not even come to terms with the fact that the same human being may well change names. In our data, both surnames and personal names have changed, in a few cases three or more times in the space of a graduate career. ALSO even among our grad students, some students had almost the same name, and among publishing academics, there are many publication names that are attached to multiple people. Fabio, you have to decide whether you want to worry about these problems in your data.

    If the only fields you are working with are name fields, they should be short enough for Stata to handle. You can do it with string functions and a program of hand-matching the hard-to-match cases.

    Like

    olderwoman

    July 23, 2012 at 8:57 pm

  14. I think my data is *somewhat* easier. My real data is not human names, but article descriptors (e.g., MeSH terms). So you don’t have name changes or multiple people/same name issues. A key word has only one corresponding number in the NIH list. There isn’t AIDS (version 1) and AIDS (version 2).

    The issue is that the website that provides MeSH terms for articles produces them in a format that is slightly different than the standardized list with numbers. For example, in my scraped data, you might have AIDS and then in the standardized list you have AIDS/HIV. So, merging results in an empty cell for “AIDS” when it should be matched to the number for “AIDS/HIV.”

    I investigated Google Refine this afternoon, which looks promising. It has tools for grouping together terms that are similar. The underlying code is JSON and can be retrieved, so you can create log files for replication.

    Like

    fabiorojas

    July 23, 2012 at 9:11 pm

  15. OW,

    I’ve also come across cases of substantive name changes rather than just irregular spellings. In my case it’s that IMDb sometimes renames a film, for instance, when a film was rereleased under a new title and this is how it’s recorded but some IMDb editor later discovers and restores the original title.

    I think the take home of both of our cases is that at a certain point you hit the limits of fuzzy algorithms and need to rely on hand-coding (as documented through a merge table file or a “replace newvar=x if oldvar=y” script). This is especially important if these hard-to-merge cases are not random, but systematic, whether it be women who are more likely to change surnames in a database of people or foreign films that are more likely to change title in a database of films.

    Like

    gabriel rossman

    July 24, 2012 at 12:00 am


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 1,125 other followers

%d bloggers like this: