orgtheory.net

stata bleg: substring search

Let’s say each case has a paragraph of text. How can I tell Stata to search the paragraph and tell me if the string is in there? In my case, I have data on articles within a scientific specialty. The abstract is stored as a string variable for each case. I would like to see if the abstract mentions research method (e.g., did they do an experiment?)

Written by fabiorojas

April 28, 2010 at 8:30 pm

Posted in fabio, research

14 Responses

Subscribe to comments with RSS.

  1. I’m sure there’s a fancier way, but:

    split(paragraph), gen(new_var) p(string1 string2)
    gen is_txt_there = (paragraph ~= new_var1)

    Should do it.

    Like

    Thorfinn

    April 28, 2010 at 8:40 pm

  2. Use regular expressions. Something like:

    gen methodmentioned=1 if regexm(abstracts, “research method”)

    Look-up “regular expressions in stata” for more nuance.

    Like

    shane

    April 28, 2010 at 8:45 pm

  3. wait up a second. Stata can only hold up to 244 characters in a string. most abstracts are 1200 words, or about 1500 characters. thus it sounds like you’re only searching the first fifth or so of each abstract and so have a lot of false negatives. if i were you i’d check whether you’re truncating the abstracts, something like “list abstract in 1/5”. if you are, i suggest that you code the abstracts for keywords before getting it into Stata. if you tell me more about your data i can be more specific.

    Like

    gabrielrossman

    April 28, 2010 at 10:13 pm

  4. correction, about “200 words” — doesn’t change the point though

    Like

    gabrielrossman

    April 28, 2010 at 10:25 pm

  5. Why not use perl for the data-searching/generation part? It’s free and incredibly good at text matching.

    Like

    andrewperrin

    April 29, 2010 at 1:02 am

  6. Whoa, hadn’t thought of that. The absracts are saved in Excel, and I just imported them into stata. And, I just haven’t gotten around to Perl…

    Like

    fabiorojas

    April 29, 2010 at 2:28 am

  7. As Gabriel says, Stata has a pretty low limit on string fields. FYI SPSS takes much longer strings. A couple of summers ago, a student and I tested various programs for doing what it sounds like you are doing (coding abstracts for the presence of particular key words or phrases), and SPSS was the best statistical package for this application because it could handle whole abstracts, although its search choices were fairly limited. I don’t know Perl, it sounds like Andrew is right about it, but don’t forget to check about lost text in a Stata import. Maybe the latest version increased the string size?

    Like

    olderwoman

    April 29, 2010 at 1:00 pm

  8. here’s my tentative solution, which should work for mac or linux. first, create a csv file where the first column is some kind of key variable and the second column is the abstract. then run this Unix command

    grep 'experiment' raw.csv | cut -d "," -f 1 > experiment.csv

    finally, insheet “experiment.csv” and merge it with the main dataset. any _merge==3 is an article with the word “experiment” in the abstract.
    i think i can work the whole thing into a do-file and i’ll probably post it to my own blog in a few hours

    Like

    gabrielrossman

    April 29, 2010 at 1:39 pm

  9. If your data is already in Stata, why not use the –strpos– function? Foe example:

    gen exists=strpos(abstract,”thestringyouwant”)

    will give you a variable with the starting location of the string in the abstract; 0 if the string is not present.

    Like

    pll

    April 29, 2010 at 2:05 pm

  10. If they’re already in excel, maybe it’s easier to search the abstracts in excel. Add an extra column and use the FIND function.

    Like

    David Chen

    April 29, 2010 at 2:30 pm

  11. […] at the Orgtheory mothership, Fabio asked how to do a partial string match in Stata, specifically to see if certain keywords appear in scientific abstracts. This turns out to be hard, […]

    Like

  12. Gabriel, I don’t think that saving the file as a .csv is wise in this case because the abstract field is likely to have commas in it. It would probably be better to save the file in tab-delimited format and changing the elegant grep code to reflect that.

    Like

    mike

    May 1, 2010 at 8:33 pm

  13. mike,
    the embedded commas in a csv will be escaped if it’s properly formatted, but that’s a big if. so basically, i agree that tab-delimited is better and in fact i generally usually use tab text for that reason. (i used csv this time because i converted the sample data from xlsx to text with Numbers 08). here’s the code with tab-delimited text:

    grep 'experiment' raw.txt | cut -f 1 > experiment.txt

    Like

    gabrielrossman

    May 4, 2010 at 9:19 pm

  14. Thanks for the clarification, Gabriel. Also, for those who might be interested, I posted some code that can do this “internal” to Stata (i.e., not needing to call to the operating system). Using grep is probably a lot faster than using the code that I posted, but it is an option for those who can’t or don’t want to call to grep.

    Like

    mike

    May 5, 2010 at 12:05 am


Comments are closed.