stata bleg: substring search

Let’s say each case has a paragraph of text. How can I tell Stata to search the paragraph and tell me if the string is in there? In my case, I have data on articles within a scientific specialty. The abstract is stored as a string variable for each case. I would like to see if the abstract mentions research method (e.g., did they do an experiment?)

Written by fabiorojas

April 28, 2010 at 8:30 pm

Posted in fabio, research

14 Responses

Subscribe to comments with RSS.

I’m sure there’s a fancier way, but:

split(paragraph), gen(new_var) p(string1 string2)
gen is_txt_there = (paragraph ~= new_var1)

Should do it.

LikeLike

Thorfinn

April 28, 2010 at 8:40 pm
Use regular expressions. Something like:

gen methodmentioned=1 if regexm(abstracts, “research method”)

Look-up “regular expressions in stata” for more nuance.

LikeLike

shane

April 28, 2010 at 8:45 pm
wait up a second. Stata can only hold up to 244 characters in a string. most abstracts are 1200 words, or about 1500 characters. thus it sounds like you’re only searching the first fifth or so of each abstract and so have a lot of false negatives. if i were you i’d check whether you’re truncating the abstracts, something like “list abstract in 1/5”. if you are, i suggest that you code the abstracts for keywords before getting it into Stata. if you tell me more about your data i can be more specific.

LikeLike

gabrielrossman

April 28, 2010 at 10:13 pm
correction, about “200 words” — doesn’t change the point though

LikeLike

gabrielrossman

April 28, 2010 at 10:25 pm
Why not use perl for the data-searching/generation part? It’s free and incredibly good at text matching.

LikeLike

andrewperrin

April 29, 2010 at 1:02 am
Whoa, hadn’t thought of that. The absracts are saved in Excel, and I just imported them into stata. And, I just haven’t gotten around to Perl…

LikeLike

fabiorojas

April 29, 2010 at 2:28 am
As Gabriel says, Stata has a pretty low limit on string fields. FYI SPSS takes much longer strings. A couple of summers ago, a student and I tested various programs for doing what it sounds like you are doing (coding abstracts for the presence of particular key words or phrases), and SPSS was the best statistical package for this application because it could handle whole abstracts, although its search choices were fairly limited. I don’t know Perl, it sounds like Andrew is right about it, but don’t forget to check about lost text in a Stata import. Maybe the latest version increased the string size?

LikeLike

olderwoman

April 29, 2010 at 1:00 pm
here’s my tentative solution, which should work for mac or linux. first, create a csv file where the first column is some kind of key variable and the second column is the abstract. then run this Unix command
```
grep 'experiment' raw.csv | cut -d "," -f 1 > experiment.csv
```
finally, insheet “experiment.csv” and merge it with the main dataset. any _merge==3 is an article with the word “experiment” in the abstract.
i think i can work the whole thing into a do-file and i’ll probably post it to my own blog in a few hours

LikeLike
gabrielrossman

April 29, 2010 at 1:39 pm
If your data is already in Stata, why not use the –strpos– function? Foe example:

gen exists=strpos(abstract,”thestringyouwant”)

will give you a variable with the starting location of the string in the abstract; 0 if the string is not present.

LikeLike

pll

April 29, 2010 at 2:05 pm
If they’re already in excel, maybe it’s easier to search the abstracts in excel. Add an extra column and use the FIND function.

LikeLike

David Chen

April 29, 2010 at 2:30 pm
[…] at the Orgtheory mothership, Fabio asked how to do a partial string match in Stata, specifically to see if certain keywords appear in scientific abstracts. This turns out to be hard, […]

LikeLike

Grepmerge « Code and Culture

April 29, 2010 at 4:46 pm
Gabriel, I don’t think that saving the file as a .csv is wise in this case because the abstract field is likely to have commas in it. It would probably be better to save the file in tab-delimited format and changing the elegant grep code to reflect that.

LikeLike

mike

May 1, 2010 at 8:33 pm
mike,
the embedded commas in a csv will be escaped if it’s properly formatted, but that’s a big if. so basically, i agree that tab-delimited is better and in fact i generally usually use tab text for that reason. (i used csv this time because i converted the sample data from xlsx to text with Numbers 08). here’s the code with tab-delimited text:
```
grep 'experiment' raw.txt | cut -f 1 > experiment.txt
```
LikeLike
gabrielrossman

May 4, 2010 at 9:19 pm
Thanks for the clarification, Gabriel. Also, for those who might be interested, I posted some code that can do this “internal” to Stata (i.e., not needing to call to the operating system). Using grep is probably a lot faster than using the code that I posted, but it is an option for those who can’t or don’t want to call to grep.

LikeLike

mike

May 5, 2010 at 12:05 am

Comments are closed.

orgtheory.net

stata bleg: substring search

14 Responses

“…the science of association is the mother science; the progress of all the others depends on the progress of that one.” Alexis de Tocqueville

about us

recent comments

#orgtheory twitter feed

email subscription to orgtheory.net

blogroll

other links

Top Posts & Pages

orgtheory.net

stata bleg: substring search

Share this:

Related

14 Responses

“…the science of association is the mother science; the progress of all the others depends on the progress of that one.” Alexis de Tocqueville

about us

recent comments

#orgtheory twitter feed

email subscription to orgtheory.net

blogroll

other links

Top Posts & Pages