Indexing PDF documents using sunspot / solr seemed like a difficult task initially. There are a number of tutorials dealing with this topic but they seemed overly complicated. In the end, I decided the simplest way to go about providing search functionality for indexed PDF documents was to
1) parse the documents locally
2) input the raw PDF text into an ActiveRecord Object.
3) index the associated raw text field and make it searchable
This allows us to skip the unnecessary step of having Solr index the PDF document, and instead indexes the raw PDF text – which is really the only thing we need.
The configuration is as follows
Gemfile
gem sunspot_rails
gem pdf-reader
group:development do
gem sunspot_solr
end #we're using websolr on Heroku
Model Configuration
searchable do
text :pdf_contents
end
Parsing and automatic indexing.
file = open(url_to_pdf_document) #if you are downloading the file, otherwise skip
reader = PDF::Reader.new(File.open(file,"rb"))
contents = ""
reader.pages.each do |page|
#remove all newlines and extraneous white spaces from the raw content
contents += page.text.gsub("\\n","").gsub(/\\s+/," ").strip
end
object.pdf_contents = contents
object.save
The PDF document is now fully indexed and searchable within the rails app.
This post does not cover making the PDFs accessible through your App – this will depend on your specific situation. The steps above are storage agnostic and will work with any configuration (s3, local, other cloud service, etc).
I ended up doing the same thing. Tried using Yomu (an Apache Tika wrapper), but 50% of my PDFs didn’t parse well. pdf-reader seemed to be fine with any pdf I threw at it. So, I’ll probably use pdf-reader for pdfs and Tika for Word and other docs
Where are you putting the pdf_reader code? I can’t get this to work