NY Times doesn't work #229

abhigenie92 · 2015-05-20T21:57:01Z

from goose import Goose
extractor = Goose()
article = extractor.extract(url='http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
text = article.cleaned_text

MaxwellRebo · 2015-06-22T15:31:31Z

NYT does a ton of redirecting, it's incredibly annoying. The strategy is to set the user agent to look like a browser and then continue from there (learned from a colleague at Factr). If it doesn't like the user agent, it will sometimes put you in an infinite redirect loop. It partially has to do with their paywall.

Xavier Grangier added 8 commits June 29, 2014 11:33

Merge branch 'release/v1.0.19'

8e2f875

Merge branch 'release/1.0.20'

93e8239

Merge branch 'master' of github.com:grangier/python-goose

a275c45

Merge branch 'release/1.0.21'

fba20fd

Merge branch 'release/1.0.22'

f5dc260

Merge branch 'release/1.0.23'

066a3c0

Merge branch 'release/1.0.24'

9e28861

Merge branch 'release/1.0.25'

840ced1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NY Times doesn't work #229

NY Times doesn't work #229

Uh oh!

abhigenie92 commented May 20, 2015

Uh oh!

MaxwellRebo commented Jun 22, 2015

Uh oh!

Uh oh!

NY Times doesn't work #229

Are you sure you want to change the base?

NY Times doesn't work #229

Uh oh!

Conversation

abhigenie92 commented May 20, 2015

Uh oh!

MaxwellRebo commented Jun 22, 2015

Uh oh!

Uh oh!