Scraping HTML in lisp -

May 15, 2015

my question related question found here scraping html table in common lisp?

i trying extract data webpage in common lisp. using drakma send http request, , i'm trying use chtml extract data looking for. webpage i'm trying scrap http://erg.delph-in.net/logon, here code

(defun send-request (sentence)  "sends sentence in http request logon parsing, , recieves   webpage containing mrs output"  (drakma:http-request "http://erg.delph-in.net/logon"                     :method :post                     :parameters `(("input" . ,sentence)                                  ("task" . "analyze")                                  ("roots" . "sentences")                                  ("output" . "mrs")                                  ("exhaustivep" . "best")                                  ("nresults" . "1"))))

and here's function having trouble with

(defun get-mrs (sentence)     (let* (        (str (send-request sentence))        (document (chtml:parse str (cxml-stp:make-builder))))       (stp:filter-recursively (stp:of-name "mrsfeaturetop") document)))

basically data need extract in html table, it's big paste here though. in get-mrs function, trying tag name mrsfeaturetop, not sure if correct though since getting error: not ncname 'onclick. scraping table appreciated. thank you.

ancient question, know. 1 that defeated me long time. it's true lot of webpages rubish, entire 2.0 build upon screen scraping, integrating heterogeneous websites hack upon hack -- should ideal application lisp!

the key (in addition drakma) lquery allows access pages contents using lispy transliteration of css selectors (what jquery uses).

let's links media strip on google's news page! if open https://news.google.com in browser , view source. you'll overwhelmed complexity of page. if view page in browsers development panel (firefox: f12, inspector) you'll see page has logic it. use search box find .media-strip-table element contain images want. open you're favourite repl. (well, let's honest here, emacs: m-x slime)

(ql:quickload :drakma) (ql:quickload :lquery)  ;;; links media strip on google's news page. (defparameter response  (drakma:http-request "https://news.google.com/"))  ;;; lquery parses page , gets ready queried. (lquery:$ (initialize http-response))

now let's explore results

;;; package qualified '$' opperator, barbaric!   ;;; use (use-package :lquery) omit package prefix. (lquery:$ ".media-strip-table" (html))

wow! that's tiny section of page? ok, how first element?

(elt (lquery:$ ".media-strip-table" (html)) 0)

ok, that's little more manageable. let's see if there's image tag in there somewhere, emacs: c-s img yay! there is.

(lquery:$ ".media-strip-table img" (html))

hmmm... it's finding something, returning empty text... oh yeah, image tags supposed empty!

(lquery:$ ".media-strip-table img" (attr :src))

holy crap! gif's aren't used unfunny, grainy animations?

Search This Blog

Three

Scraping HTML in lisp -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

c# - Farseer ContactListener is not working -

Automatically create pages in phpfox -