Scraping HTML in lisp -
my question related question found here scraping html table in common lisp?
i trying extract data webpage in common lisp. using drakma send http request, , i'm trying use chtml extract data looking for. webpage i'm trying scrap http://erg.delph-in.net/logon, here code
(defun send-request (sentence)  "sends sentence in http request logon parsing, , recieves   webpage containing mrs output"  (drakma:http-request "http://erg.delph-in.net/logon"                     :method :post                     :parameters `(("input" . ,sentence)                                  ("task" . "analyze")                                  ("roots" . "sentences")                                  ("output" . "mrs")                                  ("exhaustivep" . "best")                                  ("nresults" . "1")))) and here's function having trouble with
(defun get-mrs (sentence)     (let* (        (str (send-request sentence))        (document (chtml:parse str (cxml-stp:make-builder))))       (stp:filter-recursively (stp:of-name "mrsfeaturetop") document))) basically data need extract in html table, it's big paste here though. in get-mrs function, trying tag name mrsfeaturetop, not sure if correct though since getting error: not ncname 'onclick. scraping table appreciated. thank you.
ancient question, know. 1 that defeated me long time. it's true lot of webpages rubish, entire 2.0 build upon screen scraping, integrating heterogeneous websites hack upon hack -- should ideal application lisp!
the key (in addition drakma) lquery allows access pages contents using lispy transliteration of css selectors (what jquery uses).
let's links media strip on google's news page!  if open https://news.google.com in browser , view source.  you'll overwhelmed complexity of page.  if view page in browsers development panel (firefox: f12, inspector) you'll see page has logic it.  use search box find .media-strip-table  element contain images want.  open you're favourite repl. (well, let's honest here, emacs: m-x slime)
(ql:quickload :drakma) (ql:quickload :lquery)  ;;; links media strip on google's news page. (defparameter response  (drakma:http-request "https://news.google.com/"))  ;;; lquery parses page , gets ready queried. (lquery:$ (initialize http-response)) now let's explore results
;;; package qualified '$' opperator, barbaric!   ;;; use (use-package :lquery) omit package prefix. (lquery:$ ".media-strip-table" (html)) wow! that's tiny section of page? ok, how first element?
(elt (lquery:$ ".media-strip-table" (html)) 0) ok, that's little more manageable.  let's see if there's image tag in there somewhere, emacs: c-s img  yay! there is.
(lquery:$ ".media-strip-table img" (html)) hmmm... it's finding something, returning empty text... oh yeah, image tags supposed empty!
(lquery:$ ".media-strip-table img" (attr :src)) holy crap! gif's aren't used unfunny, grainy animations?
Comments
Post a Comment