More

saaaam · on Dec 7, 2022

I did something similar, years ago, for twitter ads, using autogenerated videos. They didn’t block me, although not sure it would work today:

https://thenewinquiry.com/taxonomy-of-humans-according-to-tw...

saaaam · on Oct 28, 2022

Hello, it's me, the pearl-clutching, not sane, proselytizing, pseudo-intellectual author of this blog post. Happy to field your many insults/inquiries!

aosaigh · on Oct 28, 2022

> and more importantly to coerce us into accepting the need for their existence in the first place.

What do you think should take the place of the police force?

Genuine question, no judgement. I’m from a country that doesn’t have such a divisive and militarised police force so find it hard to understand this point of view.

Lerc · on Oct 28, 2022

I'm not sure how useful the overlay data is for live analysis but I'm glad it's overlaid on the video as a record of state. For instance the latitude and longitude are of no use to anyone who knows what this is a video of, but it stops video being misrepresented as of another location.

Style wise, I think it might actually be beneficial to emulate sci-fi films to an extent. It depends on the film of course, but the designers of those user interfaces are actual designers with a cohesive idea of what they want to represent. The alternative can often be a programmer who has to conjure the opinions of a committee into a user interface without the input of a design specialist.

ZeroGravitas · on Oct 28, 2022

Can you extract the overlay data to give a geographical way to search the footage?

subjectsigma · on Oct 28, 2022

You can now add "bitter and defensive" to that list

mike_d · on Oct 28, 2022

You could have just said artist.

saaaam · on Oct 28, 2022

there's a link at the bottom to the code (https://github.com/antiboredom/camera-motion-detector/). I use optical flow and then just count the percentage of pixels that appear to be moving away from the center. If that's bigger than ARBITRARY_THRESHOLD it's a zoom-in.

saaaam · on Oct 28, 2022

It is I, the pearl clutcher.

saaaam · on Oct 17, 2022

End of an era.

saaaam · on May 25, 2022

Thanks for sharing this - I'll check it out!

saaaam · on May 25, 2022

Hi Simon! I'll definitely consider adding that in. Also, I love Datasette!

saaaam · on May 25, 2022

Hi - I'd be interested to hear more details about what approaches you suggest!

1vuio0pswjnm7 · on May 26, 2022

Taking the examples from https://www.youtube-nocookie.com/embed/hA1ZsxE8VJg I am sharing how I approach the simple problems in the video without using Python or having any knowledge of CSS selectors.

Retrieving the HTML

   echo https://www.nytimes.com|yy025|nc -vv proxy 80 > 1.htm

yy025 is a flexible utility I wrote to generate custom HTTP from URLs. It is controlled through environmental variables. nc is a tcpclient, such as netcat. proxy is a HOSTS file entry for a localhost TLS proxy. The sequence "yy025|tcpclient" is normally contained in a shell script that adds a <base href> tag, something like

   #! /bin/sh
   yy025 5>.1 >.2
   read x < .1;
   echo "<base href=https://$x />";
   nc -vv proxy 80 < .2|yy045;

yy045 is a utility that removes chunked transfer encoding.

The benefit of using separate, small programs that do one thing will be illustrated in the solution for Problem 3.

1vuio0pswjnm7 · on May 26, 2022

Problem 2 - Extract href value from <a> tags in NYT front page

Create a file called 2.l containing

    int fileno(FILE *);
    #define jmp (yy_start) = 1 + 2 *
    #define echo do {if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)
   
   %s xa xb
   %option noyywrap noinput nounput
   %%
   \<a jmp xa;
   <xa>\40href=\" jmp xb;
   <xb>\" jmp 0;
   <xb>[^\"]* echo;putchar(10);
   .|\n
   %%
   int main(){ yylex();exit(0);}

Compile

    flex -8iCrf 1.l
    cc  -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy1

And finally,

    yy2 < 1.htm

This faster than Python and requires fewer resources.

lgas · on May 29, 2022

It's hard to imagine an environment where the the speed/resource difference between that approach and python would matter.

Can't see reaching for something like that instead of something like

    curl -s url | htmlq a --attribute href

1vuio0pswjnm7 · on May 26, 2022

Problem 1 - Extract the values of <h2> tags from NYT front page

NB. In 1.htm, NYT is using the <h3> tag for headlines, not <h2> as in the 2020 video.

Solution A - Use UNIX utilties

    grep -o "<h3[^\>]*>[^\<]*" 1.htm |sed -n '/indicate-hover/s/.*\">//p'

The grep utility is ubiquitous, but the -o option is not.

https://web.archive.org/web/20201202103125/https://pubs.open...

For example, Plan9 grep does not have an -o option.

This solution is fast and flexible, but not portable.

There are myriad other portable solutions using POSIX UNIX utilities such as sh, tr and sed. For small tasks like those in "web scraping" tutorials these can still be faster than Python (due to Python start up time alone).

Solution B - Use flex to make small, fast, custom utilities

Create a file called 1.l that contains

    int fileno(FILE *);
    #define jmp (yy_start) = 1 + 2 *
    #define echo do {if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)

   %s xa xb
   %option noyywrap noinput nounput
   %%
   \<h3 jmp xa;
   <xa>\> jmp xb;
   <xb>\< jmp 0;
   <xb>[^<]* echo;putchar(10);
   .|\n
   %%
   int main(){ yylex();exit(0);}

Then compile with something like

    flex -8iCrf 1.l 
    cc  -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy1

And finally,

    yy1 < 1.htm

This is faster than Python.

Solution C - Extract values from JSON instead of HTML

The file 1.htm contains a large proportion of what appears to be JSON.

I wrote a quick and dirty WIP JSON reformatter that takes web pages as input called yy059. https://news.ycombinator.com/item?id=31174088

   yy059 < 1.htm|sed -n '/promotionalHeadline\":\"[^\"]/p'|cut -d\" -f4

Sure enough, the JSON contains the headlines. One could rewrite Solution B to extract from the JSON instead of the HTML.

1vuio0pswjnm7 · on May 26, 2022

Problem 3 - Extract totalcount value from <span> tag in Craigslist job pages

Create a file called 3.l containing

    int fileno(FILE *);
    #define jmp (yy_start) = 1 + 2 *
   %s xa xb xc
   %option noyywrap noinput nounput
   %%
   \<ul\40id=\"jjj0\" jmp xa;
   <xa>"</ul>" yyterminate();
   <xa><a\40href=\" jmp xb;
   <xb>\" putchar(10);jmp xa;
   <xb>[^\"]* fprintf(stdout,"%s%s","https://newyork.craigslist.org",yytext);
   .|\n
   %%
   int main(){ yylex();exit(0);}

Compile

   flex -8iCrf 1.l
   cc  -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy3

yy3 extracts and prints the URLs for the job pages

Create a file called 4.l containing

    int fileno(FILE *);
    #define jmp (yy_start) = 1 + 2 *
    #define echo do{if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)
   %s xa xb xc xd xe
   %option noyywrap noinput nounput
   %%
   \<h1\40class=\"cattitle\" jmp xa;
   <xa>\<a\40href jmp xb;
   <xb>\"\> jmp xc;
   <xc>[^<]* fprintf(stdout,"%s ",yytext);jmp xd;
   <xd>\<span\40class=\"totalcount\"\> jmp xe;
   <xe>\< jmp 0;
   <xe>[0-9]* echo;putchar(10);
   .|\n
   %%
   int main(){ yylex();exit(0);}

Compile

   flex -8iCrf 1.l
   cc  -std=c89 -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy4

yy4 extracts and prints the job catgeory name and totalcount

We can either solve this in steps where we create files or we can do it as a single pipeline. I personally find breaking a problem into discrete steps is easier.

In steps

    echo http://newyork.craigslist.org|yy025|nc -vv proxy 80|yy045 > 1.htm;
    ka;yy3 < 1.htm|yy025|nc -vv proxy 80|yy045 > 2.htm;ka-;
    yy4 < 2.htm;

As a single pipeline

    echo http://newyork.craigslist.org|yy025|nc -vv proxy 80|y045|yy3|(ka;yy025)|nc -vv proxy 80|yy045|yy4;ka-

Shortened further by using a shell script called nc0 for the yy025|nc|yy045 sequence

    echo https://newyork.craigslist.org|nc0|yy3|(ka;nc0)|yy4

Thanks to yy025, we are using HTTP/1.1 pipelining. This is a feature of HTTP that almost 100% of httpd's support (I cannot name one that doesn't) however neither "modern" browsers nor cURL cannot take advantage of it. Multiple HTTP request are made over a single TCP connection. Unlike the Python tutorial in the video we are not "hammering" a server with multiple TCP connections at the same time, nor are we making a number of successive TCP connections that could "trigger a block". We are following the guidance of the RFCs which historically recommended that clients not open many connections to the same host at the same time. Here we only open one for retrievng all the jobs pages. Adding a delay between requests is unnecessary. We allow the server to return the results at its own pace. For most websites, this is remarkably fast. Craigslist is an anamaly and is rather slow.

What are ka and ka-. yy025 sets HTTP headers acording to environmental variables. For example, the value of Connection is set to "close" by default. To change it,

    Connection=keep-alive yy025 < url-list|nc -vv proxy 80 >0.htm

Another way is to use aliases

    alias ka="export Connection=keep-alive;set|sed -n /^Connection/p";
    alias ka-="export Connection=close;set|sed -n /^Connection/p";
    ka;yy025 < url-list|nc -vv proxy 80 >0.htm;ka-

yy025 is intended to be used with djb's envdir. Custom sets of headers can thus be defined in a directory.

This solution uses less resources, both on the client side and on the server side, than a Python approach. It is probably faster, too.

saaaam · on May 25, 2022

Hi! This is a guide that I started during the pandemic but never quite finished. I’m in the process of re-writing/re-recording some parts of it to bring it back up to date, and adding in the bits that are still missing.

saaaam · on May 23, 2022

Thank you! I will definitely take a look at that - looks great.