Hacker Newsnew | past | comments | ask | show | jobs | submit | Felk's commentslogin

I see that the author took a 'heuristical' approach for retrying tasks (having a predetermined amount of time a task is expected to take, and consider it failed if it wasn't updated in time) and uses SQS. If the solution is homemade anyway, I can only recommend leveraging your database's transactionality for this, which is a common pattern I have often seen recommend and also successfully used myself:

- At processing start, update the schedule entry to 'executing', then open a new transansaction and lock it, while skipping already locked tasks (`SELECT FOR UPDATE ... SKIP LOCKED`).

- At the end of processing, set it to 'COMPLETED' and commit. This also releases the lock.

This has the following nice characteristics:

- You can have parallel processors polling tasks directly from the database without another queueing mechanism like SQS, and have no risk of them picking the same task.

- If you find an unlocked task in 'executing', you know the processor died for sure. No heuristic needed


This introduces long-running transactions, which at least in Postgres should be avoided.


Depends what else you’re running on it; it’s a little expensive, but not prohibitively so.


Long running transactions interfere with vacuuming and increase contention for locks. Everything depends on your workload but a long running transactions holding an important lock is an easy way to bring down production.


If the system is already using SQS, DynamoDB has this locking library which is lighter weight for this use case

https://github.com/awslabs/amazon-dynamodb-lock-client

> The AmazonDynamoDBLockClient is a general purpose distributed locking library built on top of DynamoDB. It supports both coarse-grained and fine-grained locking.


I read too many "use Postgres as your queue (pgkitchensink is in beta)", now I'm learning listen/notify is a strain, and so are long transactions. Is there a happy medium?


Just stop worrying and use it. If and when you actually bump into the limitations, then it's time to sit down and think and find a supplement or replacement for the offending part.


Excellent advice across many domains/techs here.


t1: select for update where status=pending, set status=processing

t2: update, set status=completed|error

these are two independent, very short transactions? or am i misunderstanding something here?

--

edit:

i think i'm not seeing what the 'transaction at start of processor' logic is; i'm thinking more of a polling logic

    while true:
      r := select for update
      if r is None:
        return
      sleep a bit
this obviously has the drawback of knowing how long to sleep for; and tasks not getting "instantly" picked up, but eh, tradeoffs.


Your version makes sense. I understood the OP's approach as being different.

Two (very, if indexed properly) short transactions at start and end are a good solution. One caveat is that the worker can die after t1, but before t2 - hence jobs need a timeout concept and should be idempotent for safe retrying.

This gets you "at least once" processing.

> this obviously has the drawback of knowing how long to sleep for; and tasks not getting "instantly" picked up, but eh, tradeoffs.

Right. I've had success with exponential backoff sleep. In a busy system, means sleeps remain either 0 or very short.

Another solution is Postgres LISTEN/NOTIFY: workers listen for events and PG wakes them up. On the happy path, this gets instant job pickup. This should be allowed to fail open and understood as a happy path optimization.

As delivery can fail, this gets you "at most once" processing (which is why this approach by itself it not enough to drive a persistent job queue).

A caveat with LISTEN/NOTIFY is that it doesn't scale due to locking [1].

[1]: https://www.recall.ai/blog/postgres-listen-notify-does-not-s...


What are you thoughts on using Redis Streams or using a table instead of LISTEN/NOTIFY (either a table per topic or a table with a compound primary key that includes a topic - possibly a temporary table)?


I've not used Redis Streams, but it might work. I've seen folks advise against PG, in favor of Redis for job queues.

> using a table instead of LISTEN/NOTIFY

What do you mean? The job queue is backed by a PG table. You could optionally layer LISTEN/NOTIFY on top.

I've had success with a table with compound, even natural primary keys, yes. Think "(topic, user_id)". The idea is to allow for PARTITION BY should the physical tables become prohibitively large. The downsides of PARTITION BY don't apply for this use case, the upsides do (in theory - I've not actually executed on this bit!).

Per "topic", there's a set of workers which can run under different settings (e.g. number of workers to allow horizontal scaling - under k8s, this can be automatic via HorizontalPodAutoscaler and dispatching on queue depth!).


They're proposing doing it in one transaction as a heartbeat.

> - If you find an unlocked task in 'executing', you know the processor died for sure. No heuristic needed


Yes, and that cannot work: if a task is unlocked but in 'executing' state, how was it unlocked but its state not updated?

If a worker/processor dies abruptly, it will neither unlock nor set the state appropriately. It won't have the opportunity. Conceptually, this failure mode can always occur (think, power loss).

If such a disruption happened, yet you later find tasks unlocked, they must have been unlocked by another system. Perhaps Postgres itself, with a killer daemon to kill long-running transactions/locks. At which point we are back to square one: the job scheduling should be robust against this in the first place.


Don't have to keep transaction open. What I do is:

1. Select next job

2. Update status to executing where jobId = thatJob and status is pending

3. If previous affected 0 rows, you didn't get the job, go back to select next job

If you have "time to select" <<< "time to do" this works great. But if you have closer relationship you can see how this is mostly going to have contention and you shouldn't do it.


This is exactly what we're doing. Works like a charm.


Funny seeing this here now, as I _just_ finished archiving an old MyBB PHP forum. Though I used `wget` and it took 2 weeks and 260GB of uncompressed disk space (12GB compressed with zstd), and the process was not interruptible and I had to start over each time my hard drive got full. Maybe I should have given HTTrack a shot to see how it compares.

If anyone wanna know the specifics on how I used wget, I wrote it down here: https://github.com/SpeedcubeDE/speedcube.de-forum-archive

Also, if anyone has experience archiving similar websites with HTTrack and maybe know how it compares to wget for my use case, I'd love to hear about it!


I've tried both in order to archive EOL websites and I've had better luck with wget, it seems to recognize more links/resources and do a better job so it was probably not a bad choice.


Conversely, httrack was the only tool that could archive the JS-heavy microsite my realtor made to sell our old house. The command-line interface is horrendous, but it does handle rewriting complex sites better than wget does.


> it took 2 weeks and 260GB of uncompressed disk space

Is most of that data because of there being like a zillion different views and sortings of the same posts? That’s been the main difficulty for me when wanting to crawl some sites. There’s like an infinite number of permutations of URLs with different parameters because every page has a bunch of different link with auto-generated URL parameters for various things, that results in often retrieving the same data over and over and over again throughout an attempted crawl. And sometimes URL parameters are needed and sometimes not so it’s not like you can just strip all URL parameters either.

So then you start adding things to your crawler like, starting with shortest URLs first, and then maybe you make it so whenever you pick the next URL to visit it will take one that is most different from what you’ve seen so far. And after that you start adding super specific rules for different paths of a specific site.


The slowdown wasn't due to a lot of permutations, but mostly because a) wget just takes a considerable amount of time to process large HTML files with lots of links, and b) MyBB has a "threaded mode", where each post of a thread geht's a dedicated page with links to all other posts of that thread. The largest thread had around 16k posts, so that's 16k² URLs to parse.

In terms of possible permutations, MyBB is pretty tame thankfully. Only the forums are sortable, posts only have the regular and the aforementioned threaded mode to view them. Even the calender widget only goes from 1901-2030, otherwise wget might have crawled forever.

I originally considered excluding threaded mode using wget's `--reject-regex` and then just adding an nginx rule later to redirect any incoming such links to the normal view mode. Basically just saying "fuck it, you only get this version". That might be worth a try for your case


Is there a friendly way to do this? I'd feel bad burning through hundreds of gigabytes of bandwidth for a non-corporate site. Would a database snapshot be as useful?


MyBB PHP forums have a web interface through which one can download the database as a single .sql file. It will most likely be a mess, depending on the addons that were installed on the forum.


Downloading a DB dump and crawling locally is possible, but had two gnarly show stoppers for me using wget: the forum's posts often link to other posts, and those links are absolute. Getting wget to crawl those links through localhost is hardly easy (local reverse proxy with content rewriting?). Second, the forum and its server were really unmaintained. I didn't want to spend a lot of time replicating it locally and just archive it as-is while it is still barely running


If you want to customize the scraping, there's scrapy python framework. You would always need to download the html though.


Isn't bandwidth mostly dirt cheap/free these days?


It's inexpensive, but sometimes not free. For example, Google Cloud Hosting is $0.14 / GB so 260 GB would be around $36.


its essentially free on non-extortionate hosts. Use hetzner + cloudflare and you'll essentially never pay for bandwidth


wget2 has an option por paralel downloading. https://github.com/rockdaboot/wget2


Thankfully some service providers noticed the awful convenience and are trying to proliferate themselves by offering a better service. For example mop.la let's you buy the ticket via credit/debit card, gives you a digital ticket and let's you pause or cancel the ticket a day before the next month. (I'm not affiliated with them)


You can use the Home key to skip cutscenes. Click on "show help" for more alternate controls


Can someone explain to me what the main differences to jq are, besides the syntax?


jq seems to have more focus on the generator and pipe abstractions. In jq you say "foo | map(bar)"; foo and map(bar) are both generators, and bar refers to each element of foo as ".". Here you say "for $x in foo return bar"; foo and bar are both JSON objects, and bar refers to each element of foo as "$x", so the iteration is more explicit.

Likewise, compare "sum($element.response_time)" with "map(.response_time) | add" in jq. Processing in JSONiq goes inside to outside while jq goes left to right.


My first thought also - would be a good entry for a FAQ or blog post.


Jq is xpath, this looks to be xquery. In fact it specifically works as an xquery embed.


You can write the example below in jq as

    def avg: add / length;
    group_by(.url) | map({
      "url": .[0].url,
      "hits": length, 
      "avg": map(.response_time) | avg
    })
so jq should be (at least roughly) as powerful as JSONiq.


... although, it seems neither JSONIq nor jq contain a "parent" operator, as far as I can tell.


This might be too restricting regarding the storage.

But we have a function: https://github.com/sirixdb/brackit


Not that not having it makes either of them any less powerful. If you descend to an inner context, you can refer to the parent/ancestor via a variable you can set before-hand.


For one thing, modifying data I guess.


What do you mean? jq can modify data:

  $ jq '.foo += 1' <<< '{"foo": 2}'
  {
    "foo": 3
  }


looks nothing like jq

  1. let $stats := collection("stats")
  2. for $access in $stats
  3. group by $url := $access.url
  4. return 
  5. {
  6.   "url": $url,
  7.   "avg": avg($access.response_time),
  8.   "hits": count($access)
  9. }


> besides the syntax


isn't that question like "cinema besides the movies"?


the user experience. jq is often a one liner, terse and expressive. This jsoniq language looks almost like a scripting language, requiring multiple lines to write an expression.


I have multiple 30+ lines jq scripts in my current project. So "often a one liner" is true, but it is not a requirement, so I'm still not sure why use this instead.


no, language is not just syntax.


Nor is the cinema just movies.


so if cinema isn't just movies then what's the problem with asking about the difference in "cinema besides the movies"?


it's senseless besides nit-picking.


that's just like your opinion man.


> JSONiq borrows a large numbers of ideas from XQuery

So basically grep or even sql ->xquery. No thank you!


Wow, that looks extremely useful. I typically use docopt for all my CLI needs, but it looks like this could be really nice as well. I will have to try it the next time I'm building a CLI.


It is. As far as I'm aware issues like these are only problematic if you either manually run a workflow (it uses your credentials) or have a workflow with the "pull_request_target" trigger (uses a token with write access). The latter has a plethora of potential pitfalls and should be avoided if you can.


Indeed, pull_request_target should be avoided.

The better model to use here is "pull_request" to do the work of building/testing a PR, and then a separate workflow that triggers on "workflow_run" to collect the results and attach them to the PR.

It's really not a lot of fun to implement though :/


Github badly need to add an abstraction for passing an artifact between workflows. The official recommendation for how to use workflow_run is comically messy (20+ lines of javascript-in-yaml because actions/download-artifact doesn't support fetching artifacts across repos):

https://securitylab.github.com/research/github-actions-preve...

Kinda hard to expect average users to grok this, running a follow-up workflow in a secure context with some carried over artifacts should be trivial to do declaratively.


I wonder if GH could/should make it a lot more convenient to implement with some additional abstractions, to encourage the secure approach by making it as easy as the insecure one.


I believe you are confusing dark net with deep net


Evidently so, although one is a subset of the other. I think saying "it requires special software or authentication" merely adds to the confusion, here.

For example, a site may require tor, or it may require a VPN connection to the same network the site lives on - is there a functional difference? And gating content behind authentication would be a good definition for "deep web" too.

However i can see the appeal of having "dark web" or "dark net" signify illicit things, but we also have "dark fiber", so something will have to give.


Yes. Here's an excerpt from their documentation on <https://docs.github.com/en/github/authenticating-to-github/m...>:

> GitHub will automatically use GPG to sign commits you make using the GitHub web interface


It’s even worse, if somebody rebase-merges a pull request that you authored (thereby creating a new commit that you did not author), GitHub will show you as the author (without a separate committer, like it normally does when author and committer differ), and put “verified” next to it, which usually means that they verified that it was signed by your GPG key, but in this case, it means that the commit was created by GitHub.

https://twitter.com/vmulps/status/1386717970458677250


Says it signs the commit with its own key. I guess you have to trust GitHub.


Well, yes. The question was whether you can sign _on GitHub_, so your private key has to be available to GitHub. You can always sign locally if you don't trust GitHub.


What else would they be signing with? They don’t have your key obviously


Well that was my point - I wonder why we haven't set up a system that lets me sign the merge commit. Otherwise it's a commit purported to be authored by me but when you look it's actually signed by someone else.


> Please delete user 01FEY4XQ988G9BYFH45JKQZF5R .. .

It's one thing to make a valid complain about there not being an automated account deletion feature yet, requiring you to send an email. And even then it's not that bad since it's literally a "mailto:contact@revolt.chat?Subject=Delete my account" link. It's another thing to make a temper tantrum out of it: https://imgur.com/U2DJwm7

Please don't be the guy that's technically right but an ass about it.


Not being able to delete your own account yourself is a valid complaint. Having to send an email is 100% not acceptable.


They could literally create a #DeleteAccount room on the Revolt "Server" or a specific user/bot that could handle deletion request instead of using emails. (Revolt is literally a communication platform!)

I created my account from an email address that could not send emails. (Unless I manually create an alias.)

They told me "Well, did you sign up just to delete your account afterward ?, You could look at the screenshot on the homepage."

I just wanted to test the platform, I don't need a specific reason to have my account deleted.

It reminds me of theses business where you can subscribe through internet but where you need to call them over the phone to cancel the subscription.

It was simpler to create a Curl loop to get my account deleted then sending an email.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: