Skip to main content

Notes on regex sed, bash DOM html, meta tag

Using bash tool sed, awk, grep, vim ... to edit multiple file. Case study DOM tag <html DOCTYPE, meta, charset...

https://unix.stackexchange.com/questions/26284/how-can-i-use-sed-to-replace-a-multi-line-string
Sed cheatsheet
https://gist.github.com/asenchi/2291903

HTML meta tags (recommend way ?)
https://www.quackit.com/html_5/tags/html_meta_tag.cfm


Example
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml " xml:lang="ja" lang="ja">
<head>
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
<meta http-equiv="Content-Script-Type" content="text/javascript" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="description" content="content here" />
<title>hoge</title>
</head>
<body>
</body>
</html>

Expected:

<!DOCTYPE html>
<html lang="ja">
<head>
<meta charset="EUC-JP">
<title>hoge</title>
<meta name="description" content="content here">
</head>
<body>
</body>
</html>

1. With <!DOCTYPE html : My idea is that delete entire line after match "DOCTYPE html and append ">" close tag.
$  sed -i 's/\DOCTYPE html.*/DOCTYPE html>/' filename

For filename, I use simple way to find . -name "*.html" |xargs grep -rli "DOCTYPE" $1 to get list of html files contain DOCTYPE to edit.
And then use column edit (Vim or Sublime text) to add prefex sed .... to list HTML files.
I think I can use "one line" command solution (use pipe, chain etc.) but it seem take time.

Special case:

-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

=> +<!DOCTYPE html>
Because of two line so sed grex have a issue here. Fortunately there are only few of them so I can manually fixed it. Like many other automate task, there are always exceptions, so be careful to double check the Input / Ouput...
And sed, aws, regex used in ultimately many time. In some previous post (should linked here) I also mentioned.


2. <html xmlns="http://www.w3.org/1999/xhtml " xml:lang="ja" lang="ja"> =>
<html lang="ja">

In this case, similar to previous line, I use sed to remove front of match (before lang=...). But it seem there are no good/working answer out there on the Internet, so I try working around by:
- Append first part of line before match with keyword 'delete-me' (should unique), we don't need this part and can be remove later.
- Preend <html to second part (before match words) and keep tail of  second part line.

$ sed -i 's/ lang=/delete-me\n<html&/g'  filename.html

& here mean keyword ' lang=' or tail of second part (I will double check this).
And then sed to remove temporary dump lines:
$ sed -i '/delete-me/d'  filename.html

3. Remove css, js meta tag is straightforward
$ sed -i '/meta http-equiv="Content-Style/d' filename

Replace Content-Style with Content-Script to remove JS meta tag.
Looked at keyword 'meta http-equiv', without meta we will be missed with <script> tag if we only use keyword "Content-Stype"...

4. <meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
or
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />  (most common case).

=> <meta charset="EUC-JP">

This case seem similar with <html lang="en"> convert above.

$  sed -i 's/; charset=UTF/delete-me&\n<meta charset=\"UTF/g' filename

Becase of this keyword "charset=" could be mixed in <script> or another tag:
<script ... charset="..."> so, we have to added '; ' in the keyword.

The most occurrence case is content="...; charset=...".
I think I made a small mistake here. I should not or do not need UTF in the keyword. The reason is that when I firstly copied some command from another tut, '&' in sed regex placed at last '&/g'. This one is for append or preend keyword (or tail ?) part before or after keyword.
So I will try to run without UTF later. (But I think I too lazy to try again :)

Anyway, it work.

We need another sed update.
$ sed -i 's/; charset=\"/delete-me&\n<meta charset=/g'  file2edit.html

This time for all "; charset=" match. And then run sed to remove all dump line contain 'delete-me'.


At first, I think we can only use one sed for this case, apply to keyword '; charset=' or simple ' charset=' if we ignore <script> tag since it appear only a few and can be manually edited.
But the real reason this second sed run is required is because of, ie. this case:
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />

You can see, we do not have "content='... ; charset=...". Focus on '; ' character. We have to remove whole front part of line before keyword. So second sed run is needed.



Sed change multiple files
https://stackoverflow.com/questions/10445934/change-multiple-files

Diff two text file to get different, only get different line. The idea is that sort two file first and then get different.
https://stackoverflow.com/questions/10708300/compare-two-files-ignoring-order

Comments

Popular posts from this blog

AWS Elasticache Memcached connection

https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/accessing-elasticache.html#access-from-outside-aws http://hourlyapps.blogspot.com/2010/06/examples-of-memcached-commands.html Access memcached https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/GettingStarted.AuthorizeAccess.html Zip include hidden file https://stackoverflow.com/questions/12493206/zip-including-hidden-files phpmemcachedadmin ~ phpMyAdmin or phpPgAdmin ... telnet mycachecluster.eaogs8.0001.usw2.cache.amazonaws.com 11211 stats items stats cachedump 27 100 https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/VPCs.EC.html https://lzone.de/cheat-sheet/memcached VPC ID Security Group ID (sg-...) Cluster: The identifier for the cluster memcached1 Creation Time: The time (UTC) when the cluster was created January 9, 2019 at 11:47:16 AM UTC+7 Configuration Endpoint: The configuration endpoint of the cluster memcached1.ahgofe.cfg.usw1.cache.amazonaws.com:11211 St

Rocket.Chat DB schema

_raix_push_notifications avatars.chunks avatars.files instances meteor_accounts_loginServiceConfiguration meteor_oauth_pendingCredentials meteor_oauth_pendingRequestTokens migrations rocketchat__trash rocketchat_cron_history rocketchat_custom_emoji rocketchat_custom_sounds rocketchat_import rocketchat_integration_history rocketchat_integrations rocketchat_livechat_custom_field rocketchat_livechat_department rocketchat_livechat_department_agents rocketchat_livechat_external_message rocketchat_livechat_inquiry rocketchat_livechat_office_hour rocketchat_livechat_page_visited rocketchat_livechat_trigger rocketchat_message rocketchat_oauth_apps rocketchat_oembed_cache rocketchat_permissions rocketchat_raw_imports rocketchat_reports rocketchat_roles rocketchat_room rocketchat_settings rocketchat_smarsh_history rocketchat_statistics rocketchat_subscription rocketchat_uploads system.indexes users usersSessions https://rocket.chat/docs/developer-guides/sc

Notes Windows 10 Virtualbox config, PHP Storm Japanese, custom PHP, Apache build, Postgresql

 cmd => Ctrl + Shift + Enter mklink "C:\Users\HauNT\Videos\host3" "C:\Windows\System32\drivers\etc\hosts" https://www.quora.com/How-to-create-a-router-in-php https://serverfault.com/questions/225155/virtualbox-how-to-set-up-networking-so-both-host-and-guest-can-access-internet 1 NAT + 1 host only config https://unix.stackexchange.com/questions/115464/how-to-properly-set-up-2-network-interfaces-in-centos-running-in-virtualbox DEVICE=eth0 TYPE=Ethernet #BOOTPROTO=dhcp BOOTPROTO=none #IPADDR=10.9.11.246 #PREFIX=24 #GATEWAY=10.9.11.1 #IPV4_FAILURE_FATAL=yes #HWADDR=08:00:27:CC:AC:AC ONBOOT=yes NAME="System eth0" [root@localhost www]# cat /etc/sysconfig/network-scripts/ifcfg-eth1 # Advanced Micro Devices, Inc. [AMD] 79c970 [PCnet32 LANCE] DEVICE=eth1 IPADDR=192.168.56.28 <= no eff => auto like DHCP #GATEWAY=192.168.56.1 #BOOTPROTO=dhcp BOOTPROTO=static <= no eff ONBOOT=yes HWADDR=08:00:27:b4:20:10 [root@localhost www]#